RL - Reinforcement Learning

Problem statement
Reinforcement learning (RL) - is a set of methods that allow an agent (an intelligent system) to develop an optimal strategy when interacting with the environment (the external world).
At each (usually discrete) moment in time t, the agent can apply a certain action at on the environment. In response, it receives information - an observation ot+1, which fully or partially characterizes the new state of the environment st+1, and a numerical measure of the success of the action, called the reward rt+1. o0 a0→ {o1,r1} a1→ {o2,r2} a2→ ... Based on the information received by t time {...; ot−2,at−2; ot−1,at−1; ot,_} (trajectory), the agent must choose the next action at in order to maximize the cumulative reward R=r1+r2+... over a sufficiently long period of time.
The environment can be deterministic or stochastic, fully or partially observable, stationary or non-stationary. If the environment is non-stationary, its properties change over time. Partial observability means that the observation ot does not carry all the information about the environment (there are "hidden variables"). In a fully observable environment, the observation is equivalent to the state of the environment: st=ot. Finally, in a stochastic environment, even with full observability, there are random factors. Starting from the same state and performing the same sequence of actions can result in different trajectories.
The environment may have terminal states, upon reaching which the "game" episode ends. Typically, episodes also have a time limit (a finite number of time steps). An observation is usually a vector with real or integer components.
Possible actions at can belong to a finite set of integers (0 - step left or 1 - step right) or be a vector of real numbers: {step speed, turn angle}. Many methods can only work with discrete actions or only with continuous actions.
In typical RL tasks, the reward rt+1 is often not directly related to the action at. Non-zero rewards can be rare, delayed, and may only appear after a series of negative values (suffered for a long time, but then received a lot). Sometimes rewards come only at the end (for example, in the game of Go). Rewards can also be consistently negative. For instance, if the task is to be solved in minimal time, the reward at each step could be rt=−1. Maximizing the total reward means minimizing the time to reach the terminal state.
Environment model
Usually, instead of observing the environment ot, we will refer to its state st. It is important to note that, in general, these values might not coincide. For a fully described environment st=ot. For an incomplete description, the history of observations and actions must be considered. Even taking into account the entire trajectory s0,a0,s1,a1,..., the state of the environment may be probabilistic or even turn out to be a "thing-in-itself". For instance, suppose a recurrent network is used to "compress" the history of observations: ot+1=g(st; ot,at), st+1=f(st; ot,at). In this case, the "auxiliary" memory vector st is naturally interpreted as the state of the environment.
During training, the agent can form an environment model, which predicts the probability of the next observation based on the trajectory: P(st+1,rt+1|...,st,at). If the reward is determined solely by the state, then the model is: P(st+1,|...,st,at) and rt=r(st).
If the current state of the environment st contains all the information necessary for decision-making, such an environment is called Markovian. Formally, this means that the conditional probability of transitioning to a new state depends only on the current state and action: P(st+1,rt+1|...,st−1,at−1, st,at)=P(st+1,rt+1|st,at). In this case, the future is determined by the present and does not require knowledge of the past. Choosing an optimal strategy in such an environment is called a Markov decision process (MDP).
If the observation does not provide all the information about the state (incomplete description, partial observability), decisions in such an environment are called a partially observable Markov decision process (POMDP).
Agent policy
The algorithm by which an agent selects an action is commonly referred to as its policy. This can be a straightforward (deterministic) function at=π(st) (policy function), where the argument is the current state. Previous actions and rewards can also be included in st.
To explore the environment, the agent must constantly "experiment". Even in a stationary environment with a complete description, experiments are necessary to avoid getting "stuck" in some region of the state space, as there might be more "profitable" states of the environment. Therefore, in general, the policy is the conditional probability π(a|s) of the agent choosing action a in state s.
A deterministic policy function can easily be transformed into a probabilistic one. For instance, with probability p the optimal (at the current stage of training) action a=π(s) is chosen, and with probability 1−p any other state is equally likely. If the set of actions is discrete, one can, for example, approximate π(a|s) with a neural network having the number of outputs equal to the number of possible actions, which will produce (after softmax) the probability distribution of the corresponding decisions.
Lack of training data
A significant difference between reinforcement learning and standard supervised learning is the absence of training data of the form {(s,a), (s′,a′), ...}, where a are the optimal actions "known to the teacher" that need to be performed when the environment is in a given state. The agent must solve the task even if a human does not know how to solve it (i.e., does not know which actions are optimal).
There are also technical challenges. For instance, suppose the policy function πθ(a|s) is approximated by a neural network with parameters θ and we aim to maximize the average total reward ˉR=⟨r1+r2+...⟩ over a large number of finite episodes. Essentially, this is a standard optimization problem. Given a policy, the average reward ˉR is also a function of the parameters, and the task is to find the extremum: θ∗=argmax
When there are few parameters, classic optimization methods (like Nelder-Mead, Powell, etc.) can be applied. If there are more parameters, stochastic methods such as genetic algorithms or simulated annealing might work. However, for deep learning, these approaches are practically useless due to the very large number of parameters. Gradient-based methods, which are effective in this scenario, cannot be directly applied because there is no mathematical model of the environment (it is a "black box" for the agent). This prevents the computation of derivatives of the objective function \bar{R}(\theta) with respect to the parameters \theta necessitating various tricks that constitute the essence of reinforcement learning.
Examples of RL problems
Many complex problems have been solved using reinforcement learning. Interest in this field surged after DeepMind successfully applied deep learning to decision-making in Atari video games with an efficiency comparable to that of a typical human player (2013). Among recent achievements, the most notable is the AlphaGo system (from the same company), which managed to defeat the world champion in the game of Go (2016). Let's consider a few typical RL problems.
Multi-Armed Bandit
There is a set of one-armed bandits – slot machines with a lever. When the agent pulls the lever of a machine,
he may receive a reward.
Each machine has its own (unknown) probability distribution of winnings: P(r|a). The agent's task, as usual, is to obtain the maximum cumulative reward. In this problem, there are no observations. The agent's action is an integer representing the number of the machine whose lever the agent pulls at the current step.
This is a classic educational problem typically used to illustrate the balance between exploration and exploitation. If the agent initially pulls the lever of each machine sequentially and then only pulls the levers of those that provided the highest reward, he is unlikely to achieve the maximum average income over a long period. At the beginning of learning, he needs to explore the environment (assess probabilities and rewards) to develop an optimal strategy. For example, some machines may have a low probability of delivering a very large reward.
A simple yet reasonable method is to calculate the action value Q_t(a)=\langle r(a)\rangle_t, equal to the average income of the a-th machine, which it has brought up to the current moment t. When choosing an action, an epsilon-greedy strategy should be used. This involves setting a parameter \epsilon \in [0...1]. With probability \epsilon a machine is chosen randomly, and with probability 1-\epsilon the lever of the most profitable machine is pulled: a_t = \left\{ \begin{array}[ll] \displaystyle random~action,&~~~~~with~probability~ \varepsilon\\ \displaystyle \arg\max_a Q_t(a), &~~~~~with~probability~ 1-\varepsilon \end{array} \right. As Q_t(a) stabilizes, the parameter \epsilon decreases. From a theoretical point of view, the problem is interesting because it can prove many theorems about the efficiency of different methods.
Frozen Lake
There is a square, unchanging map of a frozen lake, with some cells containing holes.
The task is to guide a gnome from the top-left corner of the map (starting state)
to the bottom-right corner (target state) without falling into a hole.
The gnome is somewhat tipsy and wobbles left and right.
Therefore, with a probability of 1/3 he will move where the agent intends,
and with a probability of 1/3 he will move in one of the perpendicular directions.
If he tries to move off the map, the gnome will remain in the same cell (or wobble in one of the perpendicular directions).
Thus, there are 16 discrete states (cell numbers) and 4 discrete actions (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP). Falling into a hole or reaching the target cell ends the episode. Reaching the target cell (terminal state) gives a reward of 1, otherwise, the reward is 0. The FrozenLake environment has a complete description o_t=s_t (the agent knows the cell number where the gnome is located), but it is highly stochastic and has very sparse rewards (only at the end). The optimal actions of the gnome are shown in the illustration with arrows and are suggested as an easy task to explain.
Cart Pole
The environment is a pole attached by a hinge to a cart that moves frictionlessly along a track. In the initial state, the pole is vertical (with a slight random deviation) and tends to fall down under the influence of gravity The task is to prevent the pole from falling by pushing the cart to the right or left (click on the image).
In this problem, the observation consists of four physical parameters: o_t=\{x,\,\,v,\,\phi,\,\omega\}, where x is the position of the cart, v is its velocity, \phi is the angle of deviation of the pole from vertical, \omega is the angular velocity of the pole. The agent's actions are binary: push the cart to the right (a=1) or to the left (a=0). After each action, the reward is r=1 as long as the angle |\phi| < 12^\circ and the cart is within the screen. Otherwise, the episode ends. Thus, the agent's task is to choose actions that keep the pole vertical for as long as possible.
The CartPole environment has a complete description, and the optimal action of the agent is fully determined by the current observation s_t=o_t. If the velocities are removed from the observation: o_t=\{x,\,\phi\}, the environment will have an incomplete description. For a quality solution in this case, it is necessary to consider two consecutive observations: s_t = \{o_t,\, o_{t-1}\}.
Mountain Car
In this environment, a car is situated between two hills of different heights. The observation consists of the car's position and velocity o_t=\{x,\,\,v\}. The agent can push the car to the right (a=2), do nothing (a=1) or push the car to the left (a=0). The goal is to reach the right (higher) hill. If the car reaches the left edge of the screen, it stops (inelastic collision). Each action results in a reward of r=-1, and the episode ends when the car reaches the target (flag in the illustration).
If the agent consistently chooses the action a=2 (towards the goal), in most cases, the car won't have enough "power" to climb the right hill. Therefore, the agent needs to swing the car back and forth to build up enough speed to reach the target. This must be done optimally to minimize the overall time (number of actions) spent solving the task (click on the picture).
This is a deterministic task with a complete description. It is noticeably more complex than the previous one. In Cart Pole random actions can keep the pole upright for some time, but in this environment, random actions almost never allow reaching the top. Since the episode time is limited to 200 intervals, the total reward in a large parameter space area remains unchanged (equal to -200). Therefore, there are no clear indicators of where the optimal average reward is located. Additionally, incorrect actions in Cart Pole have an immediate effect, whereas in Mountain Car their effect is "delayed" until the end of the episode.

Pong
Pong is a classic game from the Atari games collection where you need to move a paddle up and down to hit a tennis ball flying between players. In the OpenAI Gym collection the environment consists of an image 210 x 160 pixels. If the ball passes by the opponent's paddle, the agent receives a positive reward, and if it passes by the agent's paddle - a negative reward. For other events, there are no rewards. Thus, rewards are very rare (most of the time, they are zero).
In Atari Pong, the opponent follows a fairly simple paddle control algorithm (unknown to the agent). Therefore, this is, generally speaking, an environment with an incomplete description. Especially if the opponent is not the Atari algorithm, but another version of an agent.

Bipedal Walker
This is one of the tasks related to controlling robotic mechanical models. A bipedal robot needs to learn to move across slightly rough terrain. Each leg has two joints with motors controlled by the agent.
The agent's action consists of four real numbers (the torque applied to the joints of the robot). The observations consist of 24 real numbers. The first 14 determine the speeds of the head and legs and their contact with the ground. Another 10 components of the observation vector are the results of a lidar (laser-based radar). Rewards are positive for moving the robot forward and equal to -100 if it falls. Active use of the motors slightly reduces the rewards. The task is to optimally advance as far as possible without falling within 1600 time steps.
Unlike the previous tasks, actions in this environment are continuous.
Recommendation System
Imagine users on the internet are shown some information (a news article, a product, etc.). Along with it, brief descriptions of "related" information (other news articles or products) are also provided. The task is to display the most relevant related information for the given user.
The observation is a combination of the feature vector of the information and the features characterizing the user. The action is the selection (from a very large set) of a finite number of elements (related information). The reward is the fact of clicking on one of the displayed elements.
If the agent learns in real-world conditions, it is clear that the environment it interacts with is incomplete, stochastic, and non-stationary.
Q and V functions
Let's assume the agent's policy \pi(a|s) is fixed. It can be deterministic or probabilistic, but it is assumed that the agent always follows it when making decisions. Then, being in state s_t and taking action a_t (and then continuing to follow \pi), the agent will receive some cumulative reward in the future, which is typically discounted: R_{t+1} = r_{t+1} + \gamma\, r_{t+2} + \gamma^2\, r_{t+3} + ...~=~r_{t+1} + \gamma\,R_{t+2}. The discount factor \gamma \in [0..1] is a real hyperparameter of the learning process. If \gamma = 1, then R_{t+1} is simply the total future reward. When \gamma is less than 1, it implies that future rewards are considered less important to the agent than immediate ones. This reflects uncertainty about these future rewards (uncertainty of the future). Finally, the parameter \gamma technically allows working with divergent infinite sums. For instance, if all r_t=r=\mathrm{const}, due to the fact that 1+\gamma+\gamma^2+...=1/(1-\gamma), such a sum will be equal to R_{t+1}=r/(1-\gamma). Typically, \gamma ranges from 0.9 to 1.
We assume that the agent, when choosing the next action, aims to maximize the future cumulative reward. Since both the policy and the environment can be probabilistic, it is necessary to maximize the average value of the reward. If we are in a given state s of the environment and take action a, then this average value of R_{t+1} (denoted by angle brackets) is conditional (conditions after the vertical bar): Q^\pi(s,\,a) = \langle R_{t+1} ~|~ s_t=s,\,a_t=a \rangle_\pi The index \pi indicates that the averaging over different trajectories (sequences of states and actions) is done under the fixed agent policy \pi. If the optimal Q^*(s,a) is known for all possible policies, then the choice of the optimal action in the given state is obvious: Q^*(s,a) = \max_\pi Q^\pi(s,a),~~~~~~~~~~~~~~~a^* = \arg\max_a Q^*(s,a), i.e., among all possible policies \pi, the one where Q is maximal is chosen. The optimal action a^* is the one where Q is maximal among all possible actions.
The function Q(s,a) is called the utility function of the current state and action, or the Q-function. It plays an important role in many reinforcement learning methods. Some of these methods, instead of searching for the optimal policy function, build the "optimal" utility function, and then use it to find the optimal policy.
If for a given policy \pi(a_1|s) > \pi(a_2|s) and for it Q^\pi(s,a_1) < Q^\pi(s,a_2) - this means that there exists a more optimal policy \pi' where \pi'(a_1|s) > \pi'(a_2|s). Dynamic programming (DP) - is the iterative evaluation of a policy, i.e., computing Q^\pi(s,a), then improving the policy: a=\pi(s)=\arg\max_a Q(s,a), and repeating the evaluation and improvement process. If the environment is Markovian, DP converges to the optimal policy \pi^*.
Similarly, we can talk about the utility function of a state for a given policy \pi: V^\pi(s)=\langle R_{t+1} ~|~ s_t=s \rangle_\pi. Sometimes the advantage function is also introduced: A(s,a) = Q(s,a) - V(s), which shows how much better or worse a specific action a is in state s compared to choosing an action according to the policy \pi(a|s). Clearly, for optimal policies and values A^*(s,a^*)=0. In this case, A^*(s,a)\le 0. With a non-optimal policy, the sign of A(s,a) can be any.

Let's consider a simple example to illustrate the given definitions. Suppose we have a deterministic environment with seven states, four of which are terminal (gray circles). In each state, one of two actions a\in\{0,1\} (up or down) can be taken.
First, let's consider a policy where the agent chooses any action with equal probabilities in each state. In the environment, there are four equally probable trajectories starting from state s_0. Therefore, the value of state s_0 is (\gamma=1): V^\pi(s_0)=\frac{1}{4}\,\big[(4+0)+(4+4)+(2+16)+(2-4)\bigr]= \color{red}{\bf 7}. Similarly, the action values in state s_0 are computed as follows (next reward plus the average of the subsequent trajectories): Q^\pi(s_0, 0)=4 + \frac{0+4}{2}=\color{red}{\bf 6},~~~~~~~~~~~~~~~ Q^\pi(s_0, 1)=2 +\frac{16-4}{2}=\color{red}{\bf 8}. Thus, the advantage function A(s,a) for this policy is alternating (\pm 1) and the action a=1 is more "profitable."
Now, let's take the optimal policy, which is obviously deterministic: \pi^*(s_0) = 1, ~~~~~~~\pi^*(s_1) = 1,~~~~~~~\pi^*(s_2) = 0. For it, the state values for s_0 are: V^*(s_0)=Q^*(s_0,1)= \color{red}{\bf 18},~~~~~~~~Q^*(s_0,0)=\color{red}{\bf 8}. In this case, V^*(s)=\max_a Q^*(s,a) and A^*(s_0,0)=-10, A^*(s_0,1)=0. As an exercise, consider the worst policy and compute its V and Q.
A bit of math
Let's formalize the definitions given above. These formulas will become clearer when considered in specific cases. Let s_t be the state of the environment at time t. For environments with a complete description s_t=o_t. In general, the state is some function of the current observation and all previous observations and actions s_t=\{o_0,a_0,...,o_{t-1},a_{t-1},o_{t}\}. We define:
- Policy function: \pi(a|s) - the probabilities of taking action a in state s.
- Environment model: P(s',r'|s,a) - the probabilities that after taking action a, the environment transitions from state s to state s' and gives reward r'.
- Initial state probability: p(s_0). In most ML tasks, the initial state s_0 is not fixed, but is chosen with equal probability from a small region of the state space.
- Episode trajectory: \tau_0 = \{s_0,a_0,\,s_1,a_1,r_1,\,s_2,a_2,r_2,...\} - a sequence of states (observations), actions taken by the agent, and rewards received.
The utility of a state and action (s,a) for a given policy \pi is the average discounted reward the agent will receive in the future when taking action a in state s: Q^\pi(s,a) = \langle R_{t+1} | s_t=s,\,a_t=a \rangle_\pi. Angle brackets denote averaging over all subsequent states, rewards, and actions. This conditional probability is: Q^\pi(s_t,a_t) = \sum_{s_{t+1},\,a_{t+1},\,r_{t+1},...} \bigr[ P(s_{t+1},r_{t+1}|s_t,a_t)\cdot\pi(a_{t+1}|s_{t+1})\,P(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\cdot...\bigr]\cdot R_{t+1}. The utility of a state s for a given policy \pi is the average discounted reward the agent will receive starting from state s (the index \pi is omitted hereafter): V(s) = \sum_a \pi(a|s)\, Q(s,a). The utility functions of state and state-action satisfy the Bellman equations: Q(s,a) = \sum_{s',\,r'} P(s',r'\,|\,s, a)\,\bigr[ r' + \gamma\,V(s')\bigr]. These equations are based on the relation R_t=r_t+\gamma\,R_{t+1}. Thus, the utility of state V(s) is equal to the reward in the next state s' plus the discounted utility of that state (future rewards from it). All this is averaged over all possible actions and states s' : V(s) = \sum_{a,\,s',\,r'} \pi(a|s)\,P(s',r'|s,a)\, \bigr[ r' + \gamma\,V(s') ]. These equations are discussed in more detail in the following document.
Classification of ML methods
Many different algorithms have been developed in reinforcement learning. Unlike standard supervised learning, these methods are very sensitive to hyperparameters and take a long time to train. Furthermore, there are no universal methods. Depending on the task, a particular method may be ineffective or not work at all. Some methods work only with discrete actions, while others only with continuous ones.
✒ When comparing methods by efficiency, pay attention to three main criteria:
- Speed is characterized by the number of episodes during which certain target values of the average total reward are achieved (over the last N episodes). A "fast" algorithm may still take a long time if training cycles are frequent.
- Optimality refers to how high an average total reward the method can ultimately achieve, regardless of speed.
- Stability means the conditional monotonicity of convergence to target values. Unstable methods may collapse to suboptimal values before reaching a maximum. However, they sometimes achieve good results midway through training, which is why intermediate training results are usually saved
✒ According to the use of data for training, there are:
- Online methods: new data is received and immediately used for training
- Offline methods: data is accumulated and then used for training
✒ Based on the policy used to evaluate quality, methods are distinguished as:
- On-policy - actions are chosen using the current policy
- Off-policy - actions are chosen using a different policy (old or \epsilon-greedy)
✒ Based on the use of models, policies, and value functions in the method, methods are distinguished as:
- those that model the environment (model-based) or not (model-free)
- those that directly compute:
- the policy
- state value
- both the policy and value (actor-critic)
Some idea of the diversity of methods can be seen in the following diagram from OpenAI
Additional sources
- Sutton R.S, Barto A.J. "Reinforcement Learning: Introduction"
- Deep Reinforcement Learning - Julien Vitay (and pdf)
- Introduction to Reinforcement Learning - Markel Sanz Ausin
- Simple Reinforcement Learning with Tensorflow
- UCL Course on RL
