RL - Reinforcement Learning

Problem statement
Environment model

Agent policy
Lack of training data
Examples of RL problems

Q and V functions
A bit of math
Classification of ML methods
Additional sources

Problem statement

Reinforcement learning (RL) - is a set of methods that allow an agent (an intelligent system) to develop an optimal strategy when interacting with the environment (the external world).

At each (usually discrete) moment in time $t$, the agent can apply a certain action $a_t$ on the environment. In response, it receives information - an observation $o_{t+1}$, which fully or partially characterizes the new state of the environment $s_{t+1}$, and a numerical measure of the success of the action, called the reward $r_{t+1}$. $$ o_0 ~\overset{a_0}{\rightarrow}~ \{ o_{1},\,r_{1} \} ~ \overset{a_1}{\rightarrow}~ \{o_{2},\,r_{2}\}~ \overset{a_2}{\rightarrow}~... $$ Based on the information received by $t$ time $\{...;~o_{t-2},\,a_{t-2};~o_{t-1},\,a_{t-1};~o_{t},\_\}$ (trajectory), the agent must choose the next action $a_t$ in order to maximize the cumulative reward $R=r_1+r_2+...$ over a sufficiently long period of time.

The environment can be deterministic or stochastic, fully or partially observable, stationary or non-stationary. If the environment is non-stationary, its properties change over time. Partial observability means that the observation $o_t$ does not carry all the information about the environment (there are "hidden variables"). In a fully observable environment, the observation is equivalent to the state of the environment: $s_t = o_t$. Finally, in a stochastic environment, even with full observability, there are random factors. Starting from the same state and performing the same sequence of actions can result in different trajectories.

The environment may have terminal states, upon reaching which the "game" episode ends. Typically, episodes also have a time limit (a finite number of time steps). An observation is usually a vector with real or integer components.

Possible actions $a_t$ can belong to a finite set of integers (0 - step left or 1 - step right) or be a vector of real numbers: {step speed, turn angle}. Many methods can only work with discrete actions or only with continuous actions.

In typical RL tasks, the reward $r_{t+1}$ is often not directly related to the action $a_t$. Non-zero rewards can be rare, delayed, and may only appear after a series of negative values (suffered for a long time, but then received a lot). Sometimes rewards come only at the end (for example, in the game of Go). Rewards can also be consistently negative. For instance, if the task is to be solved in minimal time, the reward at each step could be $r_t = -1$. Maximizing the total reward means minimizing the time to reach the terminal state.

Environment model

Usually, instead of observing the environment $o_t$, we will refer to its state $s_t$. It is important to note that, in general, these values might not coincide. For a fully described environment $s_t=o_t$. For an incomplete description, the history of observations and actions must be considered. Even taking into account the entire trajectory ${s_0,a_0,\,s_1,a_1,\,...}$, the state of the environment may be probabilistic or even turn out to be a "thing-in-itself". For instance, suppose a recurrent network is used to "compress" the history of observations: $$ o_{t+1} = g(s_t;~ o_t, a_t),~~~~~~~~s_{t+1} = f(s_t;~ o_t, a_t). $$ In this case, the "auxiliary" memory vector $s_{t}$ is naturally interpreted as the state of the environment.

During training, the agent can form an environment model, which predicts the probability of the next observation based on the trajectory: $P(s_{t+1},\,r_{t+1}\,|\,...,s_t,\,a_t)$. If the reward is determined solely by the state, then the model is: $P(s_{t+1},|\,...,s_t,\,a_t)$ and $r_{t} = r(s_t)$.

If the current state of the environment $s_t$ contains all the information necessary for decision-making, such an environment is called Markovian. Formally, this means that the conditional probability of transitioning to a new state depends only on the current state and action: $$ P(s_{t+1},\,r_{t+1}\,|\,...,s_{t-1},\,a_{t-1},~s_{t},\,a_{t}) = P(s_{t+1},\,r_{t+1}\,|\,s_t,\,a_t). $$ In this case, the future is determined by the present and does not require knowledge of the past. Choosing an optimal strategy in such an environment is called a Markov decision process (MDP).

If the observation does not provide all the information about the state (incomplete description, partial observability), decisions in such an environment are called a partially observable Markov decision process (POMDP).

Agent policy

The algorithm by which an agent selects an action is commonly referred to as its policy. This can be a straightforward (deterministic) function $a_t=\pi(s_t)$ (policy function), where the argument is the current state. Previous actions and rewards can also be included in $s_t$.

To explore the environment, the agent must constantly "experiment". Even in a stationary environment with a complete description, experiments are necessary to avoid getting "stuck" in some region of the state space, as there might be more "profitable" states of the environment. Therefore, in general, the policy is the conditional probability $\pi(a|s)$ of the agent choosing action $a$ in state $s$.

A deterministic policy function can easily be transformed into a probabilistic one. For instance, with probability $p$ the optimal (at the current stage of training) action $a=\pi(s)$ is chosen, and with probability $1-p$ any other state is equally likely. If the set of actions is discrete, one can, for example, approximate $\pi(a|s)$ with a neural network having the number of outputs equal to the number of possible actions, which will produce (after softmax) the probability distribution of the corresponding decisions.

Lack of training data

A significant difference between reinforcement learning and standard supervised learning is the absence of training data of the form $\{(s,a),~(s',a'),~...\}$, where $a$ are the optimal actions "known to the teacher" that need to be performed when the environment is in a given state. The agent must solve the task even if a human does not know how to solve it (i.e., does not know which actions are optimal).

There are also technical challenges. For instance, suppose the policy function $\pi_\theta(a|s)$ is approximated by a neural network with parameters $\theta$ and we aim to maximize the average total reward $\bar{R} = \langle r_1+r_2+... \rangle$ over a large number of finite episodes. Essentially, this is a standard optimization problem. Given a policy, the average reward $\bar{R}$ is also a function of the parameters, and the task is to find the extremum: $$ \theta^* = \arg\max_\theta \bar{R}(\theta). $$

When there are few parameters, classic optimization methods (like Nelder-Mead, Powell, etc.) can be applied. If there are more parameters, stochastic methods such as genetic algorithms or simulated annealing might work. However, for deep learning, these approaches are practically useless due to the very large number of parameters. Gradient-based methods, which are effective in this scenario, cannot be directly applied because there is no mathematical model of the environment (it is a "black box" for the agent). This prevents the computation of derivatives of the objective function $\bar{R}(\theta)$ with respect to the parameters $\theta$ necessitating various tricks that constitute the essence of reinforcement learning.

Examples of RL problems

Many complex problems have been solved using reinforcement learning. Interest in this field surged after DeepMind successfully applied deep learning to decision-making in Atari video games with an efficiency comparable to that of a typical human player (2013). Among recent achievements, the most notable is the AlphaGo system (from the same company), which managed to defeat the world champion in the game of Go (2016). Let's consider a few typical RL problems.

Multi-Armed Bandit

There is a set of one-armed bandits – slot machines with a lever. When the agent pulls the lever of a machine, he may receive a reward.

Each machine has its own (unknown) probability distribution of winnings: $P(r|a)$. The agent's task, as usual, is to obtain the maximum cumulative reward. In this problem, there are no observations. The agent's action is an integer representing the number of the machine whose lever the agent pulls at the current step.

This is a classic educational problem typically used to illustrate the balance between exploration and exploitation. If the agent initially pulls the lever of each machine sequentially and then only pulls the levers of those that provided the highest reward, he is unlikely to achieve the maximum average income over a long period. At the beginning of learning, he needs to explore the environment (assess probabilities and rewards) to develop an optimal strategy. For example, some machines may have a low probability of delivering a very large reward.

A simple yet reasonable method is to calculate the action value $Q_t(a)=\langle r(a)\rangle_t$, equal to the average income of the $a$-th machine, which it has brought up to the current moment $t$. When choosing an action, an epsilon-greedy strategy should be used. This involves setting a parameter $\epsilon \in [0...1]$. With probability $\epsilon$ a machine is chosen randomly, and with probability $1-\epsilon$ the lever of the most profitable machine is pulled: $$ a_t = \left\{ \begin{array}[ll] \displaystyle random~action,&~~~~~with~probability~ \varepsilon\\ \displaystyle \arg\max_a Q_t(a), &~~~~~with~probability~ 1-\varepsilon \end{array} \right. $$ As $Q_t(a)$ stabilizes, the parameter $\epsilon$ decreases. From a theoretical point of view, the problem is interesting because it can prove many theorems about the efficiency of different methods.

Frozen Lake

There is a square, unchanging map of a frozen lake, with some cells containing holes. The task is to guide a gnome from the top-left corner of the map (starting state) to the bottom-right corner (target state) without falling into a hole. The gnome is somewhat tipsy and wobbles left and right. Therefore, with a probability of 1/3 he will move where the agent intends, and with a probability of 1/3 he will move in one of the perpendicular directions. If he tries to move off the map, the gnome will remain in the same cell (or wobble in one of the perpendicular directions).

Thus, there are 16 discrete states (cell numbers) and 4 discrete actions (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP). Falling into a hole or reaching the target cell ends the episode. Reaching the target cell (terminal state) gives a reward of 1, otherwise, the reward is 0. The FrozenLake environment has a complete description $o_t=s_t$ (the agent knows the cell number where the gnome is located), but it is highly stochastic and has very sparse rewards (only at the end). The optimal actions of the gnome are shown in the illustration with arrows and are suggested as an easy task to explain.

Cart Pole

The environment is a pole attached by a hinge to a cart that moves frictionlessly along a track. In the initial state, the pole is vertical (with a slight random deviation) and tends to fall down under the influence of gravity The task is to prevent the pole from falling by pushing the cart to the right or left (click on the image).

In this problem, the observation consists of four physical parameters: $o_t=\{x,\,\,v,\,\phi,\,\omega\}$, where $x$ is the position of the cart, $v$ is its velocity, $\phi$ is the angle of deviation of the pole from vertical, $\omega$ is the angular velocity of the pole. The agent's actions are binary: push the cart to the right ($a=1$) or to the left ($a=0$). After each action, the reward is $r=1$ as long as the angle $|\phi| < 12^\circ$ and the cart is within the screen. Otherwise, the episode ends. Thus, the agent's task is to choose actions that keep the pole vertical for as long as possible.

The CartPole environment has a complete description, and the optimal action of the agent is fully determined by the current observation $s_t=o_t$. If the velocities are removed from the observation: $o_t=\{x,\,\phi\}$, the environment will have an incomplete description. For a quality solution in this case, it is necessary to consider two consecutive observations: $s_t = \{o_t,\, o_{t-1}\}$.

Mountain Car

In this environment, a car is situated between two hills of different heights. The observation consists of the car's position and velocity $o_t=\{x,\,\,v\}$. The agent can push the car to the right ($a=2$), do nothing ($a=1$) or push the car to the left ($a=0$). The goal is to reach the right (higher) hill. If the car reaches the left edge of the screen, it stops (inelastic collision). Each action results in a reward of $r=-1$, and the episode ends when the car reaches the target (flag in the illustration).

If the agent consistently chooses the action $a=2$ (towards the goal), in most cases, the car won't have enough "power" to climb the right hill. Therefore, the agent needs to swing the car back and forth to build up enough speed to reach the target. This must be done optimally to minimize the overall time (number of actions) spent solving the task (click on the picture).

This is a deterministic task with a complete description. It is noticeably more complex than the previous one. In Cart Pole random actions can keep the pole upright for some time, but in this environment, random actions almost never allow reaching the top. Since the episode time is limited to 200 intervals, the total reward in a large parameter space area remains unchanged (equal to -200). Therefore, there are no clear indicators of where the optimal average reward is located. Additionally, incorrect actions in Cart Pole have an immediate effect, whereas in Mountain Car their effect is "delayed" until the end of the episode.

Pong

Pong is a classic game from the Atari games collection where you need to move a paddle up and down to hit a tennis ball flying between players. In the OpenAI Gym collection the environment consists of an image 210 x 160 pixels. If the ball passes by the opponent's paddle, the agent receives a positive reward, and if it passes by the agent's paddle - a negative reward. For other events, there are no rewards. Thus, rewards are very rare (most of the time, they are zero).

In Atari Pong, the opponent follows a fairly simple paddle control algorithm (unknown to the agent). Therefore, this is, generally speaking, an environment with an incomplete description. Especially if the opponent is not the Atari algorithm, but another version of an agent.

Bipedal Walker

This is one of the tasks related to controlling robotic mechanical models. A bipedal robot needs to learn to move across slightly rough terrain. Each leg has two joints with motors controlled by the agent.

The agent's action consists of four real numbers (the torque applied to the joints of the robot). The observations consist of 24 real numbers. The first 14 determine the speeds of the head and legs and their contact with the ground. Another 10 components of the observation vector are the results of a lidar (laser-based radar). Rewards are positive for moving the robot forward and equal to -100 if it falls. Active use of the motors slightly reduces the rewards. The task is to optimally advance as far as possible without falling within 1600 time steps.

Unlike the previous tasks, actions in this environment are continuous.

Recommendation System

Imagine users on the internet are shown some information (a news article, a product, etc.). Along with it, brief descriptions of "related" information (other news articles or products) are also provided. The task is to display the most relevant related information for the given user.

The observation is a combination of the feature vector of the information and the features characterizing the user. The action is the selection (from a very large set) of a finite number of elements (related information). The reward is the fact of clicking on one of the displayed elements.

If the agent learns in real-world conditions, it is clear that the environment it interacts with is incomplete, stochastic, and non-stationary.

Q and V functions

Let's assume the agent's policy $\pi(a|s)$ is fixed. It can be deterministic or probabilistic, but it is assumed that the agent always follows it when making decisions. Then, being in state $s_t$ and taking action $a_t$ (and then continuing to follow $\pi$), the agent will receive some cumulative reward in the future, which is typically discounted: $$ R_{t+1} = r_{t+1} + \gamma\, r_{t+2} + \gamma^2\, r_{t+3} + ...~=~r_{t+1} + \gamma\,R_{t+2}. $$ The discount factor $ \gamma \in [0..1]$ is a real hyperparameter of the learning process. If $\gamma = 1$, then $R_{t+1}$ is simply the total future reward. When $\gamma$ is less than 1, it implies that future rewards are considered less important to the agent than immediate ones. This reflects uncertainty about these future rewards (uncertainty of the future). Finally, the parameter $\gamma$ technically allows working with divergent infinite sums. For instance, if all $r_t=r=\mathrm{const}$, due to the fact that $1+\gamma+\gamma^2+...=1/(1-\gamma)$, such a sum will be equal to $R_{t+1}=r/(1-\gamma)$. Typically, $\gamma$ ranges from 0.9 to 1.

We assume that the agent, when choosing the next action, aims to maximize the future cumulative reward. Since both the policy and the environment can be probabilistic, it is necessary to maximize the average value of the reward. If we are in a given state $s$ of the environment and take action $a$, then this average value of $R_{t+1}$ (denoted by angle brackets) is conditional (conditions after the vertical bar): $$ Q^\pi(s,\,a) = \langle R_{t+1} ~|~ s_t=s,\,a_t=a \rangle_\pi $$ The index $\pi$ indicates that the averaging over different trajectories (sequences of states and actions) is done under the fixed agent policy $\pi$. If the optimal $Q^*(s,a)$ is known for all possible policies, then the choice of the optimal action in the given state is obvious: $$ Q^*(s,a) = \max_\pi Q^\pi(s,a),~~~~~~~~~~~~~~~a^* = \arg\max_a Q^*(s,a), $$ i.e., among all possible policies $\pi$, the one where $Q$ is maximal is chosen. The optimal action $a^*$ is the one where $Q$ is maximal among all possible actions.

The function $Q(s,a)$ is called the utility function of the current state and action, or the Q-function. It plays an important role in many reinforcement learning methods. Some of these methods, instead of searching for the optimal policy function, build the "optimal" utility function, and then use it to find the optimal policy.

If for a given policy $\pi(a_1|s) > \pi(a_2|s)$ and for it $Q^\pi(s,a_1) < Q^\pi(s,a_2)$ - this means that there exists a more optimal policy $\pi'$ where $\pi'(a_1|s) > \pi'(a_2|s)$. Dynamic programming (DP) - is the iterative evaluation of a policy, i.e., computing $Q^\pi(s,a)$, then improving the policy: $a=\pi(s)=\arg\max_a Q(s,a)$, and repeating the evaluation and improvement process. If the environment is Markovian, DP converges to the optimal policy $\pi^*$.

Similarly, we can talk about the utility function of a state for a given policy $\pi$: $$ V^\pi(s)=\langle R_{t+1} ~|~ s_t=s \rangle_\pi. $$ Sometimes the advantage function is also introduced: $$ A(s,a) = Q(s,a) - V(s), $$ which shows how much better or worse a specific action $a$ is in state $s$ compared to choosing an action according to the policy $\pi(a|s)$. Clearly, for optimal policies and values $A^*(s,a^*)=0$. In this case, $A^*(s,a)\le 0$. With a non-optimal policy, the sign of $A(s,a)$ can be any.

Let's consider a simple example to illustrate the given definitions. Suppose we have a deterministic environment with seven states, four of which are terminal (gray circles). In each state, one of two actions $a\in\{0,1\}$ (up or down) can be taken.

First, let's consider a policy where the agent chooses any action with equal probabilities in each state. In the environment, there are four equally probable trajectories starting from state $s_0$. Therefore, the value of state $s_0$ is ($\gamma=1$): $$ V^\pi(s_0)=\frac{1}{4}\,\big[(4+0)+(4+4)+(2+16)+(2-4)\bigr]= \color{red}{\bf 7}. $$ Similarly, the action values in state $s_0$ are computed as follows (next reward plus the average of the subsequent trajectories): $$ Q^\pi(s_0, 0)=4 + \frac{0+4}{2}=\color{red}{\bf 6},~~~~~~~~~~~~~~~ Q^\pi(s_0, 1)=2 +\frac{16-4}{2}=\color{red}{\bf 8}. $$ Thus, the advantage function $A(s,a)$ for this policy is alternating ($\pm 1$) and the action $a=1$ is more "profitable."

Now, let's take the optimal policy, which is obviously deterministic: $$ \pi^*(s_0) = 1, ~~~~~~~\pi^*(s_1) = 1,~~~~~~~\pi^*(s_2) = 0. $$ For it, the state values for $s_0$ are: $$ V^*(s_0)=Q^*(s_0,1)= \color{red}{\bf 18},~~~~~~~~Q^*(s_0,0)=\color{red}{\bf 8}. $$ In this case, $V^*(s)=\max_a Q^*(s,a)$ and $A^*(s_0,0)=-10$, $A^*(s_0,1)=0$. As an exercise, consider the worst policy and compute its $V$ and $Q$.

A bit of math

Let's formalize the definitions given above. These formulas will become clearer when considered in specific cases. Let $s_t$ be the state of the environment at time $t$. For environments with a complete description $s_t=o_t$. In general, the state is some function of the current observation and all previous observations and actions $s_t=\{o_0,a_0,...,o_{t-1},a_{t-1},o_{t}\}$. We define:

Policy function: $\pi(a|s)$ - the probabilities of taking action $a$ in state $s$.
Environment model: $P(s',r'|s,a)$ - the probabilities that after taking action $a$, the environment transitions from state $s$ to state $s'$ and gives reward $r'$.
Initial state probability: $p(s_0)$. In most ML tasks, the initial state $s_0$ is not fixed, but is chosen with equal probability from a small region of the state space.
Episode trajectory: $\tau_0 = \{s_0,a_0,\,s_1,a_1,r_1,\,s_2,a_2,r_2,...\}$ - a sequence of states (observations), actions taken by the agent, and rewards received.

For non-Markov processes, the probabilities in $\pi$ and $P$ can depend not only on the current state but also on the entire history of states (observations) and actions. The probability of the trajectory of a given episode is the probability that the initial state will be $s_0$, the agent will take action $a_0$, the environment will transition to state $s_1$ and give reward $r_1$, and so on. The result is the product of these probabilities: $$ P^\pi(\tau_0) = p(s_0)\cdot \pi(a_0|s_0)\,P(s_1,r_1| s_{0},a_{0})\cdot\pi(a_1|s_1)\,P(s_2,r_2| s_{1},a_{1})\cdot... $$ The total future discounted reward starting from time $t$ is: $$ R_t = \sum^{\infty}_{i=0} \gamma^{i}\,r_{t+i} = r_t+\gamma\, r_{t+1} + \gamma^2\, r_{t+2}+... ~=~ r_t + \gamma\,R_{t+1}. $$ The overall success of the agent across multiple episodes is the average of this quantity over all possible trajectories $\tau_0$: $$ R(\pi) = \langle R_1\rangle_\pi = \sum_{s_{0},\,a_0,\,s_1,\,r_1,\,a_1,...} P^\pi(\tau_0)\, R_1. $$ In the case of continuous observations and/or actions, the sums are replaced by integrals. The index $\pi$ emphasizes that the average is calculated for a fixed agent policy $\pi(a|s)$. Among all possible policies, the optimal policy will be: $$ \pi^* = \arg\max_\pi R(\pi). $$

The utility of a state and action $(s,a)$ for a given policy $\pi$ is the average discounted reward the agent will receive in the future when taking action $a$ in state $s$: $$ Q^\pi(s,a) = \langle R_{t+1} | s_t=s,\,a_t=a \rangle_\pi. $$ Angle brackets denote averaging over all subsequent states, rewards, and actions. This conditional probability is: $$ Q^\pi(s_t,a_t) = \sum_{s_{t+1},\,a_{t+1},\,r_{t+1},...} \bigr[ P(s_{t+1},r_{t+1}|s_t,a_t)\cdot\pi(a_{t+1}|s_{t+1})\,P(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\cdot...\bigr]\cdot R_{t+1}. $$ The utility of a state $s$ for a given policy $\pi$ is the average discounted reward the agent will receive starting from state $s$ (the index $\pi$ is omitted hereafter): $$ V(s) = \sum_a \pi(a|s)\, Q(s,a). $$ The utility functions of state and state-action satisfy the Bellman equations: $$ Q(s,a) = \sum_{s',\,r'} P(s',r'\,|\,s, a)\,\bigr[ r' + \gamma\,V(s')\bigr]. $$ These equations are based on the relation $R_t=r_t+\gamma\,R_{t+1}$. Thus, the utility of state $V(s)$ is equal to the reward in the next state $s'$ plus the discounted utility of that state (future rewards from it). All this is averaged over all possible actions and states $s'$ : $$ V(s) = \sum_{a,\,s',\,r'} \pi(a|s)\,P(s',r'|s,a)\, \bigr[ r' + \gamma\,V(s') ]. $$ These equations are discussed in more detail in the following document.

Classification of ML methods

Many different algorithms have been developed in reinforcement learning. Unlike standard supervised learning, these methods are very sensitive to hyperparameters and take a long time to train. Furthermore, there are no universal methods. Depending on the task, a particular method may be ineffective or not work at all. Some methods work only with discrete actions, while others only with continuous ones.

✒ When comparing methods by efficiency, pay attention to three main criteria:

Speed is characterized by the number of episodes during which certain target values of the average total reward are achieved (over the last $N$ episodes). A "fast" algorithm may still take a long time if training cycles are frequent.
Optimality refers to how high an average total reward the method can ultimately achieve, regardless of speed.
Stability means the conditional monotonicity of convergence to target values. Unstable methods may collapse to suboptimal values before reaching a maximum. However, they sometimes achieve good results midway through training, which is why intermediate training results are usually saved

It is important to remember that reinforcement learning simulates learning in the real world. Therefore, an optimal method that takes a very long time to train may be useless in "production" when its application and retraining coincide. An integral criterion for comparing methods can be the area under the curve of the average reward per episode as a function of the number of episodes during training.

✒ According to the use of data for training, there are:

Online methods: new data is received and immediately used for training
Offline methods: data is accumulated and then used for training

Strictly speaking, all ML methods are online methods because if data is accumulated, it is done with a "sliding window" approach, forgetting very old data. However, some "highly online" methods start training after each time step and do not accumulate state and action histories. More offline methods may first go through one or more episodes and then switch to training mode. Finally, some methods have a sufficiently large but finite updateable memory buffer, from which all data is used during training — these are "highly offline" methods.

✒ Based on the policy used to evaluate quality, methods are distinguished as:

On-policy - actions are chosen using the current policy
Off-policy - actions are chosen using a different policy (old or $\epsilon$-greedy)

In off-policy methods, there are usually two policies (or functions from which policies are derived). Decisions are made based on one "learning" policy, called the target policy. The second policy uses these decisions for training. Over time, the trained policy becomes the target policy.

✒ Based on the use of models, policies, and value functions in the method, methods are distinguished as:

those that model the environment (model-based) or not (model-free)
those that directly compute:
- the policy
- state value
- both the policy and value (actor-critic)

Some idea of the diversity of methods can be seen in the following diagram from OpenAI

Additional sources

Sutton R.S, Barto A.J. "Reinforcement Learning: Introduction"
Deep Reinforcement Learning - Julien Vitay (and pdf)
Introduction to Reinforcement Learning - Markel Sanz Ausin
Simple Reinforcement Learning with Tensorflow
UCL Course on RL