$$ \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bt}{\boldsymbol{\theta}} \newcommand{\dkl}{\mathrm{d}_{\mathrm{KL}}} \newcommand{\dtv}{\mathrm{d}_{\mathrm{TV}}} \newcommand{\emv}{\hat{\theta}_{\mathrm{emv}}} \newcommand{\ent}{\mathrm{Ent}} $$

3  Course reminders

The following is a reminder of the main theoretical results about Reinforcement Learning for stochastic control problems seen during the course.

3.1 Some Foundations of Reinforcement Learning

3.1.1 Basics of Reinforcement Learning

Definition 3.1 (Markov Decision Processes.) A Markov Decision Process is a quadruplet given by \((\mathcal{X},\mathcal{A},P,r=(f,g))\) such that :

  • \(\mathcal{X}\) denotes the space of states on which the discrete time state process \((X_t)_{t \in \mathbb{N}}\)
  • \(\mathcal{A}\) denotes the space of actions in which the control \((\alpha_t)_{t \in \mathbb{N}}\) is defined
  • State dynamics given by : \[X_{t+1} \sim P_t(X_t,\alpha_t)\] with a probability transition given by an application \((t,x,a) \in \mathbb{N} \times \mathcal{X} \times \mathcal{A} \mapsto P_t(x,a,dx') \in \mathcal{P}(\mathcal{X})\).
  • Reward given by a couple \((f,g)\) such that :
    • \(f(x,a)\) is a running reward obtained in state \(x\) when choosing the action \(a\)
    • Terminal reward \(g(x)\)
    • Discount factor \(\beta \in [0,1]\) in case of infinite horizon.

Definition 3.2 (Policy Definition.) A policy \(\pi=(\pi_t)_{t \in \mathbb{N}}\) is a sequence of actions choosen in a markovian setting with respect to the state variable. A policy \(\pi\) can be either :

  • deterministic when \(\pi_t : \mathcal{X} \mapsto \mathcal{A}\)
  • randomized when \(\pi_t : \mathcal{X} : \mapsto \mathcal{P}(\mathcal{A})\) meaning that \(\pi_t\) is a probability distribution of choosing an action at time \(t\) in state \(x\).

We will say that a control \(\alpha = (\alpha_t)_{t \in \mathbb{N}}\) is drawn from a policy \(\pi\) if for each \(t \in \mathbb{N}\), we have :

  • \(\alpha_t =\pi_t(X_t)\) in the case of deterministic policies.
  • \(\alpha_t \sim \pi_t(X_t)\) in the case of randomized policies.

The goal of Reinforcement Learning will be to learn the policy \(\pi=(\pi_t)_{t \in \mathbb{N}}\) which maximises the sum of rewards which will be defined in the value function.

Definition 3.3 ((Value function).) Given a policy \(\pi=(\pi_t)_{t \in \mathbb{N}}\) and an horizon \(T \in \mathbb{N}\), we define :

  • The state value function as : \[\begin{align} V_t^{\pi}(x) &= \mathbb{E}_{\pi} \Big[\sum_{s=t}^{T-1} f(X_s,\alpha_s) + g(X_T) | X_t = x \Big], \quad x \in \mathcal{X}. \end{align} \] where \(\mathbb{E}_{\pi}\) denotes the expectation when \(\alpha \sim \pi\).

  • The Q value function of \(\pi\) which is defined as : \[\begin{align} Q_t^{\pi}(x,a) = \mathbb{E}_{\pi} \Big[\sum_{s=t}^{T-1} f(X_s,\alpha_s) + g(X_T) | X_t = x,\alpha_t = a \Big], \quad x \in \mathcal{X}, \alpha \in \mathcal{A}. \end{align} \]

  • We note that \[ V_t^{\pi}(x) = \mathbb{E}_{a \sim \pi_t(x)} \big[Q_t^{\pi}(x,a) \big]. \]

The goal is therefore to find a policy \(\pi^{*}\) such that we have at a given \(t \in \mathbb{N}\) \[ \begin{align} V_t^{\pi^{\star}}(x) = \underset{\pi \in \Pi}{\text{ sup }} V_t^{\pi}(x), \quad Q_t^{\pi^{\star}}(x) = \underset{\pi \in \Pi}{\text{ sup }} Q_t^{\pi}(x). \end{align} \]

Theorem 3.1 (Bellman equation for value functions.)  

  • For a fixed policy \(\pi=(\pi_t)_{t \in \mathbb{N}}\), we have

\[ \begin{align} \begin{cases} V_t^{\pi}(x) &= \mathbb{E}_{a \sim \pi_t(x)} \Big[ f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[ V_{t+1}^{\pi}(x') \big] \Big], \quad t =0, \ldots, T-1 \\ V_T^{\pi}(x) &= g(x), \end{cases} \end{align} \tag{3.1}\] and \[ \begin{align} \begin{cases} Q_t^{\pi}(x,a) &= f(x,a) + \mathbb{E}_{x' \sim P_t(x,a), a' \sim \pi_{t+1}(x')} \big[ Q_{t+1}^{\pi}(x',a') \big] , \quad t =0, \ldots, T-1 \\ Q_T^{\pi}(x,a) &= g(x). \end{cases} \end{align} \tag{3.2}\]

  • For optimal value function and \(Q\) functions, we have

\[ \begin{align} \begin{cases} V_t(x) &= \underset{a \in A}{\text{ sup }} \big \lbrace f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[V_{t+1}(x')], t=0,\ldots, T-1 \\ V_T(x) &= g(x) \end{cases} \end{align} \] and \[ \begin{align} \begin{cases} Q_t(x,a) &= f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[ \underset{a' \in A}{\text{ sup }} Q_{t+1}(x',a') \big] , \quad t =0, \ldots, T-1 \\ Q_T(x,a) &= g(x). \end{cases} \end{align} \] Then the optimal deterministic policy \(\pi^{\star} = (\pi^{\star}_t)_{t \in \mathbb{N}}\) is given from the \(Q\)-value function as \[ \begin{align} \pi_t(x) = \underset{a \in A}{\text{ arg max }} Q_t(x,a) \end{align} \tag{3.3}\]

Proof. Let \(\pi=(\pi_t)_{t \in \mathbb{N}}\) be a fixed policy. Then, we have by law of iterated conditional expectations

\[ \begin{align} \begin{cases} V_t^{\pi}(x) &= \mathbb{E}_{a \sim \pi_t(x)} \Big[ f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} V_{t+1}^{\pi}(x') \Big], \\ Q_t^{\pi}(x,a)&= f(x,a) + \mathbb{E}_{ x' \sim P_t(x,a)} \Big[ V_{t+1}^{\pi}(x') \Big] \end{cases} \end{align} \] Moreover, we notice that \[ \begin{align} Q_t^{\pi}(x,a) = f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[ V_{t+1}^{\pi}(x') \big], \end{align} \] and recalling that \(V_{t+1}(x) = \mathbb{E}_{a \sim \pi_{t+1}(x)} \big[ Q_{t+1}(x,a) \big]\), we recover Equation 3.1 and Equation 3.2.

Moreover, by definition of the optimal \(Q\)-function and optimal value function, we have

\[ \begin{align} Q_t(x,a) = \underset{\pi}{\text{ sup } } Q_t^{\pi}(x,a) &= \underset{ \pi }{\text{ sup }} \big \lbrace f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[V_{t+1}^{\pi}(x') \big] \rbrace \\ &= f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \big[V_{t+1}(x') \big] \end{align} \] The optimality of the policy \(\pi^{\star}\) in Equation 3.3 is obtained noticing that \(V_t(x) \leq \underset{a \in A}{\text{ sup }} Q_t(x,a)\).

3.1.2 Value-based methods

In the case of valued based methods, the goal is to learn a representation of the value function \(V^{\pi^*}\) and \(Q^{\pi^*}\) and then derive the optimal policy from the value function.

3.1.2.1 Q-learning and Sarsa Algorithms

The Q-learning algorithm exploits the Bellman equation for the Q-function given by : \[ \begin{align} Q_t(x,a) = f(x,a) + \mathbb{E}_{x' \sim P_t(x,a)} \Big[ \underset{a' \in A}{\text{ sup }} Q_{t+1}(x',a' ) \Big] \end{align} \notag. \]

The Sarsa algorithm is an online policy learning by iterating. It will approximate for every episode \(k\), the approximation of \(Q^{\pi^k}\) via the Bellman equation Equation 3.2:

\[ \begin{align} Q_t^{\pi^k}(x,a) = f(x,a) + \mathbb{E}_{x' \sim P_t(x,a), a' \sim \pi_{t_{k+1}^k(x')}} \Big[ Q_{t+1}^{\pi^k}(x',a' ) \Big] \end{align} \notag. \] Then, by policy improvement, we can compute the greedy policy \(\pi_t^{k+1}(x) = \underset{a \in A}{\text{ arg max }} Q_t^{\pi^k}(x,a)\) for any \(t=0,\ldots, T-1\), \(x \in \mathcal{X}\) which implies \(V_t^{\pi^k} \leq V_t^{\pi^{+1}}\) for any \(t\).

The Q-learning and Sarsa algorithms procedures can be found in the notes in Lecture 1.

3.1.3 Policy-based methods

This subsection will be updated later. We will cover policy gradient methods and Actor-Critic algorithms.

3.2 Reinforcement Learning in Continous Time

We present in the following the main concepts of Reinforcement Learning in continuous time. This can be motivated by the following applications.

  • Applicability to real-word problems
    • High frequency trading
    • Enginerring systems
  • Scalability, flexibility and efficiency
    • Avoid discretization errors
  • Mathematical tools
    • Tools from stochastic calculus, control and PDEs
    • Understanding of RL algorithms and interpretability

3.2.1 The control problem

Definition 3.4 (Randomized policies in continuous time.) We now introduce the concept of randomized policies in continuous time, viewed as the natural extension of MDP.

  • \(\pi : [0,T] \times \mathbb{R}^d \to \mathcal{P}(A)\) with density \(A \ni a \mapsto \pi(t,x,a)\) with respect to a reference measure \(\nu\) on \(P(A)\) and we will by misuse of notation write \[ \begin{align} \pi(t,x,da) = \pi(t,x,a) d a. \end{align} \]

  • A control \(\alpha = (\alpha_t)_{t \in [0,T]}\) is said to be sampled from \(\pi=(\pi_t)_{t \in [0,T]}\) if at each time \(t \in [0,T]\), we have \(\alpha_t \sim \pi_t\). We will denote \(\alpha \sim \pi\).

There is a technical issue here to ensure that we can in fact sample \(\alpha\) from \(\pi\) using an appropriate randomization of the controls through an uncountable number of i.i.d uniform random variables \((U_t)_{t \in [0,T]}\). This kind of randomization can be done under a suitable enlargment of the probability space \((\Omega,\mathcal{F},\mathbb{P})\) by using the Fubini Extension to ensure the progressive measurability of \(\alpha\) with respect to the filtration given by \[\tilde{\mathbb{F}} = (\tilde{\mathcal{F}_t})_{t \in [0,T]}, \quad \tilde{\mathcal{F}_t} = \sigma((W_s)_{s \leq t}) \vee \sigma(U_s, s \leq t).\]

  • Given a policy \(\pi=(\pi_t)_{0 \leq t \leq T}\), the state dynamics \(X=(X^{\pi}_t)_{0 \leq t \leq T}\) is given by

\[ \begin{align} dX_t = b(X_t,a_t) dt + \sigma(X_t,a_t) dW_t. \end{align} \]

Now, we can define the value function in the context of reinforcement learning in continuous time.

Definition 3.5 (Value function in continuous time.) Given a policy \(\pi=(\pi_t)_{t \in [0,T]}\), we consider the performance value function defined as

\[ \begin{align} V_t^{\pi}(x) &= \mathbb{E}_{\pi} \bigg[ \int_{t}^{T} f(X_s,a_s) + \lambda \mathcal{E}(\pi(s,(X_s)) d s + g(X_T) | X_t = x \bigg] \notag \\ &= \mathbb{E}_{\pi} \bigg[ \int_{t}^{T} f(X_s,a_s) + \lambda \text{ log } \pi(s,X_s,a_s) d s + g(X_T) | X_t = x \bigg] \notag \end{align} \] where \(\mathbb{E}_{\pi}\) denotes the expectation when \(\alpha \sim \pi\) and when we added the Shannon entropy \(\mathcal{E} : \mathcal{P}(A) \to \mathbb{R}\) with temperature \(\lambda > 0\)

\[ \begin{align} \mathcal{E}(\pi) = - \int_{A} \text{ log } \pi(a) \pi(d a)= - \mathbb{E}_{a \sim \pi} [ \text{ log } \pi(a) ], \end{align} \] for \(\pi \in \mathcal{P}(A)\) with density given by \(a \mapsto \pi(a)\).

Denoting by \(\Pi\) the set of admissible policies (here we will consider the set of absolutely continuous measure with respect to Lebesgue measure), the optimal value function is defined as

\[ \begin{align} V_t(x) = \underset{\pi \in \Pi}{\sup } V_t^{\pi}(x), \quad (t,x) \in [0,T] \times \mathbb{R}^d. \end{align} \tag{3.4}\]

Theorem 3.2 (Hamiltonian and Bellman equations.)  

  • For any stochastic policy \(\pi\), the value function \(V^{\pi}\) satisfies the following PDE

\[ \begin{align} \begin{cases} \frac{\partial V^{\pi}}{\partial t} (t,x) + \int_{A} \big[ H(x,a, \nabla_x V^{\pi}(t,x), D_x^2 V^{\pi}(t,x)) - \lambda \text{log}(\pi(t,x,a)) \big] \pi(t,x,a)\nu(da) &= 0,\notag \\ V^{\pi}(T,x) &= g(x), \end{cases} \end{align} \tag{3.5}\] where the map \(H\) is called Hamiltonian function and defined as \[ \begin{align} H(x,a,p,M) = b(x,a) \cdot p + \frac{1}{2} \text{tr}(\sigma \sigma^{\top}(x,a) M) + f(x,a), \notag \end{align} \] for \(x \in \mathbb{R}^d\), \(a \in A\), \(p \in \mathbb{R}^d\), and \(M \in \mathbb{R}^{d \times d}\).

  • The Bellman equation for the value function \(V\) is given by \[ \begin{align} \begin{cases} \frac{\partial V}{\partial t} (t,x) + \underset{\pi \in \mathcal{P}(A)}{\text{ sup }} \int_{A} \big[ H(x,a, \nabla_x V^{\pi}(t,x), D_x^2 V^{\pi}(t,x)) - \lambda \text{log}(\pi(t,x,a)) \big] \pi(t,x,a)\nu(da) &= 0,\notag \\ V(T,x) &= g(x), \end{cases} \end{align} \tag{3.6}\]

Proof. The proof of theses results follows from the Bellman relation for \(V^{\pi}\) and \(V\) by using the law of iterated conditional expectations and by applying Itô’s formula to the process \(\big(V^{\pi}(s,X_s) \big)_{t \leq s \leq T}\) between \(t\) and \(t+h\). The proof follows standard computations so we omit it.

Theorem 3.3 (Optimal randomized policy.) The optimal randomized policy for the control problem Equation 3.4 is given by

\[ \begin{align} \pi^{\star}(t,x,a) = \frac{ \text{ exp } \big(\frac{1}{\lambda} H(x,a,D_x V(t,x), D^2_x V(t,x) \big)}{\int_{A} \text{ exp } \big(\frac{1}{\lambda} H(x,a,D_x V(t,x), D^2_x V(t,x) \big) \nu(d a) }, \quad (t,x,a) \in [0,T] \times \mathbb{R}^d \times \mathbb{R}^m. \end{align} \tag{3.7}\]

The right-hand side of Equation 3.7 is called Gibbs measure and can be viewed as the continuous time of the soft-max policy.

Proof. For a fixed \((t,x \in [0,T] \times \mathbb{R}^n)\) and from Equation 3.6 , we define \[ \begin{align} \mathbb{R}^m \ni a \mapsto F_x(a) = H(x,a, D_x V(t,x), D^2_x V(t,x)) - \lambda \text{ log } \pi(a), \end{align} \] where we omit the dependance in \((t,x)\) for \(\pi\) for simplicity. Then, we aim to solve the optimization problem

\[ \begin{align} \underset{\pi \in \mathcal{P}(\mathbb{R}^m)}{\text{ max }} \int_{A} \big( F_x(a) - \lambda \text{ log } \pi(t,x,a) \big) \pi(t,x,a) \nu(d a) \end{align} \] with the constraints \(\int_{A} \pi(a) \nu(d a) = 1\) and \(\pi(a) \geq 0\) \(\nu-\text{a.e.}\). Defining the lagrangian multiplier associated to theses constraints, we define the map \[ \begin{align} \mathcal{L}(\pi,\eta) := \int_{A} \big( F_x(a) - \lambda \text{ log } \pi(a)) \pi(a) \nu(d a) + \eta \big( \int_{A} \pi(a) \nu(da) - 1) \end{align} \] Deriving \(\mathcal{L}\) with respect to \(\eta\) and taking the stationnarity condition, we have that \(\nu(\d a)-\text{ a.e}\) \[ \begin{align} F_x(u) - \lambda \big( \text{ ln } \pi(a) +1) + \eta = 0, \end{align} \] and therefore \(\pi(a) = C \text{exp}(\frac{F_x(u)}{\lambda})\) for a positive constant \(C\). Using the normalization condition for \(\pi\), we get the result.

The following result is the continuous time version of the Policy iteration for MDP with the \(Q\)-value function.

Theorem 3.4 (Policy Improvement.) Define the following Gibbs operator as \[ \begin{align} \Pi \ni \pi \mapsto \mathcal{G}(\pi)(t,x,a) = \pi'(t,x,a) := \frac{ \text{ exp } \big(\frac{1}{\lambda} H(x,a,D_x V^{\pi}(t,x), D^2_x V^{\pi}(t,x) \big)}{\int_{A} \text{ exp } \big(\frac{1}{\lambda} H(x,a,D_x V^{\pi}(t,x), D^2_x V^{\pi}(t,x) \big) \nu(d a) }, \quad (t,x,a) \in [0,T] \times \mathbb{R}^d \times \mathbb{R}^m. \end{align} \tag{3.8}\] Then, \(V^{\pi'} \geq V^{\pi}\). Moreover, if \(\mathcal{G}\) has a fixed point \(\pi^{\star}\), then \(\pi^{\star}\) is an optimal policy.

Proof. The proof of Theorem 3.4 can be found in the lecture notes and takes advantage of the Gibbs measure representation of the optimal randomized policy.

3.2.2 The linear quadratic case

Definition 3.6 (Linear quadratic in continuous time.) A specific subclass of control problems is the class of linear quadratic control problems where we set :

  • Linear dynamics for state :

\[ \begin{align} b(x,a) = Bx + Ca, \quad \sigma(x,a) = Dx + Fa, \end{align} \] with \(x \in \mathbb{R}^d\), \(a \in \mathbb{R}^m\), \(D \in \mathbb{R}^{d \times d}\) and \(F \in \mathbb{R}^{d \times m}\).

  • Quadratic cost functional :

\[ \begin{align} f(x,a) = x^{\top} Qx + a^{\top} N a, \quad g(x)= x^{\top} P x, \end{align} \] with \(Q,P \in \mathbb{S}^d_{+}\) and \(N \in \mathbb{S}^m_{+}\).

One can show after an ansatz that the optimal value function is in the quadratic form

\[ \begin{align} V(t,x) = x^{\top} K(t) x + \chi(t), \end{align} \] for some deterministic functions \(K : [0,T] \rightarrow \mathbb{S}_{+}^d\) and \(R : [0,T] \rightarrow \mathbb{R}\) solutions to Ricatti and linear ODE with terminal condition $K(T)=

Moreover, one can show that the optimal policy \(\pi^{\star}\) is Gaussian where \(\lambda\) is a term which measures the degree of exploration, i.e., associated to the variance of \(\pi^{\star}\). We refer to the slides in the course to view the associated Ricatti system.