Reinforcement Learning
Bandit Algorithm
Multi-Arm Bandit
For each time step $t$:
- $\epsilon$ Greedy Algorithm:
    
- if $rand() < \epsilon$ random select $k$ from $1$ to $K$
 - if $rand() > \epsilon$, $k = \underset{a}{argmax\ }(Q(a))$
 - Usually $\epsilon = \frac{1}{\sqrt {t}}$
 
 - Take action $k$, record reward $r$
 - Update action times
 - Update action values:
 - Interpretation:
 
Context-Free - UCB (Upper Confidence Bound)
- for a given option $a_i$: the expected return is $\bar x_{i,t} + \sqrt{\frac{2ln(n)}{n_{i,t}}}$
 - first term - exploitation: the average return of item $i$ in the last $n_{i,t}$ trials
 - second term - exploration: higher bounds for items which were selected not that often
 
Context-Free - Thompson Sampling
- beta distribution: $f(x;\alpha, \beta) = C \cdot x^{\alpha-1} \cdot (1-x)^{\beta-1} $. Note that $\alpha = \beta = 1$ gives the uniform distribution between 0 and 1.
 - the beta distribution is the conjugate prior probability distribution for the Bernoulli distribution
 - each time stamp, find the largest $\theta_i$, and update the parameter $\alpha_i$ and $\beta_i$ based on the reward $r$.
 - The two parameters controls the mean and variance of the $\theta$ for each option.
 
Contextual Bandit
- The decision becomes making action $a$ based on the contextual vector $x_t$ and policy $\pi$
 - get reward $r_{t,a}$ and revise policy $\pi$.
 
Markov Process
Markov Property of a state
- $P(S _{t+1} \vert S_t) = P(S _{t+1} \vert S_t, S _{t-1}, S _{t-2}, …)$, i.e, given the present, future is independent of past states.
 
Markov Process (MP)
- A sequence of random states $S$ with Markoe property, with the defintions of:
    
- $S$: a finite set of states
 - $P$: transition function - $P(S _{t+1} = s’ \vert S_t=s)$
 
 
Markov Reward Process
Reward Function:
Return Function:
Value Function & Bellman Expectation Equation
- how good is it to be in a particular state
 - Two components:
    
- Immediate reward
 - Discounted value of the successor state
 
 - Interpretation:
    
- The expected reward we get upon leaving that state
 
 
Analytical solution to Bellman Equation
- Computational complexity: $O(n^3)$
 
Markov Decision Process
New State Transition Probability - with action $a$
New Reward Function - with action $a$
Policy:
- Given a fixed policy, the MDP becomes MRP.
 
New State Transition Probability - with policy $\pi$
New Reward Function - with policy $\pi$
New Value Function & Bellman Expectation Equation - with policy $\pi$
New Action-Value Function - with policy $\pi$ and action $a$:
- how good is it to be in a particular state with a particular action
 
The relationship between $V$ and $q$
Rewrite New Action-Value Function
Rewrite New Value Function
Analytical solution to New Bellman Equation
Bellman Optimality Equation
- Optimal Policy
 
- Optimal Value Function
 
- Optimal Value-Action Function
 
Relationship between optimal functions
Rewrite Optimal Value Function
Model-based Approach (known $P$ and $r$)
Policy Evaluation
- Given MDP and policy $\pi$
 - Using analytical solution to Bellman Equation Rewrite New Value Function
 
- Iteration from $k$ to $k+1$:
 
Policy Iteration
- 
    
Policy Evaluation: Given $\pi_i$, evalute $V _{\pi_i}$ for each state $s$.
 - 
    
Policy Improvement: for each state $s$:
- Calculate $q(s,a)$ for each $a$
 
 
- Get updated policy $\pi _{i+1}(s)$
- Convergence: can be mathematically improved
 - Stop Criteria:
    
- Assume current policy $\pi_i$ is already optimal
 
 
- Then $\pi_i$ is the optimal $\pi^*$, we have:
 
Example

Value Iteration
- 
    
Directly find the optimal value function of each state instead of explicitly finding the policy
 - 
    
Value Update:
 
- Stop Criteria: Difference in $V$ value
 - Optimal Policy: Action with max $V$ for each state
 
Example

Model-free Approach (Unknown $P$ and $r$)
Monte-Carlo (MC) Methods
- Update until the return is known
 

Policy Gradient - Monte Carlo REINFORCE
- The parameters $\theta$ in the $NN$ determines a policy $\pi _{\theta}$
 - Define trajectory $\tau$
 
- Define total reward $r(\tau)$
 
- The loss function for a given policy $\pi _{\theta}$ is
 
- A mathematical property can be proved
 
- There are other revisions to replace $\color{blue}{ r(\tau)}$ with $\color{blue}{ \Phi(\tau, t)}$ to account for various problems. (e.g., standardization for all positive rewards, only calcultate the reward of $a_t$ from $t+1$, etc.). For example:
 
- Step 1: sample $\tau = (a_1, s_1, a_2, s_2, …, a_T, s_T)$ based on $\pi _{\theta}$, and get $r(\tau)$
 - Step 2: Calculate back propagation for $\nabla_\theta J(\theta)$. For each time step $t$, it’s multiclass-classification $NN$ with target value as $\color{blue}{r(\tau)}$ or $\color{blue}{ \Phi(\tau, t)}$.
 - Step 3: Update weiights $w$.
 
Drawbacks
- Update after each episode instead of each time step. There may be some bad actions.
 - Require large sample size.

 
Example
see this notebook
Temporal-Difference (TD) Learning
- TD methods need to wait only until the next time step.
 - At time t + 1 they immediately form a target and make a useful update.
 
Q-Learning
- Key idea: the learned action-value function $Q$ directly approximates $q^*$, the optimal action-value function
 - Optimal Policy:
 
- 
    
ref: http://www.incompleteideas.net/book/RLbook2020.pdf
 - Initialize $Q$ table with $Q(s,a)$, note that $Q(terminal, \cdot)=0$
 - Loop for each episode:
    
- Initialize S
 - Loop for each step of episode:
        
- Choose action $a$ from S using policy derived from $Q$
 - Take action $a$, observe reward $r$ and next state $s’$.
 - Update $Q$ table:
 
- $s \leftarrow s’$
 
 - until $s$ is terminal
 
 
Drawbacks:
- A finite set of actions. Cannot handle continuous or stochatisc action space.
 
Example

Deep Q-Learning
- 
    
The parameters $w$ in the $NN$ determines a policy $Q(s,a \vert s, w)$ for each $a$.

 - 
    
For episode in range(max_episode):
- For step in range(max_step):
        
- From $s$, run $NN$, and get $\hat Q_w(s,a)$ for each action a.
 - Select and take best action $a^*$
 - Get reward $r$ and Get next state $s’$.
 - Run $NN$ again to get $Q(s’,a’)$ for each $a’$
 - If $s’$ is not terminal: Set target value $\color{blue}{Q(s,a^*) = [r + \gamma \underset{a’}{max\ } Q(s’,a’)]} — [4]$
 - Set loss $L= [\hat Q_w(s,a^) - \color{blue}{Q(s,a^)}] ^2$
 - Update weights
 
 
 - For step in range(max_step):
        
 
Example
see this notebook
Vanilla Actor-Critic
- 
    
Instead of using $r(\tau)$ (which is the emprical, long-term reward based on T steps) and updating $\theta$ after one episode, here we use $Q(s,a)$ instead (There are also other variations). This enables updating weights every step.
- Policy Gradient:
 - Actor:
 
- Deep Q-Learning
 
- Critic
 
 
Algorithm
- Intialize $s, \theta, w$, and sample $a$ from based on $\pi_\theta$.
 - For $t = 1, 2, … ,T$
    
- get $s’$ and $r$
 - get $a’$ based on $\pi_\theta(a’ \vert s’)$
 - Update $\theta$ based on $[7]$
 - Update $w$ based on $[8]$
 - $a \leftarrow a’$, $s \leftarrow s’$.
 
 
Deep Deterministic Policy Gradient
- Both REINFORCE and the vanilla version of actor-critic method are on-policy: training samples are collected according to the target policy — the very same policy that we try to optimize for.
 - DDPG is a model-free off-policy actor-critic algorithm
 - https://arxiv.org/pdf/1509.02971.pdf
 


Comparison with Deep Q Learning

Comparison

ref: https://antkillerfarm.github.io/rl/2018/11/18/RL.html

- Action space
    
- DQN cannot deal with infinite actions (e.g., driving)
 - Policy gradient can learn stochastic policies while DQN has deterministic policy given state.
 
 - Convergence
    
- Policy gradient has better convergence (policy is updated smoothly, while in DQN, slight change in Q value may completely change action/policy space, thus no convergence guarantee)
 - Policy gradient has guaranteed for local minimum at least
 - But usually policy gradient takes longer to train)
 
 - Variance
    
- Policy gradient: Higher variance and sampling inefficiency. (Think of a lot of actions taken in a epoch before updating)
 - Actor-Critic: TD update.
 
 

- Left: model unknown
 - Right: model known
 - Top: update each step
 - Bottom: update each episode