Description

A markov reward process where you have one or more “actions” that you can take from each state. Each action has some chance of taking you to another state.

                                   +-----------+                       
                          p = 0.2  |           |                       
                         +-------->|  state 2  |                       
                         |  r = 3  |           |                       
+-----------+            |         +-----------+                       
|           |            |                                             
|  state 1  |  ---------->                                             
|           |            |                                             
+-----------+            |         +-----------+                       
      |                  | r = 2   |           |                       
      |                  +-------->|  state 3  |                       
      |                   p = 0.8  |           |                       
      |                            +-----------+                       
      |                                                                
      |                                                                
      |                            +-----------+                       
      |                    r = 4   |           |                       
      +--------------------------->|  state 4  |                       
                           p = 1   |           |                       
                                   +-----------+                       

This is the thing we’re really interested in, in RL. In real life, we’re often in some state, and have some actions available to us. Each action has some chance of taking us to some other states and giving us some rewards. If we know the dynamics, how can we plan so that we can make the best decisions (i.e. choose the best actions at each state). If we don’t know the dynamics, can we still learn to behave in such a way as to maximize return?

As usual, if we know the dynamics, we can just solve bellman’s expectation equations (system of linear equations) to “directly” find the value function for the MDP. But this is very inefficient for large state space MDPs (most interesting/useful MDPs are very large state spaced). So usually, we run simulations and “learn” the value function from the simulations. We can either use MC or TD lambda.

Learning Value Function MC Style

Run a bunch of terminating episodes. For each episode, as soon as you encounter state of interest, add up all the rewards you see afterwards. This is another “sample” for the value of this state. Look at more episodes, and do the same thing (find out how much reward you collect after you see this state of interest). You now have a bunch of samples for the value of this state (each from a different episode), so just average em son .

This is called “first visit MC”. There is a slightly different version, called “every visit MC”. In first visit MC, you ignore subsequent visits to the same state of interest (in the same episode). In every state MC, you count the cumulative reward obtained after the second visit is yet another sample for the value of the state.

I should note, that like all mean calculations, you can calculate this incrementally, and incrementally using a gradient approach :).

Learning Value Function TD Style

Recall that when we are incrementally (gradient style) calculating an expectation (“mean”, same thing bruh…), we constantly move our current expectation closer to a target. In MC, this target is the cumulative reward collected after you visited the state in an episode. In TD this target is a combination of rewards that you collect from that state to a subsequent state, and the value of that subsequent state!

In TD n = 0, you only collect one immediate reward, so your update rule is:

$v(s_t) = v(s_t) + (\overbrace{r_{t+1} + v(s_{t+1})} - v(s_t))$

In TD n = 1, you collect two immediate rewards, so your update rule is:

$v(s_t) = v(s_t) + (\overbrace{r_{t+1} + r_{t+2} + v(s_{t+2})} - v(s_t))$

In TD n = 2, you collect 3 immediate rewards, so your update rule is:

$v(s_t) = v(s_t) + (\overbrace{r_{t+1} + r_{t+2} + r_{t+3} + v(s_{t+3})} - v(s_t))$

The “overbrace” stuff in the above equations is your target. Notice that as n increases, you put more actual rewards in your target.

You can actually do a bunch of these (an n1, n2, n3, etc) and then weighted average them. Since you asked, this is actually what TD lambda does!