Recall, in monte-carlol method of finding values, you are keeping track of a “gradient based” online mean for each state. You update this mean each time you get a new estimate. You get a new estimate each time you get a new episode (that has your state of interest in it). In other words, you need a complete episode, each time you want to update the mean value for a state.

TD, takes a different appraoch.

It will update the mean value for state s, based on how much reward it got transitioning out of state s (the “immediate reward”), to some other state s’, and the mean of s’ (“long term reward”)! Whooooooweeeeeeeeeee boy, we call this BOOT strappin!! Whoooweeee!

When you update an estimate based on another estimate, round hea, we call that BOOTstrappin, heeeeeeeee-haaaaww!

We updated our estimate for state s value based on our estimate for state s’ value.

Anyways, here is the full update equation man…

You notice that in monte carlo was simply a new value taken from a complete episode! In TD, each “next” state (in the same episode), provides an update for the previous state!

Hmmmm…Here’s a question that just poped up in my head. Once you update a state, since you have a new estimate for that state, can you immediately update the state before it, and then since you have a new update for the state before it, can you now update the state before that state!?. This would make sense, becuase you now have a “better” estimate for state s, and thus you can have a better estimate for state s-1! Whoooooooweeeeeeeeeee!!!! I’m betting my horse ole betty that this actually happens! We’ll find out soon folks.

Stay tuned.


Oh, and one MOOOOOOOOOORE thing! Above, we say that as soon as you get s+1 in your simulation, you can update the mean value fo s. This is called “n = 0” TD. If you wait till s + 2 in order to update s, then we call it “n = 1” TD. BUT WAIT THERE’S MORE! If you wait until s + 3 to update the mean of s, then we call it (drumroll) “n = 2” TD! The general name for this shenanigan is “n step” TD. N tells you have many steps you should wait before you update the state’s value! The higher the N, the more “actual reward” you are taking into account, and the less “guess” you are taking into account.

Also, the longer you wait (the bigger the n), the more you are approaching just a plain ole monte-carlo method, because if you wait long enough, i.e. till the very end of the episode, to update the mean value, then you are just doing monte-carlo!

Man, I keep typing monte-carlo as monte-carlol LOL.

Anyways, that’s all for today folks. Notin to see here!


I should really note this somewhere, so I’ll do it here!

When I talked about using monte-carlo or TD to find values for a policy, I talked about finding the state-values. This is definitely correct, but usually when you don’t have the dynamics (aka sampling episodes, aka doing monte carlo or TD), you are more interested in action-values, because action-values can be selected over greedily. State-values (without dynamics) cannot be selected over greedily. You don’t know what subsequent states the other actions might take you, so you can’t estimate their values!

Ok, now for real, that’s all for today!