There is

  • planning
  • learning

Planning is when you already know the MDP (the transition probabilities), and you either want to evaluate a given policy (“policy evaluation” aka “prediction”), or you want find an optimal policy (“control”).

Policy Evaluation (aka prediction)

  • iterative policy evaluation (using bellman’s expectation equation as update rule)

To find the optimal policy, you just need to find the optimal value function (going from optimal value function to optimal policy is trivial).

To find optimal value function

  • value iteration (using bellman’s optimality equation as update rule)

You can also use iterative policy evaluation to first evaluate your policy, then use the value function to improve the policy (greedily), then evaluate the new policy to get a new value function, then use the new value function to improve the policy (greedily) again. You can keep alternating between finding a new value function by doing policy evaluation and finding a new policy by looking at the value function and choosing actions greedily. You stop doing this when you notice that the policy stops changing! This is another method of finding the optimal policy! Thus this is also a “control” solution I guess.

That is planning. Learning is when you don’t know the transition probabilities (aka dynamics). In this case you must run trials (simulation) and see what happens. You can use Monte-Carlo methods or TD methods. In essence, they both look at a bunch of trails (episodes) and distribute all rewards seen in the episode to actions taken along the episode.

Monte Carlo, is very straightforward, it moves the value of a state (or state-action pair) towards the return experienced after visiting that state during the episode.

TD(n=0) keeps updating the value of a state (or state-action pair) based on the immediate reward and the value of the next state. Td(n=1) waits until it collects 2 immediate rewards. Td(2) waits until it collects 3 immediate rewards, and so on. TD(infinity) waits till the end of the episode, and so is equivalent to monte carlo!

TD lambda is basically, doing TD(0-infinity) (but discounting) and averaging the results. The “backward” view implementation is much more efficient. When you finally get a reward, “propagate” it backwards (proportional to eligibility of the state).