Hello! And Welcome back!

In this post, we’ll talk about the general way in which we learn (or compute) a policy. This “general” view is known as “general policy iteration”.

Ok, so you always start with an arbitrary value function, and an arbitrary poilicy.

Now, this value function isn’t the “true” value function for this policy, and this policy isn’t greedy with respect to this value funciton.

So, first we get the true value function for this policy. If we know the dynamics, we can directly solve bellman’s optimality equation, or we can use iterative policy evaluation to find the value function. If we don’t know the dynamics, we must get some samples (episodes)! In this case, we can use monte-carlo or TD methods to find the value funciton.

All good so far!

Now, the value function we have is the true value function (or a close enough estimate (depending on number of samples) if you are using samples) for the policy. But the policy is not greedy with respect to this value function. So, we change our policy, such that for each state, it chooses the action that has the highest expected return! This makes our policy greedy with respect to this value function.

But now, the value function is no longer the true value function for this new policy! So, we must again get a new value function, which is the true value function for this new policy! Once we get the new value function, again, our policy is no longer greedy! This cycle (of life) continues until your policy stabilizes (doesn’t change). Once you notice that the policy never changes, you’ve “converged” to the correct value function and optimal policy (well, now it’s called optimal value function, because it is the value function for the optimal policy)!

The thing is, you can do these cycles on little pieces of the state space. You can find values for some of the states, then select greedily on those states. So you can make “local” changes to your states! Additionally, even if you know the dynamics, you don’t have to get the true value funciton, just do a couple iterations of “iterative policy evaluation”, to move the value function a little closer to the true one. When we are using monte-carlo or TD, we never get the “true” value function anyways, becaues it is based on samples!

So really, during your evaluation step, just move your value function a little closer to the true one.

# No Dynamics!

Also, I should (but won’t hehe, fine I guess I will ->) note that in order to act greedily on a value function, we either need the dynamics or state-action (as opposed to just state) samples. In other words, if we don’t got the dynamics, we really need to find the action-value function not state-value function! Once we have the action-value function, we can act greedily on it to select the action at each state that has the greatest value!

Now, the second (first being the above paragraph) issue with not having the dynamics, is that if you select greedily based on your value function (remember: action-value function in the case of not having the dynamics), then your new policy is deterministic! It won’t choose any new actions in the future, thus the future episodes won’t give you information on how valueable other actions might be. In other words, your future episodes (the ones that follow your new greedy bastard policy), don’t explore new actions!

This is a problem because we’ll never see potentially better (more valueable) actions! The solution is to select “almost greedily”. What I mean by this is that you still want a stochastic policy, but you want it to select the greedy action the majority of the times, but give it a small chance that it will select a random other action. We call these kind of policies “soft” policies, because they are not tough enough. Buncha wimps I tell ya. And, we no longer say “select greedily”, we say “select e-greedily” because there is an e chance that a completely random action will be chosen (and a 1-e chance that the established best action (the one with the highest action value) will be chosen).

Anyways, that’s all for today folks!