You are already aware of the following methods for finding the value function of a policy:

But the above methods only work if you know the dynamics (transition probabilities, aka the p(s',r | s,a) function).

If you don’t have the dynamics, one straight forward method of finding the value for a particular state is to run the system and see what happens!

Run your system, and “record” what states you go through, what actions are taken (by the current policy), and what rewards are given after each action. So you’ll have a “string” of state-action-reward-state-action-reward-etc.

We call this an “espisode” (it’s just a sample run that followed the policy).

Now, pick one of the states that was visited in the episode, and see how much the return was after that state was visited. Remember, return is the cummulative future discounted reward. So start at that state, and add up all the reward that you got, until the very end of the episode. That is an estimate (based off a single sample) of the value of this state!

You can use the same reasoning to find the value of the other states that occured in your episode.

As you run more and more episodes, you can start to average the esimated values you get for a particular state. You can do this average incrementally (aka online). Furthermore, you can do this using a gradient approach!

By the law of large numbers, as you add more samples, then your “estimatd average” will converge to the expected.

Assuming you are using what I call a “gradient approach” for online mean tracking, here is how you’d update your mean value for a particular state, as you get a new sample value (from a new episode):

# Every-Visit vs First-Visit Monte Carlo

What happens if you see the same state twice (or more) in the same episode?

Should you update the mean value once for every time you see the state in the episode, or should you just ignore the subsequent visits to the same state?

If you choose to update the value once for every time you see the state in the episode, we call that “every-visit monte carlo”. If you choose to ignore all subsequent visits to the same state (and thus only update the value for the first time you visit that state in the episode), we call that “first-visit monte carlo”.