In this post I talked about the awesomeness of what I called “approaching” methods. An “approaching method” (my own terminology ) is a certain way of approaching the solution to a problem. You start with an initial guess, and then move your guess a little closer to a particular sample that you see. The idea is, if you see enough samples, your guess will approach the true solution! Very cool!

Calculating mean incrementally using a “gradient” method, as shown in this post, is an “approaching method”.

Another example, one we’ll discuss in this post, is the delta rule in supervised learning (or regression in statistics).

Recall, in supervised learning (or regression in statistics), your goal is to find a function that “best fits” some given input-output pairs.

You always start with a skeletal structure for the function (linear function, polynomial, etc) and then your goal is to tweak the parameters of this function such that the resultant function “best” fits the provided training examples (“input-output” pairs).

The delta rule basically says: Start with an initial guess for the parameters (for your weight vector, $\vec{w}$). Then every time you see a training sample, move your guess ($\vec{w}$) such that the new $\vec{w}$ predicts the just seen sample a little better. So again, you are moving your guess a little closer in response to a single sample.

So how exactly do we change our $\vec{w}$ such that our prediction for the just seen sample becomes better than it currently is?

Well, here is the train of thought.

If our prediction is too low, then we need to change $\vec{w}$ such that all components of $\vec{w}$ that cause the prediction to be higher, go up. The components of $\vec{w}$ that cause an increase in prediction are, by definition, the components whose derivitive of the prediction wrt the component is positive! The derivitave of the prediction wrt a component of $\vec{w}$ tells us weather a positive change in the component will result in a positive change in the prediction!

If our “skeletal function” is linear, i.e. our prediction is just a linear combination of $\vec{w}$ and $\vec{x}$, then the derivative of our prediction with respect to a particular w is just the coeffecient of that weight, which is that x! In other words, if a positive x component that has the same index as the respective w component, then we should increase that w, because increasing that w will result in increasing the prediction!

Let’s put all of this in equation form !

Our prediction is:

If we update our w like so (in response to a new sample):

Where $A$ is the actual value for sample and $P$ is the prediction for the sample using current weights. If we update our sample like this, then notice that what we described above in words, actually happens!

When $(A - P)$ is positive, it means we need to change $\vec{w}$ such that it will result in a bigger prediction for the sample. In this case, our expression will increase the components of $\vec{w}$ that will result in increasing the prediction (in particular the components of $\vec{w}$ whose index is positive in the new sample $\vec{x}$).