Description

Gradient Descent is a numerical method for minimizing a scaler valued multivariate function.

When you want to minimize a function, you want to find the values of the parameters that give you the smallest output.

Gradient Descent says, start at some arbitrary parameter values, calculate your gradient at these values, then move your parameter values a little bit in the opposite direction of the gradient (i.e. climb down). Now, calculate a new gradient based on your new parameters, and move your parameters a little in the opposite direction of the new gradient. Keep doing this until some termination condition. Typically, your termination condition will be when your output (the function’s value at these parameters) stops changing by much, but some people just put a limit on the number of moves.

Algorithm

you have a function; now find the gradient of your function (symbolically)

i.e. $\vec{\nabla f} = [\frac{d}{d p_1},\frac{d}{d p_2},...]$

initialize your parameters to arbitrary values

calculate (evaluate) the gradient at current parameter values

move your current parameter values a little in the opposite direction of the gradient

i.e. for each parameter $p$ , $p = p + \alpha (-\frac{d}{dp})$

or, if you wanna think of it in vector form, $\vec{p} = \vec{p} + \alpha (- \vec{\nabla f)}$

go to step 3

You keep doing this until your output stops changing by much, or you can just do a certain number of iterations and call it a day!

Note: $\alpha$ just controls the size of the steps you take in the direction (opposite) of the gradient.

Concrete Example

You have the function $z = x^2 + y^2$ . We already know that the min of this will be at (x = 0, y = 0), but let’s see how gradient descent moves towards this.

Our derivitive wrt x is 2x, and wrt y is 2y, thus our gradient (symbolically) is $\vec{\nabla f} = [2x,2y]$
We will initialize our parameters arbitrarly to (x = 10, y = 10)
iteratively move in (opposite) direciton of gradient
1. gradient at our current parameters is (2(10), 2(10)), or (20,20)
2. move parameters a little in the opposite direction of the gradient:
  - new_params = old_params + K * -gradient
  - K is some constant (the bigger K is the bigger steps you move, we called k “alpha” above) - <new_x,new_y> = <old_x,old_y> + K * -<gradient_x,gradient_y> - <new_x,new_y> = <10,10> + 0.1 * -<20,20> - <new_x,new_y> = <8,8>

Ya see! We moved from (10,10), to (8,8). If we do more iterations, we will keep approaching (0,0), which is the true min of this function!

Some Notes

When you want to maximize the function, we call it gradient ascent, and instead of moving in the direciton opposite to the gradient, you move in the direction of the gradient! Simple!
Gradient descent will find a local minima. To try to find a global minima, start at a bunch of different parameter values, do gradient descent on each of them, and then take the one that results in the lowest output.
Gradient descent is often used to minimize the error function when doing regression. To be more precise, a slightly modified version of gradient descent is used, called stochastic gradient descent, which is substantially computationally cheaper.