Regression is when you have a bunch of input-output pairs and you are trying to find out an approximate mathematical relationship (i.e. equation, i.e. function) between your inputs and your outputs. In other words, you are trying to find a function given a bunch of input-output pairs for that funciton. In yet other words, you want to find the function that “best fits” your input-output pairs.

The general approach is like so: You start by guessing what type of a generic relationship might exist (linear, polynomial, etc), but you don’t have to know the specific parameters of that equation (i.e. the coeffecients of the linear combination, etc). You want to find parameters of your equation, such that that specific equation “fits” your input-output pairs the best. Note, that you will only find a specific version of your generic equation that best fits the data. Your only looking at how different version of your generic equation performs on the data, so you are not looking at other generic equations at all. There may be other equations that fit your data even better. You are starting with a bias (the picked generic equation).

There are several different ways of quantifying “best fit”. A straight forward way is by looking at each of your input-output pairs, and see the difference between your actual output, and the output as predicted by the specific equation (i.e. equation with specific set of parameters). The specific equation that results in the least difference between the actual and predicated values, is the “best fit” one.


This boils down to an optimization problem. Given a generic equation, you want to find the parameters (i.e. coeffecients) of the equation that will maximize some definition of “best fit” (or minimize some definition of “worst fit”). If your criteria for best fit is the one with the least differences between the actual and predicted outputs, you essentially want to minimize a summation (the sum of the differences between your actual and predicted outputs).

You know that you can use gradient descent to minimize a function, but if you have a ton of data (i.e. ton of input-output pairs), it can be expensive becuase you will have to do N subtractions (subtract each actual output from the predicted output). Instead, each iteration of gradient descent, you just randomly pick one of your data points, and do a subtraction just for that. The idea is, that over many iterations, you will likely pick very different data, thus you will get descent results. This modified version of gradient descent is called stochastic gradient descent because you are randomly picking a data point each time to process (instead of processing every data point each iteration).


A common way to define best fits is by taking the difference of your actual and predicted output, and then squaring it. The squaring does two things:

  1. larger differences are punished more.
  2. makes all differences positive

The parameters that result in the smallest sum of differences squared is considered the “best fit”. So you want to find parameters of your equation that minimize the sum of differences squared.