Regular Ole Regression Has a Differentiable Objective Function

Remember that in regular ole regression, both your inputs and output of your data points are real numbers.

You know that in regular ole regression, you find the rate that your error changes with respect to the model parameters (i.e. you find the gradient of the error with respect to the model parameters) and you move down that gradient (i.e. you move each of your current model variable values a little in the opposite direction of the corresponding gradient entry). This is possible because your objective function is differentiable.

Conventions:

• $x_1$, $x_2$, $x_3$, etc are input data points
• $x_{11}$, $x_{12}$, $x_{13}$, etc are the 1st, 2nd, and 3rd, etc features of data point 1
• $y_1$, $y_2$, $y_3$, etc are the outputs (actual outputs) of $x_1$, $x_2$, $x_3$, etc
• $p_1$, $p_2$, $p_3$, etc are the predicted outputs
• $w_1$, $w_2$, etc are the weights for the features of a data point

$E = \sum_i{(y_i - p_i)^2}$, where $y_i$ is actual output of data point $i$ and $p_i$ is the predicted output1

$p_i$ is a differential function, usually just a linear combination of weights (e.g. $w_1*x_{i1} + w_2*x_{i2} + ...$), therefore $E$ is differentiable with respect to any of the model parameters (aka weights)2

In doing the above, we discovered a little insight:

Your objective function includes your error function as well as your model in it. This is because it uses your model to get predicted values, and then it uses your error function to find the error beween the predicted and actual values.

But the main point really was to show that in regular ole regression, your objective function is differentiable, thus you can use gradient descent to effeciently minimize it to find parameters for your model that best fit your data.

Classification Does Not Have a Differentiable Objective Function

In classification, your output is not a real number. Your output is often bi-state (aka boolean, we’ll handle n-state later).

If your model function outputs a boolean, it is not smooth, thus it is not differentiable. Earlier we found out that your model function (along with your error function) are a part of your objective function, thus if your model function is non-differentiable, then your objective function is non-differentiable, thus you cannot use gradient descent to effeciently minimize it. I’m sure you can still brute-force minimize it, but that is computationally infeasable.

Thus we need to make our model function smooth. We will make our model function take a linear combination of the features of a data point, and pass it into a logit function, which will convert it to a real number in the range of 0-1. For really small (aka highly negative) inputs, the logit function outputs 0. For really large inputs, it outputs 1, but the transition from 0 to 1 is smooth, thus the function is smooth.

Ok, so now we have a model function that is differentiable, how about our error function, what should it be? It can be the difference between predicted (which is differentiable) and actual. Actual will either be 0 or 1 (since output is binary). If the actual output is 1, and the predicted is 0.9 (after being passed to logit function), then the difference is 0.1, not too shabby. If the actual output is 0, and the predicted 0.8, the difference is 0.8, much worse. Anyways, you can see how subtracting our logit fed model output from the actual output is still a good error measure, and it still keeps our objective function differentiable, so we’ll go with it, homie!

Classification With N-State

What if your output is not bi-state (boolean), but say, tri-state (“cat”, “dog”, “mouse”)? Simple. You do binary classification for each state, i.e. a binary classifcation for “cat” or “not cat”, another for “dog” or “not dog”, and another for “mouse” or “not mouse”. Let’s say you got a 0.2 probability that it is a cat, a 0.9 probability that it is a dog, and a 0.6 probability that it is a mouse. You then normalize these 3 probabilities to make them all sum to 1! That’s it!

So, in summary:

• in regular ole regression, your output is real numbers, thus your objective function is differentiable, thus you can effeciently minimize it
• in binary classification, your output is boolean; a function that outputs a boolean is not smooth, thus not differentiable, thus you want your model function to take a linear combination of the features and then pass it to a logit function (which outputs smooth).
• to do n-state classification, just do n binary classifcations, and then normalize all n outputs so they sum to a probability of 1
1. We square because we just want magnitude of differences. We don’t use absolute value because absolute value is non-differentiable.

2. If $p_i$ is differentiable, then everything inside the parenthesis is differentiable (because $y_i$ is just a constant),thus $(y_i - p_i)^2$ is differentiable (chain rule), thus the entire summation is differentiable!