This document contains brief notes on machine learning. Kind of like a summary/high level picture of machine learning. Not meant for beginners.

# General

• you have some data, some features of this data are your inputs, one feature is your output
• you want to find a function that “best fits” this data
• give the input features, your function should pretty accurately predict the output feature for a particular data point
• so you want to find a function that “best fits” this data, but can predict future data pretty well too
• this is sometimes called “function approximation”
• because you are finding a function that approximately fits your data
• different ways of representing the function (decision tree, neural network, just an equation, etc)
• different algorithms for finding the function (whether it’s a decision tree, neural network, or equation)
• the algorithms/function representation have trade offs
• in complexity to built the representation
• 2 different algorithms may build the same representation (decision tree, neural network, etc) but have different complexity
• in complexity to classify instances once built
• can the function representation represent all possible functions of your data?
• if it can’t, the representation is termed “restriction biased”
• even if the representation can represent all possible functions of your data, does the algorithm actually consider (i.e. search) all of the representations?
• if it doesn’t, the algorithm is termed “preference biased”
• some representations/algorithms are better for discreet valued features/outputs
• pick a representation/algorithm based on your data (num features, types of features (discreet? real?), statistics of your data)
• once you have built a representation (aka model), how do you judge it’s performance? how “good” is it?
• cross validation - have separate test data, see how your model performs on the test data
• k-fold cross validation - rotate which chunk you use as test data, average the performance
• over fitting occurs when your model is really good at predicting the training data, but it performs poorly on other (i.e. test) data
• one solution to preventing overfitting
• slowly increase the complexity of your model until performance on test data starts to go down

# Decision Tree

• human readable representation for the function
• built using algorithms: ID3 and C4.5
• C4.5 just allows lil things like real valued features
• ID3 algorithm:
• recursively split your data on the attribute that results in the most information gain
• information gain = entropy lost
• there are other “splitting measures” that you can take besides information gain (there is gini impurity and gain ratio)
• if you recursively split until all data chunks are homogenous (i.e. have 0 entropy), you have overfit
• larger trees = more complex function = overfitting
• two general approaches to prevent decision tree overfitting
• stop growing the tree
• over fit the tree then prune it afterwards
• this is the more commonly used approach
• “reduced error pruning”
• evaluate impact of pruning each node (and the children/descendant nodes of course)
• greedily prune the node that results in most increased test performance
• keep doing this until test performance no longer goes up
• this produces the smallest version of the most accurate sub tree
• “rule post pruning”
• make rules from each branch
• prune each rule, as long as it doesn’t decrease test data performance
• arrange rules in order of highest performing (on test data) to lowest
• when classifying, use rules in this order
• decision tree qualities
• human readable representation of the function
• since each split considers the entire data set, you are robust to errors in the data (both wrongly classified data, and data with bad features)
• can handle missing attributes (one approach is just to use the value for other attributes, or an average of them)
• can handle costed attributes
• tries to limit use of high cost attributes (i.e. for most examples, it won’t need to consider the high cost attributes to classify them)
• no restriction bias, but algorithms for building it have preference bias
• shorter trees that put high information gain attributes at root are preferred (because we recursively split on attributes that result in the most info gain, remember?!)
• real world uses
• diagnosing diseases

# Finding Function Parameters

• you can start with an arbitrary function and then find parameters for your function that best fits your data
• you can start with a linear, polynomial, etc function
• you want to find parameters of your function that result in a function that best fits your data
• in other words, you want to find the parameters that result in a function that has the least squared error on predicting your examples
• squared error as opposed to error because squared error gets rid of negatives and penalizes bigger errors more
• alternative metrics to least squared error
• this boils down to a optimization problem (minimization specifically)
• gradient descent can be use to minimize
• stochastic gradient descent can be used to minimize more efficiently
• you can use “logistic regression” to make binary classification (e.g. is this a dog or not?)
• you can break discreet classification into a bunch of binary classifications, for example consider finding it out if a picture is a dog, cat, or mouse
• is it a dog? yes, no?
• is it a cat? yes, no?
• is it a mouse? yes, no?
• actually, it’s more like this
• what is the probability that it is a dog?
• what is the probability that it is a cat?
• what is the probability that it is a mouse?
• take the highest probability?

TODO