This document contains brief notes on machine learning. Kind of like a summary/high level picture of machine learning. Not meant for beginners.

General

you have some data, some features of this data are your inputs, one feature is your output
you want to find a function that “best fits” this data
- give the input features, your function should pretty accurately predict the output feature for a particular data point
so you want to find a function that “best fits” this data, but can predict future data pretty well too
- this is sometimes called “function approximation”
  - because you are finding a function that approximately fits your data
different ways of representing the function (decision tree, neural network, just an equation, etc)
different algorithms for finding the function (whether it’s a decision tree, neural network, or equation)
the algorithms/function representation have trade offs
- in complexity to built the representation
  - 2 different algorithms may build the same representation (decision tree, neural network, etc) but have different complexity
- in complexity to classify instances once built
- can the function representation represent all possible functions of your data?
  - if it can’t, the representation is termed “restriction biased”
- even if the representation can represent all possible functions of your data, does the algorithm actually consider (i.e. search) all of the representations?
  - if it doesn’t, the algorithm is termed “preference biased”
- some representations/algorithms are better for discreet valued features/outputs
- pick a representation/algorithm based on your data (num features, types of features (discreet? real?), statistics of your data)
once you have built a representation (aka model), how do you judge it’s performance? how “good” is it?
- cross validation - have separate test data, see how your model performs on the test data
- k-fold cross validation - rotate which chunk you use as test data, average the performance
over fitting occurs when your model is really good at predicting the training data, but it performs poorly on other (i.e. test) data
- one solution to preventing overfitting
  - slowly increase the complexity of your model until performance on test data starts to go down

Decision Tree

human readable representation for the function
built using algorithms: ID3 and C4.5
- C4.5 just allows lil things like real valued features
ID3 algorithm:
- recursively split your data on the attribute that results in the most information gain
  - information gain = entropy lost
  - there are other “splitting measures” that you can take besides information gain (there is gini impurity and gain ratio)
if you recursively split until all data chunks are homogenous (i.e. have 0 entropy), you have overfit
- larger trees = more complex function = overfitting
- two general approaches to prevent decision tree overfitting
  - stop growing the tree
  - over fit the tree then prune it afterwards
    - this is the more commonly used approach
      - “reduced error pruning”
        
        evaluate impact of pruning each node (and the children/descendant nodes of course)
        
        greedily prune the node that results in most increased test performance
        
        keep doing this until test performance no longer goes up
        
        this produces the smallest version of the most accurate sub tree
      - “rule post pruning”
        
        make rules from each branch
        
        prune each rule, as long as it doesn’t decrease test data performance
        
        arrange rules in order of highest performing (on test data) to lowest
        
        when classifying, use rules in this order
decision tree qualities
- human readable representation of the function
- since each split considers the entire data set, you are robust to errors in the data (both wrongly classified data, and data with bad features)
- can handle missing attributes (one approach is just to use the value for other attributes, or an average of them)
- can handle costed attributes
  - tries to limit use of high cost attributes (i.e. for most examples, it won’t need to consider the high cost attributes to classify them)
- no restriction bias, but algorithms for building it have preference bias
  - shorter trees that put high information gain attributes at root are preferred (because we recursively split on attributes that result in the most info gain, remember?!)
real world uses
- diagnosing diseases

Finding Function Parameters

you can start with an arbitrary function and then find parameters for your function that best fits your data
you can start with a linear, polynomial, etc function
you want to find parameters of your function that result in a function that best fits your data
- in other words, you want to find the parameters that result in a function that has the least squared error on predicting your examples
  - squared error as opposed to error because squared error gets rid of negatives and penalizes bigger errors more
  - alternative metrics to least squared error
this boils down to a optimization problem (minimization specifically)
gradient descent can be use to minimize
stochastic gradient descent can be used to minimize more efficiently
you can use “logistic regression” to make binary classification (e.g. is this a dog or not?)
- you can break discreet classification into a bunch of binary classifications, for example consider finding it out if a picture is a dog, cat, or mouse
  - is it a dog? yes, no?
  - is it a cat? yes, no?
  - is it a mouse? yes, no?
- actually, it’s more like this
  - what is the probability that it is a dog?
  - what is the probability that it is a cat?
  - what is the probability that it is a mouse?
  - take the highest probability?

Neural Network

TODO