This document contains brief notes on machine learning. Kind of like a summary/high level picture of machine learning. Not meant for beginners.

General

  • you have some data, some features of this data are your inputs, one feature is your output
  • you want to find a function that “best fits” this data
    • give the input features, your function should pretty accurately predict the output feature for a particular data point
  • so you want to find a function that “best fits” this data, but can predict future data pretty well too
    • this is sometimes called “function approximation”
      • because you are finding a function that approximately fits your data
  • different ways of representing the function (decision tree, neural network, just an equation, etc)
  • different algorithms for finding the function (whether it’s a decision tree, neural network, or equation)
  • the algorithms/function representation have trade offs
    • in complexity to built the representation
      • 2 different algorithms may build the same representation (decision tree, neural network, etc) but have different complexity
    • in complexity to classify instances once built
    • can the function representation represent all possible functions of your data?
      • if it can’t, the representation is termed “restriction biased”
    • even if the representation can represent all possible functions of your data, does the algorithm actually consider (i.e. search) all of the representations?
      • if it doesn’t, the algorithm is termed “preference biased”
    • some representations/algorithms are better for discreet valued features/outputs
    • pick a representation/algorithm based on your data (num features, types of features (discreet? real?), statistics of your data)
  • once you have built a representation (aka model), how do you judge it’s performance? how “good” is it?
    • cross validation - have separate test data, see how your model performs on the test data
    • k-fold cross validation - rotate which chunk you use as test data, average the performance
  • over fitting occurs when your model is really good at predicting the training data, but it performs poorly on other (i.e. test) data
    • one solution to preventing overfitting
      • slowly increase the complexity of your model until performance on test data starts to go down

Decision Tree

  • human readable representation for the function
  • built using algorithms: ID3 and C4.5
    • C4.5 just allows lil things like real valued features
  • ID3 algorithm:
    • recursively split your data on the attribute that results in the most information gain
      • information gain = entropy lost
      • there are other “splitting measures” that you can take besides information gain (there is gini impurity and gain ratio)
  • if you recursively split until all data chunks are homogenous (i.e. have 0 entropy), you have overfit
    • larger trees = more complex function = overfitting
    • two general approaches to prevent decision tree overfitting
      • stop growing the tree
      • over fit the tree then prune it afterwards
        • this is the more commonly used approach
          • “reduced error pruning”
            • evaluate impact of pruning each node (and the children/descendant nodes of course)
            • greedily prune the node that results in most increased test performance
            • keep doing this until test performance no longer goes up
            • this produces the smallest version of the most accurate sub tree
          • “rule post pruning”
            • make rules from each branch
            • prune each rule, as long as it doesn’t decrease test data performance
            • arrange rules in order of highest performing (on test data) to lowest
              • when classifying, use rules in this order
  • decision tree qualities
    • human readable representation of the function
    • since each split considers the entire data set, you are robust to errors in the data (both wrongly classified data, and data with bad features)
    • can handle missing attributes (one approach is just to use the value for other attributes, or an average of them)
    • can handle costed attributes
      • tries to limit use of high cost attributes (i.e. for most examples, it won’t need to consider the high cost attributes to classify them)
    • no restriction bias, but algorithms for building it have preference bias
      • shorter trees that put high information gain attributes at root are preferred (because we recursively split on attributes that result in the most info gain, remember?!)
  • real world uses
    • diagnosing diseases

Finding Function Parameters

  • you can start with an arbitrary function and then find parameters for your function that best fits your data
  • you can start with a linear, polynomial, etc function
  • you want to find parameters of your function that result in a function that best fits your data
    • in other words, you want to find the parameters that result in a function that has the least squared error on predicting your examples
      • squared error as opposed to error because squared error gets rid of negatives and penalizes bigger errors more
      • alternative metrics to least squared error
  • this boils down to a optimization problem (minimization specifically)
  • gradient descent can be use to minimize
  • stochastic gradient descent can be used to minimize more efficiently
  • you can use “logistic regression” to make binary classification (e.g. is this a dog or not?)
    • you can break discreet classification into a bunch of binary classifications, for example consider finding it out if a picture is a dog, cat, or mouse
      • is it a dog? yes, no?
      • is it a cat? yes, no?
      • is it a mouse? yes, no?
    • actually, it’s more like this
      • what is the probability that it is a dog?
      • what is the probability that it is a cat?
      • what is the probability that it is a mouse?
      • take the highest probability?

Neural Network

TODO