# Abdullah's Machine Learning Notes

**machine learning**:

- Math/Machine Learning Terminology
- Rule Post Pruning Summary
- Find-S Summary
- Candidate Elimination Summary
- Decision Trees Summary
- Stochastic Gradient Descent Summary
- Delta Training Rule Summary
- Perceptron Training Rule Summary
- A Summary Of Multilayer Neural Networks
- > Abdullah's Machine Learning Notes

This document contains

briefnotes on machine learning. Kind of like a summary/high level picture of machine learning. Not meant for beginners.

# General

- you have some data, some features of this data are your inputs, one feature is your output
- you want to find a function that “best fits” this data
- give the input features, your function should pretty accurately predict the output feature for a particular data point

- so you want to find a function that “best fits” this data, but can predict future data pretty well too
- this is sometimes called “function approximation”
- because you are finding a function that approximately fits your data

- this is sometimes called “function approximation”
- different ways of representing the function (decision tree, neural network, just an equation, etc)
- different algorithms for finding the function (whether it’s a decision tree, neural network, or equation)
- the algorithms/function representation have trade offs
- in complexity to built the representation
- 2 different algorithms may build the same representation (decision tree, neural network, etc) but have different complexity

- in complexity to classify instances once built
- can the function representation represent all possible functions of your data?
- if it can’t, the representation is termed “restriction biased”

- even if the representation can represent all possible functions of your data, does the
*algorithm*actually consider (i.e. search) all of the representations?- if it doesn’t, the algorithm is termed “preference biased”

- some representations/algorithms are better for discreet valued features/outputs
- pick a representation/algorithm based on your data (num features, types of features (discreet? real?), statistics of your data)

- in complexity to built the representation
- once you have built a representation (aka model), how do you judge it’s performance? how “good” is it?
- cross validation - have separate test data, see how your model performs on the test data
- k-fold cross validation - rotate which chunk you use as test data, average the performance

- over fitting occurs when your model is really good at predicting the training data, but it performs poorly on other (i.e. test) data
- one solution to preventing overfitting
- slowly increase the complexity of your model until performance on
*test*data starts to go down

- slowly increase the complexity of your model until performance on

- one solution to preventing overfitting

# Decision Tree

- human readable representation for the function
- built using algorithms: ID3 and C4.5
- C4.5 just allows lil things like real valued features

- ID3 algorithm:
- recursively split your data on the attribute that results in the most information gain
- information gain = entropy lost
- there are other “splitting measures” that you can take besides information gain (there is gini impurity and gain ratio)

- recursively split your data on the attribute that results in the most information gain
- if you recursively split until all data chunks are homogenous (i.e. have 0 entropy), you have overfit
- larger trees = more complex function = overfitting
- two general approaches to prevent decision tree overfitting
- stop growing the tree
- over fit the tree then prune it afterwards
- this is the more commonly used approach
- “reduced error pruning”
- evaluate impact of pruning each node (and the children/descendant nodes of course)
- greedily prune the node that results in most increased test performance
- keep doing this until test performance no longer goes up
- this produces the smallest version of the most accurate sub tree

- “rule post pruning”
- make rules from each branch
- prune each rule, as long as it doesn’t decrease test data performance
- arrange rules in order of highest performing (on test data) to lowest
- when classifying, use rules in this order

- “reduced error pruning”

- this is the more commonly used approach

- decision tree qualities
- human readable representation of the function
- since each split considers the entire data set, you are robust to errors in the data (both wrongly classified data, and data with bad features)
- can handle missing attributes (one approach is just to use the value for other attributes, or an average of them)
- can handle costed attributes
- tries to limit use of high cost attributes (i.e. for most examples, it won’t need to consider the high cost attributes to classify them)

- no restriction bias, but algorithms for building it have preference bias
- shorter trees that put high information gain attributes at root are preferred (because we recursively split on attributes that result in the most info gain, remember?!)

- real world uses
- diagnosing diseases

# Finding Function Parameters

- you can start with an arbitrary function and then find parameters for your function that best fits your data
- you can start with a linear, polynomial, etc function
- you want to find parameters of your function that result in a function that best fits your data
- in other words, you want to find the parameters that result in a function that has the least squared error on predicting your examples
- squared error as opposed to error because squared error gets rid of negatives and penalizes bigger errors more
- alternative metrics to least squared error

- in other words, you want to find the parameters that result in a function that has the least squared error on predicting your examples
- this boils down to a optimization problem (minimization specifically)
- gradient descent can be use to minimize
- stochastic gradient descent can be used to minimize more efficiently
- you can use “logistic regression” to make binary classification (e.g. is this a dog or not?)
- you can break discreet classification into a bunch of binary classifications, for example consider finding it out if a picture is a dog, cat, or mouse
- is it a dog? yes, no?
- is it a cat? yes, no?
- is it a mouse? yes, no?

- actually, it’s more like this
- what is the probability that it is a dog?
- what is the probability that it is a cat?
- what is the probability that it is a mouse?
- take the highest probability?

- you can break discreet classification into a bunch of binary classifications, for example consider finding it out if a picture is a dog, cat, or mouse

# Neural Network

TODO