Notes on thoughtful machine learning.txt

KNN

Notes:
Overcame 'curse of dimensionality' by selecting three features within the data-
lat, long, and the square footage of the lot. This cuts down on the dimensions.
This uses Euclidian distance between houses. 

In trying to toy with this - the data have other interesting attributes, but
the values for those appear to only be at most three values. However I did
select 'DevelopmentRightsPurch' to include in the data to test and it seems to
make the minimum mean absolute distance take longer - more than 4 folds in order
to fall below 100k.

Summary of Use: Use this to predict a value based on a few key attributes 
shared by some surrounding values. In this case, 'surrounding' is 
determined by Euclidian distance.


Naive Bayes classifier

Notes: I had to change the encoding string used. Other than that, it seems to 
work as advertised.
The crossvalidate.py file does two folds and ultimately the spam trainer and 
the author go with the model from the fold that minimizes false positives so 
that email messages are not incorrectly lost, favoring letting a small amount 
of spam through instead.

Summary of Use: Use this to classify when the inputs are independent of one 
another. Uses a probability score to classify data in multiple directions. The 
Bayesian part comes in while classifying things - it ignores the probability of
multiple attributes or dimensions occurring together and simply calculates the 
individual probabilities. This is easier to compute. 
It also assumes for the sake of new information, that any given attribute of 
interest has 1/n probability of occurring, where n is number of attributes.
This ensures that with new information, the probabilities are still able to be 
calculated.


Decision trees/Random Forests

Notes: To be honest the output of the cross_validate.py is not intuitive. It 
prints out the output of 'validate' which is the aggregation of all the 
confusion matrices. So, I assume looking at those values, and the m

Summary of Uses:
This chapter used three techniques for splitting the data into sub categories.
Information Gain - a measure of entropy, GINI impurity - A measure of an 
attribute being mistakenly identified, and variance reduction - a way of 
reducing dispersion of the classification - through pruning. There are two 
techniques the book discusses for pruning - bagging and random forests. The 
sample code uses confusion matrices to validate the decision tree, regression, 
and random forest. Random forests fit to sub-samples of the data and uses 
averaging to find the best categories to use to fit a model to. In this way,
the technique can avoid overfitting.

Hidden Markov Models

Notes: None of the tests seem to work :(.

Summary of Uses: 
There are three stages of using HMM are evaluation, decoding, and learning.
Evaluation is done in the book using the forward-backward algorithm. This is 
basically the process of asking how probable a hidden state is given 
several observations. Decoding is the process of asking what the most likely 
state sequence is given a sequence of observations. Learning is the actual 
prediction part of the algorithm. Given the output of the previous two steps,
i.e. the set of highly probable states, and sequences to those states, what is 
the most likely thing to occur in the future. We do this by figuring out the 
most probable next observation and the most probable next state.
Does not rely on a large amount of historical data. Can be used to predict 
changes over time in a model.
Relies on knowing all the states and probabilities of transitions between the 
states. The Markov assumption is that the probability of all the future states 
only relies on the present state and not on past states.