forked from thoughtfulml/examples-in-python
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNotes on thoughtful machine learning.txt
74 lines (60 loc) · 3.72 KB
/
Notes on thoughtful machine learning.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
KNN
Notes:
Overcame 'curse of dimensionality' by selecting three features within the data-
lat, long, and the square footage of the lot. This cuts down on the dimensions.
This uses Euclidian distance between houses.
In trying to toy with this - the data have other interesting attributes, but
the values for those appear to only be at most three values. However I did
select 'DevelopmentRightsPurch' to include in the data to test and it seems to
make the minimum mean absolute distance take longer - more than 4 folds in order
to fall below 100k.
Summary of Use: Use this to predict a value based on a few key attributes
shared by some surrounding values. In this case, 'surrounding' is
determined by Euclidian distance.
Naive Bayes classifier
Notes: I had to change the encoding string used. Other than that, it seems to
work as advertised.
The crossvalidate.py file does two folds and ultimately the spam trainer and
the author go with the model from the fold that minimizes false positives so
that email messages are not incorrectly lost, favoring letting a small amount
of spam through instead.
Summary of Use: Use this to classify when the inputs are independent of one
another. Uses a probability score to classify data in multiple directions. The
Bayesian part comes in while classifying things - it ignores the probability of
multiple attributes or dimensions occurring together and simply calculates the
individual probabilities. This is easier to compute.
It also assumes for the sake of new information, that any given attribute of
interest has 1/n probability of occurring, where n is number of attributes.
This ensures that with new information, the probabilities are still able to be
calculated.
Decision trees/Random Forests
Notes: To be honest the output of the cross_validate.py is not intuitive. It
prints out the output of 'validate' which is the aggregation of all the
confusion matrices. So, I assume looking at those values, and the m
Summary of Uses:
This chapter used three techniques for splitting the data into sub categories.
Information Gain - a measure of entropy, GINI impurity - A measure of an
attribute being mistakenly identified, and variance reduction - a way of
reducing dispersion of the classification - through pruning. There are two
techniques the book discusses for pruning - bagging and random forests. The
sample code uses confusion matrices to validate the decision tree, regression,
and random forest. Random forests fit to sub-samples of the data and uses
averaging to find the best categories to use to fit a model to. In this way,
the technique can avoid overfitting.
Hidden Markov Models
Notes: None of the tests seem to work :(.
Summary of Uses:
There are three stages of using HMM are evaluation, decoding, and learning.
Evaluation is done in the book using the forward-backward algorithm. This is
basically the process of asking how probable a hidden state is given
several observations. Decoding is the process of asking what the most likely
state sequence is given a sequence of observations. Learning is the actual
prediction part of the algorithm. Given the output of the previous two steps,
i.e. the set of highly probable states, and sequences to those states, what is
the most likely thing to occur in the future. We do this by figuring out the
most probable next observation and the most probable next state.
Does not rely on a large amount of historical data. Can be used to predict
changes over time in a model.
Relies on knowing all the states and probabilities of transitions between the
states. The Markov assumption is that the probability of all the future states
only relies on the present state and not on past states.