-
Notifications
You must be signed in to change notification settings - Fork 794
supervised
- summary Tutorial on supervised learning using pmtk3
To create a model of type 'foo', use one of the following
where '...' refers to optional arguments, and 'foo' is a string such as
* 'linreg' (linear regression) * 'logreg' (logistic regression binary and multiclass) * 'mlp' (multilayer perceptron, aka feedforward neural network) * 'naiveBayesBer' (NB with Bernoulli features) * 'naiveBayesGauss' (NB with Gaussian features) * 'discrimAnalysis' (LDA or QDA) * 'RDA' (regularized LDA)
Here X is an N*D design matrix, where N is the number of training cases and D is the number of features. y is an N*1 response vector, which can be real-valued (regression), 0/1 or -1/+1 (binary classification), or 1:C (multi-class).
Once the model has been fit, you can make predictions on test data (using a plug-in estimate of the parameters) as follows
Here yhat is an Ntest*1 vector of predicted responses of the same type as ytrain, where Ntest is the number of rows in Xtest. For regression this is the predicted mean, for classification this is the predicted mode. The meaning of py depends on the model, as follows:
* For regression, py is an Ntest*1 vector of predicted variances. * For binary classification, py is an Ntest*1 vector of the probability of being in class 1. * For multi-class, py is an Ntest*C matrix, where py(i,c) = p(y=c|Xtest(i,:),params)
Below we consider some specific examples.
We can fit a linear regression model, and use it to make predictions on a test set, as follows
See the following demos for examples of this code in action:
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Linear_regression/linregDemo1.html linregDemo1] Fit a linear regression model in 1d.
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Linear_regression/linregPolyVsDegree.html linregPolyVsDegree] Fit a polynomial regression model in 1d.
We can fit a logistic regression model, and use it to make predictions on a test set, as follows
In the binary case, y(i) can be in {-1,+1} or {0,1} or {1,2}; the predicted value yhat will be converted to the same 'domain' as the training data. In the multiclass case, y(i) should be in {1,..,C} or {0,...,C-1}; again the output will be converted to the same form. The number of classes is inferred from the number of unique values in y. If the training set is very small, you should specify nclasses directly.
See the following demos for examples of this code in action:
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Introduction/logregSATdemo.html logregSATdemo] Fit a binary logistic regression model to 1d data.
Using maximum likelihood to train a model often results in overfitting. So it is very common to use regularization, or MAP estimation, instead. We can use various kinds of prior. The simplest and most widely used is the spherical Gaussian prior, which corresponds to using L2 regularization. We can fit an L2-regularized linear or logistic regression model as follows (where lambda is ths strength of the regularizer, or equivalently, the precision of the Gaussian prior):
(In the linear regression context, this technique is known as ridge regression.) See the following demos for examples of this code in action:
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Introduction/linregPolyVsReg.html linregPolyVsReg] Show the effect of changing the regularization parameter in ridge regression.
It is very important to choose the strength of the regularizer correctly: if it is too large, the fitted function will underfit, but if it is too slow, it will overfit. A very common method for selecting such 'tuning paramaters' is cross validation (CV). In pmtk, you can specify a matrix of lambda values (one per row); the fit function will try them all and pick the one that CV estimates as the best. To save you the effort of thinking of a good set of lambda values, you can alternatively specify just the number of values to try, and the fit function will try to come up with a good range. here are some examples:
You can control how many folds CV uses by passing in 'nfolds'. If you are searching over a 1d or 2d grid of values, you can set 'plotCV' to true to see a plot of the CV-estimated risk as a function of the tuning parameters. If you set 'useSErule' to true, fit will use the one-standard-error rule to choose the best lambda; in this case, fit assumes that the values of lambda are sorted such that the first row is the least complex model, and the last row is the most complex model. For fine-grained control, call the 'cvEstimate' function directly.
See the following demos for examples of this code in action:
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Decision_theory/linregCvPolyVsRegDemo.html linregCvPolyVsRegDemo] Compare CV estimate of future error with test error, and illustrate one standard error rule for ridge regression. We already saw an example of this above where we illustrated polynomial regression.
We are free to preprocess the data in any way we choose before fitting the model. In pmtk, you can create a preprocessor (pp) 'object', and then pass it to the fitting function; the pp will be applied to the training data before fitting the model, and will be applied again to the test data. The advantage of this approach is that the pp is stored inside the model, which makes sense, since it is part of the model definition.
One very common form of preprocessing is to standardize the data. This is particularly important when using regularization methods such as L2, since the corresponding prior assumes all the parameters have comparable magnitude, which will only be true if the features have comparable magnitude. In pmtk, we can do this as follows. First create a pp object which standardizes our data, then pass it to the fitting function:
A more interesting form of preprocessing is basis function expansion. This replaces the original features with a larger set, thus permitting us to fit nonlinear models. A popular approach is to use kernel functions, and to define the new feature vector as follows: phi(x) = [K(x,mu(1)),], where the mu(j) are 'prototypes'. and K(x,x') is a 'kernel function', which in this context just means a function of two arguments. A common example is the Gaussian or RBF kernel, K(x,x') = N(x|x', sigma2), where sigma2 is the 'bandwidth'. Often we take the prototypes to be the training vectors, but we don't have to. Below we show an example of 3 different kernels being used to fit a logistic regression model to data that is not linearly separable in the original feature space:
* [http://pmtk3.googlecode.com/svn/trunk/docs/demoOutput/Introduction/logregXorDemo.html logregXorDemo] Kernelized logistic regression applied to the binary XOR problem.