General concept of data science: Extracting knowledge and insights from data. There is a huge variety of methods to do it, ranging from basic statistics to deep neural networks.
Machine learning systems learn how to combine input to produce useful predictions on never-before-seen data.
General concepts of machine learning:
- Labels: A label is the thing we're predicting - the
y
variable in simple linear regression (y = ax + b). The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything. - Features: A feature is an input variable - the
x
variable in a simple linear regression (y = ax + b). A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: x1,x2, ..., xn. - Models: A model defines the relationship between features and
label -
ax + b
in a simple linear regression (y = ax + b). For example, a spam detection model might associate certain features strongly with "spam". Let's highlight two phases of a model's life. Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationship between features and label. In the case of a linear regression, it consists in finding "good" values fora
andb
, the parameters of the model. Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'
). For example, during inference, you can predict for a new incoming mail whether it is spam or not. - Supervised learning: In supervised learning, we model the relationship input and output. Depending on the type of values predicted by the model, we call it either a regression model or a classification model. A regression model predicts continuous values. For example, regression models make predictions that answer questions like "What is the value of a house in California?" or "What is the probability that a user will click on this ad?". A classification model predicts discrete values. For example, classification models make predictions that answer questions like "Is a given message spam or not spam?" or "Is this an image of a dog, a cat, or a hamster?".
- Unsupervised learning: In unsupervised learning, we model the features of a dataset without reference to any label. It is often described as "letting the dataset speak for itself". These type of machine learning model include tasks such as clustering or dimensionality reduction. Clustering algorithms identify distinct groups of data while dimensionality reduction algorithms search for a more succinct representations of the data.
- There exist more learning paradigms such as Reinforcement learning (AlphaGo), Semi-supervised learning (hybrid between supervised and unsupervised) and Self-supervised learning (a kind of unsupervised learning in which we use the data itself to create a "virtual" supervision, an example of self-supervised model would be a learned compressor-decompressor).
Deep learning is a part of machine learning methods based on artificial neural networks. Artificial neural networks are a family of computation models vaguely inspired by the biological neural networks that constitute animal brains. We will go briefly over the history of this discipline and then study the example of the linear regression. While not a neural network per se, understanding all its concepts should give a very solid base for studying deep neural networks.
A short history of deep learning (from Wikipedia):
- 1951: First neural network machine by Marvin Minsky
- 1957: Invention of the Perceptron by Frank Rosenblatt
- 1970: Automatic differentiation by Seppo Linnainmaa
- 1980: Neocognitron, the inspiration for convolutional neural networks by Kunihiko Fukushima
- 1982: Hopfield networks, a type of recurrent neural network
- 1997: Long short-term memory recurrent neural networks by Sepp Hochreiter and Jürgen Schmidhuber
- 1998: Neural networks on MNIST, an digit image classification dataset
Inspired by ML crash course
Finding "good" parameters a
and b
of a line y = ax + b
to
approximate a set of point is an example of machine learning. The set
of points (x, y)
is the dataset or training data of the
problem. In this case, the model is a Linear Regression (a line of
equation y = ax + b
). Finding "good" parameters for a
and b
using labelled examples is the training process, optimization or
learning of the model and to define what being a "good model" means
we will need a loss function.
The loss function used to evaluate a model. It penalizes bad
prediction. If the model's prediction is perfect, the loss is zero
otherwise, the loss is greater. The goal of training model is to
find a set of parameters (in our case a
and b
) that have low loss,
on average, across all examples.
Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly, the line in the right plot is a much better predictive model than the line in the left plot.
You might be wondering whether you could create a mathematical function - a loss function - that would aggregate the individual losses in a meaningful fashion.
The linear regression models we'll examine here use a loss function
called squared loss (also known as L2 loss). The
squared loss of a model y' = ax + b
for a single example (x, y)
is
as follows:
- = the square of the difference between the label and the prediction
- = (observation - prediction(x))2
- = (y - y')2
- = (y - (ax + b))2
- = (y - ax - b) 2
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
where:
(x, y)
is an example in whichx
is the set of features (for example, chirps / minute, age, gender) that the model uses to make predictions.y
is the example's label (for example, temperature).
prediction(x)
is a function of the weights and bias in combination with the set of featuresx
. In our case caseprediction(x) = ax + b
.D
is a data set containing many labeled examples, which are(x, y)
pairs.N
is the number of examples inD
.
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances. Because of the squaring operation, a single large difference between a prediction and a label will be penalized more than many smaller ones.
A high loss value signifies that the models' predictions are poor approximation of the labels. Conversely a small loss value means that our model captures the structure of the data. Now that we know that, we will take a look at algorithms designed to lower the loss of a specific model on a specific dataset by modifying its parameters.
The procedure that we will use to learn our model is iterative. We
start with a random guess for each parameter of our model (here a
and b
), we compute the loss to evaluate how good our current
parameters are and, using this loss, we will compute an update hoping
to lower the loss on the next iteration.
The following figure summarizes the process:
To be able to apply this procedure to our problem, we have to find a way to compute parameters updates.
Suppose we had the time and the computing resources to calculate the
loss for all possible values of a
(w1 in the
figures). For the kind of regression problems we have been examining,
the resulting plot of loss vs a
will always be convex. In other
words, the plot will always be bowl-shaped, kind of like this:
Convex problems have only one minimum; that is, only one place where the slope is exactly 0. The minimum is where the loss function converges.
As computing the loss function for all values of a
to find its
minimum would be extremely inefficient we need a better
mechanism. Let's examine such an algorithm, the gradient descent.
As explained in the previous section, we start with a random guess for our parameter.
Now that we have an initial value for our parameter a
, we can
compute the loss value for our linear regression. Next, we would like
to know whether we should increase or decrease the value of a
to
make the loss decrease.
To get this information, the gradient descent algorithm calculates the
gradient of the loss curve at the current point. In the next Figure,
the gradient of the loss is equal to the derivative (slope) of the
curve and tells you that you should increase the value of a
to
make the loss value decrease.
As the gradient of a curve approximates it well in a very small neighborhood, we add a small fraction (this fraction is called learning rate) of the gradient magnitude to the starting point.
Now that we have a new (and hopefully better) value for a
, we
iterate the previous process to improve it even further.
The learning rate is an hyperparameter in this problem, a parameter of the training process, not of the model. It is important to note that the learning rate (which determines the size of the steps during the descent process) should neither be too small (otherwise the training process will take very long to converge) nor too big (which can lead to the process not converging at all).
Interactive graph of the learning rate importance
In the example that we have seen earlier we used the gradient descent
to find the correct value for a single parameter. In cases where we
have more parameters (a
and b
for a line equation), we compute the
gradient of the loss function (the derivatives according to each of
the variable) and update them all at once.
Let's compute the gradients of a
and b
in our example. First lets
recall and develop the MSE loss formula in our case.
Now we want to differentiate this function according to a
and b
separately. Let's start with a
. In order to simplify the
computation we factor the previous formula by a
.
from there we easily compute the derivative of the loss according to
a
.
Now onto b
, we follow a similar process. First we factor by b
.
from there we compute the derivative of l
according to b
.
An animation of an application of the gradient descent algorithm on a linear regression problem can be found here.
In real problems, it is often not practical to compute the value of the average loss across all of the training set as it often contains a huge quantity of data.
To deal with this problem, we run the gradient descent using batches. A batch is a subset of the training set (usually 10 to 1000 examples) that we use to compute an approximation of the gradient. The larger the size of the batches the better the gradient approximation, the more robust the training procedure is. By using this algorithm, we trade robustness for efficiency.
The main takeaways of this section are the following:
- Artificial neural network have existed for a long time.
- The two main components that made them "work" were the backpropagation algorithm (automatic differentiation) and the advent of more powerful computers.
- Artificial Neural Networks are huge derivable functions that we optimize using a variant of the Stochastic gradient descent algorithm.