In this course we will define what are neurons and neural networks and how to train them.
A neuron is a computation system that takes multiple inputs x1,x2, ..., xn and outputs a single value y.
A neuron can be seen as a small machine learning model. In the
previous figure the inputs or features are
x1,...,x6, the parameters or weights of
the neuron are w1,...,w6 and the output or
prediction of the model is y
.
A
is called the activation function of the neuron, in the previous
figure we see a very simple threshold activation function. If its
input is negative it outputs 0, if its input is positive is
outputs 1. This component is there to mimic the behavior of biological
neurons firing or
not
depending on the electrical current of their incoming
synapses. The activation
functions used in neural networks are non
linear in order to
allow neural networks to approximate non linear functions.
However, there is a problem with the previous definition. As we will see later, we want to use some sort of gradient descent algorithm to train our neural networks. These algorithms requires the function we are trying to optimize to be differentiable according to the parameters of the model. The thresholding activation function that we saw is not differentiable so we need to replace it by something else.
Here, A
is the sigmoid
function. It
approximates the behavior of the thresholding activation while being
differentiable.
There are many different activation functions that can be used when building neural networks, the sigmoid function being only one example.
Now that we have defined what neurons are, let's see how to assemble them together to form a neural network.
The previous figure shows what we call a Fully Connected neural Network (FCN) or MultiLayer Perceptron (MLP). In this section, we will describe each of its component.
Each of the node in the previous graph is a neuron. All the incoming
connections are its inputs (the x
s in the previous section). All
the outgoing connections are copies of its output (copies of y
in
the previous section). We can see that the outputs of neurons on the
left hand size of the network are used as inputs of neurons on the
right hand side.
A column of neurons is called a layer. In the example of this
network all the layers are linear or fully connected as every
output of the layer t
is used as input for every neuron of the layer
t + 1
. There are multiple kinds of neural network layers that we
will see later.
When we use a neural network as a predictive model, we feed it our inputs on the leftmost layer, the input layer. Let's say we build a neural network predicts whether an image depicts a cat or a dog, we would feed the pixel values of the image to the input layer.
To read the prediction or output of the neural network, we look at the values it outputs on the rightmost layer, the output layer. In our previous cat and dog image classifier example, each neuron would contain a value that represents the confidence of the model that the image contains a cat or dog. If we had 3 classes of image, we would have 3 neurons in the output layer.
The input and output layers are what the user of the model has access to. The user plugs its data as input on the input layer and get the prediction on the output layer.
The third type of layer contains neurons in the middle of the network, these are hidden layers. These layers are used by the network to refine its understanding of the input and analyze the hierarchical structure of the data.
The "deep" word in "deep learning" comes from "deep neural networks", neural networks with many hidden layers.
Now that we have defined what a neural network is, let's build one that computes a simple function. The network will be composed of a single neuron and we want to make it compute the binary AND function.
We want to find w1, w2 and b such that
The solution to this exercise is available here.
In order to apply the classification loss that we will see in the next section, we need a way to convert a list of outputs from a neural network into a probability distribution. To do this, we will use an activation function called the softmax function.
The softmax is defined as follows:
where k
is the number of classes of our problem and z
is the list
of k
values that we want to normalize.
Let's apply the softmax function to a vector to get a sense of how it works.
>>> model_output = torch.tensor([0.3, -16.2, 5.3, 0.7])
>>> model_output
tensor([ 0.3000, -16.2000, 5.3000, 0.7000])
>>> model_output_exp = model_output.exp()
>>> model_output_exp
tensor([1.3499e+00, 9.2136e-08, 2.0034e+02, 2.0138e+00])
>>> model_output_softmax = model_output_exp / model_output_exp.sum()
>>> print(*[f'{value:5.3f}' for value in model_output_softmax], sep = ', ')
0.007, 0.000, 0.983, 0.010
We see that the softmax values sum to 1 with one being much larger than the others.
The softmax function gives us a differentiable way to convert our model outputs into something we can use to compute the cross entropy loss. It quantifies the confidences the network associates to each of the class relatively to all the others.
In the previous lesson, we saw the MSE loss as a way to evaluate the quality of our model approximations on a regression task. This loss is not suitable for classification problems. To evaluate and train models on classification tasks, we use the cross entropy loss:
The cross entropy
quantifies the difference between two probability distributions p
and q
. An in-depth explanation of the information theory underlying
concepts can be found in the Stanford CS231n
course. As with the MSE
loss, the ideal value for the cross entropy loss is 0
.
Let's try this formula on an example to get more familiar with
it. Let's say we have a classifier that tries to predict what animal
is in a picture, the classes are cat
, dog
and snake
. We want to
compute the loss value for a specific sample that contains a
snake
. Our target probability p
is defined as follow:
We know that the image contains a snake
the probability of the
snake
class is 1
and the probabilities for the other classes are
0
When working with classification problems, neural networks attribute a
probability value for each class of the problem. We note this
probability distribution q
.
In this first example, the model have a 50% confidence that the image
contains a cat
, 30% that it contains a dog
and 20% that is
contains a snake
. Let's now compute the value of the cross entropy
between p
and q
.
The value of the cross entropy is 0.669
. In this case, the model was
wrong: the highest confidence was on cat
although the image
contained a snake
. Let's try again with another model output.
Now the model prediction is correct as the highest confidence is on
snake
. Let's compute the cross entropy.
The value of the cross entropy is 0.097
which is much lower than in
the previous case. We can also see that, even though the model
prediction was correct, the loss value is not 0
. The model could
have improved the loss value even further by being more confident in
its prediction.
When performing a training step on a batch of samples, the value that we are trying to optimize is the mean cross entropy loss between the model predictions and correct labels.
Let's now see how to define and train a neural network using PyTorch. There is still a few tools missing to understand the whole process that we will get to study in the following courses. The complete runnable code for the example described in this section is available here with a lot of commentaries.
The task we want to learn in this example is really simple, we want to
determine whether a point (x, y)
with -1 <= x, y < 1
is in a
disk. Visually, we want to discriminate between red and violet point
in the following figure.
This problem is a classification task, we want the model to output 1 for points inside the disk and 0 for points outside. We will set the radius of the disk in order to have as many points inside than outside in our dataset.
First let's write a function that will generate our dataset.
def generate_data(n_samples):
# We randomly generate points (x, y) with -1 <= x, y < 1
X = torch.rand(n_samples, 2) * 2 - 1
# We compute the Euclidean distance between the origin (0, 0) and
# each sample
dist_to_origin = (X ** 2).sum(axis = 1).sqrt()
# radius value in order to have as many samples inside than
# outside
radius = math.sqrt(2 / math.pi)
# label is 1 for samples inside the disk and 0 for samples outside
y = (dist_to_origin < radius).long()
return X, y
Now let's create our neural network.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleLinearNetwork(nn.Module):
def __init__(self):
super(SimpleLinearNetwork, self).__init__()
self.input_layer = nn.Linear(2, 64)
self.hidden_layer = nn.Linear(64, 64)
self.output_layer = nn.Linear(64, 2)
def forward(self, x):
x = self.input_layer(x)
x = F.relu(x)
x = self.hidden_layer(x)
x = F.relu(x)
x = self.output_layer(x)
x = torch.log_softmax(x, dim = 1)
return x
In PyTorch, neural networks inherit from nn.Module
, this
automatically enables mechanisms that we will see later like parameter
registering with the optimizer and automatic differentiation.
In the __init__
method, we call the parent object (nn.Module
)
__init__
method to initialize our network module. We then declare
the layers of our neural network. In this case we use will three
layers:
- one input layer with two inputs (the coordinates of our points) and 64 outputs (the size of our hidden layer).
- one hidden layer with 64 inputs and 64 outputs
- one output layer with 64 inputs (the size of our hidden layer) and 2 outputs (the number of classes of our problem, inside or outside the disk).
In this method, we just declare the list of layers that we will use,
how we use them will be declared in the forward
method.
The forward
method takes x
as parameter, x
is a batch of
samples. As our samples are two dimensional, x
will be of shape
[batch size, 2]
. It is much faster to perform computations on
batches of samples rather than one by one. In this method we will make
the information flow through the network by sequentially applying
our layer computations. Let's take a look at the first layer
application.
x = self.input_layer(x)
x = F.relu(x)
x = self.input_layer(x)
will compute the activation of all the 64
neurons of our input layer for each sample of the batch. After this
line, the shape of x
will be [batch size, 64]
, for each element of
the batch, we have the result of the computation for each of the 64
neurons. In PyTorch, applying the activation is done after performing
the layer computations. Here we apply a Rectified Linear Unit
(ReLU)
activation. The activation does not change the shape of our tensor.
x = self.hidden_layer(x)
x = F.relu(x)
Similarly, we compute the result of the application of our hidden
layer. As our hidden layer takes 64 inputs and outputs 64 values, the
shape of x
does not change, it is still [batch size, 64]
.
x = self.output_layer(x)
x = torch.log_softmax(x, dim = 1)
We now take the output of our hidden layer and pass it through our
output layer. After x = self.output_layer(x)
, x
is of shape
[batch size, 2] = [batch size, number of class]
. In order to compute
the cross entropy to evaluate the performance of our model, we need to
normalize the output of the model using a softmax activation. Just as
when we compute the sum along the lines of a 2D tensor, we have to
precise along which dimension we want to perform the sum operation of
the softmax function. After the line x = torch.log_softmax(x, dim = 1)
the shape of x
is [batch size, 2]
with the sum of values being 1
along the lines of the tensor. The first column corresponds to the
confidence of the model in the class 0
(outside the disk) and the
second column corresponds to the confidence of the model in the class
1
(inside the disk). Notice that we compute the log
of the
softmax
values instead of simply the softmax
, this is because we
will use the Negative Log Likelihood Loss
(NLLLoss) of PyTorch
that requires this type of normalized inputs.
Let's write the main function of our project.
from torch.utils.data import TensorDataset
import torch.optim as optim
def main():
n_samples = 3000
epochs = 300
batch_size = 32
learning_rate = 1e-3
X, y = generate_data(n_samples)
dataset = TensorDataset(X, y)
model = SimpleLinearNetwork()
criterion = nn.NLLLoss()
optimizer = optim.SGD(
params = model.parameters(),
lr = learning_rate
)
train(model, criterion, optimizer, dataset, epochs, batch_size)
The beginning of the function is explicit enough, we generate our
dataset and set the epochs
(number of times that we will use the
whole dataset to perform parameters update during training),
learning_rate
and batch_size
hyperparameters.
dataset = TensorDataset(X, y)
wraps our dataset into the PyTorch
Dataset
interface. It will later provide us a very simple way to
generate random batches of samples.
We then instantiate our neural network with model = SimpleLinearNetwork()
. It is at this point that the weights of all
the neurons will be initialized according to the algorithm specified
in the documentation.
We then create the loss function that we will use to evaluate and
train our model with criterion = nn.NLLLoss()
. The negative log
likelihood loss corresponds to the cross entropy loss that we have
seen in the previous section.
After this, we create our optimizer. Here we choose to use a simple
Stochastic Gradient Descent (SGD) algorithm, there are many more
available in the optim
module of PyTorch. The first argument params = model.parameters()
declares the list of parameters that the optimizer will be allowed to
modify to lower the loss. During the creation of our network, all the
weights of the linear layers have been registered as its
parameters. The lr = learning_rate
parameter defines the size of
the steps the optimizer will take in the direction of the gradient.
Once everything is setup, we can call our training function. Let's define it.
from torch.utils.data import DataLoader
def train(model, criterion, optimizer, dataset, epochs, batch_size):
dataloader = DataLoader(dataset, batch_size = batch_size, shuffle = True)
for epoch in range(epochs):
for X, y in dataloader:
y_pred = model(X)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
First we create our batch generator. PyTorch provides us with the
DataLoader
class to create batches of data from a PyTorch
Dataset
. It is a very powerful tools that is almost always used to
write training or evaluation loops. When using shuffle = True
the
elements of the dataset will be shuffle every time we create an
iterator on the DataLoader
(here, at every epoch).
The first loop for epoch in range(epochs)
defines the number of
times we will iterate over the dataset.
The second loop for X, y in dataloader
iterates over batches of the
dataset. Typically, the input X
will be of shape [batch size, 2] = [32, 2]
and the corresponding output y
will be of shape [32]
. The
batch size may not be respected for the last batch if the size of the
dataset is not a multiple of the batch size.
The first step of the training loop is to compute the model
predictions on a batch of sample with y_pred = model(X)
. When using
model(X)
the forward
method of the network is called. y_pred
will be of shape [batch size, 2]
with each column corresponding in
the confidence in each of the classes.
Now that we have the model's predictions, we can compute the loss
between these predictions y_pred
and the target y
with loss = criterion(y_pred, y)
. loss
will be a single value, the average
cross entropy loss for all the samples in the batch.
Now that we have the loss value on the batch, we want to perform a
weight update using gradient descent. To do this, we need the gradient
of the loss function with respect to the parameters of the neural
network. This is done with loss.backward()
. This computation is done
using the
Backpropagation
algorithm that we will study later in the course. For now let's just
assume that this computation is done and the gradients are stored in
the neural networks right next their corresponding parameters values.
To perform a parameter update using the gradient descent algorithm and
the gradients that we have just computed, we use
optimizer.step()
. The optimizer already knows what it is allowed to
modify as we have precised the params
argument during its creation
and, as we just explained, the gradients computed using the backward
method are stored next to the parameters their correspond to.
The last step of the training loop is to clean the gradient computations that we have just done in order to perform the next optimization step independently of this one. In very specific cases we do not want to clear the gradients after the end of the loop but this kind of use is beyond the scope of this course.
Now that we have the training loop for our model, we would like to evaluate it to know the proportion of correct answer that it outputs.
def evaluate(model, dataset, batch_size):
dataloader = DataLoader(dataset, batch_size = batch_size, shuffle = False)
correct_pred = 0
total_pred = 0
with torch.no_grad():
for X, y in dataloader:
y_pred = model(X)
y_pred_class = y_pred.argmax(dim = 1)
correct_pred += (y_pred_class == y).sum().item()
total_pred += len(y)
return correct_pred / total_pred
This function looks very similar to the training loop as we are also
iterating over batches to evaluate our model performances. We use the
context torch.no_grad()
to specify that we will not compute the
gradients in this block, because of the backpropagation algorithm this
speeds up the computation and reduce the memory consumption.
y_pred = model(X)
y_pred_class = y_pred.argmax(dim = 1)
With these two lines, we compute the network predicted class for each
sample of the batch. argmax
returns the index of the maximum along
the dimension 1
, the index of the class with the greatest
confidence.
With (y_pred_class == y).sum().item()
we get the number of times the
model prediction corresponds to the label y
. We use .item()
to
extract the result from the PyTorch tensor
into a simple Python
int
. We also compute the total number of prediction done by the
model with total_pred += len(y)
. The accuracy of the model is the
proportion of correct predictions it makes, correct_pred / total_pred
.
If we launch the program containing all this code with a call to the evaluation function every 10 epochs we get the following output.
> python basic_neural_network_example.py
Initial accuracy 49.033%
0 -> 49.033% accuracy
10 -> 75.767% accuracy
20 -> 67.400% accuracy
30 -> 68.900% accuracy
40 -> 72.367% accuracy
50 -> 76.267% accuracy
60 -> 78.933% accuracy
70 -> 81.567% accuracy
80 -> 83.767% accuracy
90 -> 85.600% accuracy
100 -> 87.267% accuracy
110 -> 88.033% accuracy
120 -> 89.267% accuracy
130 -> 90.000% accuracy
140 -> 91.067% accuracy
150 -> 91.967% accuracy
160 -> 93.200% accuracy
170 -> 94.000% accuracy
180 -> 94.667% accuracy
190 -> 95.167% accuracy
200 -> 95.567% accuracy
210 -> 95.900% accuracy
220 -> 96.267% accuracy
230 -> 96.633% accuracy
240 -> 96.967% accuracy
250 -> 97.033% accuracy
260 -> 97.433% accuracy
270 -> 97.467% accuracy
280 -> 97.533% accuracy
290 -> 97.800% accuracy