We have learned that to train neural networks, you need:
- Quickly multiply matrices (tensors)
- Compute gradients to perform gradient descent optimization
- Operate with tensors on whatever compute is available: CPU, GPU, or even TPU.
- Automatically compute gradients (they are explicitly programmed for all built-in tensor functions).
- Neural Network constructor / higher-level API (describe network as a sequence of layers).
- Simple training functions (like
fit
in Scikit-Learn). - A number of optimization algorithms in addition to gradient descent.
- Data handling abstractions (that will ideally work on GPU, too).
-
TensorFlow 1.x - The first widely available framework (Google). Allowed defining static computation graphs, pushing them to GPUs, and explicitly evaluating them.
-
PyTorch - A framework from Facebook that is growing in popularity.
-
Keras - A higher-level API on top of TensorFlow/PyTorch to unify and simplify using neural networks (created by François Chollet).
-
TensorFlow 2.x + Keras - New version of TensorFlow with integrated Keras functionality, supporting dynamic computation graphs, allowing tensor operations very similar to NumPy (and PyTorch).
A tensor is a multi-dimensional array. It is very convenient to use tensors to represent different types of data:
400x400
- black-and-white picture400x400x3
- color picture16x400x400x3
- minibatch of 16 color pictures25x400x400x3
- one second of 25-fps video8x25x400x400x3
- minibatch of 8 1-second videos
You can easily create simple tensors from lists of NumPy arrays, or generate random ones.
Tensor operations such as +
/add
return new tensors. However, sometimes you need to modify the existing tensor in-place. Most of the operations have their in-place counterparts, which end with _
.
For backpropagation, you need to compute gradients. By setting any PyTorch Tensor's attribute requires_grad
to True
, all operations with this tensor will be tracked for gradient calculations. To compute the gradients, you need to call the backward()
method, after which the gradient will become available using the grad
attribute.
PyTorch automatically accumulates gradients. If you specify retain_graph=True
when calling backward
, the computational graph will be preserved, and new gradients will be added to the grad
field. To restart computing gradients from scratch, reset the grad
field to 0 explicitly by calling zero_()
.
To compute gradients, PyTorch creates and maintains a compute graph. For each tensor with the requires_grad
flag set to True
, PyTorch maintains a special function called grad_fn
, which computes the derivative of the expression according to the chain differentiation rule.
Here, c
is computed using the mean
function, thus grad_fn
points to a function called MeanBackward
.
In most cases, we want PyTorch to compute the gradient of a scalar function (such as a loss function). However, if we want to compute the gradient of a tensor with respect to another tensor, PyTorch allows us to compute the product of a Jacobian matrix and a given vector.
Suppose we have a vector function π¦Μ = π(π₯Μ ), where π₯Μ = (π₯β, ..., π₯β) and π¦Μ = (π¦β, ..., π¦β). Then, the gradient of π¦Μ with respect to π₯Μ is defined by a Jacobian matrix:
[ J = \begin{bmatrix} \frac{\partial yβ}{\partial xβ} & \ldots & \frac{\partial yβ}{\partial xβ} \ \vdots & \ddots & \vdots \ \frac{\partial yβ}{\partial xβ} & \ldots & \frac{\partial yβ}{\partial xβ} \end{bmatrix} ]
Instead of giving us access to the entire Jacobian matrix, PyTorch computes the product π£α΅ Β· π½ with a vector π£ = (π£β, ..., π£β). To do that, we need to call backward()
and pass π£ as an argument. The size of π£ should match the size of the original tensor, with respect to which we compute the gradient.
Let's try to use automatic differentiation to find a minimum of a simple two-variable function:
[ f(xβ, xβ) = (xβ - 3)Β² + (xβ + 2)Β² ]
Let tensor π₯ hold the current coordinates of a point. We start with a starting point π₯β½β°βΎ = (0, 0) and compute the next point in a sequence using the gradient descent formula:
[ x^{(n+1)} = x^{(n)} - \eta \nabla f ]
Here, Ξ· is the so-called learning rate (denoted as lr
in the code), and βπ = \left(\frac{\partial f}{\partial xβ}, \frac{\partial f}{\partial xβ}\right) is the gradient of π.
Linear regression is defined by a straight line ( f_{W, b}(x) = Wx + b ), where ( W, b ) are model parameters that we need to find.
An error on our dataset ( {x_i, y_i}_{i=1}^{N} ) (also called the loss function) can be defined as the mean square error:
[ \mathcal{L}(W, b) = \frac{1}{N} \sum_{i=1}^{N} (f_{W, b}(x_i) - y_i)^2 ]
We will train the model on a series of minibatches. We will use gradient descent, adjusting model parameters using the following formulae:
[ W^{(n+1)} = W^{(n)} - \eta \frac{\partial \mathcal{L}}{\partial W} ]
To use GPU for computations, PyTorch supports moving tensors to GPU and building computational graphs for GPU. Traditionally, at the beginning of our code, we define the available computation device as device
(which is either cpu
or cuda
), and then move all tensors to this device using a call like .to(device)
.
We can also create tensors on the specified device upfront by passing the parameter device=...
to the tensor creation code. This approach ensures that the code works seamlessly on both CPU and GPU without modification.
Now we will consider a binary classification problem. A good example of such a problem would be a tumor classification between malignant and benign based on its size and age.
The core model is similar to regression, but we need to use a different loss function. Let's start by generating some sample data.
Let's use PyTorch gradient computing machinery to train a one-layer perceptron.
Our neural network will have 2 inputs and 1 output. The weight matrix ( W ) will have a size of ( 2 \times 1 ), and the bias vector ( b ) will have a size of ( 1 ).
Note that we use W.data.zero_()
instead of W.zero_()
. We need to do this because we cannot directly modify a tensor that is being tracked using the Autograd mechanism.
The core model will be the same as in the previous example, but the loss function will be a logistic loss. To apply logistic loss, we need to get the value of probability as the output of our network, i.e., we need to bring the output ( z ) to the range [0,1] using the sigmoid activation function: ( p = \sigma(z) ).
If we get the probability ( p_i ) for the i-th input value corresponding to the actual class ( y_i \in {0, 1} ), we compute the loss as:
[ \mathcal{L}_i = - (y_i \log p_i + (1 - y_i) \log(1 - p_i)) ]
In PyTorch, both these steps (applying sigmoid and then logistic loss) can be done using one call to the binary_cross_entropy_with_logits
function. Since we are training our network in minibatches, we need to average out the loss across all elements of a minibatch β and this is also done automatically by binary_cross_entropy_with_logits
.
The call to
binary_crossentropy_with_logits
is equivalent to a call tosigmoid
, followed by a call tobinary_crossentropy
.
To loop through our data, we will use the built-in PyTorch mechanism for managing datasets. It is based on two concepts:
- Dataset: The main source of data, which can be either Iterable or Map-style.
- Dataloader: Responsible for loading data from a dataset and splitting it into minibatches.
In our case, we will define a dataset based on a tensor and split it into minibatches of 16 elements. Each minibatch contains two tensors:
- Input data (size = 16x2)
- Labels (a vector of length 16 of integer type representing the class number).
In PyTorch, a special module torch.nn.Module
is defined to represent a neural network. There are two methods to define your own neural network:
- Sequential, where you specify a list of layers that comprise your network.
- As a class inherited from
torch.nn.Module
.
The first method allows you to specify standard networks with sequential composition of layers, while the second method is more flexible and gives an opportunity to express networks of arbitrary complex architectures.
Inside modules, you can use standard layers, such as:
- Linear - dense linear layer, equivalent to a one-layered perceptron. It has the same architecture as defined for our network.
- Softmax, Sigmoid, ReLU - layers that correspond to activation functions.
- There are also other layers for special network types, such as convolutional, recurrent, etc. These will be covered later.
Most of the activation functions and loss functions in PyTorch are available in two forms:
- As a function (inside
torch.nn.functional
namespace).- As a layer (inside
torch.nn
namespace).
For activation functions, it is often easier to use functional elements fromtorch.nn.functional
, without creating separate layer objects.
If we want to train a one-layer perceptron, we can just use one built-in Linear layer.
As you can see, the parameters()
method returns all the parameters that need to be adjusted during training. These correspond to the weight matrix ( W ) and bias ( b ). Note that they have requires_grad
set to True
, as we need to compute gradients with respect to these parameters.
PyTorch also contains built-in optimizers, which implement optimization methods such as gradient descent. Here is how we can define a stochastic gradient descent optimizer:
Now let's train a multi-layered perceptron. It can be defined by specifying a sequence of layers. The resulting object will automatically inherit from Module
, meaning it will also have the parameters
method that returns all parameters of the entire network.
Using a class inherited from torch.nn.Module
is a more flexible method because we can define any computations inside it.
The Module
class automates many things, for example:
- It automatically understands all internal variables that are PyTorch layers.
- It gathers their parameters for optimization.
You only need to define all layers of the network as members of the class.
Let's wrap our existing PyTorch model code into a PyTorch Lightning module. This allows us to work with our model more conveniently and flexibly using various Lightning methods for training and accuracy testing.
First, we need to install and import PyTorch Lightning. This can be done with the command:
- PyTorch allows you to operate on tensors at a low level, providing maximum flexibility.
- There are convenient tools to work with data, such as Datasets and Dataloaders.
- You can define neural network architectures using Sequential syntax or by inheriting a class from
torch.nn.Module
. - For an even simpler approach to defining and training a network, consider using PyTorch Lightning.