The perceptron set the foundations for Neural Network models. It can be used to predict two-class classification problems, using a nonlinear activation sign function, which outputs
According to this, if a record is classified correctly, we get
and the update rule:
where
Gradient descent is based on the observation that if a multi-variable function
We therefore know gradient to be a vector pointing at the greatest increase of a function: combining this concept with a cost function, we can find the direction in which this cost is minimized. Parametrising our cost function with respect to the model (whatever it is, a neural network, a perceptron, etc.) we can find the direction in which the cost is minimized, iteratively:
Remember that normalisation plays a big role in gradient descent: if features are very different in magnitude, the gradient will be dominated by the largest.
A shallow neural network is just a NN with one hidden layer:
X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y)
X3 /
The sigmoid was the first activation function, very commonly used at the beginning. It though turns out that it can lead to small updates in the weights. A second function is the
While in logistic regressors we could initialize the parameters to 0, if we do it now, all the layers will compute the same function, updates will be the same, and the model will be the same. We start with small values, though, as otherwise our activation functions (sigmoid and tanh) would saturate fast.
By using NumPy, we are able to perform operations on vectors, avoiding costly for loops. The NumPy dot product uses vectorization by default, being performing. Reshapes are not costly.
Gradient checking allows us to compute the gradient numerically, and compare it to the gradient computed by the backpropagation algorithm. This is useful for debug matters.
We usually split the dataset into 3 different sets:
- The training dataset, the biggest one, which we use to train the model
- The validation dataset, which we use to validate the model in order to tune the hyperparameters
- The test dataset, which we use to test the model when we reached a final configuration
Usually, these 3 sets are split as
The L2 loss allows us to naturally divide our error into two parts: the bias and the variance (overfitting). The first one is also known as internal variance, meaning an error that is introduced by the learning algorithm, while the second is known as parametric variance, representing the error that is due to the limitedness of available data. We're always dealing with a tradeoff between these two. Generally, when we have high bias, we'll want to make our model more complex (for example, a bigger NN), try a different model, to run it longer. When we have high variance, we'll want more data, some kind of regularization, or a different model.
Regularization is a technique that allows us to reduce the variance. We can cite two different techniques, the L1 and the L2. The L1 regularization is the sum of the absolute values of the weights, while the L2 regularization is the sum of the squares of the weights. The L1 regularization is a good choice when we have a lot of weights, as it forces the model to be more sparse: it tends to take weights to exactly 0, while L2 tends to make them small. Regularization introduces a
Dropout is another technique that allows us to reduce the variance. It's a technique that randomly sets some weights to zero, while keeping the rest of the model intact. Inverted Dropout is a famous technique for this, which just initializes a vector of 0/1 randomly picked with probability
Other techniques we can cite are data augmentation, early stopping and ensemble methods.
Normalizing inputs speeds up the training. To do so, we subtract the mean from every feature, than divide
This happens when our gradient becomes very small or very large. Due to the chain rule, we notice that gradients in the first layers of the network can become very small when
Instead of computing the updates on the whole dataset, we can just split the dataset into subsets, and compute the updates on each subset. This is called a mini-batch. When we've run all the subsets one time, an epoch has passed. The cost won't go down at every step perfectly, as the gradient is computed on a subset of the dataset, but in the long run, it will behave as if we had computed the gradient on the whole dataset. When the mini batch size is 1, we're dealing with Stochastic Gradient Descent. This is somewhat noisy and the convergence is hard, but it's fast. Mini-batches having a size in the form
Using exponentially weighted averages, we're introducing weighting factors that decrease exponentially. This means that the weighting for each older datum decreases exponentially, though never to zero. This is used for the concept of momentum: imagine a ball rolling down the gradient slope, it will gain momentum as it goes down. We are basically averaging the past updates to the parameters, using them to have some kind of momentum.
$$ \begin{aligned} \text { update }{t} &=\gamma \cdot \text { update }{t-1}+\eta \nabla w_{t} \ w_{t+1} &=w_{t}-\text { update }_{t} \end{aligned} $$
This means that in regions having gentle slopes, we're still able to move fast. We can then introduce the bias correction, which is a technique that helps us to converge faster and more accurately. This happens because we normally set
Nesterov Momentum is a technique that computes the gradient using the previous update, and then uses the current update to compute the next one. This is done to avoid local minima.
AdaGrad (Adaptive Gradient) solves a problem that lies in the data: it can happen that features may be dense/sparse, making their updates faster or slower. AdaGrad introduces a different learning rate for each different feature, at each different iteration. It adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequent features, and larger updates for parameters associated with rare features. Basically, the learning rate decays with respect to the update history, accumulating the squared gradients:
RMSProp is used to perform larger updates on the weights, as it uses the square root of the sum of the squared gradients. We know
Adam, standing for adaptive moment estimation, is a technique that mixes RMSprop and momentum. It computes
vdW = 0, vdW = 0
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
vdW = (beta1 * vdW) + (1 - beta1) * dW # momentum
vdb = (beta1 * vdb) + (1 - beta1) * db # momentum
sdW = (beta2 * sdW) + (1 - beta2) * dW^2 # RMSprop
sdb = (beta2 * sdb) + (1 - beta2) * db^2 # RMSprop
vdW = vdW / (1 - beta1^t) # fixing bias
vdb = vdb / (1 - beta1^t) # fixing bias
sdW = sdW / (1 - beta2^t) # fixing bias
sdb = sdb / (1 - beta2^t) # fixing bias
W = W - learning_rate * vdW / (sqrt(sdW) + epsilon)
b = B - learning_rate * vdb / (sqrt(sdb) + epsilon)
Finally, we can use learning rate decay to reduce the learning rate as we get closer to our optimum. This is done by multiplying the learning rate by a factor
Tuning the hyperparameters is a crucial step for any neural network. The order of importance is something along the lines of:
- Learning rate.
- Momentum beta.
- Mini-batch size.
- No. of hidden units.
- No. of layers.
- Learning rate decay.
- Regularization lambda.
- Activation functions.
- Adam beta1, beta2 & epsilon.
Don't tune with a grid, it's better to use a random search and narrow down when we found decent solutions. Furthermore, it's better to search using a logarithmic scale rather than a linear one. We have two approaches for hyperparameter tuning: the panda approach, in which we nudge the parameters a little during training, with one training at a time, or the caviar technique, running multiple model in parallel and checking the results in the end.
Batch normalization is a technique that speeds up learning by normalizing the outputs from neural layers. This is usually done before the activation function, but it can also be done after it. It's usually applied with mini-batches. Note that if we're using this, the bias in the network gets removed when we subtract the mean. The technique works because it reduces the problem to input values changing, regularizing the network and adding some noise similarly to dropout. Bigger batch sizes will reduce this effect. Don't rely on this technique as a regularization, you should still use L2 or Dropout. When then using the network to predict one single example, we'll need to estimate mean and variance better than computing it on a single element. This is usually managed by the libraries.
As computer vision is one of the applications that are growing faster, we're interested in building layers that are optimal for 2D images.
Convolutions are the basic operations that we'll use to build CNNs: we're basically shifting the kernel over the image, and then multiplying it by the image.
Notice that with the kernel being shifted, we also need some kind of padding in order to avoid reducing the size of the image over and over. Generally speaking, if a matrix
In addition to that, we usually use Pooling layers, that reduce the size of the data. MaxPooling is the most common one, and it's done by taking the maximum value of the submatrix. AveragePooling takes the average.
ResNets are a family of deep neural networks that are designed to be more efficient than traditional CNNs. They are usually composed of several layers, and the first one is a convolutional layer. The second one is a residual layer, which adds the output of the first layer to the output of the second layer. This is done several times, and the output of the last layer is the output of the network. This is an example of Skip-connections: the output of the first layer is added to the output of the second layer, and the output of the second layer is added to the output of the third layer, and so on. These networks can go deeper and deeper without incurring in vanishing gradients.
Inception was proposed by introducing multiple convolution kernels in a single step, and letting the algorithm learn the best.
Many types of data are indeed sequential: we need the data that was analysed in the past to understand the current one.
In the past section, we'll index the first element of
We can build a vocabulary containing all the words in our training set. Often, we just need to represent the most occurring ones: we can add an <UNK>
token to the vocabulary, and we can use it to represent all the words that are not in the vocabulary.
Why can't we just use a normal neural network? There are two problems: the inputs and outputs have no standard length, and features are not shared across different positions of the text sequence. The latter means that if I have a word that's repeating ten times, the ten repetitions will be different from each other. In a RNN, every time we have an output, we can use it as an input for the next time step. This is called a recurrent layer. There are 3 weight matrices now:
- The first one is the input-to-hidden matrix,
$W_{ax}: (N_{hidden_neurons}, n_x)$ - The second one is the hidden-to-hidden matrix,
$W_{aa}: (N_{hidden_neurons}, N_{hidden_neurons})$ - The third one is the hidden-to-output matrix,
$W_{ya}: (n_y, N_{hidden_neurons})$ Now, the forward pass is computed as follows: $$ a^{<1>} = g_1(W_{aa} a^{<0>} + W_{ax} x^{<1>} + b_a)\ \hat{y}^{<1>}= g_2(W_{ya} a^{<1>} + b_y)\ a^{} = g(W_{aa} a^{} + W_{ax} x^{} + b_a)\ \hat{y}^{}= g(W_{ya} a^{} + b_y) $$ Generally,$g_1$ is a$tanh/ReLU$ activation function, and$g_2$ is a$sigmoid$ or$softmax$ activation function. Usually, to perform backpropagation, we use the cross-entropy loss function: $$ \begin{aligned} &\mathcal{L}\left(\hat{y}^{\langle t\rangle}, y^{\langle t\rangle}\right)=-\sum_{i} y_{i}^{\langle t\rangle} \log \hat{y}_{i}^{\langle t\rangle} \ &\left.\mathcal{L}=\sum \mathcal{L}^{\langle t\rangle} \hat{y}^{\langle t\rangle}, y^{\langle t\rangle}\right) \end{aligned} $$
A language model is a model that predicts the next word in a sequence. It's usually trained with a sequence of words, and the model predicts the probabilities for the next word. We just get a training set of target language text, tokenize this by getting the vocabulary and one-hot each word, add <EOS>
and <UNK>
tokens.
To predict a whole sentence's probability, we feed one word at a time and multiply the probabilities.
To sample novel sequences, we can just pick a random first word from the distribution obtained by
Character-level language models are a special case of language models, where the input is a sequence of characters: these tend to create longer sequences and are not as good at capturing long range dependencies.
As every deep neural network, RNN are subject to vanishing gradients. This means that RNNs are not good in long-term dependencies. Gradient clipping (i.e. deciding a maximum for gradients) can solve the exploding gradient problem, while a weight initialization (e.g. He) and echo state networks (i.e. RNNs with recurrent dropout) can help to avoid vanishing gradients. The most popular solution, though, is using GRU/LSTM networks.
GRUs introduce a memory cell that is updated at every time step. This cell is used to remember the output of the previous time step:
With the update usually being a small number (in the order of
$a^{}: (N_{hidden_neurons}, 1)$ $c^{}: (N_{hidden_neurons}, 1)$ $\tilde{c}^{}: (N_{hidden_neurons}, 1)$ $u^{}: (N_{hidden_neurons}, 1)$
This was true for the simplified GRU, but the full GRU introduces a new gate, telling us how relevant the previous memory cell is. We'll call this the relevance gate:
LSTMs have 3 different gates: an update gate, a forget gate and an output gate.
Some sentences need information from both the past and the future. For example, the sentence "I love you" needs information from both the past and the future. BiRNNs solve this issue by having activations that come from both left and right. BiRNN with LSTM appear to be commonly used, but you obviously need the whole sequence before you can process it: this is not optimal in, for example, live speech recognition.
Sometimes, stacking multiple RNN layers is powerful. In feed-forward deep nects, there could be even 200 layers, while in deep RNNs having 3 is already deep and expensive.
Word embeddings are a way to represent words in a vector space. This is useful for tasks like text classification, where words are represented by a vector space. Up until now, we used a vocabulary, but this is not optimal: we would like to encode the relationship between words, for example between king and queen.
Algorithms used to generate word embeddings examine unlabeled text and learn the representation. Word embeddings tend to make an extreme difference with smaller datasets, and they reduce the size of the input from a one-hot vector to a vector of features. Word embedding technology has even been used for face recognition, being able to analyze similarity.
Word embeddings can be used to analyze analogies: by computing the vector difference between 2 words, you can check whether the difference between them is similar to the one between 2 other words by computing their Cosine Similarity.
When implementing a word embedding algorithm, you end up with a learning matrix: this matrix will be of size (n_words, n_features)
.