It is standard practice to perform pseudo-random transformations on the inputs and parameters of neural networks before and during training. Shuffling or random selection of training data, initialization of weights and biases on a Gaussian (or Poisson etc.) distribution, randomized neuron drouput, and most objective function optimization procedures like stochastic gradient descent or Adaptive moment estimation readily spring to mind when one considers current methods.
It is worth asking the following question: why do all these procedures involve randomization, and why not simply arrange the input examples in any arbitrary order and maintain that during training instead? Each case is considered in the following sections.
Some experimentation will convince one that randomization leads to better performance: for example, failing to randomize the initial weights
To begin, let's consider gradient descent.
Stochastic gradient descent can be thought of as the foundation upon which most optimization procedures are built upon, and the online (one training example at a time, that is) algorithm is as follows:
Given a vector of a network's weights and biases
When
The randomization step is that the dataset is shuffled. Why take that set in the first place? The objective function
Gradient descent may take many iterations towards optimization of the loss function. To speed up this process, gradient descent may be modified to include parameters such as momentum, or extended into optimizers like RMSprop or AdaGrad.
It has been asserted in the past that a problem with gradient-based optimization algorithms is that they are prone to becoming 'trapped' in local minima or saddle points that are non-optimal. Approaching a minimal point in any direction, the gradient vanishes, $ \nabla F_j(v_i) \to 0$ such that any sufficiently small step
This can easily be tested, and for this author the results are as follows: with a wide array of objective functions and neural net architectures, disappearing gradients have not been observed to occur and thus it seems that even for very non-convex objective functions there is virtually no likelihood of getting 'stuck' at a local minima, or even landing on a saddle point. The important thing to note here is that a function can easily be non-convex without containing any local minima, or even any saddle points, as all that is required is for some $ \nabla F_j(v_i)$ vector to intersect the cost functions surface.
This leads to an intuitive explanation as to why objective functions in high-dimensional space do not contain many if any local minima: to be a minimal (or saddle) point,
Therefore it is almost impossible for a high-dimensional objective functions to have true local minima (or true saddle points), and therefore regularizers and randomization are generally not needed to combat this possibility.
Now back to the original question: why does stochastic gradient descent work best if it is stochastic? Why not enumerate all training examples and iterate through them in order, updating the network via gradient descent using
This section will require some familiarity with recurrent neural networks, which will be given briefly here. A defining characteristic of recurrent neural networks is that they are updated sequentially as time passes. The simplest form, a fully recurrent network, has one hidden layer that is updated at each sequential element, and each element is viewed at one time step. For example, a text classification network performing sentiment analysis (finding if a statement expresses positive or negative emotions) of the following sentance:
"The dog went to the store."
would simply iterate through the sentance ('The' then 'dog' then 'went' etc.), and at each word the activation in each hidden neuron would be updated. At the end of the sentance, the hope is that some information from early elements is retained in the hidden layer's activations, as these are then used to generate some output. Stochastic gradient descent (or some other optimization procedure) and backpropegation are then performed on the network, updating the network's weights and biases in accordance with the output that was a result of the activations
Unfortunately, in practice recurrent neural networks suffer from the same problems as very deep networks of all kinds: unstable gradients.
Now consider the process of training a non-recurrent neural network, perhaps a convolutional net performing image classification. The first training example (or batch of examples) is fed to the network with initial weights and biases (represented as
Therefore the sequence of weight and bias configurations is
Now note that the final configuration
Therefore successive configuration and outputs form a directed graph
But if we replace 'node' with 'configuration', the definition of a recurrent neural network is arrived at. This means that the final configuration
To summarize, training any neural network is not instantaneous but instead occurs over a certain time interval, and therefore may be thought of as a dynamical system. In this system, the final network configuration
Considered carefully, therefore, any network network undergoing training via updating weights and biases over time is a type of recurrent neural network, with training states
It may be beneficial, therefore, to develop a theory of extending memory gates similar to those used in LSTMs and GRUs to the training of any network, such that a network's configuration at any point
It was asserted in the previous section that the sequence of neural network configurations, inputs, and outputs form a directed, acyclic graph during training. Because the graph is directed, there is no guarantee that reversing the order of
Now imagine training for one epoch only, meaning that all training examples are fed to the network exactly once (singly or in batch form). What happens if we first group the training examples by similarity, perhaps using latent space vectorization, and then feed the resulting sequence to the network? For simplicity, suppose there are inputs two types (in equal number), those belonging to set
Assuming that the network's hyperparameters (non-trainiable variables) are not changed during training,
This is to say that the network after training would be expected to classify
An example:
Suppose one were to attempt to infer the next letter in a word using a character level recurrent neural net, and that all inputs were slight variations on the phrase
and the network was trained on the task of returning the final letter. Assuming the training was effective, the expected output given an input
would be
would be
Thus although alternating input types minimized the value of
Randomization of inputs between or during training epochs is also effective for minimizing objective function loss. Why is this the case: once training inputs have been randomized, what further purpose would shuffling the inputs again serve?
In the above sections, it was claimed that a neural net configuration
Suppose we have three elements per training set, and we want to perform three epochs of training. After randomizing the inputs, there is
such that the full training session is
But this again is a periodic sequence, with a periodicity being the size of the dataset. Over many epochs, this periodicity becomes reflected in the final network configuration
Returning to the successive configurations
Now suppose one epoch is completed, and
One option would be to use the same sequence of inputs as for the first epoch, that is,
To see why this is, observe that we can assign a vector to the input sequence
Finally, is it very likely that the ideal path from
If many paths are of similar 'quality', but some paths are much better, then reordering the input sequence can act to minimize the objective function in
Neural networks were studies long before they became feasible computational models, and one of the most significant breakthroughs that allowed for this transition was the development of backpropegation.
Backpropegation can be thought of as an optimal graph traversal method (or a dynamic programming solution) that updates a network's weights and biases with the fewest number of computations possible. The goal of training is to find a certain combination of weights and biases (and any other trainable parameters) that yield a small objective function, and one way this could be achieved is by simply trying all possible weights and biases. That method is grossly infeasible, and even miniscule networks with neurons in the single digits are unable to try all possible (assuming 32 bit precision) combinations.