Lateral Connections in Denoising Autoencoders Support Supervised Learning

In this tutorial we will understand how to implement ladder network as explained in [1]. In this paper the authors have shown how unsupervised learning using a denoising autoencoder with lateral connections help improve the classification accuracy in supervised learning.

To produce results as mentioned in the paper please run following command (best test error we got was 0.6%). To run this script you will need following torch packages: nn, nngraph, dp, dpnn, optim and cunn & cutorch if using cuda (--useCuda flag).

   th tutorials/ladder.lua --verbose --eta 500 --epochs 100 --learningRate 0.002 --linearDecay --endLearningRate 0 --startEpoch 50 --useCuda --deviceId 1 --noiseSigma 0.3 --useBatchNorm --batchSize 100 --adam --noValidation --attempts 10

The unsupervised learning (denoising) task supplements the supervised learning task (classification in this case). As in autoencoders this network has an encoder and a decoder. The output of encoder is also used for classification. The output of encoder is N dimensional where N is number of classes. This N dimensional vector is used for computing classification cost as well as feeds into the decoder.

Classification

Encoder/classifier units are defined as

   Z = nn.BatchNormalization(hidden_units)(nn.Linear(inputDims, hidden_units)(previous_H))

where

   H = nn.ReLU()(nn.CMul()(nn.Add()(Z)))

For first layer previous_H is the corrupted input.

   input = nn.WhiteNoise(mean, sigma)

H for last encoder unit is defined as

   H = nn.LogSoftMax()(nn.CMul()(nn.Add()(Z)))

Last H feeds into the negative log likelihood criterion.

Denoising

Typically in denoising autoencoder the input samples are corrupted using Dropout nn.Dropout but in this paper the authors use isotropic Gaussian noise nn.WhiteNoise with zero mean.

Lateral Connections in Autoencoder

Z units in encoder are laterally connected to corresponding unit in the decoder. The output of decoder unit for neuron i is defined by

   z^_i = a_i1 * z_i + a_i2 * sigmoid(a_i3 + a_i4) + a_i5

where

   a_ij = c_ij * u_i + d_ij

U is output of decoder unit's nn.Linear(). For the top most layer U is zero. Z is output of corresponding encoder unit (this is lateral connection, decoder takes output from its previous unit through U as well as corresponding encoder unit). For the lowest layer of decoder Z is the corrupted input signal. c_j and d_j are trainable weight vectors. This forms the crux of the ladder network. This can be easily implemented using nngraph as follows

For the topmost layer U= 0 and Z is the batch normalized output from the corresponding (in this case last) encoder/classifier unit. Z^ for topmost layer is defined as

   z_hat1 = nn.CMul(hiddens[i])(Z)
   z_hat2 = nn.CMul(hiddens[i])(Z)
   z_hat3 = nn.CMul(hiddens[i])(Z)
   z_hat34 = nn.Add(hiddens[i])(z_hat3)
   z_hatSigmoid34 = nn.Sigmoid()(z_hat34)
   z_hat234 = nn.CMulTable()({z_hat2, z_hatSigmoid34})
   z_hat5 = nn.CMul(hiddens_units)(Z)

   -- Z_hat = z^
   Z_hat = nn.CAddTable()({z_hat1, z_hat234, z_hat5})

For lower decoder units Z^ is defined as

   
      u = nn.Linear()(previous_Z_hat)

      cu1 = nn.CMul(hidden_units)(u)
      du1 = nn.Add(hidden_units])(u)
      a1 = nn.CAddTable()({cu1, du1})
      cu2 = nn.CMul(hidden_units)(u)
      du2 = nn.Add(hidden_units)(u)
      a2 = nn.CAddTable()({cu2, du2})
      cu3 = nn.CMul(hidden_units)(u)
      du3 = nn.Add(hidden_units)(u)
      a3 = nn.CAddTable()({cu3, du3})
      cu4 = nn.CMul(hidden_units)(u)
      du4 = nn.Add(hidden_units)(u)
      a4 = nn.CAddTable()({cu4, du4})
      cu5 = nn.CMul(hidden_units)(u)
      du5 = nn.Add(hidden_units)(u)
      a5 = nn.CAddTable()({cu5, du5})

      z_hat1 = nn.CMulTable()({a1, z})
      z_hat2 = nn.CMulTable()({a3, z})
      z_hat3 = nn.Sigmoid()(nn.CAddTable()({z_hat2, a4}))
      z_hat4 = nn.CMulTable()({a2, z_hat3})
      Z_hat = nn.CAddTable()({z_hat1, z_hat4, a5})

Z_hat is z^. Final Z_hat is the output of decoder and feeds into the mean squared error criterion.

Criterions

Negative log likelihood criterion is used for classification task.

   nll = nn.ClassNLLCriterion()

Mean squared error is used for the auxillary task.

   mse = nn.MSECriterion()

These two training criterions are combined using eta which determines weight for auxillary task. If eta is zero then the model is trained for classification only. Combined criterion

   criterions = ParallelCriterion()
   criterions:add(nll)
   criterions:add(mse, eta)

References

[1] Rasmus, Antti, Harri Valpola, and Tapani Raiko. "Lateral Connections in Denoising Autoencoders Support Supervised Learning." arXiv preprint arXiv:1504.08215 (2015).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ladder.md

ladder.md

Lateral Connections in Denoising Autoencoders Support Supervised Learning

Classification

Denoising

Lateral Connections in Autoencoder

Criterions

References

Files

ladder.md

Latest commit

History

ladder.md

File metadata and controls

Lateral Connections in Denoising Autoencoders Support Supervised Learning

Classification

Denoising

Lateral Connections in Autoencoder

Criterions

References