Paper, Tags: #nlp, #architectures
RNNs have two main problems:
- They cannot capture very long-term dependencies because they have gradient vanishing and exploding problems.
- Because they are sequential, it is difficult, if not impossible, to parallelize the computation, which increases the time complexity.
Transformer, Vaswani et al., 2017 and TCU, Bai et al., 2018:
- non-recurrent sequence models
- built on convolution operation and attention mechanism
- allow information flow at an arbitrary length with no intermediary loss
- multi-head attention: every position is directly connected to every other positions in a sequence
- capture long-term dependencies
Issues with multi-head attention mechanism:
- ignore local structures in sequences
- loss of sequential information of positions
- heavily rely on position embeddings
- which have limited effects
- require design efforts
Novel sequence learning model: multi-layered architecture built on RNNs and the standard Transformer, has the advantages of both and mitigates their drawbacks. Captures both local structures and global long-term dependencies in sequences without any position embeddings. Code in Github.
The model consists of a stack of identical layers, each layer having 3 components organized hierarchically: LocalRNN, multi-head attention and a position-wise feedforward network.
Local Recurrent Neural Network, reorganizes the original long sequence into many short sequences which only contain local information. Constructs a local window of fixed size for each target position.
Positions in each local window form a short sequence, from which the RNN learns a latent representation.
The locality in the sequence is explicitly captured. The local window slides along the sequence position by position, the global sequential information is also incorporated. Short-term dependencies are captured, but no long-term dependencies.
The LocalRNN takes an input sequence and outputs a sequence of hidden representations that incorporate information of local regions.
Unlike position embeddings such as in Transformer and TCN, LocalRNN fully captures the sequential information within each window.
Works similarly as 1-D CNN, where each local window is processed by convolution operations, but this operation completely ignores the sequential information of positions within local windows.
Sub-layer on top of the LocalRNN to capture the global long-term dependencies. Works similarly to the pooling operation in CNNs.
Multi-head attention is effective at learning long-term dependencies because it allows a direct connection between every pair of positions. Each position attents to all the positions in the past and obtains a set of attention scores used to refine its representation.
Applied to each position independently and identically, transforms the features non-linearly.
- LocalRNN
- Layer normalization
- Multi head attention
- Layer normalization
- Feed forward
- Layer normalization
Sequential information
- TCN: the locality in sequences is captured by convolution filters, but the sequential information is ignored
- LocalRNN: fully incorporates it by the sequential nature of the RNN
Long-term dependencies
- TCN: uses dilated convolution that operate on nonconsecutive positions. Misses considerable amount of information from a large portion of positions in each layer
- R-Transformer: considers every past position and takes much more information into consideration
Similar long-term memorization capacities.
Locality in sequences
- Transformer: captures locality very vaguely with multi-head attention that operates on all positions
- R-Transformer: explicitly and effectively captures the locality in sequences with LocalRNN
Sequential information
- Transformer: has position embeddings
- limited effect
- require effort to design effective position embeddings
- R-Transformer: doesn't need position embeddings
Take a 28x28x1 image and convert it to 784x1 sequence. The rescaling makes the pixels that are connected in the origin image far apart from each other, the sequence models needs to learn very long-term dependencies.
- RNNs have bad results because they cannot memorize very long-term dependencies
- Transformer, TCN much better results
- TCN better than Transformer suspect because Transformer cannot model locality very well
- R-Transformer better than TCN
Includes British and American folk tunes
- LSTM and TCN better than Transformer, suspect because music tunes exhibit strong local structures
- R-Transformer better than TCN because TCN ignores sequential information in local structure
Character-level and word-level language modeling task. 1 million words used to investigate sequence models.
- Character-level language modeling task, the model is required to predict the next character given a context
- Word-level language modeling: models are required to predict the next word given the contextual words
Comparison
- RNN worse results
- Transformer slightly better, similar as the case of music. Strong local structures
- TCN better than Transformer
- R-Transformer better than TCN
- LSTM best results