Take a look at different positional encoding schemes in self-attention:
- Sinusoidal Attention Is All You Need
- Learned
- RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding
- ALiBi Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
The toy model will solve a copy-task. The goal is to copy the sequence before the <copy>
token after it.
1 7 2 <copy> _ _ _ _ _ _ → 1 7 2 <copy> 1 7 2 _ _ _
9 <copy> _ _ _ _ _ _ _ _ → 9 <copy> 9 _ _ _ _ _ _ _
2 2 4 3 <copy> _ _ _ _ _ → 2 2 4 3 <copy> 2 2 4 3 _
1 2 3 4 5 6 7 <copy> _ _ → 1 2 3 4 5 6 7 <copy> 1 2
The model are trained on 2000 epochs, single-headed-attention, 2 layers, 20 embed_size. Each positional scheme is evaluated 5 times, and we plot the accuracy on the test set.
Running on:
7 1 8 2 <copy> _ _ _ _ _ → 7 1 8 2 <copy> 7 1 8 2 _