Single-file implementation of Attention Is All You Need
- It's a flexible interpretable architecture, more performant than RNN and applicable to POMDP.
- Simple implementantion to play with my Titan V and see how much I can get out of mixed-precision learning and matmul-friendly tensor-cores with matmul-heavy architecture.