About

Single-file implementation of Attention Is All You Need

Why

It's a flexible interpretable architecture, more performant than RNN and applicable to POMDP.
Simple implementantion to play with my Titan V and see how much I can get out of mixed-precision learning and matmul-friendly tensor-cores with matmul-heavy architecture.