Simple Vision Transformer (ViT) Implementation of 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' (Dosovitskiy++21) in jax and equinox. Uses simple attention mechanism for pedagogical reasons.