CPU backend is very limited. It only implements BMM-style NA, and not fused NA.
Very simple C++ implementations of BMM-style ops, which enable inference on non-CUDA devices and also serve as a reference for CUDA in unit tests (read more). These implementations are NOT performance optimized.
libtorch
: Torch API used for AVX.
Originally developed back in 2022, slightly tuned in terms of launch parameters among other factors, but very naive implementations.
Tiled variants: also developed back in 2022, they implement only the PN-2D operation when dimensions per attention head is 32.
libtorch
: half atomic add in the RPB backward kernel.
TBD.
- CUTLASS
TBD.
- CUTLASS