write-your-own-operator-library Write high performance operators for LLMs with CUDA/OpenCL/Triton CUDA CUDA kernels are implemented through pycuda, and Colab is recommended for trying: