TinyKernels.jl

TinyKernels.jl provides a tiny abstraction for GPU (and CPU) kernels, with full support for CUDA (Nvidia) and ROCm (AMD) backends, limited support for Metal (GPU programming on MacOS ARM) backend, and allowing for multi-threaded CPU execution.

TinyKernels.jl is mostly a heavily stripped-down version of KernelAbstractions.jl supporting the bare minimum of the features. This package provides a sandbox for Julia GPU tooling and to measure the performance of kernels in a GPU-agnostic way. While the API of KernelAbstractions.jl is in a "transient" state, this package will provide the thin abstraction layer on top the CUDA.jl, AMDGPU.jl and Metal.jl packages.

TinyKernels.jl allows to explicitly launch GPU kernels asynchronously on different streams or queues with given priority. This feature facilitates the overlap between computations and memory transfers in distributed configurations.

TinyKernels.jl supports automatic differentiation with Enzyme.jl overloading the Enzyme.autodiff function to enable reverse mode AD of GPU (and CPU) kernels.

Preliminary benchmarks can be found in TinyBenchmarks.jl and Metal playground in MetalGPU.

Stay tuned 🚀

Compat

AMDGPU ≥ v0.4.8
CUDA ≥ 3.13
Metal ≥ v0.3.0

Notes

⚠️ Metal backend:

Only Float32 is being supported. For Float64, one could try using a construct from DoubleFloats.jl which may impact performance.
Automatic differentiation (AD) capabilities (Enzyme.jl) are currently not working on ARM GPU (Metal) and giving erroneous results on ARM CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TinyKernels.jl

Compat

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

TinyKernels.jl

Compat

Notes