A set of different attention operator implementations used evaluate the performance of its kernels and its integration with PyTorch. We leverage new native Triton profiler, Proton, which provide a lightweight mechanism for gathering crucial kernel benchmarking metric information such as:
- TFLOP/s
- Execution time
- Memory-bandwidth
- Line-level Triton kernel information
- Kernel metadata
Torch scaled_dot_product_attention
different backends (e.g. CUDN_BACKEND
) benchmarking compared to FlashAttention v3 Hopper. See test/test_sdpa_cudnn.py. We can verify the the performance is almost identical and they generate the same CUDA kernel: cudnn_generated_fort_native_sdpa_sm90_knob_7_64x128x128_4x1x1_kernel_0_0
Future plans for Proton: annotate kernels generated by torch.compile so that flops and bytes can be obtained even for these automatically generated kernels (see codegen and TritonTemplate). Generally, the user does not know this information because, in this case, the TorchInductor compiler is the one that "wrote" these kernels. (see Jokeren https://github.com/pytorch/pytorch/pull/136169#issuecomment-2375287772)
We also include several small util experimental scripts to automatically profiling Torch Inductor GPU kernels. Our main reference, PyTorch team example notebooks. Considerations:
- Sudo privileges for NCU execution: currently our workaround is a mere manual hard-coded path.
Example output for FLUX DiT model:
Pre-requisites:
- Linux x86_64
- CUDA 12.0+ for Hopper and CUDA 12.1+ for Ada
- NVIDIA Driver supporting CUDA 12.0 or later
- cuDNN 8.1 or later
- For fused attention, CUDA 12.1 or later, NVIDIA Driver supporting CUDA 12.1 or later, and cuDNN 8.9 or later.
We recommend using a Docker dev container for the formal reproducibility of the benchmarking results. We provide a Dockerfile which could be build as:
$ docker build -t test-attn-h100-dev:latest -f docker/tritonbench-nightly.dockerfile .
Then to run the container in interactive session:
$ docker run -it -d --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia --gpus all --name <container_name> <docker_image>
docker exec -it triton_server /bin/bash
Torch Custom Operators Manual states that for external CUDA kernels to have integration with Torch subsystems, such as Torch compiler, we need to provide the correspondent wrapping logic on top, to avoid graphs-breaks and increase optimization space. We have found a problem originated in the latest updates of FlashAttention that breaks the existing xformers flash3 operator integration. We modify xformers setup.py
with libraries=["cuda"]
patch as temporal workaround until upstream FA3 fix and upgrade CUTLASS
Currently it’s importing FlashAttention 3 as follows:
from flashattn_hopper.flash_attn_interface import flash_attn_func as flash_attn_func_v3
What raises the error ModuleNotFoundError: No module named 'flashattn_hopper'
. We don’t understand why the official import convention is not used instead, which we employ in this repo:
from hopper.flash_attn_interface import flash_attn_func as flash_attn_func_hopper
Due to this issue, we have decided to avoid TransformerEngine installation as optional. In case using this implementation is crucial, we recommend the following from source installation:
# Clone repository, checkout stable branch, clone submodules*
git clone --branch stable --recursive https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
export NVTE_FRAMEWORK=pytorch *# Optionally set framework*
pip install . *# Build and install*