Attention experimentation

A set of different attention operator implementations used evaluate the performance of its kernels and its integration with PyTorch. We leverage new native Triton profiler, Proton, which provide a lightweight mechanism for gathering crucial kernel benchmarking metric information such as:

TFLOP/s
Execution time
Memory-bandwidth
Line-level Triton kernel information
Kernel metadata

Torch scaled_dot_product_attention different backends (e.g. CUDN_BACKEND) benchmarking compared to FlashAttention v3 Hopper. See test/test_sdpa_cudnn.py. We can verify the the performance is almost identical and they generate the same CUDA kernel: cudnn_generated_fort_native_sdpa_sm90_knob_7_64x128x128_4x1x1_kernel_0_0

Future plans for Proton: annotate kernels generated by torch.compile so that flops and bytes can be obtained even for these automatically generated kernels (see codegen and TritonTemplate). Generally, the user does not know this information because, in this case, the TorchInductor compiler is the one that "wrote" these kernels. (see Jokeren https://github.com/pytorch/pytorch/pull/136169#issuecomment-2375287772)

We also include several small util experimental scripts to automatically profiling Torch Inductor GPU kernels. Our main reference, PyTorch team example notebooks. Considerations:

Sudo privileges for NCU execution: currently our workaround is a mere manual hard-coded path.

Example output for FLUX DiT model:

Installation

Pre-requisites:

Linux x86_64
CUDA 12.0+ for Hopper and CUDA 12.1+ for Ada
NVIDIA Driver supporting CUDA 12.0 or later
cuDNN 8.1 or later
For fused attention, CUDA 12.1 or later, NVIDIA Driver supporting CUDA 12.1 or later, and cuDNN 8.9 or later.

Docker

We recommend using a Docker dev container for the formal reproducibility of the benchmarking results. We provide a Dockerfile which could be build as:

$ docker build -t test-attn-h100-dev:latest -f docker/tritonbench-nightly.dockerfile .

Then to run the container in interactive session:

$ docker run -it -d --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia --gpus all --name <container_name> <docker_image>

docker exec -it triton_server /bin/bash

xFormers: FlashAttention 3 Torch Custom Operator

Torch Custom Operators Manual states that for external CUDA kernels to have integration with Torch subsystems, such as Torch compiler, we need to provide the correspondent wrapping logic on top, to avoid graphs-breaks and increase optimization space. We have found a problem originated in the latest updates of FlashAttention that breaks the existing xformers flash3 operator integration. We modify xformers setup.py with libraries=["cuda"] patch as temporal workaround until upstream FA3 fix and upgrade CUTLASS

NVIDIA TransformerEngine

Currently it’s importing FlashAttention 3 as follows:

from flashattn_hopper.flash_attn_interface import flash_attn_func as flash_attn_func_v3

What raises the error ModuleNotFoundError: No module named 'flashattn_hopper'. We don’t understand why the official import convention is not used instead, which we employ in this repo:

from hopper.flash_attn_interface import flash_attn_func as flash_attn_func_hopper

Due to this issue, we have decided to avoid TransformerEngine installation as optional. In case using this implementation is crucial, we recommend the following from source installation:

# Clone repository, checkout stable branch, clone submodules*

git clone --branch stable --recursive https://github.com/NVIDIA/TransformerEngine.git

cd TransformerEngine

export NVTE_FRAMEWORK=pytorch  *# Optionally set framework*

pip install .          *# Build and install*

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.ci/test_attn		.ci/test_attn
assets		assets
docker		docker
test		test
third_party		third_party
tools		tools
.gitmodules		.gitmodules
README.md		README.md
benchmark_attention.py		benchmark_attention.py
benchmark_flux.py		benchmark_flux.py
benchmark_mem_eff_attention.py		benchmark_mem_eff_attention.py
example_run_ncu.txt		example_run_ncu.txt
flops.py		flops.py
inductor_profiling.py		inductor_profiling.py
install.py		install.py
metric_out.csv		metric_out.csv
metric_table_kernel_metadata.csv		metric_table_kernel_metadata.csv
print_aten.py		print_aten.py
profiler_output.txt		profiler_output.txt
requirements.txt		requirements.txt
requirements_docker.txt		requirements_docker.txt
run_ncu.py		run_ncu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention experimentation

Installation

Docker

xFormers: FlashAttention 3 Torch Custom Operator

NVIDIA TransformerEngine

About

Releases

Packages

Languages

ai-compiler-study/test_attn

Folders and files

Latest commit

History

Repository files navigation

Attention experimentation

Installation

Docker

xFormers: FlashAttention 3 Torch Custom Operator

NVIDIA TransformerEngine

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages