Skip to content

v1.1

Compare
Choose a tag to compare
@ptrendx ptrendx released this 07 Dec 23:44
· 477 commits to main since this release

Release Notes – Release 1.1.0

Key Features and Enhancements

  • [pyTorch] Memory usage is reduced when using the fp8_model_init API during inference.
  • [pyTorch] Memory usage is reduced when using the LayerNormLinear, LayernormMLP, and TransformerLayer APIs.
  • [JAX] Transformer Engine is migrated to the new Custom Partitioning mechanism of parallelism for custom ops in JAX.
  • [JAX] The attention operation’s performance is improved when using cuDNN version 8.9.6 or greater.
  • [C/C++] Transformer Engine can now be built as a subproject.

Fixed Issues

  • Fixed an issue where in some cases passing the non-contiguous tensors as Q, K, or V to DotProductAttention would result in an error, “Exception: The provided qkv memory layout is not supported!.”

Known Issues in This Release

  • FlashAttention v2, which is a dependency of this release of Transformer Engine, has a known issue with excessive memory usage during installation (Dao-AILab/flash-attention#358). One could workaround this issue by either setting the MAX_JOBS=1 environment variable during Transformer Engine installation or installing FlashAttention v1 (e.g. by pip install flash-attn==1.0.9) before attempting to install Transformer Engine.
  • [pyTorch] FlashAttention v2.1 has changed the behavior of the causal mask when performing cross-attention (see https://github.com/Dao-AILab/flash-attention#21-change-behavior-of-causal-flag for reference). For Transformer Engine to preserve consistent behavior between versions and back ends, FlashAttention is disabled for this use case (i.e. cross-attention with casual masking) when FlashAttention version 2.1+ is installed.

Breaking Changes in This Release

There are no breaking changes in this release.

Deprecated Features

There are no deprecated features in this release.