feat: 44 make flash attention configurable #60

theissenhelen · 2025-01-06T16:27:11Z

Current setup:

If flash-attn is available in the environment, the MultiHeadSelfAttention module automatically imports the corresponding attention function. In inference however we do not have that information.

Now:

flex attention available
user specifies whether flash-attn, flex attention or scaled dot product attention should be used in the model config.
adds configurable parameters (soft cap, aLiBi) for flash attention
for aLiB:i adds a function to compute the slopes according to the number of attention heads
scaled dot product attention now supports sliding window (making it numerically equivalent to flash/flex)

Todo:

test various attention options
adjust test case coverage

* fix: change pre-cmmit autoupdate schedule to monthly * fix: change the merge strategy for Changelog to Union * fix: add .envrc to .gitignore * ci: ignore pre-commit-config and readthedocs for changelog updates * ci: fix to correct hpc workflow call * fix: update precommit config * chore: update pre-commits * feat: add codeowners file * chore: update dependencies * ci: add hpc-config * docs: changelog * fix: respond to review comments --------- Co-authored-by: Jesper Dramsch <[email protected]>

* feat: add configurability to dropout in MultiHeadSelfAttention Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]> * test: adjust to dropout_p * doc: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <[email protected]> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <[email protected]> * Update CHANGELOG.md to KeepChangelog format * [pre-commit.ci] pre-commit autoupdate (#25) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) - [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2) - [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <[email protected]> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <[email protected]> * Update CHANGELOG.md to KeepChangelog format * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog --------- Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]> Co-authored-by: Gert Mertes <[email protected]> Co-authored-by: Mario Santa Cruz <[email protected]> Co-authored-by: Jesper Dramsch <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

xfail for MultiHeadSelfAttention

for more information, see https://pre-commit.ci

* fix: change pre-cmmit autoupdate schedule to monthly * fix: change the merge strategy for Changelog to Union * fix: add .envrc to .gitignore * ci: ignore pre-commit-config and readthedocs for changelog updates * ci: fix to correct hpc workflow call * fix: update precommit config * chore: update pre-commits * feat: add codeowners file * chore: update dependencies * ci: add hpc-config * docs: changelog * fix: respond to review comments --------- Co-authored-by: Jesper Dramsch <[email protected]>

* feat: add configurability to dropout in MultiHeadSelfAttention Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]> * test: adjust to dropout_p * doc: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <[email protected]> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <[email protected]> * Update CHANGELOG.md to KeepChangelog format * [pre-commit.ci] pre-commit autoupdate (#25) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) - [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2) - [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <[email protected]> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <[email protected]> * Update CHANGELOG.md to KeepChangelog format * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog --------- Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]> Co-authored-by: Gert Mertes <[email protected]> Co-authored-by: Mario Santa Cruz <[email protected]> Co-authored-by: Jesper Dramsch <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

xfail for MultiHeadSelfAttention

for more information, see https://pre-commit.ci

….com:ecmwf/anemoi-core into feature/44-make-flash-attention-configurable

models/src/anemoi/models/layers/attention.py

mishooax

This looks OK to me (I haven't run the code, but Cathal did) - nice work @theissenhelen.

HCookie

Brilliant work exists here,
Just a few thoughts and questions,
Looks great overall.

models/src/anemoi/models/layers/attention.py

HCookie · 2025-01-27T09:42:46Z

models/src/anemoi/models/layers/attention.py

+        softcap : float, optional
+            Anything > 0 activates softcapping attention, by default None
+        use_alibi_slopes : bool, optional
+            Adds bias


Given that these are only used for flash_attention, if kwargs are needed for the other attention types, we may end up with a large number of unused kwargs, could it make sense to add attention_kwargs: dict[str, Any], and use that?

They will at some point also be used for flex_attention.

models/src/anemoi/models/layers/attention.py

training/src/anemoi/training/config/model/transformer.yaml

models/pytest.ini

models/src/anemoi/models/layers/attention.py

Co-authored-by: Harrison Cook <[email protected]>

for more information, see https://pre-commit.ci

theissenhelen and others added 30 commits September 27, 2024 08:42

feat: FlashMultiHeadSelfAttention

539e8a2

chore!: drop support for scaled_dot_product_attention

a86c9a8

feat: add softcap

105443f

test: add softcap

e82a59e

xfail for MultiHeadSelfAttention

[pre-commit.ci] auto fixes from pre-commit.com hooks

e648eb0

for more information, see https://pre-commit.ci

feat: flash attention lazy import

6271cd8

feat: make alibi slopes configurable

d4940e7

chore(deps): add flash-attn

9ff6cb9

feat: use scaled_dot_product as default

bbd89dc

feat: make alibi_slope cinfigurable in block, chunk processor

91533c6

chore(deps): remove flash-attn

0eb5c50

feat: get alibi_slopes

c04e641

docs: update docstrings

6523b47

fix: bias shape

22623cc

fix: softcap optional

ed07e34

fix: import annotations from future

c841324

fix: annotation error

6c12dda

docs: update changelog

b7b8f2e

fix: type annotation

df353d9

feat: catch low flash-attn version

fc335c7

feat: FlashMultiHeadSelfAttention

663fea0

chore!: drop support for scaled_dot_product_attention

0c55a9c

feat: add softcap

ea665be

test: add softcap

ffa2d99

xfail for MultiHeadSelfAttention

[pre-commit.ci] auto fixes from pre-commit.com hooks

7c2d634

for more information, see https://pre-commit.ci

feat: flash attention lazy import

d2ed932

theissenhelen and others added 4 commits January 14, 2025 18:26

fix: embed_dim / num_heads >=16

c0f462c

test: fix tests to account for embed_dim constraints

752de28

fix tests

3818719

Merge branch 'feature/44-make-flash-attention-configurable' of github…

c2f8890

….com:ecmwf/anemoi-core into feature/44-make-flash-attention-configurable

theissenhelen marked this pull request as ready for review January 17, 2025 10:59

theissenhelen changed the title ~~Feature/44 make flash attention configurable~~ feat: 44 make flash attention configurable Jan 17, 2025

chore: remove debugging code

2d8b775

mishooax reviewed Jan 20, 2025

View reviewed changes

models/src/anemoi/models/layers/attention.py Outdated Show resolved Hide resolved

mishooax previously approved these changes Jan 20, 2025

View reviewed changes

consitency change

7665c7f

cathalobrien dismissed mishooax’s stale review via 7665c7f January 22, 2025 10:27

HCookie requested review from HCookie and anaprietonem January 22, 2025 10:28

chore(configs): add attention_implementation

230f044

github-actions bot added the training label Jan 24, 2025

HCookie reviewed Jan 27, 2025

View reviewed changes