Community contribution: Adding Flash Attention 2 support for more architectures #26350

younesbelkada · 2023-09-22T15:51:29Z

sahilbhosale63 · 2023-09-22T17:56:03Z

Hi @younesbelkada - I want to work on adding Flash Attention 2 support for GPTBigCode (Starcoder). Can I take this task? Can you please assign this task to me?

flozi00 · 2023-09-22T20:14:27Z

Will definitely take a look next week
Great to see it merged now 💪

rajveer43 · 2023-09-23T10:31:16Z

I would like to work on MPT @younesbelkada

susnato · 2023-09-24T17:23:09Z

I would like to work on OPT.

ZeusFSX · 2023-09-25T09:25:38Z

Is it possible to add FlashAttention2 to GPT2 models?

younesbelkada · 2023-09-25T09:28:39Z

@sahilbhosale63 @flozi00 @rajveer43 @susnato thanks very much for your interest! Indeed it would be great if you could help us!
Before assigning you to this issue can you confirm you have access to a GPU that does support Flash Attention 2: https://github.com/Dao-AILab/flash-attention#installation-and-features in order to be able to run the tests ?
@ZeusFSX , yes I think that it is possible, I'll update the list accodingly

rajveer43 · 2023-09-25T10:57:00Z

@younesbelkada Yes I have

younesbelkada · 2023-09-25T12:20:58Z

OK perfect, I will assign you to MPT ! Feel free to let me know if you need any help or if you have any question, as a starting point, I would recommend to have a look at #25598 and see if you can replicate the PR for MPT. For running flash attention tests you can just run (once PR is ready):

RUN_SLOW=1 pytest -m flash_attn_test tests/models/mpt/

susnato · 2023-09-25T12:26:34Z

@younesbelkada yes I have.

younesbelkada · 2023-09-25T12:29:47Z

Thanks @susnato , perfect then, let me know whenever you start the PR and if you have any question ! Check out my instructions above for more details

sahilbhosale63 · 2023-09-25T13:12:22Z

@younesbelkada Unfortunately, My GPU is not supported

rajveer43 · 2023-09-25T16:01:29Z

OK perfect, I will assign you to MPT ! Feel free to let me know if you need any help or if you have any question, as a starting point, I would recommend to have a look at #25598 and see if you can replicate the PR for MPT. For running flash attention tests you can just run (once PR is ready):
RUN_SLOW=1 pytest -m flash_attn_tests tests/models/mpt/

Sure I will work on it!

jeromeku · 2023-09-26T17:16:09Z

@younesbelkada Would like to work on Persimmon. I have access to A4000, A5000, and A6000, which I believe should be compatible with FA2.

younesbelkada · 2023-09-26T18:14:17Z

Perfect sounds great, thanks for your help, I will assign you to Persimmon !

susnato · 2023-09-26T18:22:45Z

Since @sahilbhosale63 is not working on GPTBigCode (Starcoder)(as he said here) can I take that @younesbelkada?

younesbelkada · 2023-09-26T18:25:42Z

Yes no problem, thanks very much for proposing your help on this ! As a starting point you can have a look at @pacman100 's implementation here: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/personal_copilot/training/starcoder_flash_attn_monkey_patch.py

sorenmc · 2023-09-26T19:08:29Z

@younesbelkada I would like to implement it for BERT if it hasn't already been done? A lot of the models topping MTEB are still relying on this architecture! I have tested that i can run flash attention 2 on my nvidia geforce RTX 3060 TI.

younesbelkada · 2023-09-27T08:22:10Z

Awesome, thanks a lot for your help, ok I will assign you to BERT then!

DougTrajano · 2023-09-27T18:50:10Z

Hi everyone, I would like to help implement this with GPT2 if you want.

jeromeku · 2023-09-27T20:31:24Z

@younesbelkada

I have a working version for Persimmon that passes the flash_attn_v2 tests except for generate_padding_right as the original PersimmonFlashAttention does not have padding_mask as a kw input (as opposed to the Llama and Falcon flash implementations). Is this something that needs to be changed in both Persimmon Flash v1 and v2?

Also, any plans on incorporating additional optimizations, e.g., Flash Attention repo has fused layers for dense, rotary, and layer norm for faster training; and Triton kernels, more generally? Happy to investigate more!

Also, would like to help with Mistral-7b (just released). They use xformers memory efficient attention in their released implementation but also mention Tri Dao's FA in the blogpost.

younesbelkada · 2023-09-28T09:25:38Z

Hi @DougTrajano
Awesome! Can you confirm you have access to a hardware that is supported by FA-2?

@jeromeku awesome thanks! Can you move forward for Persimmon by opening a PR so that I can have a look?

Also, any plans on incorporating additional optimizations, e.g., Flash Attention repo has fused layers for dense, rotary, and layer norm for faster training; and Triton kernels, more generally? Happy to investigate more!

If that is something that can nicely fit into the API without any breaking behaviour that would be great !

Also, would like to help with Mistral-7b (just released). They use xformers memory efficient attention in their released implementation but also mention Tri Dao's FA in the blogpost.

I think Mistral's attention has been released in the latest version of FA-2 --> Would you be happy to open a PoC PR so that I can play with it and see what we can do?

Again thanks a lot!

younesbelkada · 2023-09-28T11:05:52Z

Hi @jeromeku
I had to check internally for Mistral, given the very recent release and the urgency, we'll take this over (#26464); if you have started a PR, I'm very happy to start from it or to add you as a co-author to the PR !
We might also refactor things a bit to support Local attention introduced by Mistral, so that needs further investigation, I'll keep you posted

rajveer43 · 2023-09-28T11:23:47Z

@younesbelkada what is the expected deadline to complete MPT, I have other issues to tackle on so I can plan accordingly

susnato · 2023-09-28T20:44:20Z

Hi @younesbelkada , I am talking this up for GPT-neo.

younesbelkada · 2023-09-28T20:58:06Z

Awesome @susnato ! Thanks !
@rajveer43 thanks for taking up MPT, will check it out!

jprivera44 · 2024-02-26T18:14:24Z

Hey thank you for letting me know on the existing PR @EduardoPach , @younesbelkada & @ArthurZucker are there any available architectures left to implement? I see there are "many more" that are not included on this list. Can you give me an example of one?

jprivera44 · 2024-02-26T18:31:30Z

It seems that T5 is still open?

rubenweitzman · 2024-02-28T22:07:27Z

@jprivera44 yes it seems so, cannot find any FA/sdpa version. Would be great if you could get that working. I am also looking at integrating sdpa into ESM @ddofer or others would love your help if there is interest!

William-WANG2 · 2024-03-02T00:42:43Z

Hello! My classmate @DELTA-DoubleWise and I are trying to write a project proposal for a course and we would like to extend this to RAG model (which is not listed above). Would you mind assigning this to us?
Thank you very much! @EduardoPach, @younesbelkada & @ArthurZucker

ddofer · 2024-03-07T22:38:51Z

@rubenweitzman I wish I could help, but I'm only familiar with keras, not pytorch :\

ArthurZucker · 2024-03-25T03:42:02Z

Hey @William-WANG2, we usually don't assign issues, and rather let the code talk: if a PR is open and pinned then that means someone is working on something and the entire community can check the progress 😉 we try to prevent work duplication with this!

sayakpaul · 2024-04-19T10:17:59Z

Does adding FA2 to CLIP make any sense?

amyeroberts · 2024-04-19T13:58:39Z

@sayakpaul Yes! It would be great to add - had a draft in #27444 but was hitting some large differences on just some of the elements which I wasn't able to track. I don't have bandwidth atm so very happy for anyone to take it up!

miladm · 2024-04-22T19:47:19Z

Integration question:
I'd love to see this implementation be available for TPU backends. Is there anything we can leverage from this line of work (out of the box), or shall TorchXLA focus on a Pallas implementation under optimum-tpu?

cc @philschmid @alanwaketan @shauheen @allenwang28

philschmid · 2024-04-23T06:57:30Z

Adding @mfuntowicz @tengomucho to answer this

tengomucho · 2024-04-23T08:40:00Z

@miladm for now there is a check on models to see if flash attention is available, but flash_attn is implemented in cuda. To allow it to run smoothly on TPU, IMO the cleanest option would be to contribute to flash_attn to implement a Torch XLA (or pallas) alternative when available. Another alternative would be to implement it directly on optimum-tpu, but that would mean we would need to patch models to use that. If you choose the second path, I will be happy to help you to integrate your contribution.

Ingvarstep · 2024-04-25T18:45:59Z

For anyone who is interested in optimized T5 version, I just finished my project on creating flash attention version with fused attention bias calculation. It allows to fix the major drawbacks of T5 and allow to run it on 100k sequences on single L4 GPU (22.5 GB). Check it here.

EduardoPach · 2024-04-26T16:17:21Z

Can we update the current state of the list of models in the issue description? For instance, GPT2 has already been merged

michaelshekasta · 2024-05-05T21:29:43Z

Hi @LysandreJik @ArthurZucker @amyeroberts @younesbelkada @fxmarty @SunMarc @pacman100 ,

I want to try to add flash attention 2 to xlmr-large.

Do you have any guidelines?

amyeroberts · 2024-05-07T11:35:16Z

@michaelshekasta I'd recommend referring to other PRs which have added this for models e.g. for GPT2 and reading the contributing guidelines

michaelshekasta · 2024-05-07T13:19:10Z

@michaelshekasta I'd recommend referring to other PRs which have added this for models e.g. for GPT2 and reading the contributing guidelines

@amyeroberts Thanks for your comment! I noticed that @DavidAfonsoValente has already implemented the majority of the code. You can find their work on this pull request: #28713. What are the differences that still need to be addressed before merging it?

amyeroberts · 2024-05-09T10:27:34Z

@michaelshekasta Following the PR history, I don't believe there was much more to add, there was just a dependency on another piece of work and the PR eventually became inactive

davidgxue · 2024-05-10T04:20:32Z

Has someone worked on FA2 to T5? I see someone has an SDPA support for T5 PR open (#30375) that is almost done. Is there still a point in adding FA2 for T5?

jiahuanluo · 2024-06-20T08:27:15Z

Any updates about FA2 for BERT?

amyeroberts · 2024-06-20T09:27:54Z

@jiahuanluo SDPA is available for BERT which often sees comparable speed-ups, FA2 is yet to be implemented

forrestbao · 2024-09-25T18:23:32Z

When will FA or FA2 been supported for T5 in transformers officially?

xbeark · 2024-09-27T02:33:03Z

Is it possible to add flash attention to bigbird model？

dricciardelli · 2024-10-15T19:36:49Z

Can we get FA / FA2 added to TimeSeriesTransformer?

radna0 · 2024-12-15T03:22:55Z

@miladm @tengomucho Any updates on the TPU Flash Attention 2 Implementation?

Taytay · 2024-12-19T22:52:31Z

ModernBERT was just released, and it uses a number of attention tricks to speed things up. Mentioning here for those that are following: https://huggingface.co/blog/modernbert

younesbelkada added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Sep 22, 2023

susnato mentioned this issue Sep 26, 2023

[FA2] Add flash attention for opt #26414

Merged

5 tasks

LysandreJik mentioned this issue Sep 27, 2023

Flash Attention 2 support for BERT, DistilBERT, and T5 #26424

Closed

rajveer43 mentioned this issue Sep 28, 2023

[Flash Attention 2] Add flash attention 2 for MPT #26470

Closed

susnato mentioned this issue Sep 28, 2023

Add flash attention for gpt_bigcode #26479

Merged

5 tasks

amyeroberts closed this as completed in #29226 Mar 28, 2024

fxmarty reopened this Mar 28, 2024

amyeroberts mentioned this issue Apr 22, 2024

Flash Attention Support for Blip2ForConditionalGeneration #30360

Open

LysandreJik mentioned this issue Aug 28, 2024

Add SDPA implementation to more models #33171

Closed

avishaiElmakies mentioned this issue Sep 26, 2024

add sdpa and flash_attention2 support to speech2text #33716

Open

5 tasks

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Comments

younesbelkada commented Sep 22, 2023 • edited by fxmarty Loading

Feature request

Motivation

Your contribution

sahilbhosale63 commented Sep 22, 2023 • edited Loading

flozi00 commented Sep 22, 2023

rajveer43 commented Sep 23, 2023

susnato commented Sep 24, 2023

ZeusFSX commented Sep 25, 2023

younesbelkada commented Sep 25, 2023

rajveer43 commented Sep 25, 2023

younesbelkada commented Sep 25, 2023 • edited Loading

susnato commented Sep 25, 2023

younesbelkada commented Sep 25, 2023

sahilbhosale63 commented Sep 25, 2023

rajveer43 commented Sep 25, 2023

jeromeku commented Sep 26, 2023

younesbelkada commented Sep 26, 2023

susnato commented Sep 26, 2023

younesbelkada commented Sep 26, 2023

sorenmc commented Sep 26, 2023

younesbelkada commented Sep 27, 2023

DougTrajano commented Sep 27, 2023

jeromeku commented Sep 27, 2023 • edited Loading

younesbelkada commented Sep 28, 2023

younesbelkada commented Sep 28, 2023 • edited Loading

rajveer43 commented Sep 28, 2023

susnato commented Sep 28, 2023

younesbelkada commented Sep 28, 2023

jprivera44 commented Feb 26, 2024

jprivera44 commented Feb 26, 2024

rubenweitzman commented Feb 28, 2024

William-WANG2 commented Mar 2, 2024

ddofer commented Mar 7, 2024

ArthurZucker commented Mar 25, 2024

sayakpaul commented Apr 19, 2024

amyeroberts commented Apr 19, 2024

miladm commented Apr 22, 2024

philschmid commented Apr 23, 2024

tengomucho commented Apr 23, 2024

Ingvarstep commented Apr 25, 2024

EduardoPach commented Apr 26, 2024

michaelshekasta commented May 5, 2024

amyeroberts commented May 7, 2024

michaelshekasta commented May 7, 2024

amyeroberts commented May 9, 2024

davidgxue commented May 10, 2024

jiahuanluo commented Jun 20, 2024 • edited Loading

amyeroberts commented Jun 20, 2024

forrestbao commented Sep 25, 2024

xbeark commented Sep 27, 2024

dricciardelli commented Oct 15, 2024

radna0 commented Dec 15, 2024

Taytay commented Dec 19, 2024

younesbelkada commented Sep 22, 2023 •

edited by fxmarty

Loading

sahilbhosale63 commented Sep 22, 2023 •

edited

Loading

younesbelkada commented Sep 25, 2023 •

edited

Loading

jeromeku commented Sep 27, 2023 •

edited

Loading

younesbelkada commented Sep 28, 2023 •

edited

Loading

jiahuanluo commented Jun 20, 2024 •

edited

Loading