-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Community contribution: Adding Flash Attention 2 support for more architectures #26350
Comments
Hi @younesbelkada - I want to work on adding Flash Attention 2 support for GPTBigCode (Starcoder). Can I take this task? Can you please assign this task to me? |
Will definitely take a look next week |
I would like to work on |
I would like to work on |
Is it possible to add FlashAttention2 to GPT2 models? |
@sahilbhosale63 @flozi00 @rajveer43 @susnato thanks very much for your interest! Indeed it would be great if you could help us! |
@younesbelkada Yes I have |
OK perfect, I will assign you to MPT ! Feel free to let me know if you need any help or if you have any question, as a starting point, I would recommend to have a look at #25598 and see if you can replicate the PR for MPT. For running flash attention tests you can just run (once PR is ready): RUN_SLOW=1 pytest -m flash_attn_test tests/models/mpt/ |
@younesbelkada yes I have. |
Thanks @susnato , perfect then, let me know whenever you start the PR and if you have any question ! Check out my instructions above for more details |
@younesbelkada Unfortunately, My GPU is not supported |
Sure I will work on it! |
@younesbelkada Would like to work on Persimmon. I have access to A4000, A5000, and A6000, which I believe should be compatible with FA2. |
Perfect sounds great, thanks for your help, I will assign you to Persimmon ! |
Since @sahilbhosale63 is not working on |
Yes no problem, thanks very much for proposing your help on this ! As a starting point you can have a look at @pacman100 's implementation here: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/personal_copilot/training/starcoder_flash_attn_monkey_patch.py |
@younesbelkada I would like to implement it for BERT if it hasn't already been done? A lot of the models topping MTEB are still relying on this architecture! I have tested that i can run flash attention 2 on my nvidia geforce RTX 3060 TI. |
Awesome, thanks a lot for your help, ok I will assign you to BERT then! |
Hi everyone, I would like to help implement this with GPT2 if you want. |
I have a working version for Also, any plans on incorporating additional optimizations, e.g., Also, would like to help with |
Hi @DougTrajano @jeromeku awesome thanks! Can you move forward for Persimmon by opening a PR so that I can have a look?
If that is something that can nicely fit into the API without any breaking behaviour that would be great !
I think Mistral's attention has been released in the latest version of FA-2 --> Would you be happy to open a PoC PR so that I can play with it and see what we can do? Again thanks a lot! |
Hi @jeromeku |
@younesbelkada what is the expected deadline to complete |
Hi @younesbelkada , I am talking this up for |
Awesome @susnato ! Thanks ! |
Hey thank you for letting me know on the existing PR @EduardoPach , @younesbelkada & @ArthurZucker are there any available architectures left to implement? I see there are "many more" that are not included on this list. Can you give me an example of one? |
It seems that T5 is still open? |
@jprivera44 yes it seems so, cannot find any FA/sdpa version. Would be great if you could get that working. I am also looking at integrating sdpa into ESM @ddofer or others would love your help if there is interest! |
Hello! My classmate @DELTA-DoubleWise and I are trying to write a project proposal for a course and we would like to extend this to RAG model (which is not listed above). Would you mind assigning this to us? |
@rubenweitzman I wish I could help, but I'm only familiar with keras, not pytorch :\ |
Hey @William-WANG2, we usually don't assign issues, and rather let the code talk: if a PR is open and pinned then that means someone is working on something and the entire community can check the progress 😉 we try to prevent work duplication with this! |
Does adding FA2 to CLIP make any sense? |
@sayakpaul Yes! It would be great to add - had a draft in #27444 but was hitting some large differences on just some of the elements which I wasn't able to track. I don't have bandwidth atm so very happy for anyone to take it up! |
Integration question: |
Adding @mfuntowicz @tengomucho to answer this |
@miladm for now there is a check on models to see if flash attention is available, but |
For anyone who is interested in optimized T5 version, I just finished my project on creating flash attention version with fused attention bias calculation. It allows to fix the major drawbacks of T5 and allow to run it on 100k sequences on single L4 GPU (22.5 GB). Check it here. |
Can we update the current state of the list of models in the issue description? For instance, |
Hi @LysandreJik @ArthurZucker @amyeroberts @younesbelkada @fxmarty @SunMarc @pacman100 , I want to try to add flash attention 2 to xlmr-large. Do you have any guidelines? |
@michaelshekasta I'd recommend referring to other PRs which have added this for models e.g. for GPT2 and reading the contributing guidelines |
@amyeroberts Thanks for your comment! I noticed that @DavidAfonsoValente has already implemented the majority of the code. You can find their work on this pull request: #28713. What are the differences that still need to be addressed before merging it? |
@michaelshekasta Following the PR history, I don't believe there was much more to add, there was just a dependency on another piece of work and the PR eventually became inactive |
Has someone worked on FA2 to T5? I see someone has an SDPA support for T5 PR open (#30375) that is almost done. Is there still a point in adding FA2 for T5? |
Any updates about FA2 for BERT? |
@jiahuanluo SDPA is available for BERT which often sees comparable speed-ups, FA2 is yet to be implemented |
When will FA or FA2 been supported for T5 in |
Is it possible to add flash attention to bigbird model? |
Can we get FA / FA2 added to TimeSeriesTransformer? |
@miladm @tengomucho Any updates on the TPU Flash Attention 2 Implementation? |
ModernBERT was just released, and it uses a number of attention tricks to speed things up. Mentioning here for those that are following: https://huggingface.co/blog/modernbert |
Feature request
Flash Attention 2 is a library that provides attention operation kernels for faster and more memory efficient inference and training: https://github.com/Dao-AILab/flash-attention
Let's try to add Flash Attention 2 support for more architectures! Currently supported architectures are
It would be great to add the support for more architectures such as
Flash Attention 2
] Add flash attention 2 for GPT-Neo-X #26463FA2
] Add flash attention for opt #26414... and many more
Adding this feature would require to follow the same protocol as in #25598
. First create a new module inside the corresponding modeling file termed as
xxxFlashAttention
that inherits fromxxxAttention
and override the foward method to use the public methods fromflash-attn
. Make sure to have access to a GPU that supports Flash Attention 2.Given the slight challenge of the issue, labelling it as a good second issue!
If you are interested to take up the challenge, comment below with the architecture name you want to integrate and open a PR!
Once you open a PR, feel free to ping @LysandreJik @ArthurZucker @amyeroberts @younesbelkada @fxmarty @SunMarc @pacman100 for a review
Motivation
Making LLMs more memory efficient and faster !
Your contribution
Reviewing PRs and possibly adding the support for more models
The text was updated successfully, but these errors were encountered: