-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need a running script for ‘dist_flash_attn’ #22
Comments
Well, After making the input sequence length divisible by world_size * block_size, it can run normally. |
What is block_size? |
the block_size for flash-attn |
I'm sorry I don't understand. I didn't find any |
seems here. |
Can you provide a script to run dist_flash_attn? I tried setting parallel_mode to dist_flash_attn but it didn't work successfully.
When trying to use 'dist_flash_attn' with 2*A100, process 0 is stuck in torch.cuda.synchronize() of _lightseq_forward of a certain decoderlayer, while process 1 runs to this step of the next decoderlayer. Strangely, the model gets stuck on the second sample. What might be causing this bug? Is there any way to solve this problem?
EasyContext/easy_context/dist_flash_attn/lightseq_async_attn.py
Line 291 in 41324ec
It seems that communication of process 0 in maybe_send_recv_fwd_qkvo is not completed.
The text was updated successfully, but these errors were encountered: