custom_kernel: fix shape mismatch by sharding segment_ids in flash attn. #8333

dudulightricks · 2024-10-29T22:26:26Z

Description: This PR addresses an issue where segment_ids were not considered when adding sharding support in this module. The absence of segment_ids handling results in a shape mismatch failure when using them in sharded Flash Attention.

Edit:
During training with dummy data using this fix, the loss stalls at 0.2 and does not converge to 0 as expected. Further adjustments are needed to resolve this convergence issue.

miladm

Thanks for the contribution! Looks like your use case runs into computational inaccuracies which may suggest enable_manual_sharding API calls need correction. I suggest adding a test case to further verify / debug the issue in a small example.

Here is a reference for kernel tests you can refer to.

dudulightricks · 2024-11-21T15:13:53Z

@miladm Hi! We have added a test that currently fails (16% of the values are correct, the others are not). I hope it will help you understand whats wrong.

miladm · 2024-11-21T19:33:23Z

@dudulightricks Thanks for submitting the test code. We had a review of your code internally. It seems if you shard the KV and Q segment_id's the code won't attend the query to all kv elements in the matmul - hence the numerical inconsistency. Have you tried sharding the query segment_id only?

dudulightricks · 2024-11-21T20:32:48Z

@miladm I just did and the test still fails, but why would something like this happen anyway? We are sharding the model and the data and expect consistency in the results in any sharding case. Can't we trust the result in any sharding case?

when adding the sharding support in this module, seqment_ids weren't take into count which causes a failure with shape mismatch when using them in sharded flash attention.

dudulightricks marked this pull request as draft October 29, 2024 22:26

dudulightricks mentioned this pull request Oct 29, 2024

Bug - Using Sharding in Flash Attention with segment ids. #8334

Open

miladm reviewed Nov 14, 2024

View reviewed changes

dudulightricks force-pushed the bug-fix/shard-segment-ids-flash-attention branch 27 times, most recently from 4ba9067 to 0608900 Compare November 21, 2024 15:11

miladm added the pallas label Nov 21, 2024

dudulightricks added 2 commits November 25, 2024 15:57

custom_kernel: fix shape mismatch by sharding segment_ids in flash attn.

8b09601

when adding the sharding support in this module, seqment_ids weren't take into count which causes a failure with shape mismatch when using them in sharded flash attention.

test proposel for segment ids that fails.

b5d1b8f

dudulightricks force-pushed the bug-fix/shard-segment-ids-flash-attention branch from 0608900 to b5d1b8f Compare November 25, 2024 13:57

JackCaoG mentioned this pull request Nov 27, 2024

Support SegmentID when doing data prallel SPMD #8425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom_kernel: fix shape mismatch by sharding segment_ids in flash attn. #8333

custom_kernel: fix shape mismatch by sharding segment_ids in flash attn. #8333

dudulightricks commented Oct 29, 2024 •

edited

Loading

miladm left a comment

dudulightricks commented Nov 21, 2024

miladm commented Nov 21, 2024

dudulightricks commented Nov 21, 2024

custom_kernel: fix shape mismatch by sharding segment_ids in flash attn. #8333

Are you sure you want to change the base?

custom_kernel: fix shape mismatch by sharding segment_ids in flash attn. #8333

Conversation

dudulightricks commented Oct 29, 2024 • edited Loading

miladm left a comment

Choose a reason for hiding this comment

dudulightricks commented Nov 21, 2024

miladm commented Nov 21, 2024

dudulightricks commented Nov 21, 2024

dudulightricks commented Oct 29, 2024 •

edited

Loading