Enable FP32 Accumulate in Flash Attention and Flash Decode #13364

caixunshiren · 2024-10-02T13:13:02Z

Description

We do not have support for fp32 accumulate in sdpa family kernels. This becomes a problem when number of chunks gets large and we see diverging pcc from ground truth. For models that requires 128K sequel, this is problematic.

This issue tracks the enabling of fp32 accumulate in the following kernels:

round 1:

sdpa (bf16 cbs, fp32 accum)
sdpa decode (bf16 cbs, fp32 accum)

round 2:

sdpa (fp32 cbs, fp32 accum)
sdpa decode (fp32 cbs, fp32 accum)

FYI @cglagovichTT

caixunshiren · 2024-10-16T13:40:31Z

Update:

It appears that the largest amount of pcc drop are attributes to math approximation, not fp32 accumulate. The issue is tracked here: Fix Diverging PCC issue in SDPA Kernels #13866
Based on my experiments, there is no issue of fp32 accumulate with bf16 cbs. With fp32 cbs, we need some inputs to mul_block_inplace and add_block_inplace to be bf16 cbs during im and stat accumulation/updates, otherwise we get pcc degradation compared to bf16. My WIP work is on branch sdpa-fp32-investigations
I also found out that reconfig_dataformat doesn't work as expected and could hang/wrong result for in/out_cb. This explains the flash decode fp32 accumulate hang in reducer cores that I saw earlier: Deterministic Hang with unpack_reconfig_data_format at a specific spot in Flash Decode Kernel #9608
Actions for now: Support fp32 accumulate with bf16 cbs on sdpa/sdpa-decode kernels. Leave fp32 cbs for future.

caixunshiren added kernels kernels, such as hlks or llks or below models Models that run in tt-metal P1 llama3 labels Oct 2, 2024

caixunshiren self-assigned this Oct 2, 2024

caixunshiren mentioned this issue Oct 2, 2024

Flash Attention and Flash Decode GQA (and MQA) Improvements: Master Issue #12330

Open

19 tasks

caixunshiren added flash-attention flash-decode labels Oct 16, 2024

caixunshiren added a commit that referenced this issue Oct 16, 2024

#13364: WIP investigations of fp32 accumulate and pcc issues in sdpa

634ced9

This was referenced Oct 16, 2024

Deterministic Hang with unpack_reconfig_data_format at a specific spot in Flash Decode Kernel #9608

Closed

#13364: WIP investigations of fp32 accumulate and pcc issues in sdpa #13877

Draft

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: enabled fp32 accumulate in sdpa decode

6ae9d5b

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: enabled fp32 accumulate in sdpa decode

3cf067f

caixunshiren mentioned this issue Oct 18, 2024

Sdpa decode pcc fix #13990

Merged

1 task

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: fixed fp32 accumulation error in sdpa

22f0c90

caixunshiren mentioned this issue Oct 18, 2024

Sdpa pcc improvement #13997

Merged

1 task

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: enabled fp32 accumulate in sdpa decode

67d0af5

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: enabled fp32 accumulate in sdpa decode

c251257

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: fixed fp32 accumulation error in sdpa

1d78e56

caixunshiren added a commit that referenced this issue Oct 18, 2024

#13364: fixed fp32 accumulation error in sdpa

bf48f8e

ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this issue Nov 12, 2024

tenstorrent#13364: enabled fp32 accumulate in sdpa decode

c4569c1

ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this issue Nov 12, 2024

tenstorrent#13364: fixed fp32 accumulation error in sdpa

313be7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FP32 Accumulate in Flash Attention and Flash Decode #13364

Enable FP32 Accumulate in Flash Attention and Flash Decode #13364

caixunshiren commented Oct 2, 2024 •

edited

Loading

caixunshiren commented Oct 16, 2024

Enable FP32 Accumulate in Flash Attention and Flash Decode #13364

Enable FP32 Accumulate in Flash Attention and Flash Decode #13364

Comments

caixunshiren commented Oct 2, 2024 • edited Loading

Description

caixunshiren commented Oct 16, 2024

caixunshiren commented Oct 2, 2024 •

edited

Loading