Fix CB Overflow issue on certain transposes and permutes #16155

sjameelTT · 2024-12-18T20:14:29Z

Ticket

Problem description

Transpose WH row major for interleaved scales with O(Height * Width) of the tensor, which causes CB overflows for bigger dimensions on the PT2.0 traces. As well, the kernel requires 16 element requirement for height and width, which forced us to convert to tiled

What's changed

Use ttnn::prim::permute which allocates constant space and does not have an alignment requirement. Also enable tests on transpose that used to fail due to CB overflows.

This brings ttnn.transpose coverage from 98.4 to 99.2% on sweeps (3 of the "failures" are actually 0.998 pcc not 0.999).
ttnn.permute for ROW major is now 100% for this change from 87%. We are at 85.44% overall as tiled is hanging.

Use permute in fold for perf reasons

Checklist

Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12588859911/job/35088166120
Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12584878436
Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12584881248(fail matches main)
Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12584884092/job/35075829376 (fail matches main)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

bbradelTT · 2024-12-18T22:16:06Z

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py

@@ -747,17 +745,20 @@ def test_transpose_4d_wh_tile(shape, device):
    assert_with_pcc(torch_output, tt_output, 0.9999)


+@pytest.mark.skip("Skipping due to hang on to_layout to tile where input shape has 1 in it")


Please indicate an issue number.

bbradelTT · 2024-12-18T22:16:57Z

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py

    ],
 )
-@pytest.mark.parametrize("memory_config", [ttnn.L1_MEMORY_CONFIG, ttnn.DRAM_MEMORY_CONFIG])
+@pytest.mark.parametrize("memory_config", [ttnn.DRAM_MEMORY_CONFIG])


Why remove L1 memory configs?

An accident while testing, which was especially dumb because this is a disabled test.

- replace the existing interleaved row-major transpose WH row major implementation with ttnn::prim::permute - the current kernel takes up O(H*W) space per core, with support for only 16 element aligned inputs, resulting in CB overflows and conversions to tiled - the new kernel takes up constant space, making it more reliable, though not as performant atm.

…r N-d

ntarafdar · 2025-01-02T17:02:47Z

models/demos/squeezebert/tests/test_perf_device_squeezebert.py

@@ -18,7 +18,7 @@ def test_perf_device_bare_metal(batch_size, test):
    subdir = "ttnn_squeezebert"
    num_iterations = 1
    margin = 0.03
-    expected_perf = 114.8 if is_grayskull() else 284.5
+    expected_perf = 102.7 if is_grayskull() else 298.7


great non gray skull perf increase! do we have an idea why gray skull perf got worse?

My guess is the LLK perf for unpack_tilize and pack_untilize may not be the best since it's done completely differently

ntarafdar · 2025-01-02T17:03:14Z

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py


    torch_input = torch.randn(shape, dtype=torch.bfloat16)
    torch_output = torch_input.transpose(0, 1)

    tt_input = ttnn.from_torch(torch_input, dtype=ttnn.DataType.BFLOAT16, layout=layout, device=device)
    tt_output = ttnn.transpose(tt_input, 0, 1)
    tt_output = ttnn.to_torch(tt_output)
-    assert_with_pcc(torch_output, tt_output, 0.9999)
+    assert_with_pcc(torch_output, tt_output, 0.99)


is this a PCC drop?

the previous ones are the same, but for the new bigger inputs we start seeing 0.998 or so pcc

actually after rebasing I don't see the drop at all anymore.

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py

ntarafdar

Amazing fix, just some clarifications.

llongTT · 2025-01-02T17:05:40Z

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py

@@ -815,7 +817,7 @@ def test_transpose_unaligned(config, memory_config, device):
    )
    tt_output = ttnn.transpose(tt_input, config[1][0], config[1][1])
    tt_output = ttnn.to_torch(tt_output)
-    assert_with_pcc(torch_output, tt_output, 0.9999)
+    assert_with_pcc(torch_output, tt_output, 0.99)


is 0.99 an OK benchmark for PCC or just want to let it pass?

Should be okay since the similarity is off by only 1%. Previous tests didn't decrease, it's just the new really big input I added we see it drop to 0.997/0.998.

actually after rebasing it seems like I don't get the drop anymore. Not sure why...

llongTT

LGTM

sjameelTT requested review from ayerofieiev-tt, dmakoviichuk-tt, rfurko-tt, cfjchu, TT-BrianLiu, razorback3, dongjin-na, bbradelTT, ntarafdar, jaykru-tt, yugi957, jvegaTT and llongTT as code owners December 18, 2024 20:14

bbradelTT reviewed Dec 18, 2024

View reviewed changes

sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from dca2fe3 to 0702f83 Compare December 19, 2024 17:21

sjameelTT requested review from mywoodstock, shwetankTT, sankarmanoj-tt and pavlejosipovic as code owners December 19, 2024 22:57

bbradelTT approved these changes Dec 19, 2024

View reviewed changes

sjameelTT requested a review from nardoTT as a code owner December 20, 2024 23:20

sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from 3e62f76 to 01c2dce Compare December 24, 2024 20:53

sjameelTT requested a review from uaydonat as a code owner December 24, 2024 20:53

sjameelTT added 5 commits January 2, 2025 15:37

#0: disable and enable tests based on to_layout hang + new support fo…

e7ef834

…r N-d

#0: fix fold by using the deprecated kernel

ff7badf

#0: fix blackhole tests

e956226

#0: fix memory config and kernel for blackhole

088f586

sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from 01c2dce to cc32355 Compare January 2, 2025 15:37

ntarafdar reviewed Jan 2, 2025

View reviewed changes

tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py Show resolved Hide resolved

ntarafdar approved these changes Jan 2, 2025

View reviewed changes

llongTT reviewed Jan 2, 2025

View reviewed changes

#0: update squeezebert perf and use combined kernel for fold

1e8bfdf

sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from cc32355 to 1e8bfdf Compare January 2, 2025 17:40

uaydonat approved these changes Jan 2, 2025

View reviewed changes

ayerofieiev-tt approved these changes Jan 2, 2025

View reviewed changes

llongTT approved these changes Jan 2, 2025

View reviewed changes

mywoodstock approved these changes Jan 2, 2025

View reviewed changes

sjameelTT merged commit 9acc400 into main Jan 3, 2025
355 of 415 checks passed

sjameelTT deleted the sjameel/fix_buffer_overflow branch January 3, 2025 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CB Overflow issue on certain transposes and permutes #16155

Fix CB Overflow issue on certain transposes and permutes #16155

sjameelTT commented Dec 18, 2024 •

edited

Loading

bbradelTT Dec 18, 2024

sjameelTT Dec 19, 2024

bbradelTT Dec 18, 2024

sjameelTT Dec 19, 2024

ntarafdar Jan 2, 2025

sjameelTT Jan 2, 2025

ntarafdar Jan 2, 2025

sjameelTT Jan 2, 2025

sjameelTT Jan 2, 2025

ntarafdar left a comment

llongTT Jan 2, 2025

sjameelTT Jan 2, 2025 •

edited

Loading

sjameelTT Jan 2, 2025

llongTT left a comment

		@@ -747,17 +745,20 @@ def test_transpose_4d_wh_tile(shape, device):
		assert_with_pcc(torch_output, tt_output, 0.9999)


		@pytest.mark.skip("Skipping due to hang on to_layout to tile where input shape has 1 in it")

Fix CB Overflow issue on certain transposes and permutes #16155

Fix CB Overflow issue on certain transposes and permutes #16155

Conversation

sjameelTT commented Dec 18, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntarafdar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjameelTT Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llongTT left a comment

Choose a reason for hiding this comment

sjameelTT commented Dec 18, 2024 •

edited

Loading

sjameelTT Jan 2, 2025 •

edited

Loading