Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CB Overflow issue on certain transposes and permutes #16155

Merged
merged 6 commits into from
Jan 3, 2025

Conversation

sjameelTT
Copy link
Contributor

@sjameelTT sjameelTT commented Dec 18, 2024

Ticket

#16109

Problem description

Transpose WH row major for interleaved scales with O(Height * Width) of the tensor, which causes CB overflows for bigger dimensions on the PT2.0 traces. As well, the kernel requires 16 element requirement for height and width, which forced us to convert to tiled

What's changed

Use ttnn::prim::permute which allocates constant space and does not have an alignment requirement. Also enable tests on transpose that used to fail due to CB overflows.

This brings ttnn.transpose coverage from 98.4 to 99.2% on sweeps (3 of the "failures" are actually 0.998 pcc not 0.999).
ttnn.permute for ROW major is now 100% for this change from 87%. We are at 85.44% overall as tiled is hanging.

Use permute in fold for perf reasons

Checklist

@@ -747,17 +745,20 @@ def test_transpose_4d_wh_tile(shape, device):
assert_with_pcc(torch_output, tt_output, 0.9999)


@pytest.mark.skip("Skipping due to hang on to_layout to tile where input shape has 1 in it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indicate an issue number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

],
)
@pytest.mark.parametrize("memory_config", [ttnn.L1_MEMORY_CONFIG, ttnn.DRAM_MEMORY_CONFIG])
@pytest.mark.parametrize("memory_config", [ttnn.DRAM_MEMORY_CONFIG])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove L1 memory configs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An accident while testing, which was especially dumb because this is a disabled test.

@sjameelTT sjameelTT requested a review from nardoTT as a code owner December 20, 2024 23:20
@sjameelTT sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from 3e62f76 to 01c2dce Compare December 24, 2024 20:53
@sjameelTT sjameelTT requested a review from uaydonat as a code owner December 24, 2024 20:53
- replace the existing interleaved row-major transpose WH row major implementation with ttnn::prim::permute
- the current kernel takes up O(H*W) space per core, with support for only 16 element aligned inputs, resulting in CB overflows and conversions to tiled
- the new kernel takes up constant space, making it more reliable, though not as performant atm.
@sjameelTT sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from 01c2dce to cc32355 Compare January 2, 2025 15:37
@@ -18,7 +18,7 @@ def test_perf_device_bare_metal(batch_size, test):
subdir = "ttnn_squeezebert"
num_iterations = 1
margin = 0.03
expected_perf = 114.8 if is_grayskull() else 284.5
expected_perf = 102.7 if is_grayskull() else 298.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great non gray skull perf increase! do we have an idea why gray skull perf got worse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is the LLK perf for unpack_tilize and pack_untilize may not be the best since it's done completely differently


torch_input = torch.randn(shape, dtype=torch.bfloat16)
torch_output = torch_input.transpose(0, 1)

tt_input = ttnn.from_torch(torch_input, dtype=ttnn.DataType.BFLOAT16, layout=layout, device=device)
tt_output = ttnn.transpose(tt_input, 0, 1)
tt_output = ttnn.to_torch(tt_output)
assert_with_pcc(torch_output, tt_output, 0.9999)
assert_with_pcc(torch_output, tt_output, 0.99)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a PCC drop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous ones are the same, but for the new bigger inputs we start seeing 0.998 or so pcc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually after rebasing I don't see the drop at all anymore.

Copy link
Contributor

@ntarafdar ntarafdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing fix, just some clarifications.

@@ -815,7 +817,7 @@ def test_transpose_unaligned(config, memory_config, device):
)
tt_output = ttnn.transpose(tt_input, config[1][0], config[1][1])
tt_output = ttnn.to_torch(tt_output)
assert_with_pcc(torch_output, tt_output, 0.9999)
assert_with_pcc(torch_output, tt_output, 0.99)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0.99 an OK benchmark for PCC or just want to let it pass?

Copy link
Contributor Author

@sjameelTT sjameelTT Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be okay since the similarity is off by only 1%. Previous tests didn't decrease, it's just the new really big input I added we see it drop to 0.997/0.998.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually after rebasing it seems like I don't get the drop anymore. Not sure why...

@sjameelTT sjameelTT force-pushed the sjameel/fix_buffer_overflow branch from cc32355 to 1e8bfdf Compare January 2, 2025 17:40
Copy link
Contributor

@llongTT llongTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjameelTT sjameelTT merged commit 9acc400 into main Jan 3, 2025
355 of 415 checks passed
@sjameelTT sjameelTT deleted the sjameel/fix_buffer_overflow branch January 3, 2025 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants