-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CB Overflow issue on certain transposes and permutes #16155
Conversation
@@ -747,17 +745,20 @@ def test_transpose_4d_wh_tile(shape, device): | |||
assert_with_pcc(torch_output, tt_output, 0.9999) | |||
|
|||
|
|||
@pytest.mark.skip("Skipping due to hang on to_layout to tile where input shape has 1 in it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indicate an issue number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
], | ||
) | ||
@pytest.mark.parametrize("memory_config", [ttnn.L1_MEMORY_CONFIG, ttnn.DRAM_MEMORY_CONFIG]) | ||
@pytest.mark.parametrize("memory_config", [ttnn.DRAM_MEMORY_CONFIG]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove L1 memory configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An accident while testing, which was especially dumb because this is a disabled test.
dca2fe3
to
0702f83
Compare
3e62f76
to
01c2dce
Compare
- replace the existing interleaved row-major transpose WH row major implementation with ttnn::prim::permute - the current kernel takes up O(H*W) space per core, with support for only 16 element aligned inputs, resulting in CB overflows and conversions to tiled - the new kernel takes up constant space, making it more reliable, though not as performant atm.
01c2dce
to
cc32355
Compare
@@ -18,7 +18,7 @@ def test_perf_device_bare_metal(batch_size, test): | |||
subdir = "ttnn_squeezebert" | |||
num_iterations = 1 | |||
margin = 0.03 | |||
expected_perf = 114.8 if is_grayskull() else 284.5 | |||
expected_perf = 102.7 if is_grayskull() else 298.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great non gray skull perf increase! do we have an idea why gray skull perf got worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is the LLK perf for unpack_tilize and pack_untilize may not be the best since it's done completely differently
|
||
torch_input = torch.randn(shape, dtype=torch.bfloat16) | ||
torch_output = torch_input.transpose(0, 1) | ||
|
||
tt_input = ttnn.from_torch(torch_input, dtype=ttnn.DataType.BFLOAT16, layout=layout, device=device) | ||
tt_output = ttnn.transpose(tt_input, 0, 1) | ||
tt_output = ttnn.to_torch(tt_output) | ||
assert_with_pcc(torch_output, tt_output, 0.9999) | ||
assert_with_pcc(torch_output, tt_output, 0.99) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a PCC drop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous ones are the same, but for the new bigger inputs we start seeing 0.998 or so pcc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually after rebasing I don't see the drop at all anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing fix, just some clarifications.
@@ -815,7 +817,7 @@ def test_transpose_unaligned(config, memory_config, device): | |||
) | |||
tt_output = ttnn.transpose(tt_input, config[1][0], config[1][1]) | |||
tt_output = ttnn.to_torch(tt_output) | |||
assert_with_pcc(torch_output, tt_output, 0.9999) | |||
assert_with_pcc(torch_output, tt_output, 0.99) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 0.99 an OK benchmark for PCC or just want to let it pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be okay since the similarity is off by only 1%. Previous tests didn't decrease, it's just the new really big input I added we see it drop to 0.997/0.998.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually after rebasing it seems like I don't get the drop anymore. Not sure why...
cc32355
to
1e8bfdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ticket
#16109
Problem description
Transpose WH row major for interleaved scales with O(Height * Width) of the tensor, which causes CB overflows for bigger dimensions on the PT2.0 traces. As well, the kernel requires 16 element requirement for height and width, which forced us to convert to tiled
What's changed
Use ttnn::prim::permute which allocates constant space and does not have an alignment requirement. Also enable tests on transpose that used to fail due to CB overflows.
This brings ttnn.transpose coverage from 98.4 to 99.2% on sweeps (3 of the "failures" are actually 0.998 pcc not 0.999).
ttnn.permute for ROW major is now 100% for this change from 87%. We are at 85.44% overall as tiled is hanging.
Use permute in fold for perf reasons
Checklist