Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttnn.hypot_bw low PCC #13776

Open
Tracked by #13795
amalbasaTT opened this issue Oct 14, 2024 · 6 comments
Open
Tracked by #13795

ttnn.hypot_bw low PCC #13776

amalbasaTT opened this issue Oct 14, 2024 · 6 comments
Assignees

Comments

@amalbasaTT
Copy link
Contributor

amalbasaTT commented Oct 14, 2024

Describe the bug
ttnn.hypot_bw fails with low PCC when input_tensor_a and/or input_tensor_b has bfloat8_b dtype

To Reproduce
Steps to reproduce the behavior:
Sweep test for hypot_bw is located in 'tests/sweep_framework/sweeps/eltwise/binary_backward/hypot_bw/hypot_bw.py'

  1. Go to 'tests/sweep_framework/sweeps/eltwise/binary_backward/hypot_bw/hypot_bw.py'
  2. Generate new parameter vectors and run the sweep test
python3 tests/sweep_framework/sweeps_parameter_generator.py --elastic cloud --module-name eltwise.binary_backward.hypot_bw.hypot_bw
python3 tests/sweep_framework/sweeps_runner.py --elastic cloud --module-name eltwise.binary_backward.hypot_bw.hypot_bw --suite-name xfail
  1. See the error. Results can be found on elastic cloud as explained here: https://github.com/tenstorrent/tt-metal/tree/main/tests/sweep_framework
@KalaivaniMCW
Copy link
Contributor

KalaivaniMCW commented Nov 5, 2024

ttnn.hypot_bw uses ttnn.reciprocal in its implementation which has this issue #14672
This is causing the PCC drop when using bfloat8_b

@KalaivaniMCW KalaivaniMCW self-assigned this Nov 5, 2024
@eyonland eyonland added the MCW label Dec 20, 2024
@KalaivaniMCW
Copy link
Contributor

Hi @amalbasaTT ,
When using bfloat8_b dtype, could you try with non-zero inputs to see if these cases pass ?
You can find more information on this here #14672

When involving bfloat8_b and ttnn.reciprocal related ops we need separate implementations for zero- and non-zero inputs.
Handling these at kernel level will make the op slower, so we'll discuss what the best approach will be - whether to do it at composite level. For now could we test on non-zero inputs to identify any other issues other than this.

@amalbasaTT
Copy link
Contributor Author

@KalaivaniMCW ,
Of course, no problem. Do you want a unit test or a sweep test for each of ttnn.reciprocal related ops? Also is there a list of ttnn.reciprocal related ops somehwere?

@KalaivaniMCW
Copy link
Contributor

For now, the below issues seems to be related.
If these op sweeps pass for non-zero inputs, we can rule out these to be related to bfloat8_b behaviour.
#13776
#13937
#13975

@amalbasaTT
Copy link
Contributor Author

amalbasaTT commented Dec 23, 2024

  • ttnn.hypot passes in all cases when all inputs (input_tensor_a and input_tensor_b) are with non-zero elements and with bfloat8_b datatype
  • ttnn.rdiv_bw (issue) passes in 4.167% cases when all inputs (grad_tensor, input_tensor_a and factor) are with non-zero elements and when grad_tensor and input_tensor_a have bfloat8_b datatype

I will need some additional time to analyse div_bw since it has way more parameters.

checkout branch amalbasaTT/reciprocal-related-nonzero-sweeps to check the tests (hypot_nonzero.py and rdiv_bw_nonzero.py)

@amalbasaTT
Copy link
Contributor Author

amalbasaTT commented Dec 24, 2024

@KalaivaniMCW

I have tested rdiv_bw, hypot_bw and div_bw with inputs outside of range [-1, 1]. Before running the ops, i have converted inputs to ttnn tensor with dtype bfloat8_b and back to torch before running ops with them, so that i could ensure that all the inputs are non-zero after any eventual cut-off when converting them to ttnn tensor with bfloat8_b. Issue i found is that when input_tensor has bfloat8_b and "certain" shape, it produces an all-zero tensor (or tensors for binary ops). You can check details in issues i found bellow, i have provided a unit test and a sweep test for each op:

#16305
#16304
#16306

Besides the issue i reported, all three ops work fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants