Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

Closed
wants to merge 4 commits into from

Conversation

bbradelTT
Copy link
Contributor

…rleavedToSharded

Divide the tiles to be read by each core by height across both reader and writer kernels.

Use an extra intermediate CB to communicate between the two kernels when to start and complete writing.

Also, if not converting the data format will skip the cb_write_back for extra performance improvement.

Testing:

@bbradelTT
Copy link
Contributor Author

@pavlejosipovic
Copy link
Contributor

pavlejosipovic commented Apr 11, 2024

@bbradelTT This change looks good for L1 reads, but it doesn't look good for DRAM reads.
On dram BFLOAT4_B is significantly improved ( 47%), BLOAT8_B seems to be the ~same ( 2-10% regression), BFLOAT16 seems to have ~25% regression and Float32 almost 30% regression.
I've used this test for benchmarking https://github.com/tenstorrent-metal/tt-metal/blob/7f9754121a372d9457d05911b7a029c5a28b6748/tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py#L2472 at 1GHz frequency.

Also you should double check these numbers on GS, I've checked this change only on WH.

@bbradelTT
Copy link
Contributor Author

I ran the changes on GS against https://github.com/tenstorrent-metal/tt-metal/blob/7f9754121a372d9457d05911b7a029c5a28b6748/tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py#L2472 for test_sharded.py::test_interleaved_2_sharded_DRAM and test_sharded.py::test_interleaved_2_sharded_L1.

The following are the values from the DEVICE FW DURATION column and percentage changes:

data type mb L1 Baseline L1 New DRAM Baseline DRAM New L1   DRAM
BFLOAT16 2.25 8180 8679 33739 37928 -6.10%   -12.42%
BFLOAT8_B 1.125 6329 6759 42188 40958 -6.79%   2.92%
FLOAT32 4.5 13423 14340 57733 57447 -6.83%   0.50%
BFLOAT4_B 0.5625 5658 6078 40619 42335 -7.42%   -4.22%
BFLOAT16 4.5 15725 9642 75982 56884 38.68%   25.13%
BFLOAT8_B 2.25 11686 6911 69672 34546 40.86%   50.42%
FLOAT32 9 25843 19317 108653 109101 25.25%   -0.41%
BFLOAT4_B 1.125 10256 6133 83263 33626 40.20%   59.61%
BFLOAT16 9 30386 19495 139522 109573 35.84%   21.47%
BFLOAT8_B 4.5 22646 12408 146972 69334 45.21%   52.83%
FLOAT32 18 52212 38262 216038 210106 26.72%   2.75%
BFLOAT4_B 2.25 19487 10797 158498 64361 44.59%   59.39%
BFLOAT16 18 59985 36993 261937 211507 38.33%   19.25%
BFLOAT8_B 9 44164 24209 290629 138222 45.18%   52.44%
FLOAT32 36     424803 414884     2.33%
BFLOAT4_B 4.5 37673 20191 318118 140002 46.40%   55.99%

These are without the changes to call the barrier less frequently.

Would it be better to restrict using the two processors somehow? E.g. to only L1 and not DRAM, or not DRAM for float32 (and equivalent).

@bbradelTT
Copy link
Contributor Author

I rebased the code and only used the two nocs for L1.

It looks like there is the notion of partial_op that now overlaps with the changes. I'm not sure which unit tests.

The best improvement after @pavlejosipovic 's changes is only about 20% (details below).

@pavlejosipovic @tt-aho

  1. is it worthwhile to get these changes in even when the improvement is a lot lower than before?
  2. are there any specific tests I should run to make sure everything works as expected?

Test run for GS:

DEVICE FW DURATION [ns]                
    L1       DRAM    
    main branch     main branch  
  BFLOAT16 18326 18420 0.51% BFLOAT16 41028 40183 -2.06%
  BFLOAT8_B 15691 16126 2.77% BFLOAT8_B 32601 34641 6.26%
  FLOAT32 24155 24665 2.11% FLOAT32 58953 59569 1.04%
  BFLOAT4_B 13839 14299 3.32% BFLOAT4_B 33804 35174 4.05%
  BFLOAT16 23978 20468 -14.64% BFLOAT16 66339 67150 1.22%
  BFLOAT8_B 19019 17146 -9.85% BFLOAT8_B 61538 63631 3.40%
  FLOAT32 36488 29299 -19.70% FLOAT32 103483 105122 1.58%
  BFLOAT4_B 15692 14496 -7.62% BFLOAT4_B 60582 62639 3.40%
  BFLOAT16 37085 30286 -18.33% BFLOAT16 125507 122777 -2.18%
  BFLOAT8_B 25195 21432 -14.94% BFLOAT8_B 120560 119015 -1.28%
  FLOAT32 58938 49116 -16.66% FLOAT32 196532 197649 0.57%
  BFLOAT4_B 19395 17430 -10.13% BFLOAT4_B 118359 119876 1.28%
  BFLOAT16 59638 48687 -18.36% BFLOAT16 235450 239869 1.88%
  BFLOAT8_B 40177 31359 -21.95% BFLOAT8_B 230968 233934 1.28%
  BFLOAT4_B 26760 22153 -17.22% FLOAT32 382449 385252 0.73%
          BFLOAT4_B 235854 235215 -0.27%

@github-actions github-actions bot closed this Dec 17, 2024
@github-actions github-actions bot deleted the bbradel-7309 branch December 17, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have both NCRISC and BRISC read tiled data for InterleavedToSharded
3 participants