-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#7309: Use both reader and writer kernels to read tiled data for Inte… #7310
Conversation
All post-commit tests pass: https://github.com/tenstorrent-metal/tt-metal/actions/runs/8636931546 |
...dnn/op_library/sharded/kernels/dataflow/reader_unary_sharded_blocks_interleaved_start_id.cpp
Show resolved
Hide resolved
@bbradelTT This change looks good for L1 reads, but it doesn't look good for DRAM reads. Also you should double check these numbers on GS, I've checked this change only on WH. |
I ran the changes on GS against https://github.com/tenstorrent-metal/tt-metal/blob/7f9754121a372d9457d05911b7a029c5a28b6748/tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py#L2472 for test_sharded.py::test_interleaved_2_sharded_DRAM and test_sharded.py::test_interleaved_2_sharded_L1. The following are the values from the DEVICE FW DURATION column and percentage changes:
These are without the changes to call the barrier less frequently. Would it be better to restrict using the two processors somehow? E.g. to only L1 and not DRAM, or not DRAM for float32 (and equivalent). |
I rebased the code and only used the two nocs for L1. It looks like there is the notion of partial_op that now overlaps with the changes. I'm not sure which unit tests. The best improvement after @pavlejosipovic 's changes is only about 20% (details below).
Test run for GS:
|
…rleavedToSharded
Divide the tiles to be read by each core by height across both reader and writer kernels.
Use an extra intermediate CB to communicate between the two kernels when to start and complete writing.
Also, if not converting the data format will skip the cb_write_back for extra performance improvement.
Testing: