#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

bbradelTT · 2024-04-10T16:35:12Z

…rleavedToSharded

Divide the tiles to be read by each core by height across both reader and writer kernels.

Use an extra intermediate CB to communicate between the two kernels when to start and complete writing.

Also, if not converting the data format will skip the cb_write_back for extra performance improvement.

Testing:

Verified that output is correct
Verified that performance improved by about 2x (with changes for Issue Dram Bandwidth below 20bytes/clk/channel for InterleavedToSharded #6729 )

bbradelTT · 2024-04-10T21:04:27Z

All post-commit tests pass: https://github.com/tenstorrent-metal/tt-metal/actions/runs/8636931546

...dnn/op_library/sharded/kernels/dataflow/reader_unary_sharded_blocks_interleaved_start_id.cpp

pavlejosipovic · 2024-04-11T10:20:15Z

@bbradelTT This change looks good for L1 reads, but it doesn't look good for DRAM reads.
On dram BFLOAT4_B is significantly improved ( 47%), BLOAT8_B seems to be the ~same ( 2-10% regression), BFLOAT16 seems to have ~25% regression and Float32 almost 30% regression.
I've used this test for benchmarking https://github.com/tenstorrent-metal/tt-metal/blob/7f9754121a372d9457d05911b7a029c5a28b6748/tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py#L2472 at 1GHz frequency.

Also you should double check these numbers on GS, I've checked this change only on WH.

bbradelTT · 2024-04-11T21:02:01Z

I ran the changes on GS against https://github.com/tenstorrent-metal/tt-metal/blob/7f9754121a372d9457d05911b7a029c5a28b6748/tests/tt_eager/python_api_testing/unit_testing/misc/test_sharded.py#L2472 for test_sharded.py::test_interleaved_2_sharded_DRAM and test_sharded.py::test_interleaved_2_sharded_L1.

The following are the values from the DEVICE FW DURATION column and percentage changes:

data type	mb	L1 Baseline	L1 New	DRAM Baseline	DRAM New	L1	DRAM
BFLOAT16	2.25	8180	8679	33739	37928	-6.10%	-12.42%
BFLOAT8_B	1.125	6329	6759	42188	40958	-6.79%	2.92%
FLOAT32	4.5	13423	14340	57733	57447	-6.83%	0.50%
BFLOAT4_B	0.5625	5658	6078	40619	42335	-7.42%	-4.22%
BFLOAT16	4.5	15725	9642	75982	56884	38.68%	25.13%
BFLOAT8_B	2.25	11686	6911	69672	34546	40.86%	50.42%
FLOAT32	9	25843	19317	108653	109101	25.25%	-0.41%
BFLOAT4_B	1.125	10256	6133	83263	33626	40.20%	59.61%
BFLOAT16	9	30386	19495	139522	109573	35.84%	21.47%
BFLOAT8_B	4.5	22646	12408	146972	69334	45.21%	52.83%
FLOAT32	18	52212	38262	216038	210106	26.72%	2.75%
BFLOAT4_B	2.25	19487	10797	158498	64361	44.59%	59.39%
BFLOAT16	18	59985	36993	261937	211507	38.33%	19.25%
BFLOAT8_B	9	44164	24209	290629	138222	45.18%	52.44%
FLOAT32	36			424803	414884		2.33%
BFLOAT4_B	4.5	37673	20191	318118	140002	46.40%	55.99%

These are without the changes to call the barrier less frequently.

Would it be better to restrict using the two processors somehow? E.g. to only L1 and not DRAM, or not DRAM for float32 (and equivalent).

…rleavedToSharded

bbradelTT · 2024-06-12T23:18:31Z

I rebased the code and only used the two nocs for L1.

It looks like there is the notion of partial_op that now overlaps with the changes. I'm not sure which unit tests.

The best improvement after @pavlejosipovic 's changes is only about 20% (details below).

@pavlejosipovic @tt-aho

is it worthwhile to get these changes in even when the improvement is a lot lower than before?
are there any specific tests I should run to make sure everything works as expected?

Test run for GS:


	L1				DRAM
	main	branch			main	branch
BFLOAT16	18326	18420	0.51%	BFLOAT16	41028	40183	-2.06%
BFLOAT8_B	15691	16126	2.77%	BFLOAT8_B	32601	34641	6.26%
FLOAT32	24155	24665	2.11%	FLOAT32	58953	59569	1.04%
BFLOAT4_B	13839	14299	3.32%	BFLOAT4_B	33804	35174	4.05%
BFLOAT16	23978	20468	-14.64%	BFLOAT16	66339	67150	1.22%
BFLOAT8_B	19019	17146	-9.85%	BFLOAT8_B	61538	63631	3.40%
FLOAT32	36488	29299	-19.70%	FLOAT32	103483	105122	1.58%
BFLOAT4_B	15692	14496	-7.62%	BFLOAT4_B	60582	62639	3.40%
BFLOAT16	37085	30286	-18.33%	BFLOAT16	125507	122777	-2.18%
BFLOAT8_B	25195	21432	-14.94%	BFLOAT8_B	120560	119015	-1.28%
FLOAT32	58938	49116	-16.66%	FLOAT32	196532	197649	0.57%
BFLOAT4_B	19395	17430	-10.13%	BFLOAT4_B	118359	119876	1.28%
BFLOAT16	59638	48687	-18.36%	BFLOAT16	235450	239869	1.88%
BFLOAT8_B	40177	31359	-21.95%	BFLOAT8_B	230968	233934	1.28%
BFLOAT4_B	26760	22153	-17.22%	FLOAT32	382449	385252	0.73%
				BFLOAT4_B	235854	235215	-0.27%

bbradelTT self-assigned this Apr 10, 2024

bbradelTT requested review from tt-aho and TT-BrianLiu April 10, 2024 19:45

bbradelTT temporarily deployed to dev April 10, 2024 19:46 — with GitHub Actions Inactive

bbradelTT temporarily deployed to production April 10, 2024 20:09 — with GitHub Actions Inactive

tt-aho reviewed Apr 10, 2024

View reviewed changes

...dnn/op_library/sharded/kernels/dataflow/reader_unary_sharded_blocks_interleaved_start_id.cpp Show resolved Hide resolved

bbradelTT linked an issue Apr 10, 2024 that may be closed by this pull request

Have both NCRISC and BRISC read tiled data for InterleavedToSharded #7309

Open

tt-aho approved these changes Apr 10, 2024

View reviewed changes

bbradelTT added 3 commits June 12, 2024 17:12

#7309: Use both reader and writer kernels to read tiled data for Inte…

e2f6e82

…rleavedToSharded

#7309: Call cb_push_back as soon as possible for tiled i2s

5699d9d

#7309: Add missing constexpr for tiled i2s

514d68c

bbradelTT force-pushed the bbradel-7309 branch from bbe8235 to 514d68c Compare June 12, 2024 17:13

#7309: i2s dual noc for L1 only and fix rebase issues

f4abe94

bbradelTT temporarily deployed to dev June 12, 2024 23:19 — with GitHub Actions Inactive

bbradelTT had a problem deploying to dev June 12, 2024 23:19 — with GitHub Actions Failure

bbradelTT temporarily deployed to dev June 12, 2024 23:19 — with GitHub Actions Inactive

bbradelTT temporarily deployed to dev June 12, 2024 23:23 — with GitHub Actions Inactive

bbradelTT temporarily deployed to dev June 12, 2024 23:27 — with GitHub Actions Inactive

bbradelTT had a problem deploying to dev June 12, 2024 23:27 — with GitHub Actions Failure

bbradelTT temporarily deployed to dev June 12, 2024 23:27 — with GitHub Actions Inactive

bbradelTT temporarily deployed to production June 12, 2024 23:40 — with GitHub Actions Inactive

github-actions bot closed this Dec 17, 2024

github-actions bot deleted the bbradel-7309 branch December 17, 2024 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

bbradelTT commented Apr 10, 2024

bbradelTT commented Apr 10, 2024

pavlejosipovic commented Apr 11, 2024 •

edited

Loading

bbradelTT commented Apr 11, 2024

bbradelTT commented Jun 12, 2024

#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

#7309: Use both reader and writer kernels to read tiled data for Inte… #7310

Conversation

bbradelTT commented Apr 10, 2024

bbradelTT commented Apr 10, 2024

pavlejosipovic commented Apr 11, 2024 • edited Loading

bbradelTT commented Apr 11, 2024

bbradelTT commented Jun 12, 2024

pavlejosipovic commented Apr 11, 2024 •

edited

Loading