Optimized Llama 3.x perf with sharded residual #15142

yieldthought · 2024-11-17T12:47:42Z

Ticket

Problem description

The shared llama3 codebase used interleaved L1 for the residual during decode. This had lower performance than previous specialized models.

What's changed

Sharded the residual path for all combinations of models and devices
Resolved the L1 corruption issues without deallocate workarounds
Improved performance over all previous specialised models: 8b n150 is now >24t/s/u and 70b t3k is >15 t/s/u

Checklist

davorchap · 2024-11-17T22:31:13Z

Nice !

models/demos/llama3/tests/test_llama_decoder.py

yieldthought · 2024-11-20T09:56:44Z

models/demos/llama3/tests/multimodal/test_llama_cross_attention_transformer_text.py fails with PCC errors. This passes on main. Investigating.

models/demos/llama3/tt/multimodal/llama_cross_attention_transformer_text.py

…ring N is an even number of tiles

yieldthought · 2024-11-25T12:24:20Z

Note that the async dispatch revert in commit 30364957fe888854bda3ae43a00b1867f7c2291c issue #15324 decreases the performance by around 10% so we actually see ~22.8 t/s/u 8b N150 for example. Revert that to see the promised perf gains in the description.

### Ticket [14273](#14273) ### Problem description The shared llama3 codebase used interleaved L1 for the residual during decode. This had lower performance than previous specialized models. ### What's changed * Sharded the residual path for all combinations of models and devices * Resolved the L1 corruption issues without deallocate workarounds * Improved performance over all previous specialised models: 8b n150 is now >24t/s/u and 70b t3k is >15 t/s/u ### Checklist - [x] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/12007682870) - [x] [Single card demo tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975122584) - [x] [Single card model perf tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975114297) - [ ] [T3K demo + other tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975098695) ([perf rerun](https://github.com/tenstorrent/tt-metal/actions/runs/12009863881))

yieldthought requested review from cglagovichTT, mtairum and uaydonat as code owners November 17, 2024 12:47

yieldthought changed the title ~~Llama3/sharded residual~~ Optimized Llama 3.x perf with sharded residual Nov 17, 2024

cglagovichTT approved these changes Nov 18, 2024

View reviewed changes

uaydonat reviewed Nov 18, 2024

View reviewed changes

models/demos/llama3/tests/test_llama_decoder.py Outdated Show resolved Hide resolved

yieldthought self-assigned this Nov 20, 2024

yieldthought force-pushed the llama3/sharded-residual branch from a8d0494 to 9ccefa9 Compare November 20, 2024 08:52

yieldthought force-pushed the llama3/sharded-residual branch from 64e575c to 4b048cf Compare November 20, 2024 11:48

mtairum reviewed Nov 21, 2024

View reviewed changes

models/demos/llama3/tt/multimodal/llama_cross_attention_transformer_text.py Outdated Show resolved Hide resolved

mtairum approved these changes Nov 21, 2024

View reviewed changes

yieldthought added 16 commits November 22, 2024 17:27

#0: Don't start pending jobs automatically on restart

c5d0a3e

#14273: Shard residual stream, perf >15 t/s/u on 70B, 24 t/s/u 8b n150

8eede2c

#14273: Generalize sharding, L1 corruption resolved

aef64bd

#14273: Update tests for sharded residual

3bd4a8d

#14273: Update perf test

4053dce

#0: Improve support for perf tests

b4c2911

#0: Update vision model to work with sharded text decoder

f993399

#0: Fix vision test to use correct memcfg

0e3368a

#0: Fix vision tests, work around s2i issue

308e2f0

#0: Apply sharded residual only in decode mode

0673f88

#0: re-enable async test

35199e9

#0: work around segfault in main

4152a02

#0: Fix tracing support with vision demo

b87b033

#0: remove debug code

03e2cc0

#0: Work around bad PCC in dram-sharded matmul in vision test by ensu…

2e15665

…ring N is an even number of tiles

#0: Update model quick bounds a little, some tighter some lower

a3db43c

yieldthought force-pushed the llama3/sharded-residual branch from 12d480a to a3db43c Compare November 22, 2024 16:27

yieldthought added 2 commits November 25, 2024 09:32

#0: loosen compile time checks, we don't care about these for this model

e24d18d

Merge branch 'main' into llama3/sharded-residual

92a5fe5

yieldthought merged commit 576e612 into main Nov 25, 2024
128 of 130 checks passed

yieldthought deleted the llama3/sharded-residual branch November 25, 2024 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Llama 3.x perf with sharded residual #15142

Optimized Llama 3.x perf with sharded residual #15142

yieldthought commented Nov 17, 2024 •

edited

Loading

davorchap commented Nov 17, 2024

yieldthought commented Nov 20, 2024

yieldthought commented Nov 25, 2024

Optimized Llama 3.x perf with sharded residual #15142

Optimized Llama 3.x perf with sharded residual #15142

Conversation

yieldthought commented Nov 17, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

davorchap commented Nov 17, 2024

yieldthought commented Nov 20, 2024

yieldthought commented Nov 25, 2024

yieldthought commented Nov 17, 2024 •

edited

Loading