Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized Llama 3.x perf with sharded residual #15142

Merged
merged 18 commits into from
Nov 25, 2024
Merged

Conversation

yieldthought
Copy link
Contributor

@yieldthought yieldthought commented Nov 17, 2024

Ticket

14273

Problem description

The shared llama3 codebase used interleaved L1 for the residual during decode. This had lower performance than previous specialized models.

What's changed

  • Sharded the residual path for all combinations of models and devices
  • Resolved the L1 corruption issues without deallocate workarounds
  • Improved performance over all previous specialised models: 8b n150 is now >24t/s/u and 70b t3k is >15 t/s/u

Checklist

@yieldthought yieldthought changed the title Llama3/sharded residual Optimized Llama 3.x perf with sharded residual Nov 17, 2024
@davorchap
Copy link
Collaborator

Nice !

@yieldthought yieldthought self-assigned this Nov 20, 2024
@yieldthought yieldthought force-pushed the llama3/sharded-residual branch from a8d0494 to 9ccefa9 Compare November 20, 2024 08:52
@yieldthought
Copy link
Contributor Author

models/demos/llama3/tests/multimodal/test_llama_cross_attention_transformer_text.py fails with PCC errors. This passes on main. Investigating.

@yieldthought yieldthought force-pushed the llama3/sharded-residual branch from 64e575c to 4b048cf Compare November 20, 2024 11:48
@yieldthought yieldthought force-pushed the llama3/sharded-residual branch from 12d480a to a3db43c Compare November 22, 2024 16:27
@yieldthought yieldthought merged commit 576e612 into main Nov 25, 2024
128 of 130 checks passed
@yieldthought yieldthought deleted the llama3/sharded-residual branch November 25, 2024 12:21
@yieldthought
Copy link
Contributor Author

Note that the async dispatch revert in commit 30364957fe888854bda3ae43a00b1867f7c2291c issue #15324 decreases the performance by around 10% so we actually see ~22.8 t/s/u 8b N150 for example. Revert that to see the promised perf gains in the description.

spoojaryTT pushed a commit that referenced this pull request Nov 25, 2024
### Ticket
[14273](#14273)

### Problem description
The shared llama3 codebase used interleaved L1 for the residual during
decode. This had lower performance than previous specialized models.

### What's changed
* Sharded the residual path for all combinations of models and devices
* Resolved the L1 corruption issues without deallocate workarounds
* Improved performance over all previous specialised models: 8b n150 is
now >24t/s/u and 70b t3k is >15 t/s/u

### Checklist
- [x] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12007682870)
- [x] [Single card demo
tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975122584)
- [x] [Single card model perf
tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975114297)
- [ ] [T3K demo + other
tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975098695)
([perf
rerun](https://github.com/tenstorrent/tt-metal/actions/runs/12009863881))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants