Releases: tenstorrent/tt-metal
v0.52.0-rc1
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10588594252
📦 Uncategorized
- #0: Remove run_operation from async_runtime.hpp
- PR: #11757
- #11640: Include simulation device in tt_cluster
- PR: #11766
- #11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
- #11649: update tt_lib with ttnn support for non working folder
- PR: #11654
- Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
- Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
- Fold sharded support
- PR: #11722
- #9450: add env flag to skip recompiling and reloading FW
- PR: #11681
- Move semaphores into kernel config ring buffer
- PR: #11764
- #10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
- [Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
- #11768: Fix watcher pause feature
- PR: #11780
- [Improvement] Added some graph names in the separate file
- PR: #11732
- Migrate CB configs into kernel config ring buffer
- PR: #11778
- #0: Feed more data to visualizer
- PR: #11400
- #11490: ttnn and tt_metal shapes are mixed
- PR: #11723
- Migrate sharded ops from TTL to TTNN
- PR: #11546
- #8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
- #11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
- TTNN sweep low pic unit tests
- PR: #11775
- Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
- LLK Test Coverage Follow-up
- PR: #11715
- Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
- #10866: Read profiler buffer with
EnqueueReadBuffer
in fast dispatch mode- PR: #11781
- Lpremovic/0 expand llk ctest coverage
- PR: #11653
- #11313: Migrate layernorm_distributed to ttnn
- PR: #11696
- [Blackhole Bringup] Fixes for maxpool
- PR: #11761
- #11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
- modify keys within device_info
- PR: #11852
- #0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
- #0: fix cloud-virtual-machine label
- PR: #11863
- #11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
- #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
- #9527: Moving bcast to operations/data_movement
- PR: #11599
- #10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
- #11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
- #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
- #11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
- FD Optimizations/Cleanup
- PR: #11872
- #11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18- PR: #11882
- Revert "#11881: Add
-Wno-vla-cxx-extension
to CMake to fix build on clang18"- PR: #11887
- #10163: Add backward support for remainder op
- PR: #9712
- Added ttnn.hypot_bw unit test
- PR: #11843
- #0: Add another codeowner for conv2d
- PR: #11849
- #11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
- #0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
- Removed "" graph_consts.hpp
- PR: #11904
- [Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
- [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
- [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
- [Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
- #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
- Collection of small dprint/watcer changes
- PR: #11906
- #11917: disable test
- PR: #11918
- #11706: Use new Conv2D API in UNet Shallow
- PR: #11902
- #11925 Update ttnn.arange binding
- PR: #11926
- #0: Remove test include from packet_demux
- PR: #11924
- #7709: Fix exp like ops ttnn doc issues
- PR: #7879
- #11126: Resnet Demo with new conv API
- PR: #11770
- Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
- #10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
- [Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
- Added ttnn.topk unit test
- PR: #11935
- #0: (MINOR) Update to v0.52.0
- PR: #11946
- #11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
- #11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
- #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
- #0: fixed External Operation logging
- PR: #11958
- #0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
- #11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
- Enabling BH L1 data cache
- PR: #11909
- #0: Move Unary device operation to tmp
- PR: #11793
- Moved tracked methods out of tensor
- PR: #11921
- #11964: Only write branch is if the repo is not detached
- PR: #11965
- #11622: add concat sweep
- PR: #11733
- #0: Refactor Python dynamic modules creation
- PR: #11798
- #0: Update resnet test infra to print total batch size for multi device
- PR: #11966
- #11930: Increase status checks
- PR: #11945
- Convs on BH
- PR: #11977
v0.51.0
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689
Demo models and their metrics
Grayskull (GS) Models
Model | Batch | Target throughput |
---|---|---|
ResNet-50 (fps) | 20 | 10,000 |
BERT-Large (sen/s) | 12 | 410 |
Falcon7B-decode (t/s) | 32 | 140 |
ViT (fps) | 8 | 2000 |
T5 small (sen/s) | ||
Bloom (sen/s) | ||
U-Net | coming soon |
[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.
[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.
Wormhole (WH) Models
Note
All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.
Furthermore, all performance numbers here are run or based off an N300 Wormhole card.
Model | Gen. Token [3] | Batch | Time to first token [4] | Target throughput |
---|---|---|---|---|
Falcon7B | 129th | 32 | 0.08 s | 26 |
Mistral-7B | 129th | 32 | coming soon | 25 |
Mamba-2.8B | any | 32 | 0.04 s | 41 |
LLaMA-3.1-8B | 129th | 8 | coming soon | 23 |
BERT-Large (sen/s) [5] | - | 8 | - | 400 |
Stable Diffusion 1.4 512x512 (sec/img) [6] | - | 1 | - | 3 |
ResNet-50 (fps) | - | 16 | - | 7,000 |
[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.
[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.
[3] - Generating the i
'th token in a sequence while the kv_cache is filled with i-1
rows.
[4] - Time to fill the kv_cache and generate the first output token (1st user).
[5] - This model demo does not work on N150. It does work on N300.
[6] - This model demo does not work on N300. It does work on N150.
TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models
Model | Technique | Gen. Token [3] | Batch | Target throughput |
---|---|---|---|---|
Falcon7B | Data Parallel | 129th | 256 | 26 t/s/u |
LLaMA-2-70B | Tensor Parallel | 129th | 32 | 20 t/s/u |
LLaMA-3.1-70B | Tensor Parallel | 129th | 32 | 20 t/s/u |
Falcon40B | Tensor Parallel | 129th | 32 | 36 t/s/u |
Mixtral7Bx8 | Tensor Parallel | 129th | 32 | 33 t/s/u |
ResNet-50 (fps) | Data Parallel | - | 128 | 56,000 |
Single Galaxy (8x4 mesh of WHs) Models
Model | Last verified release | Technique | Gen. Token [3] | Batch | Time to first token [4] | End-to-end throughput [1] | Device throughput [2] | Target throughput |
---|---|---|---|---|---|---|---|---|
Falcon7B | v0.51.0-rc30 | Data Parallel | 129th | 1024 | 0.30 s | 4.0 t/s/u - 4096 t/s | 17.7 t/s/u - 18125 t/s | 26 t/s/u |
📦 Uncategorized
- #10600: renamed execute_on_main_thread to operator()
- PR: #10601
- #0: refactor ttnn device operation code and program cache
- PR: #11223
- #11112: Add forward support for Relational Inplace Ops
- PR: #8790
- Update Synchronize api to barrier for missing transactions
- PR: #11318
- Move sliding_window to TTNN
- PR: #10346
- Update CODEOWNERS of SD tests
- PR: #11360
- #11283: Remove old Stable Diffusion implementation and its tests
- PR: #11284
- #5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
- PR: #11309
- #11349: Add missing include in
kernel_types.hpp
- PR: #11352
- #10119: move fold op to ttnn infra
- PR: #11273
- Bump up TRISC0 stack size
- PR: #11317
- #11089: Fix ttnn.line_all_gather(..) to work with async
- PR: #11331
- Add best practices for error messages
- PR: #11337
- #0: updated mistral readme to reflect batching changes
- PR: #11371
- #0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
- PR: #11361
- #11280: Enable sharded buffer l1 read/writes test on BH
- PR: #11281
- Update CODEOWNERS
- PR: #11388
- Add new items to best_practices.md
- PR: #11385
- #9322: Remove lamb_optimizer op
- PR: #11383
- #10550: Enable remote chip routing before profiler init
- PR: #10554
- #11333: Resolve hang with Trace and R-Chip Event Synchronization
- PR: #11334
- #10117: Migrate fast_reduce_nc op to ttnn
- PR: #11311
- #11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
- PR: #11390
- Update perf and latest features for llm models (Aug 12)
- PR: #11373
- #0: Update watcher noc_sanitize to internally specify noc_id
- PR: #11394
- Fix rounding in recip causing pcc issues in models
- PR: #11319
- update build_metal.sh to trigger cmake
test
target- PR: #11409
- #9322: Remove unused bindings
- PR: #11404
- Migrate Sharded Partial from TTL to TTNN
- PR: #11285
- Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
- PR: #11324
- #10360: Cut down on build time by targeting tests target directly
- PR: #11419
- #11042: Overload complex fw ops
- PR: #11047
- #0: remove decorate_as_composite
- PR: #11345
- #11346: Replace tt_lib usage in eltwise backward
- PR: #11347
- Add sweeps for complex bw_ops: polar, recip, add, mul
- PR: #11364
- #0: Add initial t3000 nightly pipeline
- PR: #11372
- #11038: Clean up more runner labels for single card
- PR: #11376
- [CLEANUP] Remove old unused Mistral code inside models/experimental
- PR: #11444
- #5424: GELU and GELU' API calls submodule LLKs
- PR: #11193
- #5424: GELU and GELU' API calls submodule LLKs
- PR: #11150
- #5424: GELU and GELU' API calls submodule LLKs
- PR: #11154
- #10127: Move reduce op from tt_lib to ttnn part 1
- PR: #11299
- #0: Recommend noc_async_write_flushed() on examples
- PR: #11448
- #0: added llama3-tg nightly demo test
- PR: #11399
- #0: re-add install step at end of
build_metal.sh
- PR: #11452
- #0: update fold call to new ttnn
- PR: #11455
- #0: Fix watcher sanitization for NOC1
- PR: #11456
- Implementing all_gather to datacopy signaling
- PR: #11231
- #11322: Fix UNet functional and performance demo crash
- PR: #11405
- #9992: Compute-engine add example DRAM NOC fix for WH n300
- PR: #11393
- Ccl/revert datacopy
- PR: #11466
- Fixed default arguments for repacking llama3
- PR: #11326
- #11443: Updated Mistral7B reference
- PR: #11458
- #11241: Replace tt_lib in models/demos/bert and falcon7b_common
- PR: #11249
- #7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
- PR: #10629
- #10612: Unit tests for Galaxy cluster
- PR: #10705
- #11469: Run ci/cd upload only on main if workflow_run
- PR: #11475
- FIx elt_...
v0.51.0-rc37
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10570109365
- no changes
v0.51.0-rc36
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10551974381
- no changes
v0.51.0-rc35
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10534616656
- no changes
v0.51.0-rc34
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10518337313
- no changes
v0.51.0-rc33
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10500260327
- no changes
v0.51.0-rc32
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10492474513
- no changes
v0.51.0-rc31
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10481931978
- no changes
v0.51.0-rc30
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10477207849
- no changes