Skip to content

Releases: tenstorrent/tt-metal

v0.52.0-rc1

28 Aug 02:29
Compare
Choose a tag to compare
v0.52.0-rc1 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10588594252

📦 Uncategorized

  • #0: Remove run_operation from async_runtime.hpp
  • #11640: Include simulation device in tt_cluster
  • #11342: Replace tt_lib with ttnn function in experimental/functional
  • #11649: update tt_lib with ttnn support for non working folder
  • Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
  • Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
  • Fold sharded support
  • #9450: add env flag to skip recompiling and reloading FW
  • Move semaphores into kernel config ring buffer
  • #10874: Enable test cases for concurrent instances in CCL all gather
  • [Falcon7b] Remove hf reference files and import from transformers instead
  • #11768: Fix watcher pause feature
  • [Improvement] Added some graph names in the separate file
  • Migrate CB configs into kernel config ring buffer
  • #0: Feed more data to visualizer
  • #11490: ttnn and tt_metal shapes are mixed
  • Migrate sharded ops from TTL to TTNN
  • #8865: Port ttnn ops to dispatch profiling infra
  • #11700: update write_tensor with copy_host_to_device_tensor
  • TTNN sweep low pic unit tests
  • Add sweeps for ops: topk, frac, trunc, ceil to TTNN
  • LLK Test Coverage Follow-up
  • Llama3.1 70b Prefill - MLP and Attention
  • #10866: Read profiler buffer with EnqueueReadBuffer in fast dispatch mode
  • Lpremovic/0 expand llk ctest coverage
  • #11313: Migrate layernorm_distributed to ttnn
  • [Blackhole Bringup] Fixes for maxpool
  • #11850: Remove Llama3.1-8B output matching to avoid blocking CI
  • modify keys within device_info
  • #0: remove extra arch-wormhole labels for single-card workflows
  • #0: fix cloud-virtual-machine label
  • #11564: added test for generating sample data with many different use cases to the visualizer
  • #0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
  • #9527: Moving bcast to operations/data_movement
  • #10332: Make ttnn::event_synchronize block only in the app thread
  • #11554: Replace tt_lib in sweeps, integration_tests
  • #11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
  • #11845: fix worker ring direction assignment in reduce scatter
  • FD Optimizations/Cleanup
  • #11881: Add -Wno-vla-cxx-extension to CMake to fix build on clang18
  • Revert "#11881: Add -Wno-vla-cxx-extension to CMake to fix build on clang18"
  • #10163: Add backward support for remainder op
  • Added ttnn.hypot_bw unit test
  • #0: Add another codeowner for conv2d
  • #11334: Remove unnecessary code for previous ci/cd csvs
  • #0: Bump timeout for single-card perf tests to see if that helps with timeouts
  • Removed "" graph_consts.hpp
  • [Falcon7b] Re-enable decode perplexity test with seq len 2048
  • [Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
  • [Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
  • [Blackhole Bringup] Add pack_untilize tests & fixes
  • #0: Consolidate demo tests for single card and t3000 to use impls rather than copy
  • Collection of small dprint/watcer changes
  • #11917: disable test
  • #11706: Use new Conv2D API in UNet Shallow
  • #11925 Update ttnn.arange binding
  • #0: Remove test include from packet_demux
  • #7709: Fix exp like ops ttnn doc issues
  • #11126: Resnet Demo with new conv API
  • Added ttnn.argmax sweeps, API calls and unit tests
  • #10515: For matmul corner case, if CBs don't fit, choose different program config
  • [Mixtral8x7B] Increase demo max context length to 32k
  • Added ttnn.topk unit test
  • #0: (MINOR) Update to v0.52.0
  • #11847: Add tt-smi reset command environment variable for sweeps
  • #11000: Enable uint8 A2D and (un)pack reconfig
  • #0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
  • #0: fixed External Operation logging
  • #0: Update matmul_multi_core_reuse to support mixed precision
  • #11138: Move large global vars in prefetcher and dispatcher to the stack
  • Enabling BH L1 data cache
  • #0: Move Unary device operation to tmp
  • Moved tracked methods out of tensor
  • #11964: Only write branch is if the repo is not detached
  • #11622: add concat sweep
  • #0: Refactor Python dynamic modules creation
  • #0: Update resnet test infra to print total batch size for multi device
  • #11930: Increase status checks
  • Convs on BH

v0.51.0

27 Aug 15:10
e6f4c70
Compare
Choose a tag to compare

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689

Demo models and their metrics

Grayskull (GS) Models

Model Batch Target throughput
ResNet-50 (fps) 20 10,000
BERT-Large (sen/s) 12 410
Falcon7B-decode (t/s) 32 140
ViT (fps) 8 2000
T5 small (sen/s)
Bloom (sen/s)
U-Net coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model Gen. Token [3] Batch Time to first token [4] Target throughput
Falcon7B 129th 32 0.08 s 26
Mistral-7B 129th 32 coming soon 25
Mamba-2.8B any 32 0.04 s 41
LLaMA-3.1-8B 129th 8 coming soon 23
BERT-Large (sen/s) [5] - 8 - 400
Stable Diffusion 1.4 512x512 (sec/img) [6] - 1 - 3
ResNet-50 (fps) - 16 - 7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model Technique Gen. Token [3] Batch Target throughput
Falcon7B Data Parallel 129th 256 26 t/s/u
LLaMA-2-70B Tensor Parallel 129th 32 20 t/s/u
LLaMA-3.1-70B Tensor Parallel 129th 32 20 t/s/u
Falcon40B Tensor Parallel 129th 32 36 t/s/u
Mixtral7Bx8 Tensor Parallel 129th 32 33 t/s/u
ResNet-50 (fps) Data Parallel - 128 56,000

Single Galaxy (8x4 mesh of WHs) Models

Model Last verified release Technique Gen. Token [3] Batch Time to first token [4] End-to-end throughput [1] Device throughput [2] Target throughput
Falcon7B v0.51.0-rc30 Data Parallel 129th 1024 0.30 s 4.0 t/s/u - 4096 t/s 17.7 t/s/u - 18125 t/s 26 t/s/u

📦 Uncategorized

  • #10600: renamed execute_on_main_thread to operator()
  • #0: refactor ttnn device operation code and program cache
  • #11112: Add forward support for Relational Inplace Ops
  • Update Synchronize api to barrier for missing transactions
  • Move sliding_window to TTNN
  • Update CODEOWNERS of SD tests
  • #11283: Remove old Stable Diffusion implementation and its tests
  • #5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
  • #11349: Add missing include in kernel_types.hpp
  • #10119: move fold op to ttnn infra
  • Bump up TRISC0 stack size
  • #11089: Fix ttnn.line_all_gather(..) to work with async
  • Add best practices for error messages
  • #0: updated mistral readme to reflect batching changes
  • #0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
  • #11280: Enable sharded buffer l1 read/writes test on BH
  • Update CODEOWNERS
  • Add new items to best_practices.md
  • #9322: Remove lamb_optimizer op
  • #10550: Enable remote chip routing before profiler init
  • #11333: Resolve hang with Trace and R-Chip Event Synchronization
  • #10117: Migrate fast_reduce_nc op to ttnn
  • #11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
  • Update perf and latest features for llm models (Aug 12)
  • #0: Update watcher noc_sanitize to internally specify noc_id
  • Fix rounding in recip causing pcc issues in models
  • update build_metal.sh to trigger cmake test target
  • #9322: Remove unused bindings
  • Migrate Sharded Partial from TTL to TTNN
  • Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
  • #10360: Cut down on build time by targeting tests target directly
  • #11042: Overload complex fw ops
  • #0: remove decorate_as_composite
  • #11346: Replace tt_lib usage in eltwise backward
  • Add sweeps for complex bw_ops: polar, recip, add, mul
  • #0: Add initial t3000 nightly pipeline
  • #11038: Clean up more runner labels for single card
  • [CLEANUP] Remove old unused Mistral code inside models/experimental
  • #5424: GELU and GELU' API calls submodule LLKs
  • #5424: GELU and GELU' API calls submodule LLKs
  • #5424: GELU and GELU' API calls submodule LLKs
  • #10127: Move reduce op from tt_lib to ttnn part 1
  • #0: Recommend noc_async_write_flushed() on examples
  • #0: added llama3-tg nightly demo test
  • #0: re-add install step at end of build_metal.sh
  • #0: update fold call to new ttnn
  • #0: Fix watcher sanitization for NOC1
  • Implementing all_gather to datacopy signaling
  • #11322: Fix UNet functional and performance demo crash
  • #9992: Compute-engine add example DRAM NOC fix for WH n300
  • Ccl/revert datacopy
  • Fixed default arguments for repacking llama3
  • #11443: Updated Mistral7B reference
  • #11241: Replace tt_lib in models/demos/bert and falcon7b_common
  • #7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
  • #10612: Unit tests for Galaxy cluster
  • #11469: Run ci/cd upload only on main if workflow_run
  • FIx elt_...
Read more

v0.51.0-rc37

27 Aug 02:13
Compare
Choose a tag to compare
v0.51.0-rc37 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10570109365

  • no changes

v0.51.0-rc36

26 Aug 02:13
Compare
Choose a tag to compare
v0.51.0-rc36 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10551974381

  • no changes

v0.51.0-rc35

24 Aug 02:17
14dabb4
Compare
Choose a tag to compare
v0.51.0-rc35 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10534616656

  • no changes

v0.51.0-rc34

23 Aug 02:12
Compare
Choose a tag to compare
v0.51.0-rc34 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10518337313

  • no changes

v0.51.0-rc33

22 Aug 02:13
e6f4c70
Compare
Choose a tag to compare
v0.51.0-rc33 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10500260327

  • no changes

v0.51.0-rc32

21 Aug 15:57
93fbac3
Compare
Choose a tag to compare
v0.51.0-rc32 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10492474513

  • no changes

v0.51.0-rc31

21 Aug 02:12
739e041
Compare
Choose a tag to compare
v0.51.0-rc31 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10481931978

  • no changes

v0.51.0-rc30

20 Aug 19:13
Compare
Choose a tag to compare
v0.51.0-rc30 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10477207849

  • no changes