Releases · tenstorrent/tt-metal

28 Aug 02:29

v0.52.0-rc1

3f82016

v0.52.0-rc1 Pre-release

Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10588594252

📦 Uncategorized

#0: Remove run_operation from async_runtime.hpp
- PR: #11757
#11640: Include simulation device in tt_cluster
- PR: #11766
#11342: Replace tt_lib with ttnn function in experimental/functional
- PR: #11356
#11649: update tt_lib with ttnn support for non working folder
- PR: #11654
Perf dashboard and batching support for Mistral-7B and Llama3.1-8B
- PR: #11603
Adding fix for llama CI failure caused by ttnn.experimental.tensor.typecast
- PR: #11765
Fold sharded support
- PR: #11722
#9450: add env flag to skip recompiling and reloading FW
- PR: #11681
Move semaphores into kernel config ring buffer
- PR: #11764
#10874: Enable test cases for concurrent instances in CCL all gather
- PR: #10885
[Falcon7b] Remove hf reference files and import from transformers instead
- PR: #11758
#11768: Fix watcher pause feature
- PR: #11780
[Improvement] Added some graph names in the separate file
- PR: #11732
Migrate CB configs into kernel config ring buffer
- PR: #11778
#0: Feed more data to visualizer
- PR: #11400
#11490: ttnn and tt_metal shapes are mixed
- PR: #11723
Migrate sharded ops from TTL to TTNN
- PR: #11546
#8865: Port ttnn ops to dispatch profiling infra
- PR: #11698
#11700: update write_tensor with copy_host_to_device_tensor
- PR: #11701
TTNN sweep low pic unit tests
- PR: #11775
Add sweeps for ops: topk, frac, trunc, ceil to TTNN
- PR: #11771
LLK Test Coverage Follow-up
- PR: #11715
Llama3.1 70b Prefill - MLP and Attention
- PR: #11724
#10866: Read profiler buffer with EnqueueReadBuffer in fast dispatch mode
- PR: #11781
Lpremovic/0 expand llk ctest coverage
- PR: #11653
#11313: Migrate layernorm_distributed to ttnn
- PR: #11696
[Blackhole Bringup] Fixes for maxpool
- PR: #11761
#11850: Remove Llama3.1-8B output matching to avoid blocking CI
- PR: #11851
modify keys within device_info
- PR: #11852
#0: remove extra arch-wormhole labels for single-card workflows
- PR: #11785
#0: fix cloud-virtual-machine label
- PR: #11863
#11564: added test for generating sample data with many different use cases to the visualizer
- PR: #11862
#0: Remove llk_io.cc for WH and BH as well. GS was removed in 7b8e627
- PR: #11864
#9527: Moving bcast to operations/data_movement
- PR: #11599
#10332: Make ttnn::event_synchronize block only in the app thread
- PR: #11543
#11554: Replace tt_lib in sweeps, integration_tests
- PR: #11556
#11877: Make dispatch core order in the core descriptor match for E75 with 1 and 2 CQs
- PR: #11878
#11845: fix worker ring direction assignment in reduce scatter
- PR: #11846
FD Optimizations/Cleanup
- PR: #11872
#11881: Add -Wno-vla-cxx-extension to CMake to fix build on clang18
- PR: #11882
Revert "#11881: Add -Wno-vla-cxx-extension to CMake to fix build on clang18"
- PR: #11887
#10163: Add backward support for remainder op
- PR: #9712
Added ttnn.hypot_bw unit test
- PR: #11843
#0: Add another codeowner for conv2d
- PR: #11849
#11334: Remove unnecessary code for previous ci/cd csvs
- PR: #11898
#0: Bump timeout for single-card perf tests to see if that helps with timeouts
- PR: #11893
Removed "" graph_consts.hpp
- PR: #11904
[Falcon7b] Re-enable decode perplexity test with seq len 2048
- PR: #11868
[Falcon7b] Fix duplicate loading of rotary embeddings in prefill/decode
- PR: #11871
[Falcon7b] Re-enable demo perf-mode tests on galaxy, update targets, prevent multinomial errors (during perf-mode) using nan-to-num
- PR: #11876
[Blackhole Bringup] Add pack_untilize tests & fixes
- PR: #11875
#0: Consolidate demo tests for single card and t3000 to use impls rather than copy
- PR: #11897
Collection of small dprint/watcer changes
- PR: #11906
#11917: disable test
- PR: #11918
#11706: Use new Conv2D API in UNet Shallow
- PR: #11902
#11925 Update ttnn.arange binding
- PR: #11926
#0: Remove test include from packet_demux
- PR: #11924
#7709: Fix exp like ops ttnn doc issues
- PR: #7879
#11126: Resnet Demo with new conv API
- PR: #11770
Added ttnn.argmax sweeps, API calls and unit tests
- PR: #11552
#10515: For matmul corner case, if CBs don't fit, choose different program config
- PR: #11892
[Mixtral8x7B] Increase demo max context length to 32k
- PR: #11777
Added ttnn.topk unit test
- PR: #11935
#0: (MINOR) Update to v0.52.0
- PR: #11946
#11847: Add tt-smi reset command environment variable for sweeps
- PR: #11901
#11000: Enable uint8 A2D and (un)pack reconfig
- PR: #11537
#0: Do not use mount-cloud-weka label because we may no longer need it as cloud fixed it
- PR: #11941
#0: fixed External Operation logging
- PR: #11958
#0: Update matmul_multi_core_reuse to support mixed precision
- PR: #11947
#11138: Move large global vars in prefetcher and dispatcher to the stack
- PR: #11922
Enabling BH L1 data cache
- PR: #11909
#0: Move Unary device operation to tmp
- PR: #11793
Moved tracked methods out of tensor
- PR: #11921
#11964: Only write branch is if the repo is not detached
- PR: #11965
#11622: add concat sweep
- PR: #11733
#0: Refactor Python dynamic modules creation
- PR: #11798
#0: Update resnet test infra to print total batch size for multi device
- PR: #11966
#11930: Increase status checks
- PR: #11945
Convs on BH
- PR: #11977

Assets 9

27 Aug 15:10

github-actions

v0.51.0

e6f4c70

v0.51.0

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689

Demo models and their metrics

Grayskull (GS) Models

Model	Batch	Target throughput
ResNet-50 (fps)	20	10,000
BERT-Large (sen/s)	12	410
Falcon7B-decode (t/s)	32	140
ViT (fps)	8	2000
T5 small (sen/s)
Bloom (sen/s)
U-Net	coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model	Gen. Token [3]	Batch	Time to first token [4]	Target throughput
Falcon7B	129th	32	0.08 s	26
Mistral-7B	129th	32	coming soon	25
Mamba-2.8B	any	32	0.04 s	41
LLaMA-3.1-8B	129th	8	coming soon	23
BERT-Large (sen/s) [5]	-	8	-	400
Stable Diffusion 1.4 512x512 (sec/img) [6]	-	1	-	3
ResNet-50 (fps)	-	16	-	7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model	Technique	Gen. Token [3]	Batch	Target throughput
Falcon7B	Data Parallel	129th	256	26 t/s/u
LLaMA-2-70B	Tensor Parallel	129th	32	20 t/s/u
LLaMA-3.1-70B	Tensor Parallel	129th	32	20 t/s/u
Falcon40B	Tensor Parallel	129th	32	36 t/s/u
Mixtral7Bx8	Tensor Parallel	129th	32	33 t/s/u
ResNet-50 (fps)	Data Parallel	-	128	56,000

Single Galaxy (8x4 mesh of WHs) Models

Model	Last verified release	Technique	Gen. Token [3]	Batch	Time to first token [4]	End-to-end throughput [1]	Device throughput [2]	Target throughput
Falcon7B	v0.51.0-rc30	Data Parallel	129th	1024	0.30 s	4.0 t/s/u - 4096 t/s	17.7 t/s/u - 18125 t/s	26 t/s/u

📦 Uncategorized

#10600: renamed execute_on_main_thread to operator()
- PR: #10601
#0: refactor ttnn device operation code and program cache
- PR: #11223
#11112: Add forward support for Relational Inplace Ops
- PR: #8790
Update Synchronize api to barrier for missing transactions
- PR: #11318
Move sliding_window to TTNN
- PR: #10346
Update CODEOWNERS of SD tests
- PR: #11360
#11283: Remove old Stable Diffusion implementation and its tests
- PR: #11284
#5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
- PR: #11309
#11349: Add missing include in kernel_types.hpp
- PR: #11352
#10119: move fold op to ttnn infra
- PR: #11273
Bump up TRISC0 stack size
- PR: #11317
#11089: Fix ttnn.line_all_gather(..) to work with async
- PR: #11331
Add best practices for error messages
- PR: #11337
#0: updated mistral readme to reflect batching changes
- PR: #11371
#0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
- PR: #11361
#11280: Enable sharded buffer l1 read/writes test on BH
- PR: #11281
Update CODEOWNERS
- PR: #11388
Add new items to best_practices.md
- PR: #11385
#9322: Remove lamb_optimizer op
- PR: #11383
#10550: Enable remote chip routing before profiler init
- PR: #10554
#11333: Resolve hang with Trace and R-Chip Event Synchronization
- PR: #11334
#10117: Migrate fast_reduce_nc op to ttnn
- PR: #11311
#11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
- PR: #11390
Update perf and latest features for llm models (Aug 12)
- PR: #11373
#0: Update watcher noc_sanitize to internally specify noc_id
- PR: #11394
Fix rounding in recip causing pcc issues in models
- PR: #11319
update build_metal.sh to trigger cmake test target
- PR: #11409
#9322: Remove unused bindings
- PR: #11404
Migrate Sharded Partial from TTL to TTNN
- PR: #11285
Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
- PR: #11324
#10360: Cut down on build time by targeting tests target directly
- PR: #11419
#11042: Overload complex fw ops
- PR: #11047
#0: remove decorate_as_composite
- PR: #11345
#11346: Replace tt_lib usage in eltwise backward
- PR: #11347
Add sweeps for complex bw_ops: polar, recip, add, mul
- PR: #11364
#0: Add initial t3000 nightly pipeline
- PR: #11372
#11038: Clean up more runner labels for single card
- PR: #11376
[CLEANUP] Remove old unused Mistral code inside models/experimental
- PR: #11444
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11193
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11150
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11154
#10127: Move reduce op from tt_lib to ttnn part 1
- PR: #11299
#0: Recommend noc_async_write_flushed() on examples
- PR: #11448
#0: added llama3-tg nightly demo test
- PR: #11399
#0: re-add install step at end of build_metal.sh
- PR: #11452
#0: update fold call to new ttnn
- PR: #11455
#0: Fix watcher sanitization for NOC1
- PR: #11456
Implementing all_gather to datacopy signaling
- PR: #11231
#11322: Fix UNet functional and performance demo crash
- PR: #11405
#9992: Compute-engine add example DRAM NOC fix for WH n300
- PR: #11393
Ccl/revert datacopy
- PR: #11466
Fixed default arguments for repacking llama3
- PR: #11326
#11443: Updated Mistral7B reference
- PR: #11458
#11241: Replace tt_lib in models/demos/bert and falcon7b_common
- PR: #11249
#7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
- PR: #10629
#10612: Unit tests for Galaxy cluster
- PR: #10705
#11469: Run ci/cd upload only on main if workflow_run
- PR: #11475
FIx elt_...