Release v0.51.0 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689

Demo models and their metrics

Grayskull (GS) Models

Model	Batch	Target throughput
ResNet-50 (fps)	20	10,000
BERT-Large (sen/s)	12	410
Falcon7B-decode (t/s)	32	140
ViT (fps)	8	2000
T5 small (sen/s)
Bloom (sen/s)
U-Net	coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model	Gen. Token [3]	Batch	Time to first token [4]	Target throughput
Falcon7B	129th	32	0.08 s	26
Mistral-7B	129th	32	coming soon	25
Mamba-2.8B	any	32	0.04 s	41
LLaMA-3.1-8B	129th	8	coming soon	23
BERT-Large (sen/s) [5]	-	8	-	400
Stable Diffusion 1.4 512x512 (sec/img) [6]	-	1	-	3
ResNet-50 (fps)	-	16	-	7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model	Technique	Gen. Token [3]	Batch	Target throughput
Falcon7B	Data Parallel	129th	256	26 t/s/u
LLaMA-2-70B	Tensor Parallel	129th	32	20 t/s/u
LLaMA-3.1-70B	Tensor Parallel	129th	32	20 t/s/u
Falcon40B	Tensor Parallel	129th	32	36 t/s/u
Mixtral7Bx8	Tensor Parallel	129th	32	33 t/s/u
ResNet-50 (fps)	Data Parallel	-	128	56,000

Single Galaxy (8x4 mesh of WHs) Models

Model	Last verified release	Technique	Gen. Token [3]	Batch	Time to first token [4]	End-to-end throughput [1]	Device throughput [2]	Target throughput
Falcon7B	v0.51.0-rc30	Data Parallel	129th	1024	0.30 s	4.0 t/s/u - 4096 t/s	17.7 t/s/u - 18125 t/s	26 t/s/u

📦 Uncategorized

#10600: renamed execute_on_main_thread to operator()
- PR: #10601
#0: refactor ttnn device operation code and program cache
- PR: #11223
#11112: Add forward support for Relational Inplace Ops
- PR: #8790
Update Synchronize api to barrier for missing transactions
- PR: #11318
Move sliding_window to TTNN
- PR: #10346
Update CODEOWNERS of SD tests
- PR: #11360
#11283: Remove old Stable Diffusion implementation and its tests
- PR: #11284
#5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
- PR: #11309
#11349: Add missing include in kernel_types.hpp
- PR: #11352
#10119: move fold op to ttnn infra
- PR: #11273
Bump up TRISC0 stack size
- PR: #11317
#11089: Fix ttnn.line_all_gather(..) to work with async
- PR: #11331
Add best practices for error messages
- PR: #11337
#0: updated mistral readme to reflect batching changes
- PR: #11371
#0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
- PR: #11361
#11280: Enable sharded buffer l1 read/writes test on BH
- PR: #11281
Update CODEOWNERS
- PR: #11388
Add new items to best_practices.md
- PR: #11385
#9322: Remove lamb_optimizer op
- PR: #11383
#10550: Enable remote chip routing before profiler init
- PR: #10554
#11333: Resolve hang with Trace and R-Chip Event Synchronization
- PR: #11334
#10117: Migrate fast_reduce_nc op to ttnn
- PR: #11311
#11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
- PR: #11390
Update perf and latest features for llm models (Aug 12)
- PR: #11373
#0: Update watcher noc_sanitize to internally specify noc_id
- PR: #11394
Fix rounding in recip causing pcc issues in models
- PR: #11319
update build_metal.sh to trigger cmake test target
- PR: #11409
#9322: Remove unused bindings
- PR: #11404
Migrate Sharded Partial from TTL to TTNN
- PR: #11285
Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
- PR: #11324
#10360: Cut down on build time by targeting tests target directly
- PR: #11419
#11042: Overload complex fw ops
- PR: #11047
#0: remove decorate_as_composite
- PR: #11345
#11346: Replace tt_lib usage in eltwise backward
- PR: #11347
Add sweeps for complex bw_ops: polar, recip, add, mul
- PR: #11364
#0: Add initial t3000 nightly pipeline
- PR: #11372
#11038: Clean up more runner labels for single card
- PR: #11376
[CLEANUP] Remove old unused Mistral code inside models/experimental
- PR: #11444
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11193
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11150
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11154
#10127: Move reduce op from tt_lib to ttnn part 1
- PR: #11299
#0: Recommend noc_async_write_flushed() on examples
- PR: #11448
#0: added llama3-tg nightly demo test
- PR: #11399
#0: re-add install step at end of build_metal.sh
- PR: #11452
#0: update fold call to new ttnn
- PR: #11455
#0: Fix watcher sanitization for NOC1
- PR: #11456
Implementing all_gather to datacopy signaling
- PR: #11231
#11322: Fix UNet functional and performance demo crash
- PR: #11405
#9992: Compute-engine add example DRAM NOC fix for WH n300
- PR: #11393
Ccl/revert datacopy
- PR: #11466
Fixed default arguments for repacking llama3
- PR: #11326
#11443: Updated Mistral7B reference
- PR: #11458
#11241: Replace tt_lib in models/demos/bert and falcon7b_common
- PR: #11249
#7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
- PR: #10629
#10612: Unit tests for Galaxy cluster
- PR: #10705
#11469: Run ci/cd upload only on main if workflow_run
- PR: #11475
FIx elt_binary with fused silu sharded version
- PR: #11382
#11428: removed manual calls to ttnn::device_operation::run
- PR: #11429
#10612: Added LD_LIBRARY_PATH var to workflows (minus builds)
- PR: #11484
#11487: register sharded_partial with auto launch
- PR: #11489
llama3: Fuse silu with eltwise mul after FF1
- PR: #11471
#11278: Compiling TTNN with GCC-12
- PR: #11482
#0: fix syntax issue with build workflow
- PR: #11497
Support linking mcasts within/across subcmds for kernel bins
- PR: #11395
#11351: Replace tt_lib usage in eltwise complex
- PR: #11353
#11043: Overload complex multiply, divide
- PR: #11436
#8865: Optimize bcast_h and bcast_w binary kernel override_runtime_ar…
- PR: #10894
Add Llama3.1-8B tests to CI
- PR: #11003
#11392: update matmul block sweep pcc and adjust automatic matmul parameters to avoid exceptions
- PR: #11460
Migrate ssm_1d_sum_reduce to ttnn
- PR: #10689
#0: Revert "#0: Migrate ssm_1d_sum_reduce to ttnn" because it breaks build
- PR: #11506
#0: Remove non-ethernet dispatch trace tests
- PR: #11494
#11499: set default parameter of memory_config to nullopt
- PR: #11500
#11483: Bringing program hash computaion to op_profiler
- PR: #11491
#11473: Remove models/experimental/llama2_70b. All development is now…
- PR: #11477
#11241: Replace tt_lib in models/demos/metal_BERT_large_11
- PR: #11276
TTNN complex ops sweeps added
- PR: #11439
#0: TG - Frequent tests fix
- PR: #11502
uint8 pack reconfig
- PR: #11467
Migrate ssm_1d_sum_reduce to ttnn
- PR: #11507
Remove header references and forward decl types in several metal include files
- PR: #11274
Aliu/rm wh arch env
- PR: #11025
#11368: Add Single and Multi-Dev Event APIs to TTNN
- PR: #11407
Move 20.04 builds to self-hosted runners
- PR: #11526
Add sharing support for slice
- PR: #11362
fix compile error
- PR: #11535
#0: Add workaround for avoiding i$ on eth cores that was accidentally deleted
- PR: #11378
#11343: move copy, experimental.typecast, assign and clone
- PR: #11402
#11323: Enable all UNet Shallow tests in CI
- PR: #11481
#11538: Re-enable test_ccl_helpers test suite w/ linker fix
- PR: #11539
#11527: Replace tt_lib in tests/tt_eager
- PR: #11505
Move conv to ttnn namespace.
- PR: #11498
Updated some sweeps to be consistent with the documentation
- PR: #11501
#10881: update golden function
- PR: #11081
#8150: Fix unary docs
- PR: #11339
#11341: Replace tt_lib in models/experimental/bert
- PR: #11350
#11241: Replace tt_lib in models/demos/resnet
- PR: #11286
Share host assigned ID with device FW
- PR: #11397
#0: Fix grammar
- PR: #11510
#0: Add tg nightly pipeline
- PR: #11515
#9751: move attn_matmul, attn_matmul_from_cache and group_attn_matmul…
- PR: #11541
#11241: Replace tt_lib in models/demos/t3000
- PR: #11355
#11341: Replace tt_lib in models/experimental/bert_large_perf
- PR: #11359
#11341: Replace tt_lib in bert_tiny, distilbert
- PR: #11366
#0: Hoist row harvesting error message early
- PR: #11523
#0: Update E2E perf thresholds for Bert/Resnet after dispatch optimizations
- PR: #11542
#9527: Swapping out most python usages of bcast
- PR: #11164
#4984: Make dispatch use HAL for core types
- PR: #11532
#10874: enable initial line allgather testing in TG frequent pipelines
- PR: #11418
#9932: Add support to configure static tlbs for dram and eth cores on BH
- PR: #11107
#11566: Remove unused paged_update_cache op
- PR: #11577
#11241: Replace tt_lib in models/demos/ttnn_*
- PR: #11358
#11241: Replace tt_lib in models/demos/wormhole
- PR: #11492
Update CMake infra for UMD and run UMD unit tests in post-commit
- PR: #11496
Update relay_to_next_cb to use stateful noc apis
- PR: #11570
#11340: migrate move to ttnn
- PR: #11540
#0: fix move issue
- PR: #11586
#0: Fix corner case for process_relay_paged_cmd_large and typo for inline write for BH
- PR: #11583
TTNN Reshape on Device Migration
- PR: #11391
#0: Remove sync mode tests for multi-device resnet
- PR: #11588
Enable multi-buffer per channel in EDM
- PR: #11387
#9527: Reverting changes on falcon for t3000
- PR: #11584
#0: Split ttnn normalization .hpp/.cpp
- PR: #11590
Ngrujic/transpose opt
- PR: #11348
#11424: Replace tt_lib in models/experimental/efficientnet,falcon_40b
- PR: #11561
Update Mixtral expected output and test_model PCC
- PR: #11557
Enable compilation of g++-12 Release build
- PR: #11533
#11422: Replace tt_lib in models/experimental/t5,vovnet
- PR: #11434
Move debug build to build on GH runners
- PR: #11601
#11367: Replace tt_lib usage in tests/ttnn/unit_tests
- PR: #11549
#0: Falcon40b demo - update expected output tokens
- PR: #11525
#11422: Replace tt_lib in models/experimental/trocr,vgg,vit
- PR: #11437
Mo/11571 external op attributes
- PR: #11575
#11426: Replace tt_lib in models/experimental/grok,hrnet,lenet,helper…
- PR: #11545
#11461 stimulus seed fix
- PR: #11516
Update Mamba decode performance criteria in demo
- PR: #11558
#9823: enabling FD out of idle eth cores on BH and update eth l1 size to correct value
- PR: #11576
#0: Flip shape assert for converting RM to TILE to TT_FATAL
- PR: #11587
#0: change log level to "warning" when timing out pytest
- PR: #11613
Add more general sharding support for Pad and Transpose HC
- PR: #11574
#10136: Move SDPA ops to ttnn
- PR: #11468
#10136: Remove old SDPA ops
- PR: #11616
Enable double buffered EDM channel mode for all-gather
- PR: #11596
Graph Capture
- PR: #11142
Optimize sdpa attention mask generation
- PR: #11529
Refactor Falcon7b to use ttnn multi-device tensors (data parallel)
- PR: #11530
Implementing all_gather to datacopy signaling
- PR: #11480
#0: Fix length_adjust max value in test_prefetcher
- PR: #11619
#10100: Add support for paged update_cache and fill_cache
- PR: #10732
Rtawfik/untilize a b
- PR: #11572
#0: delete umd_device.cmake
- PR: #11633
#11488: remove -Wno-c++23-extensions flag
- PR: #11631
Update Moreh records in CODEOWNERS
- PR: #11638
#11421: Replace import tt_lib to ttnn in model experimental
- PR: #11553
Added celu and bias_gelu_unary sweeps to ttnn
- PR: #11555
#11425: Remove tt_lib from models
- PR: #11427
#11544: Replace tt_lib with ttnn in mobilenet, nanogpt, mnist
- PR: #11547
#10133: Migrate update_cache, fill_cache op
- PR: #11155
#0: Bump some perf thresholds for non-trace ttnn_resnet tests
- PR: #11655
#11610: Disable schedule on old sweeps workflow
- PR: #11661
#11478: Fix links in INSTALLING.md
- PR: #11479
#11632: fix reduce scatter regression
- PR: #11637
#11658: fix parameters and non-existing variable issues in short matmul sweeps
- PR: #11665
#11659: Remove section attribute conflict
- PR: #11666
#11656: Bump single-card demos timeout to 70 min
- PR: #11657
#11610: Add new sweep test workflow to CI
- PR: #11611
#11610: Add new flags for resetting with tt-smi-metal executable
- PR: #11672
#11669: Profiler slow dispatch unit test
- PR: #11670
Align page sizes once in AddrGens and clean up some dataflow_api fns
- PR: #11630
[Bug Fix] Fixed graph capture didn't disable hooks
- PR: #11627
Add git_branch_name and github_pipeline_link to environment CSVs for benchmarking
- PR: #11683
Add galaxy table to front page README
- PR: #11678
#11536: add galaxy umd tests to TG unit tests pipeline
- PR: #11677
#0: Add graph trace tests
- PR: #11682
#0: Align graph event tracking names
- PR: #11686
#0: Add CMake build flag ENABLE_LIBCXX to selectively enable/disable libc++
- PR: #11629
#0: Fix Graph Tracing tracks CBs deallocation
- PR: #11687
#11647: Move logical inplace ops to binary.cpp
- PR: #11652
Ngrujic/profiling
- PR: #11508
Move getting dispatch sem addr into if statement in erisck.cc
- PR: #11685
#11597: Remove tt_lib in models
- PR: #11598
#8885: Add forward support for EQ_, NE_
- PR: #8887
Add sweeps for ttnn ops identity, subalpha_bw, remainder_unary, remainder_eltwise, remainder_unary_bw
- PR: #11605
#2956: Resolve test for addcdiv
- PR: #11697
Replace ttlib to ttnn yolo folder and util function
- PR: #11374
#0: Add map_location to fix CUDA error while using torch.load
- PR: #11675
#6232: Add embedding backwards op for training
- PR: #11127
#0: Revert overloaded ops
- PR: #11705
#11614: Upload benchmarks even on failure
- PR: #11707
#8764: Additional set of changes for WH readiness, including new installation steps
- PR: #10420
#11511: Add option to Enable ASAN in CMake
- PR: #11676
#0: Add graph-based query apis
- PR: #11689
#0: Call op validation when program cache is disabled
- PR: #11719
#11470: Add device generator support for sweep framework
- PR: #11486
#0: Codify model release rules
- PR: #11714
#11379: Add device perf and t3000 demos to package and release
- PR: #11381
#11726: Fix type mismatches in sweeps ES commands
- PR: #11727
#0: Update sweeps README
- PR: #11729
#11704: Add git_branch_name to ci/cd info
- PR: #11728
#0: Add e2e_perf default on test hang
- PR: #11731
#11667: Skip failing SD and resnet tests on single-card WH regression
- PR: #11668
#0: Improve README instructions for ResNet50 running on T3k
- PR: #11759
#11038: Changed runner-labels
- PR: #11039
#11503: Use new ttnn registration for matmul/linear
- PR: #11713

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.51.0