Skip to content

v0.51.0

Compare
Choose a tag to compare
@github-actions github-actions released this 27 Aug 15:10
· 2512 commits to main since this release
e6f4c70

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689

Demo models and their metrics

Grayskull (GS) Models

Model Batch Target throughput
ResNet-50 (fps) 20 10,000
BERT-Large (sen/s) 12 410
Falcon7B-decode (t/s) 32 140
ViT (fps) 8 2000
T5 small (sen/s)
Bloom (sen/s)
U-Net coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model Gen. Token [3] Batch Time to first token [4] Target throughput
Falcon7B 129th 32 0.08 s 26
Mistral-7B 129th 32 coming soon 25
Mamba-2.8B any 32 0.04 s 41
LLaMA-3.1-8B 129th 8 coming soon 23
BERT-Large (sen/s) [5] - 8 - 400
Stable Diffusion 1.4 512x512 (sec/img) [6] - 1 - 3
ResNet-50 (fps) - 16 - 7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model Technique Gen. Token [3] Batch Target throughput
Falcon7B Data Parallel 129th 256 26 t/s/u
LLaMA-2-70B Tensor Parallel 129th 32 20 t/s/u
LLaMA-3.1-70B Tensor Parallel 129th 32 20 t/s/u
Falcon40B Tensor Parallel 129th 32 36 t/s/u
Mixtral7Bx8 Tensor Parallel 129th 32 33 t/s/u
ResNet-50 (fps) Data Parallel - 128 56,000

Single Galaxy (8x4 mesh of WHs) Models

Model Last verified release Technique Gen. Token [3] Batch Time to first token [4] End-to-end throughput [1] Device throughput [2] Target throughput
Falcon7B v0.51.0-rc30 Data Parallel 129th 1024 0.30 s 4.0 t/s/u - 4096 t/s 17.7 t/s/u - 18125 t/s 26 t/s/u

📦 Uncategorized

  • #10600: renamed execute_on_main_thread to operator()
  • #0: refactor ttnn device operation code and program cache
  • #11112: Add forward support for Relational Inplace Ops
  • Update Synchronize api to barrier for missing transactions
  • Move sliding_window to TTNN
  • Update CODEOWNERS of SD tests
  • #11283: Remove old Stable Diffusion implementation and its tests
  • #5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
  • #11349: Add missing include in kernel_types.hpp
  • #10119: move fold op to ttnn infra
  • Bump up TRISC0 stack size
  • #11089: Fix ttnn.line_all_gather(..) to work with async
  • Add best practices for error messages
  • #0: updated mistral readme to reflect batching changes
  • #0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
  • #11280: Enable sharded buffer l1 read/writes test on BH
  • Update CODEOWNERS
  • Add new items to best_practices.md
  • #9322: Remove lamb_optimizer op
  • #10550: Enable remote chip routing before profiler init
  • #11333: Resolve hang with Trace and R-Chip Event Synchronization
  • #10117: Migrate fast_reduce_nc op to ttnn
  • #11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
  • Update perf and latest features for llm models (Aug 12)
  • #0: Update watcher noc_sanitize to internally specify noc_id
  • Fix rounding in recip causing pcc issues in models
  • update build_metal.sh to trigger cmake test target
  • #9322: Remove unused bindings
  • Migrate Sharded Partial from TTL to TTNN
  • Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
  • #10360: Cut down on build time by targeting tests target directly
  • #11042: Overload complex fw ops
  • #0: remove decorate_as_composite
  • #11346: Replace tt_lib usage in eltwise backward
  • Add sweeps for complex bw_ops: polar, recip, add, mul
  • #0: Add initial t3000 nightly pipeline
  • #11038: Clean up more runner labels for single card
  • [CLEANUP] Remove old unused Mistral code inside models/experimental
  • #5424: GELU and GELU' API calls submodule LLKs
  • #5424: GELU and GELU' API calls submodule LLKs
  • #5424: GELU and GELU' API calls submodule LLKs
  • #10127: Move reduce op from tt_lib to ttnn part 1
  • #0: Recommend noc_async_write_flushed() on examples
  • #0: added llama3-tg nightly demo test
  • #0: re-add install step at end of build_metal.sh
  • #0: update fold call to new ttnn
  • #0: Fix watcher sanitization for NOC1
  • Implementing all_gather to datacopy signaling
  • #11322: Fix UNet functional and performance demo crash
  • #9992: Compute-engine add example DRAM NOC fix for WH n300
  • Ccl/revert datacopy
  • Fixed default arguments for repacking llama3
  • #11443: Updated Mistral7B reference
  • #11241: Replace tt_lib in models/demos/bert and falcon7b_common
  • #7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
  • #10612: Unit tests for Galaxy cluster
  • #11469: Run ci/cd upload only on main if workflow_run
  • FIx elt_binary with fused silu sharded version
  • #11428: removed manual calls to ttnn::device_operation::run
  • #10612: Added LD_LIBRARY_PATH var to workflows (minus builds)
  • #11487: register sharded_partial with auto launch
  • llama3: Fuse silu with eltwise mul after FF1
  • #11278: Compiling TTNN with GCC-12
  • #0: fix syntax issue with build workflow
  • Support linking mcasts within/across subcmds for kernel bins
  • #11351: Replace tt_lib usage in eltwise complex
  • #11043: Overload complex multiply, divide
  • #8865: Optimize bcast_h and bcast_w binary kernel override_runtime_ar…
  • Add Llama3.1-8B tests to CI
  • #11392: update matmul block sweep pcc and adjust automatic matmul parameters to avoid exceptions
  • Migrate ssm_1d_sum_reduce to ttnn
  • #0: Revert "#0: Migrate ssm_1d_sum_reduce to ttnn" because it breaks build
  • #0: Remove non-ethernet dispatch trace tests
  • #11499: set default parameter of memory_config to nullopt
  • #11483: Bringing program hash computaion to op_profiler
  • #11473: Remove models/experimental/llama2_70b. All development is now…
  • #11241: Replace tt_lib in models/demos/metal_BERT_large_11
  • TTNN complex ops sweeps added
  • #0: TG - Frequent tests fix
  • uint8 pack reconfig
  • Migrate ssm_1d_sum_reduce to ttnn
  • Remove header references and forward decl types in several metal include files
  • Aliu/rm wh arch env
  • #11368: Add Single and Multi-Dev Event APIs to TTNN
  • Move 20.04 builds to self-hosted runners
  • Add sharing support for slice
  • fix compile error
  • #0: Add workaround for avoiding i$ on eth cores that was accidentally deleted
  • #11343: move copy, experimental.typecast, assign and clone
  • #11323: Enable all UNet Shallow tests in CI
  • #11538: Re-enable test_ccl_helpers test suite w/ linker fix
  • #11527: Replace tt_lib in tests/tt_eager
  • Move conv to ttnn namespace.
  • Updated some sweeps to be consistent with the documentation
  • #10881: update golden function
  • #8150: Fix unary docs
  • #11341: Replace tt_lib in models/experimental/bert
  • #11241: Replace tt_lib in models/demos/resnet
  • Share host assigned ID with device FW
  • #0: Fix grammar
  • #0: Add tg nightly pipeline
  • #9751: move attn_matmul, attn_matmul_from_cache and group_attn_matmul…
  • #11241: Replace tt_lib in models/demos/t3000
  • #11341: Replace tt_lib in models/experimental/bert_large_perf
  • #11341: Replace tt_lib in bert_tiny, distilbert
  • #0: Hoist row harvesting error message early
  • #0: Update E2E perf thresholds for Bert/Resnet after dispatch optimizations
  • #9527: Swapping out most python usages of bcast
  • #4984: Make dispatch use HAL for core types
  • #10874: enable initial line allgather testing in TG frequent pipelines
  • #9932: Add support to configure static tlbs for dram and eth cores on BH
  • #11566: Remove unused paged_update_cache op
  • #11241: Replace tt_lib in models/demos/ttnn_*
  • #11241: Replace tt_lib in models/demos/wormhole
  • Update CMake infra for UMD and run UMD unit tests in post-commit
  • Update relay_to_next_cb to use stateful noc apis
  • #11340: migrate move to ttnn
  • #0: fix move issue
  • #0: Fix corner case for process_relay_paged_cmd_large and typo for inline write for BH
  • TTNN Reshape on Device Migration
  • #0: Remove sync mode tests for multi-device resnet
  • Enable multi-buffer per channel in EDM
  • #9527: Reverting changes on falcon for t3000
  • #0: Split ttnn normalization .hpp/.cpp
  • Ngrujic/transpose opt
  • #11424: Replace tt_lib in models/experimental/efficientnet,falcon_40b
  • Update Mixtral expected output and test_model PCC
  • Enable compilation of g++-12 Release build
  • #11422: Replace tt_lib in models/experimental/t5,vovnet
  • Move debug build to build on GH runners
  • #11367: Replace tt_lib usage in tests/ttnn/unit_tests
  • #0: Falcon40b demo - update expected output tokens
  • #11422: Replace tt_lib in models/experimental/trocr,vgg,vit
  • Mo/11571 external op attributes
  • #11426: Replace tt_lib in models/experimental/grok,hrnet,lenet,helper…
  • #11461 stimulus seed fix
  • Update Mamba decode performance criteria in demo
  • #9823: enabling FD out of idle eth cores on BH and update eth l1 size to correct value
  • #0: Flip shape assert for converting RM to TILE to TT_FATAL
  • #0: change log level to "warning" when timing out pytest
  • Add more general sharding support for Pad and Transpose HC
  • #10136: Move SDPA ops to ttnn
  • #10136: Remove old SDPA ops
  • Enable double buffered EDM channel mode for all-gather
  • Graph Capture
  • Optimize sdpa attention mask generation
  • Refactor Falcon7b to use ttnn multi-device tensors (data parallel)
  • Implementing all_gather to datacopy signaling
  • #0: Fix length_adjust max value in test_prefetcher
  • #10100: Add support for paged update_cache and fill_cache
  • Rtawfik/untilize a b
  • #0: delete umd_device.cmake
  • #11488: remove -Wno-c++23-extensions flag
  • Update Moreh records in CODEOWNERS
  • #11421: Replace import tt_lib to ttnn in model experimental
  • Added celu and bias_gelu_unary sweeps to ttnn
  • #11425: Remove tt_lib from models
  • #11544: Replace tt_lib with ttnn in mobilenet, nanogpt, mnist
  • #10133: Migrate update_cache, fill_cache op
  • #0: Bump some perf thresholds for non-trace ttnn_resnet tests
  • #11610: Disable schedule on old sweeps workflow
  • #11478: Fix links in INSTALLING.md
  • #11632: fix reduce scatter regression
  • #11658: fix parameters and non-existing variable issues in short matmul sweeps
  • #11659: Remove section attribute conflict
  • #11656: Bump single-card demos timeout to 70 min
  • #11610: Add new sweep test workflow to CI
  • #11610: Add new flags for resetting with tt-smi-metal executable
  • #11669: Profiler slow dispatch unit test
  • Align page sizes once in AddrGens and clean up some dataflow_api fns
  • [Bug Fix] Fixed graph capture didn't disable hooks
  • Add git_branch_name and github_pipeline_link to environment CSVs for benchmarking
  • Add galaxy table to front page README
  • #11536: add galaxy umd tests to TG unit tests pipeline
  • #0: Add graph trace tests
  • #0: Align graph event tracking names
  • #0: Add CMake build flag ENABLE_LIBCXX to selectively enable/disable libc++
  • #0: Fix Graph Tracing tracks CBs deallocation
  • #11647: Move logical inplace ops to binary.cpp
  • Ngrujic/profiling
  • Move getting dispatch sem addr into if statement in erisck.cc
  • #11597: Remove tt_lib in models
  • #8885: Add forward support for EQ_, NE_
  • Add sweeps for ttnn ops identity, subalpha_bw, remainder_unary, remainder_eltwise, remainder_unary_bw
  • #2956: Resolve test for addcdiv
  • Replace ttlib to ttnn yolo folder and util function
  • #0: Add map_location to fix CUDA error while using torch.load
  • #6232: Add embedding backwards op for training
  • #0: Revert overloaded ops
  • #11614: Upload benchmarks even on failure
  • #8764: Additional set of changes for WH readiness, including new installation steps
  • #11511: Add option to Enable ASAN in CMake
  • #0: Add graph-based query apis
  • #0: Call op validation when program cache is disabled
  • #11470: Add device generator support for sweep framework
  • #0: Codify model release rules
  • #11379: Add device perf and t3000 demos to package and release
  • #11726: Fix type mismatches in sweeps ES commands
  • #0: Update sweeps README
  • #11704: Add git_branch_name to ci/cd info
  • #0: Add e2e_perf default on test hang
  • #11667: Skip failing SD and resnet tests on single-card WH regression
  • #0: Improve README instructions for ResNet50 running on T3k
  • #11038: Changed runner-labels
  • #11503: Use new ttnn registration for matmul/linear