Release v0.54.0-rc5 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12459612752

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
#13405: TTNN implementation of LENET model
- PR: #13473
Unvendor nlohmann json
- PR: #15956
#0: Update Llama3 README
- PR: #16006
#0: Minor fix to Llama3 model config for TG
- PR: #16008
#13944: Redesign memory packing API
- PR: #15980
#0: Get rid of run_pre_post_commit_regressions* scripts and split CPP tests as much as we can
- PR: #15968
Create new FD frequent pipeline to isolate unstable pgm benchmark tests
- PR: #16010
Revert "#13405: TTNN implementation of LENET model (#13473)"
- PR: #16009
#0: Dedup code in pytensor using generic lambdas and duck typing
- PR: #15989
#14353: DRAM Read Alignment for Layernorm
- PR: #15993
Afuller/fix clang tidy scan
- PR: #16017
#0: Support arch-specific sfpi releases
- PR: #15831
Enable too-small-loop-variable check
- PR: #15984
Remove built cache of previous git commits.
- PR: #15344
[tt-train] Make tests to open and close device explicitly
- PR: #15982
Update ttcnn.md
- PR: #16025
#0: Add bc to docker container for pgm dispatch math
- PR: #16030
#16012: Revert conv2d changes because of perf regressions, pcc regressions, and increase in runtime
- PR: #16019
Update ttcnn.md
- PR: #16031
Enable noexcept-move-ctor check
- PR: #16018
More updates to ttcnn.md
- PR: #16032
disable workflow telemetry in prepare-metal-run
- PR: #16034
Add support for pretty printing Conv2dConfig
- PR: #16027
[tt-train] TT-train build is broken in main
- PR: #16035
#0: created interleaved to sharded e2e sweep test
- PR: #16016
Add support for padding along width dimension to ttnn.pad
- PR: #15985
Bump umd
- PR: #15967
#0: Prevent slice from padding up a 0 volume tensor
- PR: #15988
#0: support unequal ranked inputs for broadcast in binary_ng
- PR: #15957
#16014: Fix yolo4 e2e perf measurement
- PR: #16044
Update CODEOWNERS - add experimental CCL section
- PR: #16039
#15780: div ops debug
- PR: #15992
Revert "#16012: Revert conv2d changes because of perf regressions, pc…
- PR: #16045
#13127: Make TensorLayout::compute_physical_shard_shape public
- PR: #16023
Link Tensor.reshape to ttnn.reshape
- PR: #15669
#0: Fix merge conflicts originating from #15289
- PR: #16062
Integrate chunked prefill into t3k Llama3-70B
- PR: #15921
Bump MagicEnum to v0.9.7
- PR: #16065
#15944: Fix pybind of create_sub_device_manager_with_fabric to call the correct function.
- PR: #16056
[tt-train] Add option to disable wandb in examples
- PR: #16069
Update perf and latest features for llm models (Dec 16)
- PR: #16060
#16070: Use the same Docker image as built
- PR: #16071
[tt-train] Bump magic_enum from 0.9.6 to 0.9.7
- PR: #16074
Update ttcnn.md
- PR: #16077
#13643: Extend binary-ng math support to match all primitive binary ops.
- PR: #16068
#14530: remove up front padding from generic reduce
- PR: #16053
Revert "#0: Fix merge conflicts originating from #15289"
- PR: #16080
Revert "Link Tensor.reshape to ttnn.reshape"
- PR: #16081
#15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
- PR: #15886
#0: Allow ttnn.pad to pad Tensor to an odd width in row major
- PR: #16079
#15565 Add unit test to show sharding ttnn.from_torch problems
- PR: #15827
#14977: conv config to use higher cores.
- PR: #15962
Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
- PR: #16086
[UMD] Removed set_*_params calls and constants
- PR: #15908
#0: Remove some dead code
- PR: #16084
Updated installation script
- PR: #16101
Python -> Python3
- PR: #16063
Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
Adding ND support for tilize/untilize with padding
- PR: #15933
[Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
#0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
#15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
Remove setup_core_to_tlb_map
- PR: #16048
#0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.54.0-rc5

📦 Uncategorized