Release v0.55.0-rc2 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12800305479

📦 Uncategorized

Add noc read/write burst command support to CCL command kernel. Also add automated command lowering to these noc commands
- PR: #16461
MeshWorkload: Initial Implementation
- PR: #16405
[CCL] Fix padding issues
- PR: #16347
#15868: use a buffer's size when creating its CB in groupnorm
- PR: #16093
Fix trace region size
- PR: #16519
#0: Bump E2E perf threshold for host bound WH Resnet variants
- PR: #16522
Extract Device interface
- PR: #16482
Extend graph capture to include device information
- PR: #16408
Quick fix replacing Device* with IDevice in graph tracker
- PR: #16532
#0: Add unit_tests_ttnn_tensor to post-commit
- PR: #16211
Xuncai/ccl global sem
- PR: #16455
#16153: Add fused activations to input tensors
- PR: #16283
Remove ARCH_NAME specific includes from erisc_datamover_builder
- PR: #16505
remove unused function
- PR: #16537
[TT-Train] Updates related to the fixed matmul
- PR: #16540
[Llama3] Add max prefill chunk sizes for different model/device combinations
- PR: #16508
Add sharded sweeps identiy, neg, selu, abs
- PR: #15999
Handle padded shards in ttnn.convert_to_chw
- PR: #15915
#16492: Add new APIs for setting which sub_device_ids to stall on
- PR: #16473
#0: Track local_cb_size to ensure that remote cb config is correctly sent by FD
- PR: #16542
support keepdim for prod
- PR: #16370
#16225: Int32 support for abs
- PR: #16226
Sharded sweeps: prelu, softmax, sinh, softplus, relu_max and relu_min
- PR: #16050
Changing output channel size in the readme example
- PR: #16303
Fix double move in TTNN invoke_composite launch_op
- PR: #16551
Quick fix how to storage/access for devices in the DevicePool
- PR: #16550
Add native N-dimensional tiled-interleaved permute support when the tiles are now broken.
- PR: #16468
fix multi-iter in reduce scatter and adopt runtime arg overrider infra
- PR: #16531
[tt-train] Add linear regression ddp example
- PR: #16245
Remove eth_l1_address_params.h from device.cpp
- PR: #16538
Sharded sweeps: exp, exp2, expm1, erfc, erfinv, round, log
- PR: #16323
Fix ttnn.concat golden function when groups > 1
- PR: #16556
#16171: Assert that NCRISC NOC is idle at kernel end.
- PR: #16471
Remove eth_l1_address_params.h from tt_cluster.cpp and watcher
- PR: #16568
Remove dev_mem_map.h usage from watcher_device_reader.cpp
- PR: #16572
#14616: Remove ARCH_* ifdefs from tt_cluster.cpp
- PR: #13354
Add support for DRAM Prefetcher op
- PR: #16244
Resolve reduce-scatter-async sharded tensor correctness bug & hang
- PR: #16548
disable flaky t3k test
- PR: #16583
Remove "noc_parameters.h" from device.cpp
- PR: #16582
Remove restriction of input_nsticks_per_core % w == 0
- PR: #15205
Add tt-forge sweep for conv2d.
- PR: #16178
Remove noc header file inclusion from watcher_device_reader.cpp
- PR: #16589
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #16484
Short list failing conv2d for forge sweeps
- PR: #16597
Remove halo from shard spec
- PR: #15900
Address issues of var & std
- PR: #16545
#16492: Remove sub_device_ids apis from various read/write functions throughout the stack
- PR: #16565
#6344: Update RoBERTa QA demo
- PR: #8896
Remove noc_parameters.h inclusion from ttnn
- PR: #16593
Resubmit #16339: parameterize dispatch_constants
- PR: #16478
#11512: Refactor bitwise sweeps, add bitwise sharded sweeps, modify t…
- PR: #15704
Update CODEOWNERS
- PR: #16604
Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
Set up targeting idle eth cores on BH - won't enable because of hang debug
- PR: #14817
Reorganize Print Pages Infrastructure
- PR: #16463
lower fabric erisc datamover eth context switching frequency when workload is running
- PR: #16610
Composite binary sweeps: gcd and lcm
- PR: #16423
Remove ARCH_NAME from host library code
- PR: #16616
[tt-train] Add nanogpt ddp mode
- PR: #16614
#16312: Fix full op to query physical shape for buffer volume
- PR: #16562
#16366: Changed default kernal_config_val for 32bit matmul
- PR: #16567
#16621: Add barriers at end of cq_dispatch_slave.cpp
- PR: #16624
Build wheels in models unit tests workflow
- PR: #16615
Mo/10234 eth dispatch profiling
- PR: #15609
Support subcoregrids in concat_heads
- PR: #16223
Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
#16590: profiler trace detection fix
- PR: #16591
#16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
Fix nightly stable diffusion tests
- PR: #16629
#0: Used github team for conv files
- PR: #16563
Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
fix reduce scatter multi-link support bug
- PR: #16636
support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
#12253: Implement Batch norm operation for inference mode
- PR: #16432
Port all experimental ops to compute_output_specs
- PR: #16595
#16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
Enable to/from torch tests for 0D/1D tensors
- PR: #16653
Port all data movements ops to compute_output_specs
- PR: #16652
#15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
Fix build break
- PR: #16656
Logical sharding for input tensor and halo output
- PR: #16517
#16495: reduce grid for falcon7b mlp matmul
- PR: #16569
Stress NOC mcast test
- PR: #16639
[skip ci] Update subdevice doc
- PR: #16669
Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
Fix resnet large on GS
- PR: #16665
Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
#16353: skip no volume tensors
- PR: #16619
Create README.md
- PR: #16675
Update README.md
- PR: #16676
#16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
Update .clang-format-ignore
- PR: #16681
Tweak BH csrrs init code
- PR: #16682
#0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
Update perf and latest features for llm models (Jan 13)
- PR: #16677
Update README.md
- PR: #16702
#16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
Tilize with val padding results in L1 cache OOM
- PR: #16633
#0: Fixes from commit ae61802
- PR: #16686
#0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
Generate test executables per architecture
- PR: #16594
#16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
Update METALIUM_GUIDE.md
- PR: #16602
#16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
Finish porting OPs to compute_output_specs
- PR: #16695
ScopedGraphCapture
- PR: #15774
#15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
#15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
#0: (MINOR) Bump to v0.55.0
- PR: #16714
#11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
#6983: Renable skipped TT-NN unit test
- PR: #16642
#15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
update build flag on programming examples docs
- PR: #16635
Fix for P100 board type
- PR: #16718
Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
[TT-Train] Update generate of LLM
- PR: #16723
[TT-Train] Add bias=false in LinearLayer
- PR: #16707
TT-Fabric Bringup Initial Check-in
- PR: #16343
#0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
[skip ci] Update llms.md
- PR: #16737
Update test_slice.py
- PR: #16734
#16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
Update code-analysis.yaml
- PR: #16738
[skip ci] Update llms.md
- PR: #16745
remove references to LFS
- PR: #16722
Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
#0: Disable BH tools test at workflow level
- PR: #16749
Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
[skip ci] Fix lint on a doc
- PR: #16751
#0: API Unification for Device and MeshDevice
- PR: #16570
Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
#16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
#7126: remove skip for test_sd_matmul test
- PR: #16729
#0: Make device an optional parameter in the tensor distribution API
- PR: #16746
Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
#11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
#15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
[tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
#0: Fix build of test_pgm_dispatch
- PR: #16773
[tt-train] Update serialization of tensor for DDP
- PR: #16778
#0: Fix failing TG regression tests
- PR: #16776
[skip ci] Update llms.md
- PR: #16775
Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
Add Fabric Router Config to to Hal
- PR: #16761
[skip ci] Update llms.md
- PR: #16791
Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
[skip ci] Update llms.md
- PR: #16792

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.55.0-rc2

📦 Uncategorized