v0.55.0-rc2
Pre-release
Pre-release
github-actions
released this
16 Jan 02:07
·
5 commits
to main
since this release
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12800305479
📦 Uncategorized
- Add noc read/write burst command support to CCL command kernel. Also add automated command lowering to these noc commands
- PR: #16461
- MeshWorkload: Initial Implementation
- PR: #16405
- [CCL] Fix padding issues
- PR: #16347
- #15868: use a buffer's size when creating its CB in groupnorm
- PR: #16093
- Fix trace region size
- PR: #16519
- #0: Bump E2E perf threshold for host bound WH Resnet variants
- PR: #16522
- Extract Device interface
- PR: #16482
- Extend graph capture to include device information
- PR: #16408
- Quick fix replacing Device* with IDevice in graph tracker
- PR: #16532
- #0: Add unit_tests_ttnn_tensor to post-commit
- PR: #16211
- Xuncai/ccl global sem
- PR: #16455
- #16153: Add fused activations to input tensors
- PR: #16283
- Remove ARCH_NAME specific includes from erisc_datamover_builder
- PR: #16505
- remove unused function
- PR: #16537
- [TT-Train] Updates related to the fixed matmul
- PR: #16540
- [Llama3] Add max prefill chunk sizes for different model/device combinations
- PR: #16508
- Add sharded sweeps identiy, neg, selu, abs
- PR: #15999
- Handle padded shards in
ttnn.convert_to_chw
- PR: #15915
- #16492: Add new APIs for setting which sub_device_ids to stall on
- PR: #16473
- #0: Track local_cb_size to ensure that remote cb config is correctly sent by FD
- PR: #16542
- support keepdim for prod
- PR: #16370
- #16225: Int32 support for abs
- PR: #16226
- Sharded sweeps: prelu, softmax, sinh, softplus, relu_max and relu_min
- PR: #16050
- Changing output channel size in the readme example
- PR: #16303
- Fix double move in TTNN invoke_composite launch_op
- PR: #16551
- Quick fix how to storage/access for devices in the DevicePool
- PR: #16550
- Add native N-dimensional tiled-interleaved permute support when the tiles are now broken.
- PR: #16468
- fix multi-iter in reduce scatter and adopt runtime arg overrider infra
- PR: #16531
- [tt-train] Add linear regression ddp example
- PR: #16245
- Remove eth_l1_address_params.h from device.cpp
- PR: #16538
- Sharded sweeps: exp, exp2, expm1, erfc, erfinv, round, log
- PR: #16323
- Fix
ttnn.concat
golden function when groups > 1- PR: #16556
- #16171: Assert that NCRISC NOC is idle at kernel end.
- PR: #16471
- Remove eth_l1_address_params.h from tt_cluster.cpp and watcher
- PR: #16568
- Remove dev_mem_map.h usage from watcher_device_reader.cpp
- PR: #16572
- #14616: Remove ARCH_* ifdefs from tt_cluster.cpp
- PR: #13354
- Add support for DRAM Prefetcher op
- PR: #16244
- Resolve reduce-scatter-async sharded tensor correctness bug & hang
- PR: #16548
- disable flaky t3k test
- PR: #16583
- Remove "noc_parameters.h" from device.cpp
- PR: #16582
- Remove restriction of input_nsticks_per_core % w == 0
- PR: #15205
- Add tt-forge sweep for conv2d.
- PR: #16178
- Remove noc header file inclusion from watcher_device_reader.cpp
- PR: #16589
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #16484
- Short list failing conv2d for forge sweeps
- PR: #16597
- Remove halo from shard spec
- PR: #15900
- Address issues of var & std
- PR: #16545
- #16492: Remove sub_device_ids apis from various read/write functions throughout the stack
- PR: #16565
- #6344: Update RoBERTa QA demo
- PR: #8896
- Remove noc_parameters.h inclusion from ttnn
- PR: #16593
- Resubmit #16339: parameterize dispatch_constants
- PR: #16478
- #11512: Refactor bitwise sweeps, add bitwise sharded sweeps, modify t…
- PR: #15704
- Update CODEOWNERS
- PR: #16604
- Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
- Set up targeting idle eth cores on BH - won't enable because of hang debug
- PR: #14817
- Reorganize Print Pages Infrastructure
- PR: #16463
- lower fabric erisc datamover eth context switching frequency when workload is running
- PR: #16610
- Composite binary sweeps: gcd and lcm
- PR: #16423
- Remove ARCH_NAME from host library code
- PR: #16616
- [tt-train] Add nanogpt ddp mode
- PR: #16614
- #16312: Fix full op to query physical shape for buffer volume
- PR: #16562
- #16366: Changed default kernal_config_val for 32bit matmul
- PR: #16567
- #16621: Add barriers at end of cq_dispatch_slave.cpp
- PR: #16624
- Build wheels in models unit tests workflow
- PR: #16615
- Mo/10234 eth dispatch profiling
- PR: #15609
- Support subcoregrids in concat_heads
- PR: #16223
- Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
- #16590: profiler trace detection fix
- PR: #16591
- #16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
- Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
- Fix nightly stable diffusion tests
- PR: #16629
- #0: Used github team for conv files
- PR: #16563
- Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
- fix reduce scatter multi-link support bug
- PR: #16636
- support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
- Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- Port all experimental ops to compute_output_specs
- PR: #16595
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792