Releases: tenstorrent/tt-metal
v0.53.0-rc16
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11337677247
📦 Uncategorized
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Vanilla Unet conv unit_test
- PR: #13267
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282
- #13424: Add documentation for opt output tensor and qid
- PR: #13443
- #8428: Update sweep config and doc for polyval
- PR: #13196
- #7712: Update elu, erf variant sweep config and doc
- PR: #13156
- #7961: Update logical or doc and sweep config
- PR: #13188
- Llama 3.1 8b DRAM-shard the LM head, 23.1 t/s/u
- PR: #13340
- #12559: add ttnn implementation for convnet_mnist model
- PR: #12649
- #13143: Add documentation for core, set_printoptions ops
- PR: #13199
- #13144: Add documentation for tensor creation ops, matmul ops
- PR: #13155
- Jvega/readme changes
- PR: #13431
- #0: TG-Llama3-70b - Add compilation step to demo
- PR: #13416
- TG Llama3-70b prefill frequent tests enabled
- PR: #13472
- #11791: proper bss, stack only on firmware
- PR: #13375
- Add more eltwise unary ops
- PR: #13465
- #11307: Remove l1_buffer
- PR: #13451
- Fix composite ops asserting on perf report generation
- PR: #13480
- #11791: Implement Elf reading
- PR: #13388
- #13482: Resolve 2CQ Trace Hangs on TG
- PR: #13484
- Add DPRINT support for CB rd/wr pointers from BRISC/NCRISC
- PR: #13489
- Refactor TT-NN / TT-Metal Mesh/Multi-device related into separate subdirectory
- PR: #13460
- #13127: Add get_logical_shape/get_padded_shape to Tensor
- PR: #13372
- #0: update CODEOWNERS for distributed subdirectories
- PR: #13503
- #13127: Add simple tensor creation gtest
- PR: #13500
- Fix compilation of test_create_tensor.cpp
- PR: #13506
- #0: add is_ci_env to segformer model
- PR: #13497
- New tests and updates of ttnn sweeps
- PR: #13417
- #11307: Remove l1_data section
- PR: #13483
v0.53.0-rc15
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11319532326
📦 Uncategorized
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282
- #13424: Add documentation for opt output tensor and qid
- PR: #13443
- #8428: Update sweep config and doc for polyval
- PR: #13196
- #7712: Update elu, erf variant sweep config and doc
- PR: #13156
- #7961: Update logical or doc and sweep config
- PR: #13188
- Llama 3.1 8b DRAM-shard the LM head, 23.1 t/s/u
- PR: #13340
- #12559: add ttnn implementation for convnet_mnist model
- PR: #12649
- #13143: Add documentation for core, set_printoptions ops
- PR: #13199
- #13144: Add documentation for tensor creation ops, matmul ops
- PR: #13155
- Jvega/readme changes
- PR: #13431
- #0: TG-Llama3-70b - Add compilation step to demo
- PR: #13416
- TG Llama3-70b prefill frequent tests enabled
- PR: #13472
- #11791: proper bss, stack only on firmware
- PR: #13375
- Add more eltwise unary ops
- PR: #13465
- #11307: Remove l1_buffer
- PR: #13451
- Fix composite ops asserting on perf report generation
- PR: #13480
- #11791: Implement Elf reading
- PR: #13388
- #13482: Resolve 2CQ Trace Hangs on TG
- PR: #13484
- Add DPRINT support for CB rd/wr pointers from BRISC/NCRISC
- PR: #13489
- Refactor TT-NN / TT-Metal Mesh/Multi-device related into separate subdirectory
- PR: #13460
- #13127: Add get_logical_shape/get_padded_shape to Tensor
- PR: #13372
- #0: update CODEOWNERS for distributed subdirectories
- PR: #13503
- #13127: Add simple tensor creation gtest
- PR: #13500
- Fix compilation of test_create_tensor.cpp
- PR: #13506
- #0: add is_ci_env to segformer model
- PR: #13497
- New tests and updates of ttnn sweeps
- PR: #13417
- #11307: Remove l1_data section
- PR: #13483
v0.53.0-rc14
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11301561247
📦 Uncategorized
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282
- #13424: Add documentation for opt output tensor and qid
- PR: #13443
- #8428: Update sweep config and doc for polyval
- PR: #13196
- #7712: Update elu, erf variant sweep config and doc
- PR: #13156
- #7961: Update logical or doc and sweep config
- PR: #13188
- Llama 3.1 8b DRAM-shard the LM head, 23.1 t/s/u
- PR: #13340
- #12559: add ttnn implementation for convnet_mnist model
- PR: #12649
- #13143: Add documentation for core, set_printoptions ops
- PR: #13199
- #13144: Add documentation for tensor creation ops, matmul ops
- PR: #13155
- Jvega/readme changes
- PR: #13431
- #0: TG-Llama3-70b - Add compilation step to demo
- PR: #13416
- TG Llama3-70b prefill frequent tests enabled
- PR: #13472
- #11791: proper bss, stack only on firmware
- PR: #13375
- Add more eltwise unary ops
- PR: #13465
- #11307: Remove l1_buffer
- PR: #13451
- Fix composite ops asserting on perf report generation
- PR: #13480
- #11791: Implement Elf reading
- PR: #13388
- #13482: Resolve 2CQ Trace Hangs on TG
- PR: #13484
- Add DPRINT support for CB rd/wr pointers from BRISC/NCRISC
- PR: #13489
- Refactor TT-NN / TT-Metal Mesh/Multi-device related into separate subdirectory
- PR: #13460
- #13127: Add get_logical_shape/get_padded_shape to Tensor
- PR: #13372
- #0: update CODEOWNERS for distributed subdirectories
- PR: #13503
- #13127: Add simple tensor creation gtest
- PR: #13500
- Fix compilation of test_create_tensor.cpp
- PR: #13506
- #0: add is_ci_env to segformer model
- PR: #13497
- New tests and updates of ttnn sweeps
- PR: #13417
- #11307: Remove l1_data section
- PR: #13483
v0.53.0-rc13
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11284745213
📦 Uncategorized
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282
- #13424: Add documentation for opt output tensor and qid
- PR: #13443
- #8428: Update sweep config and doc for polyval
- PR: #13196
- #7712: Update elu, erf variant sweep config and doc
- PR: #13156
- #7961: Update logical or doc and sweep config
- PR: #13188
- Llama 3.1 8b DRAM-shard the LM head, 23.1 t/s/u
- PR: #13340
- #12559: add ttnn implementation for convnet_mnist model
- PR: #12649
- #13143: Add documentation for core, set_printoptions ops
- PR: #13199
- #13144: Add documentation for tensor creation ops, matmul ops
- PR: #13155
- Jvega/readme changes
- PR: #13431
- #0: TG-Llama3-70b - Add compilation step to demo
- PR: #13416
- TG Llama3-70b prefill frequent tests enabled
- PR: #13472
- #11791: proper bss, stack only on firmware
- PR: #13375
- Add more eltwise unary ops
- PR: #13465
- #11307: Remove l1_buffer
- PR: #13451
- Fix composite ops asserting on perf report generation
- PR: #13480
- #11791: Implement Elf reading
- PR: #13388
- #13482: Resolve 2CQ Trace Hangs on TG
- PR: #13484
- Add DPRINT support for CB rd/wr pointers from BRISC/NCRISC
- PR: #13489
- Refactor TT-NN / TT-Metal Mesh/Multi-device related into separate subdirectory
- PR: #13460
- #13127: Add get_logical_shape/get_padded_shape to Tensor
- PR: #13372
- #0: update CODEOWNERS for distributed subdirectories
- PR: #13503
- #13127: Add simple tensor creation gtest
- PR: #13500
- Fix compilation of test_create_tensor.cpp
- PR: #13506
- #0: add is_ci_env to segformer model
- PR: #13497
- New tests and updates of ttnn sweeps
- PR: #13417
- #11307: Remove l1_data section
- PR: #13483
v0.53.0-rc12
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11265836934
📦 Uncategorized
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282
- #13424: Add documentation for opt output tensor and qid
- PR: #13443
- #8428: Update sweep config and doc for polyval
- PR: #13196
- #7712: Update elu, erf variant sweep config and doc
- PR: #13156
- #7961: Update logical or doc and sweep config
- PR: #13188
- Llama 3.1 8b DRAM-shard the LM head, 23.1 t/s/u
- PR: #13340
- #12559: add ttnn implementation for convnet_mnist model
- PR: #12649
- #13143: Add documentation for core, set_printoptions ops
- PR: #13199
- #13144: Add documentation for tensor creation ops, matmul ops
- PR: #13155
- Jvega/readme changes
- PR: #13431
- #0: TG-Llama3-70b - Add compilation step to demo
- PR: #13416
- TG Llama3-70b prefill frequent tests enabled
- PR: #13472
- #11791: proper bss, stack only on firmware
- PR: #13375
- Add more eltwise unary ops
- PR: #13465
- #11307: Remove l1_buffer
- PR: #13451
- Fix composite ops asserting on perf report generation
- PR: #13480
- #11791: Implement Elf reading
- PR: #13388
- #13482: Resolve 2CQ Trace Hangs on TG
- PR: #13484
- Add DPRINT support for CB rd/wr pointers from BRISC/NCR...
v0.53.0-rc11
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11246610563
📦 Uncategorized
- #0: Bump distilbert compile time because it keeps failing on it
- PR: #13135
- #13088: Cleanup set-1 unary backward ops
- PR: #13096
- #10033: Add forward support for gcd and lcm
- PR: #10241
- #13150: Cleanup LCM, GCD Macro
- PR: #13151
- Llama3.1 8b demo with tracing
- PR: #13153
- #13058: update matmul bias size validation
- PR: #13104
- #0: (MINOR) Update to v0.53.0
- PR: #13165
- #0: try with python 3.10
- PR: #13168
- #13145: Temporarily revert Resnet on Galaxy to use slower config for first conv to avoid hangs
- PR: #13146
- #0: Remove unnecessary ProgramDeleter
- PR: #13134
- #13127: Switch python get_legacy_shape to shape.with_tile_padding()
- PR: #13124
- Add sweeps for remainder, fmod, minimum, maximum, logical_and eltwise ops, rename eltwise sweeps
- PR: #13099
- Fix Yolo tests after updating weights shape in conv2d
- PR: #13163
- #13172: Use lower python version and cache dependencies
- PR: #13173
- #11830: Move l1/dram/pcie alignment into HAL
- PR: #12983
- #13014: optimize slice by adding a 4D uint32_t array implementation o…
- PR: #13125
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching for page tensor for flash decode
- PR: #13381
- Update llama ttft in README.md
- PR: #13389
- #0: Add tech report for inf/nan handling
- PR: #13391
- #11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy
- PR: #12962
- Add new ttnn sweeps
- PR: #13239
- Remove profiler core flat id look up
- PR: #13377
- #11789: Fix firmware/kernel padding/alignment
- PR: #13367
- #8534: Publish tt-metal docs to the central site
- PR: #10356
- #0: Sweeps Logger Fixes
- PR: #13423
- Mchiou/13011 dump firmware and system logs if ci jobs fail
- PR: #13231
- #13419: Handle cases where GitHub timeout on a job cuts off the data in a test in a Junit XML, leaving no data to use
- PR: #13425
- #12605: Add governor notes and move models steps into separate steps
- PR: #12703
- #13254: switch pgm dispatch to use trace, add it to CI
- PR: #13255
- #10016: jit_build: link substitutes, tdma_xmov, noc
- PR: #13430
- #11208: Slotmap datastructure for creating resource pools
- PR: #13427
- #0: Dispatch_s + Launch Message Ring Buffer Bugfixes
- PR: #13393
- #0: Reduce copy sweep to cover only bf16
- PR: #13436
- #13394: Galaxy 2cq support
- PR: #13422
- #0: Fix ncrisc code overflow problem
- PR: #13442
- Add more pipelines to top-level "Choose your pipeline" workflows
- PR: #13446
- #13127: Update ttnn::Shape struct to maintain API parity with existing tt::tt_metal::LegacyShape usages
- PR: #13382
- #0: SegFormer on n150 - functional
- PR: #13384
- #7091: Add git commit runbook to CONTRIBUTING.md
- PR: #13371
- Moving DRAM/L1_UNRESERVED_BASE into HAL
- PR: #13296
- #11401: Add supplementary tensor parallel example to regression
- PR: #12434
- #13432: fix t3k ethernet tests
- PR: #13453
- #0: fix mesh device fixture selection for test_distributed_layernorm
- PR: #13433
- #13454: Refactor API for MeshDevice::enable_async
- PR: #13455
- deprecate JAWBRIDGE
- PR: #13449
- #8488: Update activation list in doc
- PR: #13282...
v0.53.0-rc10
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11226937372
📦 Uncategorized
- Update Llama codeowners
- PR: #12116
- #0: fix uncaught edge case in page update cache and added it in test suit
- PR: #13074
- #12754: Migrate moreh_nll_loss operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12807
- #8633:Add TT_Fatal for full and ones op
- PR: #12921
- #12985: Expose
ttnn::ccl::Topology
at python level- PR: #12988
- #12556: Add queue_id and optional output tensors to assign_bw
- PR: #12573
- Support for increasing 1-D row major int32 tensors by one
- PR: #12773
- #12828: update ttnn matmul doc string
- PR: #13071
- Llama 3.1 8b DRAM-sharded matmuls
- PR: #12869
- Update perf and latest features for llm models (Sept 23)
- PR: #13064
- Work around CSV reporting 64 cores for DRAM-sharded matmuls
- PR: #13108
- #0: Fix PCC to correct bound
- PR: #13110
- #0: Simplify llrt/memory API
- PR: #13067
- #0: Fix caching race
- PR: #13063
- #0: Fix merge error with 80d6e48
- PR: #13112
- #11004: moreh: use env var for kernel src search path
- PR: #12541
- #12328: Fix Llama3.1-8B MLP tests running out of L1
- PR: #13113
- #11769: extend support for transposing/permuting bfloat8 tensors on n…
- PR: #13018
- #12141: Fixed matmul shape validation issue
- PR: #12989
- #0: move BufferType to device kernel accessible location
- PR: #12984
- #12658: update sweep export script and create initial graph script
- PR: #13051
- #0: ViT on WH
- PR: #13072
- [skip ci] Update README.md (ViT on n150)
- PR: #13119
- #0: Bump resnet50 ttnn 2cq compile time because it regressed likely due to gcc risc-v upgrade
- PR: #13121
- #0: Update WH Resnet compile time threshold
- PR: #13115
- Flash decode improvements r2
- PR: #13028
- #0: added support for n_heads > 1 for page cache prefill
- PR: #13117
- #0: Bump mamba compile time as it's not that important and the model is still performant, need to unblock people…
- PR: #13130
- #0: move Layout enum to device accessible location
- PR: #13118
- #0: Bump distilbert compile time because it keeps failing on it
- PR: #13135
- #13088: Cleanup set-1 unary backward ops
- PR: #13096
- #10033: Add forward support for gcd and lcm
- PR: #10241
- #13150: Cleanup LCM, GCD Macro
- PR: #13151
- Llama3.1 8b demo with tracing
- PR: #13153
- #13058: update matmul bias size validation
- PR: #13104
- #0: (MINOR) Update to v0.53.0
- PR: #13165
- #0: try with python 3.10
- PR: #13168
- #13145: Temporarily revert Resnet on Galaxy to use slower config for first conv to avoid hangs
- PR: #13146
- #0: Remove unnecessary ProgramDeleter
- PR: #13134
- #13127: Switch python get_legacy_shape to shape.with_tile_padding()
- PR: #13124
- Add sweeps for remainder, fmod, minimum, maximum, logical_and eltwise ops, rename eltwise sweeps
- PR: #13099
- Fix Yolo tests after updating weights shape in conv2d
- PR: #13163
- #13172: Use lower python version and cache dependencies
- PR: #13173
- #11830: Move l1/dram/pcie alignment into HAL
- PR: #12983
- #13014: optimize slice by adding a 4D uint32_t array implementation o…
- PR: #13125
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fast dispatch CI] Fix Llama3.1-8B tests running out of memory
- PR: #13362
- Update perf target for one falcon7b config due to CI variation
- PR: #13355
- Add bitwise ops sweeps, add gen_rand_bitwise_left_shift function
- PR: #13366
- Multiple watcher-related updates
- PR: #13029
- #11621: add filler sweeps for expand, fill, split_with_sizes, index_select and .t
- PR: #13359
- #13363: Surface job errors where Set up runner does not complete successfully
- PR: #13379
- #13127: Remove shape_without_padding() pybinding and usage
- PR: #13369
- #11208: Refactor ProgramCache to remove nested type erasure
- PR: #13216
- #11208: Slotmap datastructure for creating resource pools
- PR: #13378
- #13365: added program caching...
v0.53.0-rc9
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11207084989
📦 Uncategorized
- [skip ci] #0: ViT tech report
- PR: #13032
- Mchiou/11762 build tt metal in docker
- PR: #13033
- #13013: Added tests to run in TGG unit tests workflow
- PR: #13016
- [skip ci] #13019 Update remove-stale-branches.yaml
- PR: #13025
- Mchiou/0 fix docker build storage
- PR: #13042
- #11531: Autogenerate API rst stub files, add summary table on API page
- PR: #12075
- Add --no-advice to perf report, small fixes
- PR: #13048
- preserve fp32 precision
- PR: #12794
- #0: Remove unnecessary using declarations
- PR: #13056
- #12775: Cleanup docker run action
- PR: #12777
- #0: Update to gcc-12.x, take 2
- PR: #12999
- #12945: update galaxy/n150 eth dispatch cores
- PR: #13031
- #13070: fix SD
- PR: #13073
- Update Llama codeowners
- PR: #12116
- #0: fix uncaught edge case in page update cache and added it in test suit
- PR: #13074
- #12754: Migrate moreh_nll_loss operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12807
- #8633:Add TT_Fatal for full and ones op
- PR: #12921
- #12985: Expose
ttnn::ccl::Topology
at python level- PR: #12988
- #12556: Add queue_id and optional output tensors to assign_bw
- PR: #12573
- Support for increasing 1-D row major int32 tensors by one
- PR: #12773
- #12828: update ttnn matmul doc string
- PR: #13071
- Llama 3.1 8b DRAM-sharded matmuls
- PR: #12869
- Update perf and latest features for llm models (Sept 23)
- PR: #13064
- Work around CSV reporting 64 cores for DRAM-sharded matmuls
- PR: #13108
- #0: Fix PCC to correct bound
- PR: #13110
- #0: Simplify llrt/memory API
- PR: #13067
- #0: Fix caching race
- PR: #13063
- #0: Fix merge error with 80d6e48
- PR: #13112
- #11004: moreh: use env var for kernel src search path
- PR: #12541
- #12328: Fix Llama3.1-8B MLP tests running out of L1
- PR: #13113
- #11769: extend support for transposing/permuting bfloat8 tensors on n…
- PR: #13018
- #12141: Fixed matmul shape validation issue
- PR: #12989
- #0: move BufferType to device kernel accessible location
- PR: #12984
- #12658: update sweep export script and create initial graph script
- PR: #13051
- #0: ViT on WH
- PR: #13072
- [skip ci] Update README.md (ViT on n150)
- PR: #13119
- #0: Bump resnet50 ttnn 2cq compile time because it regressed likely due to gcc risc-v upgrade
- PR: #13121
- #0: Update WH Resnet compile time threshold
- PR: #13115
- Flash decode improvements r2
- PR: #13028
- #0: added support for n_heads > 1 for page cache prefill
- PR: #13117
- #0: Bump mamba compile time as it's not that important and the model is still performant, need to unblock people…
- PR: #13130
- #0: move Layout enum to device accessible location
- PR: #13118
- #0: Bump distilbert compile time because it keeps failing on it
- PR: #13135
- #13088: Cleanup set-1 unary backward ops
- PR: #13096
- #10033: Add forward support for gcd and lcm
- PR: #10241
- #13150: Cleanup LCM, GCD Macro
- PR: #13151
- Llama3.1 8b demo with tracing
- PR: #13153
- #13058: update matmul bias size validation
- PR: #13104
- #0: (MINOR) Update to v0.53.0
- PR: #13165
- #0: try with python 3.10
- PR: #13168
- #13145: Temporarily revert Resnet on Galaxy to use slower config for first conv to avoid hangs
- PR: #13146
- #0: Remove unnecessary ProgramDeleter
- PR: #13134
- #13127: Switch python get_legacy_shape to shape.with_tile_padding()
- PR: #13124
- Add sweeps for remainder, fmod, minimum, maximum, logical_and eltwise ops, rename eltwise sweeps
- PR: #13099
- Fix Yolo tests after updating weights shape in conv2d
- PR: #13163
- #13172: Use lower python version and cache dependencies
- PR: #13173
- #11830: Move l1/dram/pcie alignment into HAL
- PR: #12983
- #13014: optimize slice by adding a 4D uint32_t array implementation o…
- PR: #13125
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
andmoreh_mean_backward
- PR: #13260
- #12687: port
moreh_group_norm
andmoreh_group_norm_backward
from tt_dnn to ttnn- PR: #12755
- #12694 Refactor moreh_linear and moreh_linear_backward
- PR: #12812
- #13246: Remove unary_backward_op.hpp
- PR: #13247
- #0: integrate distributed sharded layernrm with llama-tg
- PR: #13225
- Add support for matmul 1D having L1 sharded weights
- PR: #13094
- #11791: linker script cleanups
- PR: #13305
- #0: Add copy sweep
- PR: #13356
- #12214: refactor moreh_sgd from deprecated to ttnn
- PR: #12378
- [Nightly fa...
v0.53.0-rc8
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11189307373
📦 Uncategorized
- [skip ci] #0: ViT report edits
- PR: #13015
- #12879: Use () so that workflow_call actually captures the call when we trigger off completed workflow runs and add them to workflows to properly capture
- PR: #13012
- [skip ci] #13019 Create remove-stale-branches.yaml
- PR: #13020
- #13019 Update remove-stale-branches.yaml
- PR: #13021
- Add tiny tile support for Tensor, matmul
- PR: #12908
- [skip ci] #13019 Add default recipient
- PR: #13023
- build tt metal in docker in CI
- PR: #11923
- Revert "build tt metal in docker in CI"
- PR: #13027
- [skip ci] #0: ViT tech report
- PR: #13032
- Mchiou/11762 build tt metal in docker
- PR: #13033
- #13013: Added tests to run in TGG unit tests workflow
- PR: #13016
- [skip ci] #13019 Update remove-stale-branches.yaml
- PR: #13025
- Mchiou/0 fix docker build storage
- PR: #13042
- #11531: Autogenerate API rst stub files, add summary table on API page
- PR: #12075
- Add --no-advice to perf report, small fixes
- PR: #13048
- preserve fp32 precision
- PR: #12794
- #0: Remove unnecessary using declarations
- PR: #13056
- #12775: Cleanup docker run action
- PR: #12777
- #0: Update to gcc-12.x, take 2
- PR: #12999
- #12945: update galaxy/n150 eth dispatch cores
- PR: #13031
- #13070: fix SD
- PR: #13073
- Update Llama codeowners
- PR: #12116
- #0: fix uncaught edge case in page update cache and added it in test suit
- PR: #13074
- #12754: Migrate moreh_nll_loss operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12807
- #8633:Add TT_Fatal for full and ones op
- PR: #12921
- #12985: Expose
ttnn::ccl::Topology
at python level- PR: #12988
- #12556: Add queue_id and optional output tensors to assign_bw
- PR: #12573
- Support for increasing 1-D row major int32 tensors by one
- PR: #12773
- #12828: update ttnn matmul doc string
- PR: #13071
- Llama 3.1 8b DRAM-sharded matmuls
- PR: #12869
- Update perf and latest features for llm models (Sept 23)
- PR: #13064
- Work around CSV reporting 64 cores for DRAM-sharded matmuls
- PR: #13108
- #0: Fix PCC to correct bound
- PR: #13110
- #0: Simplify llrt/memory API
- PR: #13067
- #0: Fix caching race
- PR: #13063
- #0: Fix merge error with 80d6e48
- PR: #13112
- #11004: moreh: use env var for kernel src search path
- PR: #12541
- #12328: Fix Llama3.1-8B MLP tests running out of L1
- PR: #13113
- #11769: extend support for transposing/permuting bfloat8 tensors on n…
- PR: #13018
- #12141: Fixed matmul shape validation issue
- PR: #12989
- #0: move BufferType to device kernel accessible location
- PR: #12984
- #12658: update sweep export script and create initial graph script
- PR: #13051
- #0: ViT on WH
- PR: #13072
- [skip ci] Update README.md (ViT on n150)
- PR: #13119
- #0: Bump resnet50 ttnn 2cq compile time because it regressed likely due to gcc risc-v upgrade
- PR: #13121
- #0: Update WH Resnet compile time threshold
- PR: #13115
- Flash decode improvements r2
- PR: #13028
- #0: added support for n_heads > 1 for page cache prefill
- PR: #13117
- #0: Bump mamba compile time as it's not that important and the model is still performant, need to unblock people…
- PR: #13130
- #0: move Layout enum to device accessible location
- PR: #13118
- #0: Bump distilbert compile time because it keeps failing on it
- PR: #13135
- #13088: Cleanup set-1 unary backward ops
- PR: #13096
- #10033: Add forward support for gcd and lcm
- PR: #10241
- #13150: Cleanup LCM, GCD Macro
- PR: #13151
- Llama3.1 8b demo with tracing
- PR: #13153
- #13058: update matmul bias size validation
- PR: #13104
- #0: (MINOR) Update to v0.53.0
- PR: #13165
- #0: try with python 3.10
- PR: #13168
- #13145: Temporarily revert Resnet on Galaxy to use slower config for first conv to avoid hangs
- PR: #13146
- #0: Remove unnecessary ProgramDeleter
- PR: #13134
- #13127: Switch python get_legacy_shape to shape.with_tile_padding()
- PR: #13124
- Add sweeps for remainder, fmod, minimum, maximum, logical_and eltwise ops, rename eltwise sweeps
- PR: #13099
- Fix Yolo tests after updating weights shape in conv2d
- PR: #13163
- #13172: Use lower python version and cache dependencies
- PR: #13173
- #11830: Move l1/dram/pcie alignment into HAL
- PR: #12983
- #13014: optimize slice by adding a 4D uint32_t array implementation o…
- PR: #13125
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode improvements r3
- PR: #13351
- #0: shortened flash decode tests to avoid potential timeout in fast dispatch
- PR: #13358
- #12632: Migrate moreh_layer_norm operation from tt_eager to ttnn
- PR: #12633
- #11844: Add dispatch_s for asynchronously sending go signals
- PR: #13069
- #12805: Migrate moreh_sum_backward operation from tt_eager to ttnn
- PR: #12806
- #13187: revise
moreh_mean
and `moreh_me...
v0.53.0-rc7
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11185971169
📦 Uncategorized
- Aliu/tech reports
- PR: #13010
- #11332: Move
ttnn/examples
ttnn/ttnn/examples
so we can enable directly calling them for users, but not meant to be part of ttnn API- PR: #11612
- Add sweeps for sign, deg2rad, rad2deg, relu6
- PR: #12994
- Revert "#10016: jit_build: link substitutes, tdma_xmov, noc"
- PR: #13009
- #12952: Update test_ccl_on_tg.cpp to work on TGG as well as TG
- PR: #12982
- [skip ci] #0: ViT report edits
- PR: #13015
- #12879: Use () so that workflow_call actually captures the call when we trigger off completed workflow runs and add them to workflows to properly capture
- PR: #13012
- [skip ci] #13019 Create remove-stale-branches.yaml
- PR: #13020
- #13019 Update remove-stale-branches.yaml
- PR: #13021
- Add tiny tile support for Tensor, matmul
- PR: #12908
- [skip ci] #13019 Add default recipient
- PR: #13023
- build tt metal in docker in CI
- PR: #11923
- Revert "build tt metal in docker in CI"
- PR: #13027
- [skip ci] #0: ViT tech report
- PR: #13032
- Mchiou/11762 build tt metal in docker
- PR: #13033
- #13013: Added tests to run in TGG unit tests workflow
- PR: #13016
- [skip ci] #13019 Update remove-stale-branches.yaml
- PR: #13025
- Mchiou/0 fix docker build storage
- PR: #13042
- #11531: Autogenerate API rst stub files, add summary table on API page
- PR: #12075
- Add --no-advice to perf report, small fixes
- PR: #13048
- preserve fp32 precision
- PR: #12794
- #0: Remove unnecessary using declarations
- PR: #13056
- #12775: Cleanup docker run action
- PR: #12777
- #0: Update to gcc-12.x, take 2
- PR: #12999
- #12945: update galaxy/n150 eth dispatch cores
- PR: #13031
- #13070: fix SD
- PR: #13073
- Update Llama codeowners
- PR: #12116
- #0: fix uncaught edge case in page update cache and added it in test suit
- PR: #13074
- #12754: Migrate moreh_nll_loss operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12807
- #8633:Add TT_Fatal for full and ones op
- PR: #12921
- #12985: Expose
ttnn::ccl::Topology
at python level- PR: #12988
- #12556: Add queue_id and optional output tensors to assign_bw
- PR: #12573
- Support for increasing 1-D row major int32 tensors by one
- PR: #12773
- #12828: update ttnn matmul doc string
- PR: #13071
- Llama 3.1 8b DRAM-sharded matmuls
- PR: #12869
- Update perf and latest features for llm models (Sept 23)
- PR: #13064
- Work around CSV reporting 64 cores for DRAM-sharded matmuls
- PR: #13108
- #0: Fix PCC to correct bound
- PR: #13110
- #0: Simplify llrt/memory API
- PR: #13067
- #0: Fix caching race
- PR: #13063
- #0: Fix merge error with 80d6e48
- PR: #13112
- #11004: moreh: use env var for kernel src search path
- PR: #12541
- #12328: Fix Llama3.1-8B MLP tests running out of L1
- PR: #13113
- #11769: extend support for transposing/permuting bfloat8 tensors on n…
- PR: #13018
- #12141: Fixed matmul shape validation issue
- PR: #12989
- #0: move BufferType to device kernel accessible location
- PR: #12984
- #12658: update sweep export script and create initial graph script
- PR: #13051
- #0: ViT on WH
- PR: #13072
- [skip ci] Update README.md (ViT on n150)
- PR: #13119
- #0: Bump resnet50 ttnn 2cq compile time because it regressed likely due to gcc risc-v upgrade
- PR: #13121
- #0: Update WH Resnet compile time threshold
- PR: #13115
- Flash decode improvements r2
- PR: #13028
- #0: added support for n_heads > 1 for page cache prefill
- PR: #13117
- #0: Bump mamba compile time as it's not that important and the model is still performant, need to unblock people…
- PR: #13130
- #0: move Layout enum to device accessible location
- PR: #13118
- #0: Bump distilbert compile time because it keeps failing on it
- PR: #13135
- #13088: Cleanup set-1 unary backward ops
- PR: #13096
- #10033: Add forward support for gcd and lcm
- PR: #10241
- #13150: Cleanup LCM, GCD Macro
- PR: #13151
- Llama3.1 8b demo with tracing
- PR: #13153
- #13058: update matmul bias size validation
- PR: #13104
- #0: (MINOR) Update to v0.53.0
- PR: #13165
- #0: try with python 3.10
- PR: #13168
- #13145: Temporarily revert Resnet on Galaxy to use slower config for first conv to avoid hangs
- PR: #13146
- #0: Remove unnecessary ProgramDeleter
- PR: #13134
- #13127: Switch python get_legacy_shape to shape.with_tile_padding()
- PR: #13124
- Add sweeps for remainder, fmod, minimum, maximum, logical_and eltwise ops, rename eltwise sweeps
- PR: #13099
- Fix Yolo tests after updating weights shape in conv2d
- PR: #13163
- #13172: Use lower python version and cache dependencies
- PR: #13173
- #11830: Move l1/dram/pcie alignment into HAL
- PR: #12983
- #13014: optimize slice by adding a 4D uint32_t array implementation o…
- PR: #13125
- Add llk support for cumsum and transpose_wh_dest with relevant tests
- PR: #12925
- Add numeric stable option for softmax
- PR: #13068
- #12878: Add links to job and pipeline for CI/CD analytics
- PR: #13183
- #0: fix CCL nightly tests
- PR: #13164
- #12919: Cleanup set-2 Unary Backward ops
- PR: #13138
- #8865: Add sharded tensor support to dispatch profile infra
- PR: #12871
- #0: Update CODEOWNERS for ttnn/ttnn/operations/moreh.py
- PR: #13185
- #13137: Revise moreh_arange operation
- PR: #13139
- #13095: Refactor moreh_nll_loss operations
- PR: #13097
- #10439: ttnn implementation of vgg model
- PR: #12511
- #13175: Add new category to summary table in sweeps query tool
- PR: #13176
- #5174: Disable command buffer FIFOs on BH
- PR: #13079
- Update CODEOWNERS
- PR: #13209
- Fix demo_trace and add on-device argmax to test_llama_perf
- PR: #13201
- #0: fix program caching bug in post_all_gather
- PR: #13224
- Do not require test dispatch workflow to run on "in-service" runners
- PR: #12660
- Add description to describe typical labels one could use in test dispatch workflow
- PR: #13228
- Add an option to split dprint output by risc
- PR: #13131
- Add new "choose your own pipeline" workflow
- PR: #13230
- #11962: remove uint8 unpack reconfig code
- PR: #13218
- Add tg and tgg frequent tests to "Choose your pipeline" workflow
- PR: #13236
- Add options to select a subset of pipelines that a user would like to run
- PR: #13237
- Update names of perf-models and perf-device-models jobs
- PR: #13238
- #13086: Revising moreh_getitem
- PR: #13087
- Sweeps: log, log1p, log2, log10
- PR: #13045
- #12721: Cleanup set-3 Unary Backward ops
- PR: #13207
- #13212: Cleanup set-4 Unary backward ops
- PR: #13214
- Add initial (very limited) support for line reduce scatter
- PR: #13133
- pack kernel binary memory spans into one
- PR: #12977
- #13242: Cleanup set-5 unary backward ops
- PR: #13243
- [skip ci] Update CODEOWNERS for TT-NN
- PR: #13220
- #13084: fix return vector optional tensor with launch_op
- PR: #13085
- #12757: update math function for ops
- PR: #13001
- #11512: Added sweep for ttnn.bcast
- PR: #13200
- #0: update all-gather tests to remove all_devices test fixture
- PR: #13262
- Llama device perf optimizations
- PR: #12953
- Tensor-parallel Llama3.1 8b bringup on n300
- PR: #13160
- [skip ci] Add last update date to LLM table in README
- PR: #13226
- #13285: Add arch tag for galaxy workflows that didn't have it because a) we should specify and b) we need it for data collection
- PR: #13287
- #0: Optimize untilize_with_unpad for W 16
- PR: #13114
- Update slack notification owner for t3k-model-perf-falcon7b
- PR: #13289
- #12040: add transpose trace sweeps
- PR: #13252
- Divanovic/llama tg demo
- PR: #13105
- #0: Fix bug in perplexity script for Llama
- PR: #13301
- #0: Update cast in ncrisc BH init code
- PR: #13295
- #0: Move remote chip event synchronization to dispatch core
- PR: #13256
- Vanilla Unet conv unit_test
- PR: #13267
- #11740: Extend post commit coverage and add sweep test
- PR: #13040
- #13269: Revise moreh_norm, moreh_norm_backward operations
- PR: #13270
- #13140: Cleanup Binary Backward ops
- PR: #13286
- #13315: Revise moreh_bmm, moreh_bmm_backward operations
- PR: #13316
- #0: TG Llama3-70b - fix frequent tests
- PR: #13322
- Revert "#11962: remove uint8 unpack reconfig code"
- PR: #13306
- Llama318b continuous batching + Paged Attention Support
- PR: #13205
- #0: Remove demo output files from Llama3.1-8B
- PR: #13325
- #11592: use the semaphore indices returned by CreateSemaphore
- PR: #13297
- #9370: removed ndpcc work around and debug code in sdpa decode and re-enabled CI
- PR: #13299
- #0: Bump trace region size to 20MB for T3K LLAMA2
- PR: #13309
- Not holding state for freshening profiler logs
- PR: #13335
- #13136: Consolidate
all_gather
andline_all_gather
to common api- PR: #13148
- #11005: Added CreateKernelFromString()
- PR: #12789
- #11622: sweep concat traces
- PR: #13345
- #0: Bump ttnn bert perf threshold to account for recent refactoring
- PR: #13346
- #0: fix CCL nightly and frequent test reqression suites
- PR: #13349
- #13142: Add documentation for device ops, memory config
- PR: #13166
- #13128: Add cmake options to control what tests get built
- PR: #13251
- [skip ci] Update CODEOWNERS for CMakeLists.txt
- PR: #13221
- Update matrix_engine.md
- PR: #13350
- #13258: build_metal.sh enhancements
- PR: #13259
- Flash decode imp...