Skip to content

Releases: tenstorrent/tt-metal

v0.54.0-rc21

11 Jan 02:07
ca2c867
Compare
Choose a tag to compare
v0.54.0-rc21 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12719934369

📦 Uncategorized

  • Add buffering to DPRINT
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc20

10 Jan 02:07
14dac66
Compare
Choose a tag to compare
v0.54.0-rc20 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12701599118

📦 Uncategorized

  • Add buffering to DPRINT
  • Python -> Python3
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc19

08 Jan 02:06
Compare
Choose a tag to compare
v0.54.0-rc19 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12662398466

📦 Uncategorized

  • Add buffering to DPRINT
  • Python -> Python3
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc18

07 Jan 02:28
bf94433
Compare
Choose a tag to compare
v0.54.0-rc18 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12643496109

📦 Uncategorized

  • Add buffering to DPRINT
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc17

06 Jan 02:30
cb02e39
Compare
Choose a tag to compare
v0.54.0-rc17 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12624900279

📦 Uncategorized

  • Add buffering to DPRINT
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc16

04 Jan 02:28
Compare
Choose a tag to compare
v0.54.0-rc16 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12606309953

📦 Uncategorized

  • Add buffering to DPRINT
  • Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
  • [UMD] Removed set_*_params calls and constants
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc15

03 Jan 02:04
Compare
Choose a tag to compare
v0.54.0-rc15 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12591326491

📦 Uncategorized

  • Add buffering to DPRINT
  • Revert "#0: Fix merge conflicts originating from #15289"
  • Revert "Link Tensor.reshape to ttnn.reshape"
  • #15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
  • #0: Allow ttnn.pad to pad Tensor to an odd width in row major
  • #15565 Add unit test to show sharding ttnn.from_torch problems
  • #14977: conv config to use higher cores.
  • Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
  • [UMD] Removed set_*_params calls and constants
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc14

02 Jan 02:03
Compare
Choose a tag to compare
v0.54.0-rc14 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12576268538

📦 Uncategorized

  • Add buffering to DPRINT
  • Update CODEOWNERS - add experimental CCL section
  • #15780: div ops debug
  • Revert "#16012: Revert conv2d changes because of perf regressions, pc…
  • #13127: Make TensorLayout::compute_physical_shard_shape public
  • Link Tensor.reshape to ttnn.reshape
  • #0: Fix merge conflicts originating from #15289
  • Integrate chunked prefill into t3k Llama3-70B
  • Bump MagicEnum to v0.9.7
  • #15944: Fix pybind of create_sub_device_manager_with_fabric to call the correct function.
  • [tt-train] Add option to disable wandb in examples
  • Update perf and latest features for llm models (Dec 16)
  • #16070: Use the same Docker image as built
  • [tt-train] Bump magic_enum from 0.9.6 to 0.9.7
  • Update ttcnn.md
  • #13643: Extend binary-ng math support to match all primitive binary ops.
  • #14530: remove up front padding from generic reduce
  • Revert "#0: Fix merge conflicts originating from #15289"
  • Revert "Link Tensor.reshape to ttnn.reshape"
  • #15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
  • #0: Allow ttnn.pad to pad Tensor to an odd width in row major
  • #15565 Add unit test to show sharding ttnn.from_torch problems
  • #14977: conv config to use higher cores.
  • Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
  • [UMD] Removed set_*_params calls and constants
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc13

01 Jan 02:04
416ce55
Compare
Choose a tag to compare
v0.54.0-rc13 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12565895915

📦 Uncategorized

  • Add buffering to DPRINT
  • Update CODEOWNERS - add experimental CCL section
  • #15780: div ops debug
  • Revert "#16012: Revert conv2d changes because of perf regressions, pc…
  • #13127: Make TensorLayout::compute_physical_shard_shape public
  • Link Tensor.reshape to ttnn.reshape
  • #0: Fix merge conflicts originating from #15289
  • Integrate chunked prefill into t3k Llama3-70B
  • Bump MagicEnum to v0.9.7
  • #15944: Fix pybind of create_sub_device_manager_with_fabric to call the correct function.
  • [tt-train] Add option to disable wandb in examples
  • Update perf and latest features for llm models (Dec 16)
  • #16070: Use the same Docker image as built
  • [tt-train] Bump magic_enum from 0.9.6 to 0.9.7
  • Update ttcnn.md
  • #13643: Extend binary-ng math support to match all primitive binary ops.
  • #14530: remove up front padding from generic reduce
  • Revert "#0: Fix merge conflicts originating from #15289"
  • Revert "Link Tensor.reshape to ttnn.reshape"
  • #15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
  • #0: Allow ttnn.pad to pad Tensor to an odd width in row major
  • #15565 Add unit test to show sharding ttnn.from_torch problems
  • #14977: conv config to use higher cores.
  • Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
  • [UMD] Removed set_*_params calls and constants
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small

v0.54.0-rc12

31 Dec 02:03
3949130
Compare
Choose a tag to compare
v0.54.0-rc12 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12554143182

📦 Uncategorized

  • Add buffering to DPRINT
  • #0: Prevent slice from padding up a 0 volume tensor
  • #0: support unequal ranked inputs for broadcast in binary_ng
  • #16014: Fix yolo4 e2e perf measurement
  • Update CODEOWNERS - add experimental CCL section
  • #15780: div ops debug
  • Revert "#16012: Revert conv2d changes because of perf regressions, pc…
  • #13127: Make TensorLayout::compute_physical_shard_shape public
  • Link Tensor.reshape to ttnn.reshape
  • #0: Fix merge conflicts originating from #15289
  • Integrate chunked prefill into t3k Llama3-70B
  • Bump MagicEnum to v0.9.7
  • #15944: Fix pybind of create_sub_device_manager_with_fabric to call the correct function.
  • [tt-train] Add option to disable wandb in examples
  • Update perf and latest features for llm models (Dec 16)
  • #16070: Use the same Docker image as built
  • [tt-train] Bump magic_enum from 0.9.6 to 0.9.7
  • Update ttcnn.md
  • #13643: Extend binary-ng math support to match all primitive binary ops.
  • #14530: remove up front padding from generic reduce
  • Revert "#0: Fix merge conflicts originating from #15289"
  • Revert "Link Tensor.reshape to ttnn.reshape"
  • #15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
  • #0: Allow ttnn.pad to pad Tensor to an odd width in row major
  • #15565 Add unit test to show sharding ttnn.from_torch problems
  • #14977: conv config to use higher cores.
  • Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
  • [UMD] Removed set_*_params calls and constants
  • #0: Remove some dead code
  • Updated installation script
  • Python -> Python3
  • Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
  • Adding ND support for tilize/untilize with padding
  • [Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
  • #0: Enable Local Sweeps and Use a Faster Interprocess Queue
  • #15601: Implement support for MeshDevice::reshape(..)
  • Remove setup_core_to_tlb_map
  • #0: Let sharded_to_interleaved handle interleaved input
  • #0: separate validation of conv weight and bias.
  • #0: Minor refactor of pytensor and tensor implementation files
  • C++ files should not be part of the API of a library
  • #15857: Forge sweep test
  • #15857: Unary forge sweep tests
  • Fix some more namespace pollution caused by using namespace tt::tt_metal
  • #15713 Bad Eltwise Binary ZEROACC
  • #15565 Fix unit test to show sharding ttnn.from_torch problems
  • Fix paged SDPA decode CB sizing issue
  • Reland async dispatch with workaround for hang.
  • #16119: Add forge traces to matmul and reduce sweeps
  • #10034: Binary shift operators
  • #0: Remove incorrect memory span assert
  • Add forge sweeps for slice and transpose
  • #0: Move memory config serialization in the corresponding header away from types.hpp
  • #16114: Allow Binarized Programs to be Reused across WH Devices
  • #0: aligning conv2d transpose as conv
  • support missing cases for sweep tests
  • #0: added normalization details in the tech report
  • Fix ttnn.from_torch for 0D/1D tensors with tile layout
  • Port all Moreh OPs to compute_output_specs
  • Bump umd to fix grayskull cluster bug
  • Clean-up the usage of deallocate_activation
  • llm tech report multi device section
  • Add prefill v decode section to LLM tech report [section 3.2]
  • #0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
  • [LLM tech report] Add accuracy evaluation and debugging sections
  • #16165: Disabling test that depends on some machine state to pass
  • enable dps ops for matmul
  • Isolate tracy
  • [TT-Train ]added tests for sum and mean
  • #16184: Try using ecr to avoid rate limits of docker.io
  • #15221: Post completion messages to dispatch_s
  • [TT-Train] Added softmax backward
  • Optimized FreeList allocator
  • Set the test data to be relative to the test binary
  • #0: Fix matmul doc string
  • #0: remove spammy warning from conftest
  • Update generating unicast go signal commands to ensure dispatch write linear respects alignment
  • LLM tech report sections 2.2, 2.5
  • [TT-Train] Fix tracy deps in the tt-train cmake
  • Updating Allocator docs to explain first fit usage
  • Adding asserts for hanging cases in ND tilize/untilize support
  • Fix ttnn.reallocate when unaligned RM tensors are used
  • #15891: improve full accuracy and fix full bugs
  • Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
  • #15857: Skip abs forge for GS
  • #16213: Use our own forked Docker Run Action that points to ECR
  • Add max kernel size for each risc type in an op
  • Infer Conv2dTranspose parameters during model preprocessing
  • #12662: add keepdim fixes to reduce
  • Add chunked prefill to Llama family
  • #15342: Add mirror_kernels option to conv_transpose2d
  • Update CODEOWNERS
  • support reduction for 3d & 4d dims
  • #5605: Only force-stall ethernet programs on earlier ethernet programs
  • Add full support for creating tensors with logical sharding from python
  • update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
  • #15857: Binary Forge Sweep Tests Set2
  • #14976/#15039: Add Support For ceil_mode=True
  • Add missing cache invalidates + loads before stores noc optimization for BH
  • Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
  • New FD Init Flow
  • Add support for output sharded embeddings
  • Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
  • #0: Enforce tile layout when using bf4/bf8 data types
  • MeshDevice: Support Quanta Galaxy system file
  • Move Device members from public to private
  • Add unary sharded sweeps
  • #0: Added core_grid offset for sharded layernorm
  • fix abs path bug for sweeps tests code
  • #0: Publish TT-Distributed doc under tech_reports
  • #15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
  • #16265: Remove creation op
  • Fix unsigned arithmetic bugs in reshape ops
  • Fix compile issue for earlier c++ versions
  • #0: Typo fix in TT distributed tech report
  • [Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
  • LLM tech report sections 3.1, 3.4, 3.5
  • LLM Tech report section 4.4
  • Move some Device methods to private section
  • #0: [skip_ci] Update Distributed Tech Report with Discord Server link
  • #15857: Binary Forge Sweep Tests Set1
  • #0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
  • #0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small