Add Grouped Convolution and GEMM documentation #1719

bartekxk · 2024-12-05T00:29:12Z

No description provided.

aosewski · 2024-12-05T08:06:11Z

@bartekxk What about using BF16, FP16 and FP32 instead of bhalf_t half_t and float ?

aosewski · 2024-12-18T11:12:48Z

docs/markdown/tensor_operation/gemm.md

+* **BLayout** - B layout (RowMajor/ColumnMajor).
+* **CLayout** - C layout (RowMajor/ColumnMajor).
+* **ADataType** - A data type.
+* **BDataType** - B data type.
+* **CDataType** - C data type.


A/B/C Matrix

aosewski · 2024-12-18T11:13:22Z

docs/markdown/tensor_operation/gemm.md

+* **BElementwiseOperation** - fused operation on tensor B.
+* **CElementwiseOperation** - fused operation on tensor C.


Let's start from capital letter.

aosewski · 2024-12-18T11:13:43Z

docs/markdown/tensor_operation/gemm.md

Important to note this elementwise operation is done before multiplication of those two tensors - that is on their (A/B tensor) elements.

aosewski · 2024-12-18T11:14:54Z

docs/markdown/tensor_operation/gemm.md

+* **AElementwiseOperation** - fused operation on tensor A.
+* **BElementwiseOperation** - fused operation on tensor B.
+* **CElementwiseOperation** - fused operation on tensor C.


Operation is done on the output (result of two matrix multiplication) matrix elements.

aosewski · 2024-12-18T11:16:37Z

docs/markdown/tensor_operation/gemm.md

+* **CElementwiseOperation** - fused operation on tensor C.
+
+For matrices with large K dimension `DeviceGemmSplitK` implementation is available. This implementation allows user to split K dimension between work groups. This implementation requires AtomicAdd operation on global memory (output buffer must be set to zeroes).


Suggested change

For matrices with large K dimension `DeviceGemmSplitK` implementation is available. This implementation allows user to split K dimension between work groups. This implementation requires AtomicAdd operation on global memory (output buffer must be set to zeroes).

For matrices with large K dimension `DeviceGemmSplitK` implementation is available. This implementation allows user to split K dimension between work groups. This implementation uses `AtomicAdd` operation on global memory, thus need to zero-out output buffer for correct results.

aosewski · 2024-12-18T11:18:08Z

docs/markdown/tensor_operation/gemm.md

+
+For fused operations with additional tensor there are `DeviceGemmMultipleABD` or `DeviceGemmMultipleD` operation which require following parameters:
+* **DsLayout** - layouts for additional tensors for fused operations.


Please add note that all DsLayout have to be the same as output matrix C layout.

aosewski · 2024-12-18T11:18:54Z

docs/markdown/tensor_operation/gemm.md

+* **DeviceGemmDl** - Device operation with DL instructions.
+* **DeviceGemmDpp** - Device operation with DL instructions with DPP instructions during data load.
+* **DeviceGemmWmma_CShuffle** - Device operation with WMMA instructions with CShuffle optimization for more optimized data store.


Suggested change

* **DeviceGemmWmma_CShuffle** - Device operation with WMMA instructions with CShuffle optimization for more optimized data store.

* **DeviceGemmWmma_CShuffle** - Device operation with WMMA instructions and CShuffle optimization for more optimized data store.

aosewski · 2024-12-18T12:00:39Z

docs/markdown/tensor_operation/gemm.md

+* **DeviceGemm_Xdl_CShuffleV2** - Device operation with XDL instructions with CShuffle optimization for more optimized data store. GEMM pipeline has been optimized compared to **DeviceGemm_Xdl_CShuffle**.
+* **DeviceGemmXdlSkipBLds** - Device operation with XDL instructions. Load to shared memory has been skiped for B matrix.
+* **DeviceGemm_Xdl_WaveletModel_CShuffle** - Device operation with XDL instructions with CShuffle optimization for more optimized data store.


Please not is uses producer+consumer scheme cooperation between waves in workgroup.

aosewski · 2024-12-18T12:01:53Z

docs/markdown/tensor_operation/gemm.md

+* **DeviceGemmXdl** - Device operation with XDL instructions.
+
+Table of supported cases by instance factory with XDL instruction for Row/Row/Row, Row/Column/Row, Column/Row/Row or Column/Column/Row:


Are you positive that we support all those layouts for all data types?

aosewski · 2024-12-18T12:03:10Z

docs/markdown/tensor_operation/gemm.md

+* **BLayout** - B layout (RowMajor/ColumnMajor).
+* **CLayout** - C layout (RowMajor/ColumnMajor).
+* **ADataType** - A data type.
+* **BDataType** - B data type.
+* **CDataType** - C data type.
+* **AElementwiseOperation** - fused operation on tensor A.
+* **BElementwiseOperation** - fused operation on tensor B.
+* **CElementwiseOperation** - fused operation on tensor C.
+
+This implementation allows user to split K dimension between work groups. This implementation requires AtomicAdd operation on global memory (output buffer must be set to zeroes if splitK parameter is larger than one).


Like above.

Add Grouped Convolution docs

3f5be90

bartekxk self-assigned this Dec 5, 2024

Add gemm docs

78e4cae

bartekxk marked this pull request as ready for review December 9, 2024 12:30

bartekxk requested review from junliume, illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, andriy-ca and a team as code owners December 9, 2024 12:30

aosewski reviewed Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Grouped Convolution and GEMM documentation #1719

Add Grouped Convolution and GEMM documentation #1719

bartekxk commented Dec 5, 2024

aosewski commented Dec 5, 2024 •

edited

Loading

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

aosewski Dec 18, 2024

		* BElementwiseOperation - fused operation on tensor B.
		* CElementwiseOperation - fused operation on tensor C.

		* CElementwiseOperation - fused operation on tensor C.

		For matrices with large K dimension `DeviceGemmSplitK` implementation is available. This implementation allows user to split K dimension between work groups. This implementation requires AtomicAdd operation on global memory (output buffer must be set to zeroes).


		For fused operations with additional tensor there are `DeviceGemmMultipleABD` or `DeviceGemmMultipleD` operation which require following parameters:
		* DsLayout - layouts for additional tensors for fused operations.

	* DeviceGemmWmma_CShuffle - Device operation with WMMA instructions with CShuffle optimization for more optimized data store.
	* DeviceGemmWmma_CShuffle - Device operation with WMMA instructions and CShuffle optimization for more optimized data store.

		* DeviceGemmXdl - Device operation with XDL instructions.

		Table of supported cases by instance factory with XDL instruction for Row/Row/Row, Row/Column/Row, Column/Row/Row or Column/Column/Row:

Add Grouped Convolution and GEMM documentation #1719

Are you sure you want to change the base?

Add Grouped Convolution and GEMM documentation #1719

Conversation

bartekxk commented Dec 5, 2024

aosewski commented Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aosewski commented Dec 5, 2024 •

edited

Loading