You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.
The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:
K0_M_1/K0_N_1 (fp32)
K0_M_2/K0_2_N (bf16/fp16)
K0_M_4/K0_N_4 (int8)
Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp. GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1 (the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.
Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.
GEMM
Batched GEMM
Group GEMM
Split-K GEMM
Non-C-shuffle ordinary GEMM DeviceGemmXdl
Conv
Backward data (conflict in weight)
Backward weight (conflict in both output gradient and activation)
Forward prop maps to K-contiguous implicit GEMM layout so there's no observable bank conflict.
The text was updated successfully, but these errors were encountered:
@rosenrodt Apologies for the lack of response. Can you please if this is still an issue on the latest ROCm 6.2? If not, please close the ticket. Thanks!
Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.
The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:
Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp.
GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1
(the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.
GEMM
DeviceGemmXdl
Conv
The text was updated successfully, but these errors were encountered: