Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect results when using GPUs with different architectures #1346

Closed
opcod3 opened this issue Jul 23, 2023 · 18 comments
Closed

[Bug]: Incorrect results when using GPUs with different architectures #1346

opcod3 opened this issue Jul 23, 2023 · 18 comments
Assignees

Comments

@opcod3
Copy link

opcod3 commented Jul 23, 2023

Describe the bug

rocBLAS returns incorrect results when used on two GPUs with different architectures.

This issue was first encountered in turboderp/exllama#173, while the provided code to reproduce is based off of rocBLAS-Examples.

When using rocBLAS and performing computations on two GPUs with different architectures the first computation
on each card will be correct. While any subsequent ones performed on the first card will be incorrect.

To Reproduce

Steps to reproduce the behavior:

  1. Ensure the current system has at least two GPUs and that the architecture of GPU0 is different from GPU1

  2. Install ROCm and ROCblas v5.6.0 (also present on 5.5.1, possibly earlier as well)

  3. Run make to compile the example code (bug-report.zip)

  4. Run ./gemm

  5. Observe how the first two calculations pass while the all the subsequent ones that execute on GPU0 fail

Expected behavior

It is expected that all calculations complete correctly.

Log-files

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Running AMD_LOG_LEVEL=2 ./gemm produces the following log

Current device: 0 (gfx906:sramecc+:xnack-)
:1:hip_code_object.cpp      :606 : 96578653125 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 
:1:hip_module.cpp           :83  : 96578653147 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 for module: 0x2bbceeb0 

PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
:1:hip_code_object.cpp      :606 : 96578659797 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578659807 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2ccfe920 

PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578660844 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660850 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578660857 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660866 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578661526 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661532 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578661542 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661549 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

I believe the key log entries are the following:

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr

Environment

Hardware description
CPU AMD Ryzen 9 3950X
GPU AMD Radeon VII
GPU AMD Radeon RX 6800 XT
Software version
rocm-core 5.6.0-1
rocblas 5.6.0-1

environment.txt

This has also been reproduced in the rocm/dev-ubuntu-22.04:5.5.1-complete docker container.

Additional context

According to other users in turboderp/exllama#173 the issues also occurs between Mi25 and Mi50 cards. I can also report it also occurs between any combination of the two cards I listed above and a 7900XTX.

Inverting the order of the computations (running a calculation on GPU1 first and then on GPU0) results in the same exact behavior, but with the failing card being GPU1 instead of GPU0 as before.

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
FAIL: max. relative err. = 0.5

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

From looking at more logs and rocBLAS internals i believe the error is related to the Tensile library. The behavior encountered seems to indicate that when a second .hsaco file is loaded it somehow overrides the original one with the correct architecture for the first card.
I am unsure if this is an issue in Tensile itself or in the way rocBLAS uses it.

In my opinion attempting to execute a kernel with an incorrect architecture should produce a crash or an error, instead of carrying on as normal and returning incorrect results.

@opcod3
Copy link
Author

opcod3 commented Jul 23, 2023

I just did some more tests and the issue can be reproduced in the following docker images:

rocm/dev-ubuntu-22.04:5.5.1-complete
rocm/dev-ubuntu-22.04:5.4.2-complete
rocm/dev-ubuntu-22.04:5.3-complete

I did not test any other versions but I assume the bug should be present in all versions since at least ROCm-5.3

@IMbackK
Copy link
Contributor

IMbackK commented Jul 23, 2023

I can confirm that this issue exists when the example above is executed with any combination of MI25, MI50 and rx6800xt but dose not exist (as expected) when only two MI50 are present.

@opcod3
Copy link
Author

opcod3 commented Jul 23, 2023

Building rocBLAS without tensile (BUILD_WITH_TENSILE=OFF) appears to fix the issue

@YellowRoseCx
Copy link

YellowRoseCx commented Jul 23, 2023

I also have this issue with a 6800xt and a Vega64

And I also experience similar issues when using multi-gpu Torch with ROCm. Have a collection of my errors and debugging for the torch experience here: https://rentry.org/tcahd

@opcod3
Copy link
Author

opcod3 commented Jul 23, 2023

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

@YellowRoseCx
Copy link

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

this is true for AMD but I've had people report that it will break hipBLAS usage on NVIDIA and Intel GPUs since its a call to rocblas instead of a hip function

@IMbackK
Copy link
Contributor

IMbackK commented Jul 24, 2023

its also a optional call, this is still a serious bug.

@rkamd
Copy link
Contributor

rkamd commented Jul 25, 2023

Thanks for reporting the issue. We are currently investigating the issue and will provide an update as soon as possible.
rocblas_initialize() does load all the Tensile code objects ( for all supported GFX ISA Targets), hence the results are as expected when rocblas_initialize() is used.

IMbackK added a commit to IMbackK/pytorch that referenced this issue Aug 9, 2023
IMbackK added a commit to IMbackK/pytorch that referenced this issue Sep 1, 2023
@rkamd rkamd self-assigned this Sep 7, 2023
@IMbackK
Copy link
Contributor

IMbackK commented Sep 17, 2023

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@rkamd
Copy link
Contributor

rkamd commented Sep 29, 2023

@opcod3 ,
Thanks for bringing this to our notice, a fix has been merged and should be available in future release,
rocBLAS Commit ID: bc4d8f5
Tensile Commit ID: ROCm/Tensile@24d54d7

@rkamd
Copy link
Contributor

rkamd commented Sep 29, 2023

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

@nktice
Copy link

nktice commented Sep 29, 2023

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

Please note that this page explains the ROCm roadmap and current versions...
https://github.com/RadeonOpenCompute/ROCm/releases
Note 5.7 is the last in the series, based on their roadmap -
and that 6.0 may not be comparable with the 5.x versions.
I'd like to suggest as it is a simple fix ( to existing bug )
and as there may not be a 5.7.1 version due to the roadmap,
that is be added to the 5.7 version to minimize wait.

@IMbackK
Copy link
Contributor

IMbackK commented Sep 30, 2023

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

@rkamd
I can recompile rocblas/tensile with the patch, that is not the issue, i am merely worried about rocm stability policies as it would appear from the outside that there is no internal mechanism to block a release when a serious issue is found. I don't see what issue besides "silently returns incorrect results for every operation on a supported platform 100% of the time" could possibly be more serious in the world of scientific compute.

I also concur with @opcod3 that its worrying that the rocm runtime dose not throw an error when a kernel launch fails due to the arch being wrong, but instead silently continues with garbage data and only logs this as a warning. In my option a failed kernel launch of this kind should cause an assert. Please confirm whether you have raised this problem as a bug internally or not. As otherwise i would like to file a bug against the runtime.

I would also respectfully request that a system with heterogeneous architecture is included in internal conformance testing, if such a system is not available already.

That said, thank you for fixing this issue and including the unsupported legacy platforms in the fix, your (and AMD's in general) efforts in providing an open source compute platform are much appreciated. Indeed great progress has been made in this direction in recent years.

@nktice
Copy link

nktice commented Jan 23, 2024

I'd like to report this issue appears resolved for me at this time!
Here's the guide I wrote with the instructions I used and have it working -
https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md

@xiaobo1025
Copy link

First of all, this is the wrong report.
A clear and concise description of what the problem is.
-- OS detected is ubuntu
/usr/bin/python3.8 -m venv /root/workspace/rocBLAS/build/virtualenv --system-site-packages --clear
The virtual environment was not created successfully because ensurepip is not
available. On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

apt install python3.8-venv
You may need to use sudo with that command. After installing the python3-venv
package, recreate your virtual environment.

Failing command: ['/root/workspace/rocBLAS/build/virtualenv/bin/python3.8', '-Im', 'ensurepip', '--upgrade', '--default-pip']

CMake Error at cmake/virtualenv.cmake:23 (message):
1
Call Stack (most recent call first):
cmake/virtualenv.cmake:49 (virtualenv_create)
CMakeLists.txt:139 (virtualenv_install)

-- Configuring incomplete, errors occurred!
Then, I use the.
apt update.
apt install python3.8-venv.
Update and install
Fetched 5452 B in 1s (3877 B/s)
Selecting previously unselected package python3.8-venv.
(Reading database ... 49445 files and directories currently installed.)
Preparing to unpack .../python3.8-venv_3.8.10-0ubuntu120.04.9_amd64.deb ...
Unpacking python3.8-venv (3.8.10-0ubuntu120.04.9) ...
Setting up python3.8-venv (3.8.10-0ubuntu1~20.04.9) ...
And then, the error is as follows:
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
CMake Error at cmake/virtualenv.cmake:68 (message):
1
Call Stack (most recent call first):
CMakeLists.txt:139 (virtualenv_install)
Then I use pip install-- upgrade setuptools.
Update
Installing collected packages: setuptools
Attempting uninstall: setuptools
Found existing installation: setuptools 69.0.2
Uninstalling setuptools-69.0.2:
Successfully uninstalled setuptools-69.0.2
Successfully installed setuptools-69.0.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python3 -m pip install --upgrade pip
But,then still report the same error:
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
CMake Error at cmake/virtualenv.cmake:68 (message):
1
Call Stack (most recent call first):
CMakeLists.txt:139 (virtualenv_install)

@xiaobo1025
Copy link

Could you help me to solve this problem ,thank you very much!

@IMbackK
Copy link
Contributor

IMbackK commented Mar 5, 2024

@xiaobo1025 please to dont spam this bug with unrelated issues

@rkamd I can confirm this seams to be fixed in 6.0

@rkamd
Copy link
Contributor

rkamd commented Mar 8, 2024

@IMbackK , Thanks for verifying.

Closing this issue.

@rkamd rkamd closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants