[Bug]: Incorrect results when using GPUs with different architectures #1346

opcod3 · 2023-07-23T13:46:38Z

Describe the bug

rocBLAS returns incorrect results when used on two GPUs with different architectures.

This issue was first encountered in turboderp/exllama#173, while the provided code to reproduce is based off of rocBLAS-Examples.

When using rocBLAS and performing computations on two GPUs with different architectures the first computation
on each card will be correct. While any subsequent ones performed on the first card will be incorrect.

To Reproduce

Steps to reproduce the behavior:

Ensure the current system has at least two GPUs and that the architecture of GPU0 is different from GPU1
Install ROCm and ROCblas v5.6.0 (also present on 5.5.1, possibly earlier as well)
Run make to compile the example code (bug-report.zip)
Run ./gemm
Observe how the first two calculations pass while the all the subsequent ones that execute on GPU0 fail

Expected behavior

It is expected that all calculations complete correctly.

Log-files

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Running AMD_LOG_LEVEL=2 ./gemm produces the following log

Current device: 0 (gfx906:sramecc+:xnack-)
:1:hip_code_object.cpp      :606 : 96578653125 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 
:1:hip_module.cpp           :83  : 96578653147 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 for module: 0x2bbceeb0 

PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
:1:hip_code_object.cpp      :606 : 96578659797 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578659807 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2ccfe920 

PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578660844 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660850 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578660857 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660866 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578661526 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661532 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578661542 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661549 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

I believe the key log entries are the following:

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr

Environment

Hardware	description
CPU	AMD Ryzen 9 3950X
GPU	AMD Radeon VII
GPU	AMD Radeon RX 6800 XT

Software	version
rocm-core	5.6.0-1
rocblas	5.6.0-1

environment.txt

This has also been reproduced in the rocm/dev-ubuntu-22.04:5.5.1-complete docker container.

Additional context

According to other users in turboderp/exllama#173 the issues also occurs between Mi25 and Mi50 cards. I can also report it also occurs between any combination of the two cards I listed above and a 7900XTX.

Inverting the order of the computations (running a calculation on GPU1 first and then on GPU0) results in the same exact behavior, but with the failing card being GPU1 instead of GPU0 as before.

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
FAIL: max. relative err. = 0.5

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

From looking at more logs and rocBLAS internals i believe the error is related to the Tensile library. The behavior encountered seems to indicate that when a second .hsaco file is loaded it somehow overrides the original one with the correct architecture for the first card.
I am unsure if this is an issue in Tensile itself or in the way rocBLAS uses it.

In my opinion attempting to execute a kernel with an incorrect architecture should produce a crash or an error, instead of carrying on as normal and returning incorrect results.

The text was updated successfully, but these errors were encountered:

opcod3 · 2023-07-23T14:10:56Z

I just did some more tests and the issue can be reproduced in the following docker images:

rocm/dev-ubuntu-22.04:5.5.1-complete
rocm/dev-ubuntu-22.04:5.4.2-complete
rocm/dev-ubuntu-22.04:5.3-complete

I did not test any other versions but I assume the bug should be present in all versions since at least ROCm-5.3

IMbackK · 2023-07-23T15:54:12Z

I can confirm that this issue exists when the example above is executed with any combination of MI25, MI50 and rx6800xt but dose not exist (as expected) when only two MI50 are present.

opcod3 · 2023-07-23T17:54:46Z

Building rocBLAS without tensile (BUILD_WITH_TENSILE=OFF) appears to fix the issue

YellowRoseCx · 2023-07-23T19:22:43Z

I also have this issue with a 6800xt and a Vega64

And I also experience similar issues when using multi-gpu Torch with ROCm. Have a collection of my errors and debugging for the torch experience here: https://rentry.org/tcahd

opcod3 · 2023-07-23T19:36:56Z

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

YellowRoseCx · 2023-07-24T21:36:30Z

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

this is true for AMD but I've had people report that it will break hipBLAS usage on NVIDIA and Intel GPUs since its a call to rocblas instead of a hip function

IMbackK · 2023-07-24T23:02:13Z

its also a optional call, this is still a serious bug.

rkamd · 2023-07-25T15:28:47Z

Thanks for reporting the issue. We are currently investigating the issue and will provide an update as soon as possible.
rocblas_initialize() does load all the Tensile code objects ( for all supported GFX ISA Targets), hence the results are as expected when rocblas_initialize() is used.

IMbackK · 2023-09-17T22:55:54Z

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

rkamd · 2023-09-29T16:07:43Z

@opcod3 ,
Thanks for bringing this to our notice, a fix has been merged and should be available in future release,
rocBLAS Commit ID: bc4d8f5
Tensile Commit ID: ROCm/Tensile@24d54d7

rkamd · 2023-09-29T16:08:09Z

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

nktice · 2023-09-29T22:43:31Z

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

Please note that this page explains the ROCm roadmap and current versions...
https://github.com/RadeonOpenCompute/ROCm/releases
Note 5.7 is the last in the series, based on their roadmap -
and that 6.0 may not be comparable with the 5.x versions.
I'd like to suggest as it is a simple fix ( to existing bug )
and as there may not be a 5.7.1 version due to the roadmap,
that is be added to the 5.7 version to minimize wait.

IMbackK · 2023-09-30T11:50:32Z

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

@rkamd
I can recompile rocblas/tensile with the patch, that is not the issue, i am merely worried about rocm stability policies as it would appear from the outside that there is no internal mechanism to block a release when a serious issue is found. I don't see what issue besides "silently returns incorrect results for every operation on a supported platform 100% of the time" could possibly be more serious in the world of scientific compute.

I also concur with @opcod3 that its worrying that the rocm runtime dose not throw an error when a kernel launch fails due to the arch being wrong, but instead silently continues with garbage data and only logs this as a warning. In my option a failed kernel launch of this kind should cause an assert. Please confirm whether you have raised this problem as a bug internally or not. As otherwise i would like to file a bug against the runtime.

I would also respectfully request that a system with heterogeneous architecture is included in internal conformance testing, if such a system is not available already.

That said, thank you for fixing this issue and including the unsupported legacy platforms in the fix, your (and AMD's in general) efforts in providing an open source compute platform are much appreciated. Indeed great progress has been made in this direction in recent years.

nktice · 2024-01-23T07:01:08Z

I'd like to report this issue appears resolved for me at this time!
Here's the guide I wrote with the instructions I used and have it working -
https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md

xiaobo1025 · 2024-02-01T02:15:28Z

First of all, this is the wrong report.
A clear and concise description of what the problem is.
-- OS detected is ubuntu
/usr/bin/python3.8 -m venv /root/workspace/rocBLAS/build/virtualenv --system-site-packages --clear
The virtual environment was not created successfully because ensurepip is not
available. On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

apt install python3.8-venv
You may need to use sudo with that command. After installing the python3-venv
package, recreate your virtual environment.

Failing command: ['/root/workspace/rocBLAS/build/virtualenv/bin/python3.8', '-Im', 'ensurepip', '--upgrade', '--default-pip']

CMake Error at cmake/virtualenv.cmake:23 (message):
1
Call Stack (most recent call first):
cmake/virtualenv.cmake:49 (virtualenv_create)
CMakeLists.txt:139 (virtualenv_install)

-- Configuring incomplete, errors occurred!
Then, I use the.
apt update.
apt install python3.8-venv.
Update and install
Fetched 5452 B in 1s (3877 B/s)
Selecting previously unselected package python3.8-venv.
(Reading database ... 49445 files and directories currently installed.)
Preparing to unpack .../python3.8-venv_3.8.10-0ubuntu120.04.9_amd64.deb ...
Unpacking python3.8-venv (3.8.10-0ubuntu120.04.9) ...
Setting up python3.8-venv (3.8.10-0ubuntu1~20.04.9) ...
And then, the error is as follows：
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
CMake Error at cmake/virtualenv.cmake:68 (message):
1
Call Stack (most recent call first):
CMakeLists.txt:139 (virtualenv_install)
Then I use pip install-- upgrade setuptools.
Update
Installing collected packages: setuptools
Attempting uninstall: setuptools
Found existing installation: setuptools 69.0.2
Uninstalling setuptools-69.0.2:
Successfully uninstalled setuptools-69.0.2
Successfully installed setuptools-69.0.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python3 -m pip install --upgrade pip
But,then still report the same error:
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
CMake Error at cmake/virtualenv.cmake:68 (message):
1
Call Stack (most recent call first):
CMakeLists.txt:139 (virtualenv_install)

xiaobo1025 · 2024-02-01T02:16:46Z

Could you help me to solve this problem ,thank you very much!

IMbackK · 2024-03-05T13:36:45Z

@xiaobo1025 please to dont spam this bug with unrelated issues

@rkamd I can confirm this seams to be fixed in 6.0

rkamd · 2024-03-08T15:57:24Z

@IMbackK , Thanks for verifying.

Closing this issue.

opcod3 mentioned this issue Jul 23, 2023

Splitting model on multiple GPUs is broken (ROCm) turboderp/exllama#173

Closed

IMbackK mentioned this issue Jul 24, 2023

[ROCM] GFX906 gpu dosent work when GFX900 gpu is also in the system huggingface/transformers#25007

Closed

4 tasks

ardfork mentioned this issue Jul 29, 2023

ROCm Port ggerganov/llama.cpp#1087

Merged

IMbackK added a commit to IMbackK/pytorch that referenced this issue Aug 9, 2023

Workaround for ROCm/rocBLAS#1346

4a14dc5

IMbackK added a commit to IMbackK/pytorch that referenced this issue Sep 1, 2023

Workaround for ROCm/rocBLAS#1346

43818c2

rkamd self-assigned this Sep 7, 2023

This was referenced Oct 3, 2023

Adding Spack document ROCm/ROCm#2516

Merged

Release Notes for 5.7.1 ROCm/ROCm#2520

Merged

nktice mentioned this issue Jan 27, 2024

Multi-GPU loading of models produces gibberish ROCm/HIP#3331

Closed

IMbackK mentioned this issue Mar 5, 2024

Matmul failure after dtype change on mixed AMD setup pytorch/pytorch#110855

Closed

rkamd closed this as completed Mar 8, 2024

sARY77 mentioned this issue Jan 5, 2025

Remove obsolete HIP workaround ggerganov/llama.cpp#11080

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Incorrect results when using GPUs with different architectures #1346

[Bug]: Incorrect results when using GPUs with different architectures #1346

opcod3 commented Jul 23, 2023

opcod3 commented Jul 23, 2023

IMbackK commented Jul 23, 2023

opcod3 commented Jul 23, 2023

YellowRoseCx commented Jul 23, 2023 •

edited

Loading

opcod3 commented Jul 23, 2023

YellowRoseCx commented Jul 24, 2023

IMbackK commented Jul 24, 2023

rkamd commented Jul 25, 2023

IMbackK commented Sep 17, 2023 •

edited

Loading

rkamd commented Sep 29, 2023

rkamd commented Sep 29, 2023

nktice commented Sep 29, 2023

IMbackK commented Sep 30, 2023 •

edited

Loading

nktice commented Jan 23, 2024

xiaobo1025 commented Feb 1, 2024

xiaobo1025 commented Feb 1, 2024

IMbackK commented Mar 5, 2024

rkamd commented Mar 8, 2024

[Bug]: Incorrect results when using GPUs with different architectures #1346

[Bug]: Incorrect results when using GPUs with different architectures #1346

Comments

opcod3 commented Jul 23, 2023

Describe the bug

To Reproduce

Expected behavior

Log-files

Environment

Additional context

opcod3 commented Jul 23, 2023

IMbackK commented Jul 23, 2023

opcod3 commented Jul 23, 2023

YellowRoseCx commented Jul 23, 2023 • edited Loading

opcod3 commented Jul 23, 2023

YellowRoseCx commented Jul 24, 2023

IMbackK commented Jul 24, 2023

rkamd commented Jul 25, 2023

IMbackK commented Sep 17, 2023 • edited Loading

rkamd commented Sep 29, 2023

rkamd commented Sep 29, 2023

nktice commented Sep 29, 2023

IMbackK commented Sep 30, 2023 • edited Loading

nktice commented Jan 23, 2024

xiaobo1025 commented Feb 1, 2024

xiaobo1025 commented Feb 1, 2024

IMbackK commented Mar 5, 2024

rkamd commented Mar 8, 2024

YellowRoseCx commented Jul 23, 2023 •

edited

Loading

IMbackK commented Sep 17, 2023 •

edited

Loading

IMbackK commented Sep 30, 2023 •

edited

Loading