-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Incorrect results when using GPUs with different architectures #1346
Comments
I just did some more tests and the issue can be reproduced in the following docker images:
I did not test any other versions but I assume the bug should be present in all versions since at least ROCm-5.3 |
I can confirm that this issue exists when the example above is executed with any combination of MI25, MI50 and rx6800xt but dose not exist (as expected) when only two MI50 are present. |
Building rocBLAS without tensile ( |
I also have this issue with a 6800xt and a Vega64 And I also experience similar issues when using multi-gpu Torch with ROCm. Have a collection of my errors and debugging for the torch experience here: https://rentry.org/tcahd |
Doing some more troubleshooting, apparently calling |
this is true for AMD but I've had people report that it will break hipBLAS usage on NVIDIA and Intel GPUs since its a call to rocblas instead of a hip function |
its also a optional call, this is still a serious bug. |
Thanks for reporting the issue. We are currently investigating the issue and will provide an update as soon as possible. |
I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system. |
@opcod3 , |
@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release. |
Please note that this page explains the ROCm roadmap and current versions... |
@rkamd I also concur with @opcod3 that its worrying that the rocm runtime dose not throw an error when a kernel launch fails due to the arch being wrong, but instead silently continues with garbage data and only logs this as a warning. In my option a failed kernel launch of this kind should cause an assert. Please confirm whether you have raised this problem as a bug internally or not. As otherwise i would like to file a bug against the runtime. I would also respectfully request that a system with heterogeneous architecture is included in internal conformance testing, if such a system is not available already. That said, thank you for fixing this issue and including the unsupported legacy platforms in the fix, your (and AMD's in general) efforts in providing an open source compute platform are much appreciated. Indeed great progress has been made in this direction in recent years. |
I'd like to report this issue appears resolved for me at this time! |
First of all, this is the wrong report. apt install python3.8-venv Failing command: ['/root/workspace/rocBLAS/build/virtualenv/bin/python3.8', '-Im', 'ensurepip', '--upgrade', '--default-pip'] CMake Error at cmake/virtualenv.cmake:23 (message): -- Configuring incomplete, errors occurred!
|
Could you help me to solve this problem ,thank you very much! |
@xiaobo1025 please to dont spam this bug with unrelated issues @rkamd I can confirm this seams to be fixed in 6.0 |
@IMbackK , Thanks for verifying. Closing this issue. |
Describe the bug
rocBLAS returns incorrect results when used on two GPUs with different architectures.
This issue was first encountered in turboderp/exllama#173, while the provided code to reproduce is based off of rocBLAS-Examples.
When using rocBLAS and performing computations on two GPUs with different architectures the first computation
on each card will be correct. While any subsequent ones performed on the first card will be incorrect.
To Reproduce
Steps to reproduce the behavior:
Ensure the current system has at least two GPUs and that the architecture of GPU0 is different from GPU1
Install ROCm and ROCblas v5.6.0 (also present on 5.5.1, possibly earlier as well)
Run
make
to compile the example code (bug-report.zip)Run
./gemm
Observe how the first two calculations pass while the all the subsequent ones that execute on GPU0 fail
Expected behavior
It is expected that all calculations complete correctly.
Log-files
Running
AMD_LOG_LEVEL=2 ./gemm
produces the following logI believe the key log entries are the following:
Environment
environment.txt
This has also been reproduced in the
rocm/dev-ubuntu-22.04:5.5.1-complete
docker container.Additional context
According to other users in turboderp/exllama#173 the issues also occurs between Mi25 and Mi50 cards. I can also report it also occurs between any combination of the two cards I listed above and a 7900XTX.
Inverting the order of the computations (running a calculation on GPU1 first and then on GPU0) results in the same exact behavior, but with the failing card being GPU1 instead of GPU0 as before.
From looking at more logs and rocBLAS internals i believe the error is related to the Tensile library. The behavior encountered seems to indicate that when a second
.hsaco
file is loaded it somehow overrides the original one with the correct architecture for the first card.I am unsure if this is an issue in Tensile itself or in the way rocBLAS uses it.
In my opinion attempting to execute a kernel with an incorrect architecture should produce a crash or an error, instead of carrying on as normal and returning incorrect results.
The text was updated successfully, but these errors were encountered: