-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault after model load on ROCm multi-gpu, multi-gfx #4030
Comments
Unclear if related to #3991 |
Make sure that HIP code is compiled for all your architectures. For example:
where "main" is file name of executable that crashes. The output should include something like "amdgcn-amd-amdhsa--gfx1100" and "amdgcn-amd-amdhsa--gfx1030". |
Thanks for the response. Unfortunately, I think it is indeed compiling as expected for both gfx architectures.
|
I can confirm I have a similar/same issue with rocm + multi-gpu. Running llamacpp ( Running the below cmd and partial output
llama.cpp compiled with gfx architectures
|
I've encountered the same issue, but noticed this Limitations page on AMD which stipulates that multi-GPU support cannot span multiple PCIe paths, meaning that the GPUs must be connected to the CPU directly, as opposed to the CPU and then Chipset-to-CPU. @xangelix - You mentioned it was working before (do you know what commit/tag?). Can you confirm if your motherboard's configuration supports multiple GPUs directly connected to the CPU? I'm in the same boat as you (just a generation down), and I think I'm out of luck with my specific motherboard PCIe lane configuration. |
As per https://dlcdnets.asus.com/pub/ASUS/mb/Socket%20AM5/ProArt%20X670E-CREATOR%20WIFI/E21293_ProArt_X670E-CREATOR_WIFI_UM_V2_WEB.pdf?model=ProArt%20X670E-CREATOR%20WIFI (page vii) my motherboard's top two slots, the ones I use for GPUs, are in 8x 8x bifurcation mode which uses lanes directly from the cpu. I don't at the moment know what commit llamacpp last worked with--but I did remember a few days ago when talking to some koboldcpp folk that it ONLY ever worked for me with the I would be suspicious of any AMD support claims both in the negative and positive direction. Don't let it get your hopes down (but maybe don't expect AMD to directly help either...). I'd guess that page has more to do with enterprise support commitment rather than if it should actually function or not. I haven't gotten a single gfx1100 pytorch error since I purchased that card, almost a year before AMD claimed any support at all for it. |
So i can confirm this bug on an epyc system with MI50s, ie a fully rocm supported configuration. I can also confirm that this configuration passes all the rocblas tests and i run lots of mutli-gpu workloads with no issue. below is a backtrace, it seams that hipMemcpy2DAsync is somehow missused.
|
I can confirm this issue for single GPU as well. I have a RX 590(gfx803) only and it's giving the same error as well. This started happening a few weeks ago after a system update. I believe this is ROCm's fault, I even tried a complete reinstall of my system and nothing changed. Other programs with ROCm don't work as well like Koboldcpp, Stable Diffusion, etc.
|
thats a totally unrelated issue. this issue affects p2p transfers in llamacpp only (no where else). GFX803 is known broken, it fails a lot of rocm unit tests, the oldest architecture that passes all rocm tests is GFX900 |
Man, I should have read your comment more careful. I just bisected back to b1060 without success apart from getting it to run on one GPU with HIP_VISIBLE_DEVICES=0 in a two GPU setup (6600M, both PCIe 3.0 x16, Fedora rawhide, ROCm 6.0 admgpu-install). I just got these cards for my old desktop so I don't have much recent AMD experience.
So is there any way to get |
Difference between the one GPU (
vs
So just around 5 tokens per second difference. |
The reason why |
Seems like somehow it does workaround a ROCm issue. I just tested it on head and it segfaulted:
|
According to
|
Apologize, it was cut off. I attached full log where you can see it is switching devices, even though the 7B Q8 model can fit into VRAM of one GPU just fine. |
sup guy, I also getting the segmentation fault when using my two rx 6800. |
Recently upgraded an old working version of llama.cpp where x2 GPUs worked flawlessly with ROCM. I'm unable to use llama.cpp at all now. Other loaders seem to work fine so it is likely a problem with llama.cpp and not Rocm (6.0). |
I just ran HEAD again and it works again on two AMD GPUs with ROCm 6.
Whatever you changed in #4766 (tag b1843) fixed it!
…On Wed, Dec 20, 2023 at 3:29 PM slaren ***@***.***> wrote:
The reason why --low-ram was removed is because you can get very similar
VRAM usage by reducing the batch size and disabling KV offloading, ie. with --no-kv-offload
-b 1. I am not sure if it also happened to workaround some issue with
ROCm, but that needs to be fixed separately.
—
Reply to this email directly, view it on GitHub
<#4030 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACYR3O3BGKLUN3JNWV7E7LYKLY3NAVCNFSM6AAAAAA7G4M3VWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUGU3TAMBZGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sup bro, try the latest llamacpp version its working great now with multigpu, it even works with rocm6.
Not sure what do i need to do to solve it, im going to start reading to see what can i find |
Sup bros, I installed rocm5.6 and did the follow compiling the latest llamacpp
then i set the follow enviromental variables
Now im able to run one rx 6900, two rx 6800 and one rx 6700 all together in multigpu. Everything working great now! but im using an old mobo with pcie x1 gen1 and take long time to load models, but when loaded the inference time its fast. I have an extra rx 5700 and im wondering how to add it to make it work with my setup, any ideas? |
@xangelix bro confirm if my steps above works for you so we can close the issue |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
NA
Current Behavior
Segmentation fault after model load for ROCm multi-gpu, multi-gfx. Best I can remember it worked a couple months ago, but has now been broken at least 2 weeks.
Environment and Context
rocminfo
lscpu
uname -a
ROCm 5.7.1
llamacpp 4a4fd3e
python3 --version
make --version
g++ --version
Failure Information (for bugs)
Provided below.
Steps to Reproduce
make LLAMA_HIPBLAS=1
./main -ngl 99 -m ../koboldcpp/models/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/mistral-7b-instruct-v0.1.Q5_K_M.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"
Failure Logs
Provided above.
The text was updated successfully, but these errors were encountered: