-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting model on multiple GPUs is broken (ROCm) #173
Comments
I only have a single GPU, so I can't test. But @jmoney7823956789378 successfully ran exllama with 2 MI60, so unless any regression happened, it should work. How exactly are you running it? |
Yep, I've had this issue with certain models and I don't think I ever figured out a 100% solution. One weird thing I noticed with MI60s specifically, was that running plain GPTQ-for-LLaMA ran faster than exllama... btw, I am selling them if any of you smart rocm/exllama devs are interested :) |
How do the Mi60 perform compared to 3090s? |
The 3090s are much faster (currently). I think the HMB2 in the MI60s have a lot of potential, but I have no idea how to take advantage of it. |
I am not using @jmoney7823956789378 You say your issue was model specific, did the models that fail on two cards run on a single one? If they did then what I am about to say is certainly false, but otherwise I have a feeling that maybe some buffers are architecture specific and are stored differently between the two cards and for some reason they are not getting converted when copied from one card to another. It feels like that is the only issue that may occur since the models work individually. But I may be completely off base as I don´t have much experience with GPU compute. |
I am very tempted by Mi60 or Mi25 (x4), but I'd have to re-install my 2nd CPU and another power supply. I don't even want to know what it would idle at. Right now it's 200-230W and 1000w during inference or training. The GPUs would be cheap but the power bill will come get me. |
Aren't the Mi25 cards unsupported by new ROCm releases? And afaik Mi50 (same arch as Mi60) will be EOL next year, so no more new features for them. No idea how big of a deal that is |
I'm not able to recall specifically. I also recommend doing fresh installs of the rocm drivers, rebooting, then attempting inference if possible. As the ancient IT proverb states, "turn it off and back on again" worked more often than I'd like to admit. You could be right about the change in arch, but I can't confirm. :( |
I've already rebooted a few times. Retrying once more wont hurt though. I'll also try the docker images provided by AMD, maybe that'll help |
I know what you mean. Back when I was still messing with the MI60s I ended up with a second cheap epyc system, since they have all that PCIe on a relatively low power budget. |
I just tested out TheBloke/Llama-2-13B-chat-GPTQ and TheBloke/Llama-2-70B-chat-GPTQ on the two MI60s: Single MI60, 13B: Two MI60, 13B: Two MI60, 70B: |
Same models on GPTQ-for-LLaMa, two MI60s: 13B: 70B: |
There's an update coming soon that should bump this up a little bit. Not sure how much, but the GQA implementation is a little stupid at the moment with a relatively expensive reshaping of the K/V cache to make the number of heads align with the queries. Flash Attention 2.0 supports grouping directly so it's going to be faster on 70b, if only for avoiding that reshaping step. The only holdup at the moment is that it currently does causal masking in kind of a broken way. But they're working on that over there, and I'm trying to find a workaround in the meantime. |
Flash Attention doesn't build on ROCm, and supposedly never will (according to their devs). |
|
hooooooooooooly shit. rocm actually developing rocm? preposterous. |
Why do you think that pytorch, tensorflow, triton work on ROCm? They are adding support for that themselves. They just rarely touch end users software, but add support in all the big libraries. Intel does the same. |
Just joking around. Here's some output from the ROCm flash-attention test docker container. An absolute TON of lines were removed, all stating "...does not support this problem"
Besides that, the numbers look promising! |
That isn't horrible at least. It would be good for 4xMi25 since they cost together what one Mi60 does.
Pyrorch rocm doesn't work? It worked on my RX580. |
True, the MI60s are a slightly newer generation of chip though. I only got to test MI25s for a very short time, before exllama was out. If I remember correctly I think they did about 5t/s on 13B (stock bios). |
That's not that great. People were using them for SD and kept saying they had good FP16 perf. I think they were flashing them though. |
Yep, I had flashed it after the fact but I don't think I have the perf stats saved. |
Guys, back to the issue at hand, does anyone have any tips on how to figure where is the computation breaking? Is it possible to access the raw tensors with pytorch, as in the actual bytes that I can look through with a hex editor, to see if they are what I expect? |
Another shot in the dark, but are you able to roll back to ROCm 5.5? that's what I have on my MI60s, under baremetal ubuntu 22.04. |
I would say if the model works on either GPU, but not when split across both, you'll want to start by focusing on where the hidden state is moved from one GPU to the next. I assume you've already tried
|
Tried with ROCm 5.5.1, and the 6800XT instead of the 7900XTX and no difference. Still garbage.
Yeah I tried I also already tried instrumenting |
Well damn, seems that when splitting the work over two gpus the hidden_states tensor suddenly becomes NaN
|
pretty clearly a bug in rocm or pytorch, def report it upstream |
I agree that this should be reported upstream. You can also set |
I don't think that's it. If it was a problem with moving tensors between devices you should see it start out looking correct on cuda:0 and then go bad as it's moved to a different device. According to this, you get the NaN tensor on cuda:0 already. |
What kind of PCIe slot are the GPUs in? |
I can also reproduce this with any combination of rx6800xt, mi25 and mi50s. all of my devices are connected with full x16 pcie 3.0 running a kernel with CONFIG_HSA_AMD_P2P (ie rocm can and dose use pcie p2p transfers) |
The gpu's are in x16 slots, running at up to PCIe 4.0. They are only connected at x8 width though. If I place the 7900XTX in the second slot, it hangs when running exllama (nothing ever gets loaded into VRAM). It works fine in the first slot. Other cards work fine in the second slot. Running two cards with
It seems that maybe PyTorch is attempting to load the wrong kernel into the gpu. Or maybe even not all of the correct offload architectures are being compiled into the module. |
same issue i have with just plain transformers on just one gpu see ROCm/ROCm#2328 and huggingface/transformers#25007 Please write a comment that you are haveing a simmular issue with exlama and try the transformers script on your setup We should also probubly refer the issue to pytorch too. |
I can't replicate with your scripts. But I can only use 2 GPUs at a time in my system |
After doing some more research I believe that the issue may be in As after the failing
If i grep for one of the names inside the rocm directory i find that these functions are found in the following files:
What is likely happening is that |
I have those errors since I started making the ROCm patch, but at least on a single GPU setup, it doesn't seem to have any impact. |
Indeed, difference is on a single GPU the function is eventually found:
You can see at the last line: |
I have figured out this is a bug in either I've reported it upstream: ROCm/rocBLAS/issues/1346 |
I'll close this issue here then. |
Splitting a model between two AMD GPUs (Rx 7900XTX and Radeon VII) results in garbage output (gibberish).
Tested with
Llama-2-13B-chat-GPTQ
andLlama-2-70B-chat-GPTQ
.Running a model on just any one of the two card the output seems reasonable, although I cant vouch for the correctness of the 70B model as it cannot fit on a single card.
No flags seem to impact the results, although if i split the model and use
--fused_mlp_thd 0
the following error occurs:Exception
Compiling #146 does not seem to impact the outcome either.
The system is running Arch Linux with
python-pytorch-opt-rocm 2.0.1-7
Output of rocminfo
I am available to do any testing that may help isolate the issue, I can try to test a third card as well (RX 6800XT).
The text was updated successfully, but these errors were encountered: