Multi-GPU issues #281

nktice · 2023-09-09T23:46:12Z

Here's another bug on Oobabooga's project that is unresolved...
oobabooga/text-generation-webui#2923
I realized that the ExLlama team may have a solution....
So thought I'd cross post this issue on this project, in case you've not seen.

Here's the guide I wrote to get everything working on AMD kit...
https://github.com/nktice/AMD-AI
Models load fine when it is only on one card, here are some results :
https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md

Multi-card loading only spits out gibberish, here's an example :

pha golden Riv. Jcatred (ProcSN proc Dre -:// Mindly means for the and in a Nich říct Forest Rav Rav fran fran fran gaz Agrcastle castleasiacliordinate advers Mem advers Basibenkooor paste Singapore refugeermeanny intellectualsafe Shakespe contempor Mallmanual Quantmousektr Ge Mil shadownehfdzekADmobile Und Euenf Next Dominbuchcock Infoengo‭ Hann NAT ]] Ferr' -.-- -,-

    ason, rang,-, –-

(,,

--,.,

alter

,-

(

-on,-.

I,- .

1

V

V. film-

N

    –on.,on,.

(, for.

and of- is. . and –on, –,. and

In in

film school and I on and with and I ":

.

` andon util –

The text was updated successfully, but these errors were encountered:

Ph0rk0z · 2023-09-10T11:11:44Z

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

nktice · 2023-09-13T04:44:36Z

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

Thanks for your reply... I've raised the issue on HIPs github support thread :
ROCm/HIP#3331

turboderp · 2023-09-13T08:14:59Z

Just in case you haven't tried it yet, the --gpu_peer_fix argument (corresponding entry in ExLlamaConfig) might help. Maybe? It prevents direct inter-device copying even when the driver reports that the capability is there, and copies everything via system RAM instead. There have been some issues with that on NVIDIA at least.

nktice · 2023-09-14T00:22:50Z

Thanks for your reply, and your excellent coding, it's great when it works...

I looked into this, and had trouble finding how to do such a thing...
Whereas such features would be good to have options through their interface, I have requested Oobabooga add features so it can be done :
oobabooga/text-generation-webui#3912

I have been looking for ( but have yet to find again ) a page that I found...
in it they discussed that similar issues came from torch.empty
as sometimes it had not cleared all of the data leading to issues -
in it they suggest using torch.zeros instead, which helped some people.
I went through your code, and tried that for my issue, to little avail -
but thought I'd mention in case you've not heard of it, and it helps others.
[ If I find that page, I'll update this to include a proper link then. ]

turboderp · 2023-09-14T05:00:25Z

Yep, torch.empty isn't supposed to clear the data, which could cause problems if you're incorrectly assuming that an empty tensor is the same as a zeros tensor, but I think I've been mindful enough of the distinction.

--gpu_peer_fix is only a kludge to work around a particular bug in Torch (or CUDA, or the NVIDIA driver, or whatever the case may be). So it's not really a solution or anything, more a diagnostic tool, and the solution would be filing a bug report upstream if that flag fixes something that shouldn't be broken.

I'm thinking another thing to explore would be the use of at::cuda::OptionalCUDAGuard to ensure that the correct CUDA device is selected on entry to each of the extension functions. If that doesn't get properly HIPified, it could lead to ROCm working correctly on single-GPU setups but failing (perhaps even sporadically) on multi-GPU setups.

nktice · 2023-09-14T23:09:48Z

I got a reply on Oobabooga posting about the passing
of parameters such as the one you suggest. "It's on by default"
oobabooga/text-generation-webui#3912

Thanks for your replies... I have been thinking of this, so I'll mention it.
Another issue, that is somewhat related to model loading
is cache and other memory that is involved in handling models...
So for example, forum commenters noted to use split settings
that leave lots of room for caching and memory use around models -
As model tokens, caching, and management bits consume lots of space.
[ bigger the model, the more is used for tokens, and index info... ]
I am wondering if there is a loader option to describe these bits
[ Command line option, benchmark tool parameter, or something like that... ]
so one can predict the whole memory footprint a model will use?

Related to this, instead of splitting model across GPUs,
is it practical to have supporting features on another card?
This would allow for maximizing model size loaded on one card
so as to avoid an issue like I'm having splitting up model,
using 2nd card, ( or perhaps the system RAM ) for caching / tokens.
As tokens go up, supplemental memory could be more helpful.

turboderp · 2023-09-15T00:39:03Z

Cache and state has to reside on the same device as the associated weights. You can't do CUDA operations across devices, and while you could store just the cache on a separate device, it would be slower than just swapping it to system RAM, which is still slow enough to be kind of useless.

ardfork · 2023-09-29T16:19:17Z

Guess, I forgot to answer here, this is the same issue as #173 which was fixed upstream and will be available in next ROCm version.

Note that exllama v2 is also affected and this could have easily been fixed locally in exllama with a small hack like it was done in llama.cpp, but I didn't have the hardware to test.

nktice · 2024-01-23T06:56:06Z

I can now report, that using latest drivers, it seems to work now.
As in I can load a model cross-GPUs and it's responsive.
[ ROCm 6.0 , torch==2.3.0.dev20240118+rocm6.0 ]

nktice mentioned this issue Sep 13, 2023

Multi-GPU loading of models produces gibberish ROCm/HIP#3331

Closed

nktice mentioned this issue Sep 13, 2023

RuntimeError: Input type (c10::Half) and bias type (float) should be the same Stability-AI/stablediffusion#155

Open

nktice mentioned this issue Sep 14, 2023

Support passing ExLlama arguments oobabooga/text-generation-webui#3912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU issues #281

Multi-GPU issues #281

nktice commented Sep 9, 2023 •

edited

Loading

Ph0rk0z commented Sep 10, 2023

nktice commented Sep 13, 2023

turboderp commented Sep 13, 2023

nktice commented Sep 14, 2023 •

edited

Loading

turboderp commented Sep 14, 2023

nktice commented Sep 14, 2023 •

edited

Loading

turboderp commented Sep 15, 2023

ardfork commented Sep 29, 2023

nktice commented Jan 23, 2024

Multi-GPU issues #281

Multi-GPU issues #281

Comments

nktice commented Sep 9, 2023 • edited Loading

Ph0rk0z commented Sep 10, 2023

nktice commented Sep 13, 2023

turboderp commented Sep 13, 2023

nktice commented Sep 14, 2023 • edited Loading

turboderp commented Sep 14, 2023

nktice commented Sep 14, 2023 • edited Loading

turboderp commented Sep 15, 2023

ardfork commented Sep 29, 2023

nktice commented Jan 23, 2024

nktice commented Sep 9, 2023 •

edited

Loading

nktice commented Sep 14, 2023 •

edited

Loading

nktice commented Sep 14, 2023 •

edited

Loading