Unable to use multiple GPUs – CUDA Out of Memory issue #105

Devloper-RG · 2024-09-18T07:18:46Z

While Using other models like meta-llama/Meta-Llama-3.1-8B-Instruct
I'm encountering a torch.OutOfMemoryError when trying to load a model on multiple GPUs. I have 4 GPUs, each with 14.57 GiB memory, but the model fails to allocate memory on GPU 0, even though other GPUs should share the load.

eustlb · 2024-09-18T12:08:02Z

Hey @Devloper-RG, thanks for raising this issue and testing the lib in a multi-GPU setup 🙏
I'd be glad to help on that, can you provide a reproducer?

andimarafioti · 2024-09-18T14:23:48Z

I guess the issue here is that we are pushing to cuda as a device.

Devloper-RG · 2024-09-19T16:20:54Z

Hey @eustlb , thanks for getting back to me!

I made some modifications to the code to use the meta-llama/Meta-Llama-3.1-8B-Instruct model by updating the arguments_classes/language_model_arguments.py script. Also adjusted the LLM/language_model.py script to allow the model to be accessed via Hugging Face.

Thereafter I ran the server on a Google Cloud Platform (GCP) VM with 2 NVIDIA T4 GPUs. During testing, I noticed that one of the GPUs consistently overloads, leading to a torch.OutOfMemoryError.

I tried using the DataParallel method, but it didn’t resolve the issue. I also attempted to run the model in lower precision, which worked on a single GPU, but I’d like to use higher precision models and fully leverage multiple GPUs for better performance.

Any help with getting multi-GPU support working would be greatly appreciated!

andimarafioti · 2024-10-14T13:51:37Z

Hi @Devloper-RG , if you can give a snippet with some reproducible code, that would be very helpful. Otherwise, we can't know what your issue is. We have run this on a setup with multiple GPUs without any problems.

eustlb self-assigned this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use multiple GPUs – CUDA Out of Memory issue #105

Unable to use multiple GPUs – CUDA Out of Memory issue #105

Devloper-RG commented Sep 18, 2024

eustlb commented Sep 18, 2024

andimarafioti commented Sep 18, 2024

Devloper-RG commented Sep 19, 2024 •

edited

Loading

andimarafioti commented Oct 14, 2024

Unable to use multiple GPUs – CUDA Out of Memory issue #105

Unable to use multiple GPUs – CUDA Out of Memory issue #105

Comments

Devloper-RG commented Sep 18, 2024

eustlb commented Sep 18, 2024

andimarafioti commented Sep 18, 2024

Devloper-RG commented Sep 19, 2024 • edited Loading

andimarafioti commented Oct 14, 2024

Devloper-RG commented Sep 19, 2024 •

edited

Loading