Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary #3244

MekkCyber · 2024-11-18T12:58:03Z

What does this PR do?

Fixes huggingface/transformers#34751, and part of huggingface/transformers#34706.

Explanation of `get_balanced_memory` Behavior and Handling of Large `embed_tokens` Layers

When a small model or a relatively large quantized model is loaded using device_map=auto, the function get_balanced_memory calculates the maximum memory usage for each visible device. Its goal is to distribute the model evenly across all devices while reserving all available memory on the last device.

Issue with Large `embed_tokens` Layers

For models with a small number of parameters but a large vocabulary size (e.g., Gemma2 2B), the embed_tokens layer can consume a significant amount of memory. This layer might exceed the max_memory limit on the initial devices. As a result, the infer_auto_device_map function bypasses all earlier devices, placing the embed_tokens layer—and all subsequent layers—on the last device.

A similar issue arises with quantized models. Since embedding layers are often not quantized, the max_memory per device might be insufficient to accommodate the embed_tokens layer, leading to the same behavior.

Improvement in This PR

This pull request addresses the issue by comparing the size of the embed_tokens layer to the memory available per GPU (per_gpu). It adjusts the memory allocation strategy to ensure that embed_tokens can be distributed appropriately across devices, improving the handling of models with large embeddings.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc

HuggingFaceDocBuilderDev · 2024-11-18T13:01:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc · 2024-11-18T13:16:45Z

src/accelerate/utils/modeling.py

+        if idx == 0 and not low_zero and module_sizes["model.embed_tokens"] > per_gpu * 0.9:
+            max_memory[idx] = min(module_sizes["model.embed_tokens"] * 1.3, max_memory[idx])
+        elif idx == 1 and low_zero and module_sizes["model.embed_tokens"] > per_gpu * 0.9:
+            max_memory[idx] = min(module_sizes["model.embed_tokens"] * 1.3, max_memory[idx])
+        else:
+            max_memory[idx] = min(max_memory[0] if low_zero and idx == 0 else per_gpu, max_memory[idx])


Not every model have their embedding layers as model.embed_tokens. Since this is specific to transformers, I don't think this should live in accelerate too. We can modify max_memory directly there as it is computed. We are also trying to tackle similar issues with this PR #3066 (comment).

Maybe a good solution would be to check if there is a module that is > per_gpu, return a message saying that the model is unbalanced which will lead to the whole model being put on only one device and propose to the user to use "sequential" mode instead ? Or we could also do as you suggested and modify max_memory with the largest module size.

Thanks for the suggestion @SunMarc, I updated the code to use the largest leave module size instead of hardcoding the embed_tokens layer

SunMarc · 2024-11-20T16:04:05Z

src/accelerate/utils/modeling.py

+        if idx == 0 and not low_zero and max_leave_size > per_gpu * 0.9:
+            max_memory[idx] = min(max_leave_size * 1.3, max_memory[idx])
+        elif idx == 1 and low_zero and max_leave_size > per_gpu * 0.9:
+            max_memory[idx] = min(max_leave_size * 1.3, max_memory[idx])


You are taking the minimum, is this expected ?

Yes we take the minimum between the max_memory of the gpu and the gpu space needed on the device. So if the space needed exceeds the space available on the gpu, we only allocate the space available.

SunMarc

Could you add a test that fails before this PR but is fixed now ? Also it would be nice to try with the models that have this issue.

first fix

7b85592

fix style

c5eb33c

SunMarc reviewed Nov 18, 2024

View reviewed changes

max leave size instead of embed layer

a7ea07c

guillemram97 mentioned this pull request Nov 20, 2024

Device_map='auto' not working along with bitsandbytes (transformers) bitsandbytes-foundation/bitsandbytes#1411

Closed

muellerzr requested a review from SunMarc November 20, 2024 15:09

SunMarc reviewed Nov 20, 2024

View reviewed changes

MekkCyber requested a review from SunMarc November 22, 2024 12:06

SunMarc reviewed Dec 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary #3244

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary #3244

MekkCyber commented Nov 18, 2024

HuggingFaceDocBuilderDev commented Nov 18, 2024

SunMarc Nov 18, 2024

MekkCyber Nov 19, 2024

SunMarc Nov 20, 2024

MekkCyber Nov 20, 2024

SunMarc left a comment

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary #3244

Are you sure you want to change the base?

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary #3244

Conversation

MekkCyber commented Nov 18, 2024

What does this PR do?

Explanation of get_balanced_memory Behavior and Handling of Large embed_tokens Layers

Issue with Large embed_tokens Layers

Improvement in This PR

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Nov 18, 2024

SunMarc Nov 18, 2024

Choose a reason for hiding this comment

MekkCyber Nov 19, 2024

Choose a reason for hiding this comment

SunMarc Nov 20, 2024

Choose a reason for hiding this comment

MekkCyber Nov 20, 2024

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Explanation of `get_balanced_memory` Behavior and Handling of Large `embed_tokens` Layers

Issue with Large `embed_tokens` Layers