-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP with SFTrainer: expected dtype float for end
but got dtype c10::BFloat16
#34702
Comments
Thanks all for the report and sorry for the delay, we're looking into it cc @muellerzr @SunMarc |
same issue, but with DPOTrainer (probably I also have it with SFTTrainer, but haven't tested). The error only occurs for me in multi-worker/multi-gpu/multi-node training, when using FSDP with single GPU there is no error. The issue also is not present in 4.45.2. I am wondering if it is due to this change? v4.45.2 (in mistral_modeling) hidden_states = outputs[0]
if labels is None and not is_torchdynamo_compiling():
logger.warning_once(
"Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
)
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
# TODO: remove the float() operation in v4.46
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float() in v4.46.2 hidden_states = outputs[0]
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]) my traceback looks like this File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/working_dir_files/_ray_pkg_92dffa2da1edbd43/fine_tune/main.py", line 87, in train_func
trainer.train()
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2534, in _inner_training_loop
self.optimizer.step()
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
ret = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 220, in step
adamw(
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 782, in adamw
func(
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 375, in _single_tensor_adamw
exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: expected dtype float for `end` but got dtype c10::BFloat16 |
The latest version of TRL (0.12.0) seems to have some issues, but version 0.11.3 works fine. |
Same issue. I can't use FSDP with TRL anymore. Everything works again if I downgrade Accelerate+Transformers+TRL as if we were in September. |
Hi! Thanks for the bug report. This should be fixed via #34645, can you install transformers via |
I confirm that it works with the most recent version of Transformers (already available through pip). |
@muellerzr I am still seeing this error when using transformers 4.46.3 and trl 0.12.1, but it only happens occassionally. I had a training run with 351 steps, it made it through 172 steps and I got this error on step 173. I have tried with both the SFTTrainer and DPOTrainer |
@muellerzr it appears to be happening at the exactly halfway through max_steps. Looking at the fix in the PR you linked I can't understand this behavior, but thought it might provide a clue |
I don't have this issue anymore on my side. Could you provide the traceback? |
@benjamin-marie thanks, here is my traceback File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2534, in _inner_training_loop
self.optimizer.step()
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 484, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
ret = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 227, in step
adamw(
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 767, in adamw
func(
File "/tmp/ray/session_2024-12-02_11-03-27_501367_12/runtime_resources/pip/74b671a31be4649681b5b250a141caa5a98ab328/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 380, in _single_tensor_adamw
exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: expected dtype float for `end` but got dtype c10::BFloat16 I can't understand why it can successfully complete 175/351 steps, but then fails. I have tried with different datasets, both using SFT and DPO from trl and it always fails at the halfway step |
@benjamin-marie I found out how this happened for me. We run our training in a multi-node setup and I was incorrectly calculating the |
System Info
pytorch 2.2 and 2.4 are tested.
transformers 4.46.2
4 * A6000 ada
Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
fsdp training code from 'https://huggingface.co/docs/peft/accelerate/fsdp'
but got expected dtype float for
end
but got dtype c10::BFloat16 error.I changed dtype (float16, 32, bfloat16) but failed to run the code.
What`s the problem?
param:
Expected behavior
FSDP training
The text was updated successfully, but these errors were encountered: