-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during training: "Expected dtype float for end but got dtype c10::BFloat16" #35106
Comments
cc @philschmid for AWS/Sagemaker |
I don't think this is strictly a SageMaker issue. In the time since I first posted this issue, I've reproduced it on an EC2 instance. See below for relevant details. Environment details
Accelerate config
Notably, I've been able to make the issue go away by downgrading
Notably, I don't run into the issue if I downgrade to |
Ah, I'm sorry! I misread the initial issue. Let me see if I can get someone from the TRL team to take a look. |
Thanks for reporting. Can you try to minimize the code a bit? Remove everything that isn't relevant for reproducing the bug, and make sure we can just copy and paste to reproduce the bug. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
I'm running on an Amazon SageMaker-managed p4d.24xlarge instance, so I'll do my best to provide comprehensive system info below.
Training image
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04
Dependencies
SageMaker training config
Environment variables
SFT Config
Who can help?
@muellerz @SunMarc @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Below is my training script, the SageMaker estimator object, and the stack trace showing the error that I'm getting. A curious (but perhaps irrelevant detail) is that this I've reproduced this twice, and both times it occurs during training on the 257th step (when
max_steps=1000
).Training script
SageMaker / HuggingFace estimator
Stack trace
Expected behavior
Training should complete successfully without encountering errors.
Related issues
end
but got dtype c10::BFloat16 #34702 - this seems to be the exact same issue and was active as recently as 3 weeks ago. It was closed only two days ago.end
but got dtype c10::BFloat16 SqueezeAILab/LLM2LLM#5The text was updated successfully, but these errors were encountered: