-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could there be a bug in mixed precision? #101
Comments
Hi there, how do you know the script is not launched in mixed precision but use full precision? Just tried on my side and it runs properly in mixed precision. |
Well, I know that because the training speed & memory usage is the same as full precision. |
I think maybe this is due to my code structure (training may be too complex?). |
Memory usage won't be very different unless you are using a very large batch size. For the speed you have to make sure all the dimensions of the tensors you are using are multiple of 8. |
I did use a batch size of 8. If I use PyTorch's own autocast, speed-up is normal. (mixed precision performance is very obvious on a Turing card) Also in my task, there is almost 40% memory reduction. |
It's hard to say what's going wrong without seeing any code. |
Sorry I don't really have a simple reproduceable code sample... When I set fp16 in |
On second thought, probably it's because I wrapped my code with But the loss did explode when I use only fp16=True. |
You are not letting Accelerate handle mixed precision here, you are doing it in your script yourself: when the |
Thanks for your answers! I'll try remove my own mixed precision handling, and see if the loss explosion issue can be reproduced. |
@sgugger I removed my own mixed precision handling: And if I set mixed precision as If I set Please help me to locate the problem here? Thanks in advance. |
It's really weird that you have a different behavior between config and by setting |
Yes it is indeed weird. Could it be somehow the GradScaler is not activated? I'll try it again with master tomorrow. |
@sgugger I installed from master by these instructions: And the behavior is still the same. |
Hi!
Once I turn on FP16 (doesn't matter whether from config or args) it doesn't break, but the loss stays nan. It also becomes noticeably faster, but I kind of expected that given the AMP setting.
I'm writing this to know if you guys found a solution perhaps? I also don't have a minimum working example, but I'm working on uploading the code in a repository if required. If it is of any help, I'm running on a linux machine with this package configuration:
|
@edornd My solution for now is switching back to torch amp. |
If you do get a simple reproducer, I'm happy to investigate more. I have just not been able to reproduce this error on my side. |
Hi @sgugger, thanks for the quick reply! Unfortunately I didn't have time to build a proper minimum working example yet, but I managed to adjust the CV example to a segmentation task minimizing changes to the code, here it is. I apologize for the use of the custom dataset and decoder, however if you check the code there's nothing particularly weird about them, just standard PyTorch stuff. The dataset is also nothing out of the ordinary as you can see here. Reading around, I suspect this has little to do with accelerate, but it is rather linked to underflow and log transformations in the loss (?). I'll try to adapt the same script to manual AMP and see if the same issue arises, otherwise I'll see what I can do to make it self-contained, so that it can be launched without too many configuration troubles. Cheers! |
Just a quick follow-up: no, still no self-contained min. working example, sorry :) I believe that it's working for two reasons: training became noticeably faster and the memory usage dropped by a modest 25/30%, without of course nan loss issues. However, debugging and printing stuff around, the only difference I could notice was that the loss tensor in my code was a |
Thanks for the analysis and the example you provided. I'll try to dig more into the differences tomorrow. |
I was able to investigate this more and I think I found the problem. The PR above should fix the issue, would you mind giving it a try? |
Sorry for the late response, I managed to give it a try and I can confirm that now the behaviour is the same of the "manual" AMP version! I noticed I still have some nan issues after a good amount of epochs, when the loss is getting smaller, but that's just the risk of AMP I guess (and it happens in manual override as well), so nothing to do about that apart from going full precision. This looks like it's working properly: I noticed ~50% speed increase, while GPU memory went down from Just for curiosity/ignorance from my part: in your comments to the PR, I didn't get the bit about "computing the loss in the model": is that a thing, or is it just limited to transformer models? And why is that more stable? If you can point me to an example doing so I'd be glad to give it a try as well! Thank you very much once again @sgugger! |
What I meant in the documentation is that ideally, for Accelerate to work best, the loss should be directly by the model (if you look at any transformer models, they return loss and logits), but in 90% of the cases, what you have with just a cross entropy loss applied after the model should work perfectly. This comment was meant for more complicated loss functions. Glad to know it solved the issue! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
When I use torch 1.6.0 & accelerate 0.3.0 and set mixed precision as
yes
inaccelerate config
, nothing happens (still full precision training). If I set in the codeAccelerator(fp16=True)
then amp is triggered, but the loss becomes inf right away.But if I use the pytorch way (i.e. autocast in the code myself), the training is normal and amp is enabled.
So I wonder if there is a possible bug in accelerate.
My enviroment is single 2080 Ti, local machine.
The code with this problem is here.
The text was updated successfully, but these errors were encountered: