Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it Work on FP16 with latest nvidia amp? #36

Open
578123043 opened this issue Jan 7, 2020 · 6 comments
Open

Does it Work on FP16 with latest nvidia amp? #36

578123043 opened this issue Jan 7, 2020 · 6 comments

Comments

@578123043
Copy link

No description provided.

@578123043
Copy link
Author

image

In FP16 , output is nan , and in mid layer(maybe 17th or 18th) in the bert , I found that the attention is -3000,and softmax result is nan

@mandarjoshi90
Copy link
Contributor

I will need more information to help you debug. Could you please include what command you're running?

@578123043
Copy link
Author

YES , I use {https://dl.fbaipublicfiles.com/fairseq/models/spanbert_hf.tar.gz} to runing Huggingface`s run_squad , it Failed
Does it work with the latest amp?

@mandarjoshi90
Copy link
Contributor

We haven't tested this with the new HF code. ICYMI, there's a run_squad.py in this repo. I'd recommend using that.

@578123043 578123043 changed the title Does it Work Does it Work on FP16 with latest nvidia amp? Jan 18, 2020
@marcos0318
Copy link

marcos0318 commented Jul 26, 2020

When I tried to finetune spanbert large on my own task using fp16 with the "amp" module from the apex, I meet a similar error.

I am using the code from huggingface and tried both using "Spanbert/spanbert-large-cased" and the model binaries provided in this repo. Both gives the identical result, which kept telling me gradient overflow and rescaling the loss.

Interestingly, when I tried to replace spanbert-large with BERT base/large, and spanbert-base model, these models work perfectly and achieved expected results.

Also, the spanbert-large work very well when I turn off the fp16 training.

Here I found this guy cannot run fp16 with spanbert on another task.

In a conclusion, I guess that the spanbert-large model may work with the new Nvidia amp

@YuxianMeng
Copy link

Met same issue here, is there any solution yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants