Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the loss become NaN? #6

Open
yt7589 opened this issue Mar 13, 2021 · 3 comments
Open

Why the loss become NaN? #6

yt7589 opened this issue Mar 13, 2021 · 3 comments

Comments

@yt7589
Copy link

yt7589 commented Mar 13, 2021

It is a great project. I am very interested in Transformer in Transformer model.
I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2) . But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?

@panda1949
Copy link

I have trained TNT on ImageNet. With the hyper-parameters in paper (https://arxiv.org/pdf/2103.00112.pdf) and DeiT training code (https://github.com/facebookresearch/deit), I reproduced the result: Top-1 acc=81.3 of TNT-S.

Have you tried the default hyper-parameters in TNT paper?

@Sleepychord
Copy link

@yt7589 Maybe sandwich-LN and PB-relax in CogView (https://arxiv.org/pdf/2105.13290) can help solve your problem.

@haooooooqi
Copy link

Have you observed some datapoints that sandwich-LN helps to the NaN issue? If you could you kindly share your experience?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants