Why the loss become NaN? #6

yt7589 · 2021-03-13T10:06:42Z

It is a great project. I am very interested in Transformer in Transformer model.
I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2) . But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?

The text was updated successfully, but these errors were encountered:

panda1949 · 2021-03-15T07:07:47Z

I have trained TNT on ImageNet. With the hyper-parameters in paper (https://arxiv.org/pdf/2103.00112.pdf) and DeiT training code (https://github.com/facebookresearch/deit), I reproduced the result: Top-1 acc=81.3 of TNT-S.

Have you tried the default hyper-parameters in TNT paper?

Sleepychord · 2021-10-14T17:13:37Z

@yt7589 Maybe sandwich-LN and PB-relax in CogView (https://arxiv.org/pdf/2105.13290) can help solve your problem.

haooooooqi · 2022-03-07T22:27:35Z

Have you observed some datapoints that sandwich-LN helps to the NaN issue? If you could you kindly share your experience?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the loss become NaN? #6

Why the loss become NaN? #6

yt7589 commented Mar 13, 2021

panda1949 commented Mar 15, 2021

Sleepychord commented Oct 14, 2021

haooooooqi commented Mar 7, 2022

Why the loss become NaN? #6

Why the loss become NaN? #6

Comments

yt7589 commented Mar 13, 2021

panda1949 commented Mar 15, 2021

Sleepychord commented Oct 14, 2021

haooooooqi commented Mar 7, 2022