Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

您好,用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢? #47

Open
yihong-97 opened this issue Jan 10, 2021 · 4 comments

Comments

@yihong-97
Copy link

您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan
`[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755
l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054

[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731
l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684

[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773
l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387

[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785
l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494

[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769
l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563

[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791
l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301

[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan
l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765
`

@xuebinqin
Copy link
Owner

xuebinqin commented Jan 10, 2021 via email

@yihong-97
Copy link
Author

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid.

On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA .
-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal.

@xuebinqin
Copy link
Owner

xuebinqin commented Jan 11, 2021 via email

@yihong-97
Copy link
Author

There are several other options you can try, for example (1) add the torch.nn.utils.clip_grad_norm just after the loss.backward, (2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc.

On Jan 10, 2021, at 8:49 PM, Yihong @.> wrote: We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. … x-msg://1/# On Sun, Jan 10, 2021 at 3:48 AM Yihong @.> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 <#47>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ https://webdocs.cs.ualberta.ca/~xuebin/ There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA.

Thank you very much. I'll try these options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants