-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢? #47
Comments
We use return F.sigmoid(d0) in the network definition. This may not be
reliable in some cases. You can try to return only d0 and then replace the
current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In
addition, it is also good to check your input to see if they are all valid.
…On Sun, Jan 10, 2021 at 3:48 AM Yihong ***@***.***> wrote:
您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan
`[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248,
tar: 0.097755
l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5:
0.269343, l6: 0.561054
[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669,
tar: 0.097731
l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5:
0.414125, l6: 0.675684
[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880,
tar: 0.097773
l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5:
0.377140, l6: 0.674387
[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687,
tar: 0.097785
l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5:
0.299898, l6: 0.505494
[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991,
tar: 0.097769
l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5:
0.407138, l6: 0.842563
[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025,
tar: 0.097791
l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549,
l6: 2.403301
[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar:
nan
l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976,
l6: 2.474765
`
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#47>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA>
.
--
Xuebin Qin
PhD
Department of Computing Science
University of Alberta, Edmonton, AB, Canada
Homepage:https://webdocs.cs.ualberta.ca/~xuebin/
|
There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal. |
There are several other options you can try, for example
(1) add the torch.nn.utils.clip_grad_norm just after the loss.backward,
(2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc.
… On Jan 10, 2021, at 8:49 PM, Yihong ***@***.***> wrote:
We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid.
… <x-msg://1/#>
On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 <#47>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA <https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA> .
-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ <https://webdocs.cs.ualberta.ca/~xuebin/>
There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA>.
|
Thank you very much. I'll try these options. |
您好,我训练自己数据集的时候发现在15.6K迭代额时候,d0和d1的输出为nan,导致l0和l1损失为nan
`[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755
l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054
[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731
l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684
[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773
l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387
[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785
l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494
[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769
l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563
[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791
l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301
[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan
l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765
`
The text was updated successfully, but these errors were encountered: