Training loss is NaN now. #17

Strive21 · 2024-12-06T07:56:53Z

When will the latest version of the code and data processing code be released?

SHYuanBest · 2024-12-06T09:38:22Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

Strive21 · 2024-12-06T15:46:15Z

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights，which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

SHYuanBest · 2024-12-07T02:52:58Z

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

SHYuanBest · 2024-12-07T03:10:51Z

fail to detect face using insightface, extract embedding on align face this warning cannot be avoided because facexlib may not be able to detect the face, and the code will automatically skip this training sample.

SHYuanBest · 2024-12-07T13:24:55Z

or you can try to train only LoRA instead of all parameters.

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

SHYuanBest · 2024-12-08T11:58:40Z

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

Strive21 · 2024-12-09T03:14:52Z

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

Thank you！I'll give it a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss is NaN now. #17

Training loss is NaN now. #17

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 6, 2024

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 8, 2024

Strive21 commented Dec 9, 2024

Training loss is NaN now. #17

Training loss is NaN now. #17

Comments

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 6, 2024

Strive21 commented Dec 6, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 7, 2024

SHYuanBest commented Dec 8, 2024

Strive21 commented Dec 9, 2024