Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss is NaN now. #17

Open
Strive21 opened this issue Dec 6, 2024 · 7 comments
Open

Training loss is NaN now. #17

Strive21 opened this issue Dec 6, 2024 · 7 comments

Comments

@Strive21
Copy link

Strive21 commented Dec 6, 2024

When will the latest version of the code and data processing code be released?

@SHYuanBest
Copy link
Member

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

@Strive21
Copy link
Author

Strive21 commented Dec 6, 2024

Thanks for your interest. Is the training loss NaN at the beginning, what dataset did you use? The latest version of the code may not be released so soon. We will prioritize the release of data processing code and the integration of ConsisID into diffusers.

I downloaded your dataset and processed it appropriately, using CogvideoX-5B-I2V to initialize the weights,which bs is 5 and lr is 3e-7. It has loss in the initial training, but NaN appears after about 500 iterations. Is it because I processed the data wrong? And “fail to detect face using insightface, extract embedding on align face“ occurs during training。

@SHYuanBest
Copy link
Member

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

@SHYuanBest
Copy link
Member

fail to detect face using insightface, extract embedding on align face this warning cannot be avoided because facexlib may not be able to detect the face, and the code will automatically skip this training sample.

@SHYuanBest
Copy link
Member

or you can try to train only LoRA instead of all parameters.

Oh, I see. This may be a problem with MM-DiT. The training is very unstable because the activation value of the middle layer may be very large, resulting in loss NaN. You can try turning on EMA, gradient accumulation, increasing batchsize, and reducing the learning rate. Another method is to add a regularization term to the output of the middle layer.

@SHYuanBest
Copy link
Member

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

@Strive21
Copy link
Author

Strive21 commented Dec 9, 2024

When will the latest version of the code and data processing code be released?

we have release the data processing code, please refer to here for more details.

Thank you!I'll give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants