Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments to resuming training #1406

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Cauldrath
Copy link
Contributor

Currently skips the resumed epoch if partway through and calculates global steps as only the number of steps into the current epoch.

These changes make it resume mid epoch on the appropriate step, with the right global step count, so max steps will be honored.

This also includes a change to just do a multiplication instead of a for loop over every elapsed epoch.

Currently skips the resumed epoch if partway through
These changes make it resume mid epoch on the appropriate step
@slashedstar
Copy link

This fixed the problems I was having, when resuming from 200, 400 to 1000 steps (when it outputted "epoch is incremented. current_epoch: 0, epoch: 1") it worked as intended, but resuming from the 1200 steps onwards (when it outputted "epoch is incremented. current_epoch: 0, epoch: 2") the training continued after the maximum amount of steps (it also didn't even save a model when reaching the max steps)
image

@kohya-ss
Copy link
Owner

kohya-ss commented Jul 8, 2024

Thank you for this! Sorry, I didn't test with --max_train_steps option. In my understanding, this fixes the issue when --max_train_steps is specified.

@Cauldrath
Copy link
Contributor Author

Yes, --max_train_steps combined with resuming or setting --initial_steps is the main problem if it isn't starting in the first epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants