Is resume training same as single training run? #772
Replies: 4 comments 1 reply
-
doesn't your step numbers reset to 0 when you resume ? I've been too scared by this to really resume training since a while, not trusting the resume feature, and actively researching it Since the step number becomes zero both in the further saved states and in the progress bar, I'm fearing the training engine could start as well on step zero which would mean kind of overwriting things. This is an hypothesis, and I've been digging the code of train_network.py about it for hours and I'm not sure yet. I assume you've been resuming already since you know about saving states. Are you saving as safetensors files ? How did it go ? How were your results ? |
Beta Was this translation helpful? Give feedback.
-
I read again github (the library used to train by kohya scripts) regarding the pull request that set in place getting the resuming step from the saved state : huggingface/accelerate#2765 The crucial parts happen in C:\fluxgym\env\Lib\site-packages\accelerate\checkpointing.py Like I wrote, I had doubt the resuming would actually take place from step 0, but maybe this was just in my mind and a progress bar thing. Here are the changes I've made (just loggers) :
The output in my logs was 0 Can you backup your checkpointing.py, replace that in the original, try to resume, and tell me your step value please ? At this point, until I have answers from accelerate or you show me the problem's on me, I would NOT trust the resume feature Checkpointing.py is supposed to read the step from the random_states_0.pkl file in the saved state folder dir you give the script through the --resume parameter. I'll try to hack it a bit more when I have time but I'm no expert. |
Beta Was this translation helpful? Give feedback.
-
OK, i've installed a proper IDE and have spent many hours in the code now, so I've deleted my previous comments as this was starting to be a bit long and confusing This is complex code and I've misunderstood things for a while but now I think I'm starting to get it. Here are my current conclusions : - Accelerator does saves himself his internal step (for gradient accumulation, this is NOT the global_step) itself in the saved state (in .pk1 file)... IF you have its library up to date (changed around August 2024) I think it's really useful only if you have used a gradient accumulation > 1 step and that you have used saving state after a given number of steps, thus saving the state in the middle of a batch / a gradient accumulation. This way Accelerator will resume the gradient accumulation from the right internal steps, avoiding a desynch I was very scared Accelerator would train back from 0 after this confusion between its internal gradient accumulation step and the training script own global_step / initial_step / steps_from_states, etc. The kohya readme is outdated on this, implying accelerator always resumes at step 1 which not only is and misleading because that's the gradient accumulation step we're talking about (not global_step), which Accelerator does save - There was code in train_network basically always resetting global_set to 0, thus later saving current_step in the train_state.json to zero So I've done locally this :
If using --resume, set global_step to the steps we've read in the state (from the training_state.json), this way we don't start back from zero, and when saving further training_state.json it won't be zero either but global_step + 1 If not using --resume then global_step is set to zero
Initial_step is set quite confusingly (in my opinion :-p) but I've made sure it gets the global_step value when resuming Now I don't know for sure, but I assume the actual important progress like optimizer and scheduler states are saved and loaded properly in the training state binary files @kohya-ss is my reasoning sound ? |
Beta Was this translation helpful? Give feedback.
-
Small nota bene about resuming after doing it a lot and watching loss curves in tensorboard
|
Beta Was this translation helpful? Give feedback.
-
when using save_state and resume, are there any downsides to doing this repeatedly in smaller training runs before reaching a desired state (e.g. 2000 steps) compared to a single training run of 2000 steps?
Beta Was this translation helpful? Give feedback.
All reactions