Is resume training same as single training run? #772

caniyabanci76 · 2023-08-20T14:20:53Z

caniyabanci76
Aug 20, 2023

when using save_state and resume, are there any downsides to doing this repeatedly in smaller training runs before reaching a desired state (e.g. 2000 steps) compared to a single training run of 2000 steps?

Tablaski · 2024-09-24T21:53:24Z

Tablaski
Sep 24, 2024

doesn't your step numbers reset to 0 when you resume ? I've been too scared by this to really resume training since a while, not trusting the resume feature, and actively researching it

Since the step number becomes zero both in the further saved states and in the progress bar, I'm fearing the training engine could start as well on step zero which would mean kind of overwriting things. This is an hypothesis, and I've been digging the code of train_network.py about it for hours and I'm not sure yet.

I assume you've been resuming already since you know about saving states. Are you saving as safetensors files ?

How did it go ? How were your results ?

0 replies

Tablaski · 2024-09-25T08:52:38Z

Tablaski
Sep 25, 2024

I read again github (the library used to train by kohya scripts) regarding the pull request that set in place getting the resuming step from the saved state : huggingface/accelerate#2765

The crucial parts happen in C:\fluxgym\env\Lib\site-packages\accelerate\checkpointing.py

Like I wrote, I had doubt the resuming would actually take place from step 0, but maybe this was just in my mind and a progress bar thing.
So i've added logging to where it happens precisely from what the authors are saying on github.

Here are the changes I've made (just loggers) :

 # Random states
    try:
        states = torch.load(input_dir.joinpath(f"{RNG_STATE_NAME}_{process_index}.pkl"))
        if "step" in states:
            override_attributes["step"] = states["step"]
            logger.warning(f"Loaded step {states['step']}")
        random.setstate(states["random_state"])
        np.random.set_state(states["numpy_random_seed"])
        torch.set_rng_state(states["torch_manual_seed"])
        if is_xpu_available():
            torch.xpu.set_rng_state_all(states["torch_xpu_manual_seed"])
        if is_mlu_available():
            torch.mlu.set_rng_state_all(states["torch_mlu_manual_seed"])
        else:
            torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])
        if is_torch_xla_available():
            xm.set_rng_state(states["xm_seed"])
        logger.info("All random states loaded successfully")
        logger.warning(f"Step is {states['step']}")
    except Exception as e:
        logger.info("Could not load random states")
        logger.exception(f"Could not load random states: {e}")

    return override_attributes

The output in my logs was 0

Can you backup your checkpointing.py, replace that in the original, try to resume, and tell me your step value please ?
This way we see if it that's just me ?

At this point, until I have answers from accelerate or you show me the problem's on me, I would NOT trust the resume feature

Checkpointing.py is supposed to read the step from the random_states_0.pkl file in the saved state folder dir you give the script through the --resume parameter.

I'll try to hack it a bit more when I have time but I'm no expert.

1 reply

yushan777 Sep 25, 2024

from kohya's docs : (translated) "Please note that due to Accelerator specifications, the number of epochs and global steps are not saved, and when resumed, the learning starts from 1 again"

I could be wrong but I believe this is purely a progress thing, and it is up to the developer of the trainer to save the step or epoch states and use those as the starting point on subsequent resumes. I have been training Flux using a custom trainer by @2kpr (still in closed beta), and the step counts from the last step of the previous checkpoint - presumably because he decided to save those in addition to the checkpoint and optimizer states.

https://github.com/kohya-ss/sd-scripts/blob/main/docs/train_README-ja.md

Tablaski · 2024-10-02T11:28:46Z

Tablaski
Oct 2, 2024

OK, i've installed a proper IDE and have spent many hours in the code now, so I've deleted my previous comments as this was starting to be a bit long and confusing

This is complex code and I've misunderstood things for a while but now I think I'm starting to get it. Here are my current conclusions :

- Accelerator does saves himself his internal step (for gradient accumulation, this is NOT the global_step) itself in the saved state (in .pk1 file)... IF you have its library up to date (changed around August 2024)

I think it's really useful only if you have used a gradient accumulation > 1 step and that you have used saving state after a given number of steps, thus saving the state in the middle of a batch / a gradient accumulation.

This way Accelerator will resume the gradient accumulation from the right internal steps, avoiding a desynch

I was very scared Accelerator would train back from 0 after this confusion between its internal gradient accumulation step and the training script own global_step / initial_step / steps_from_states, etc.

The kohya readme is outdated on this, implying accelerator always resumes at step 1 which not only is and misleading because that's the gradient accumulation step we're talking about (not global_step), which Accelerator does save

- There was code in train_network basically always resetting global_set to 0, thus later saving current_step in the train_state.json to zero
- The progress bar had code making it always start at zero, thus confusing us

So I've done locally this :

Check if we're using --resume or not.

If using --resume, set global_step to the steps we've read in the state (from the training_state.json), this way we don't start back from zero, and when saving further training_state.json it won't be zero either but global_step + 1

If not using --resume then global_step is set to zero

Fix the progress bar so it indicates initial_step to max_train_steps

Initial_step is set quite confusingly (in my opinion :-p) but I've made sure it gets the global_step value when resuming
This way whether we're using --initial_step to skip stuff or --resume, the progress bar will be starting from a meaningful step

Now I don't know for sure, but I assume the actual important progress like optimizer and scheduler states are saved and loaded properly in the training state binary files

@kohya-ss is my reasoning sound ?

0 replies

Tablaski · 2024-10-14T09:35:52Z

Tablaski
Oct 14, 2024

Small nota bene about resuming after doing it a lot and watching loss curves in tensorboard

It might not be EXACTLY the same as not stopping because when resuming, the trainer re-ajusts a bit for the first few steps so most of the time I see average loss do a drop or a spike at the beginning of the resuming then after a while it normalizes. However I can definitely say it works
Pay attention to your learning rate when resuming, depending on what type of scheduler you've used before, it might not resume with the same learning rate as when you've stopped. In this regard, it might be also a very good opportunity to adjust your learning rate !

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is resume training same as single training run? #772

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is resume training same as single training run? #772

caniyabanci76 Aug 20, 2023

Replies: 4 comments · 1 reply

Tablaski Sep 24, 2024

Tablaski Sep 25, 2024

yushan777 Sep 25, 2024

Tablaski Oct 2, 2024

Tablaski Oct 14, 2024

caniyabanci76
Aug 20, 2023

Replies: 4 comments 1 reply

Tablaski
Sep 24, 2024

Tablaski
Sep 25, 2024

Tablaski
Oct 2, 2024

Tablaski
Oct 14, 2024