Scheduler does not modify learning rate when stopping and restarting training #11875

amin-nejad · 2022-02-11T09:44:55Z

🐛 Bug

The learning rate scheduler does not actually modify the learning rate when stopping and re-starting training on the same model. For instance if I set the learning rate scheduler as a simple StepLR which modifies the learning rate every epoch (step_size=1), but I only train the model one epoch at a time, the learning rate never gets modified. It seems like the scheduler.step() call is happening in the wrong place, usually it happens as the very last thing after training an epoch. Stopping and restarting training is very useful for a number of different applications (e.g. transfer learning, federated learning, etc.) so it's important that this behaviour be the same as that of vanilla pytorch.

To Reproduce

Please see this in effect with the Boring model: https://colab.research.google.com/drive/1zZWp5kALBJXz4VcWI-ldYmlCrdkwQNRi?usp=sharing

Here we get the same results as in vanilla pytorch despite there being a scheduler in the lightning model but not in the vanilla model. If we specify a learning rate scheduler in the vanilla pytorch model, we can confirm what is indeed happening by looking at the difference between the weights at the start of batch_idx==1 of epoch_idx==1. With the gamma of StepLR set to 0.5, one can clearly see that the weight update on the weight layer from batch 0 to batch 1 is half of that of the lightning model.

Expected behavior

The learning rate should be decayed or modified according to the scheduler regardless of when we may stop and restart training on the same model. e.g. if I am decaying learning rate every epoch but only training one epoch at a time, my learning rate will never decay. This should not be the case, the learning rate should continue to be modified as specified.

Environment

CUDA:
- GPU:
- available: False
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.5.10
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.12
- version: # 1 SMP Tue Dec 7 09:58:10 PST 2021

Additional context

This is part of my effort to ensure I'm getting the same results in lightning as I do in vanilla pytorch.

cc @rohitgr7

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2022-07-27T10:50:39Z

sorry for the delay here:

tried your script, and not seeing this issue anymore.
@amin-nejad , are you still facing this issue?

awaelchli · 2023-09-20T00:39:01Z

This was fixed in #18280
See my full reply here on another issue: #17296 (comment)

amin-nejad added the bug Something isn't working label Feb 11, 2022

rohitgr7 self-assigned this Feb 11, 2022

rohitgr7 added the lr scheduler label Feb 11, 2022

Borda self-assigned this Nov 7, 2022

awaelchli closed this as completed Sep 20, 2023

awaelchli added the optimizer label Sep 20, 2023

awaelchli added this to the 2.0.x milestone Sep 20, 2023

awaelchli unassigned Borda and rohitgr7 Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler does not modify learning rate when stopping and restarting training #11875

Scheduler does not modify learning rate when stopping and restarting training #11875

amin-nejad commented Feb 11, 2022 •

edited by github-actions bot

Loading

rohitgr7 commented Jul 27, 2022

awaelchli commented Sep 20, 2023

Scheduler does not modify learning rate when stopping and restarting training #11875

Scheduler does not modify learning rate when stopping and restarting training #11875

Comments

amin-nejad commented Feb 11, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

rohitgr7 commented Jul 27, 2022

awaelchli commented Sep 20, 2023

amin-nejad commented Feb 11, 2022 •

edited by github-actions bot

Loading