Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler does not modify learning rate when stopping and restarting training #11875

Closed
amin-nejad opened this issue Feb 11, 2022 · 2 comments
Closed
Labels
bug Something isn't working lr scheduler optimizer
Milestone

Comments

@amin-nejad
Copy link
Contributor

amin-nejad commented Feb 11, 2022

🐛 Bug

The learning rate scheduler does not actually modify the learning rate when stopping and re-starting training on the same model. For instance if I set the learning rate scheduler as a simple StepLR which modifies the learning rate every epoch (step_size=1), but I only train the model one epoch at a time, the learning rate never gets modified. It seems like the scheduler.step() call is happening in the wrong place, usually it happens as the very last thing after training an epoch. Stopping and restarting training is very useful for a number of different applications (e.g. transfer learning, federated learning, etc.) so it's important that this behaviour be the same as that of vanilla pytorch.

To Reproduce

Please see this in effect with the Boring model: https://colab.research.google.com/drive/1zZWp5kALBJXz4VcWI-ldYmlCrdkwQNRi?usp=sharing

Here we get the same results as in vanilla pytorch despite there being a scheduler in the lightning model but not in the vanilla model. If we specify a learning rate scheduler in the vanilla pytorch model, we can confirm what is indeed happening by looking at the difference between the weights at the start of batch_idx==1 of epoch_idx==1. With the gamma of StepLR set to 0.5, one can clearly see that the weight update on the weight layer from batch 0 to batch 1 is half of that of the lightning model.

Expected behavior

The learning rate should be decayed or modified according to the scheduler regardless of when we may stop and restart training on the same model. e.g. if I am decaying learning rate every epoch but only training one epoch at a time, my learning rate will never decay. This should not be the case, the learning rate should continue to be modified as specified.

Environment

  • CUDA:
    • GPU:
    • available: False
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.5.10
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12
    • version: # 1 SMP Tue Dec 7 09:58:10 PST 2021

Additional context

This is part of my effort to ensure I'm getting the same results in lightning as I do in vanilla pytorch.

cc @rohitgr7

@amin-nejad amin-nejad added the bug Something isn't working label Feb 11, 2022
@rohitgr7 rohitgr7 self-assigned this Feb 11, 2022
@rohitgr7
Copy link
Contributor

sorry for the delay here:

tried your script, and not seeing this issue anymore.
@amin-nejad , are you still facing this issue?

@Borda Borda self-assigned this Nov 7, 2022
@awaelchli
Copy link
Contributor

This was fixed in #18280
See my full reply here on another issue: #17296 (comment)

@awaelchli awaelchli added this to the 2.0.x milestone Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lr scheduler optimizer
Projects
None yet
Development

No branches or pull requests

4 participants