You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The learning rate scheduler does not actually modify the learning rate when stopping and re-starting training on the same model. For instance if I set the learning rate scheduler as a simple StepLR which modifies the learning rate every epoch (step_size=1), but I only train the model one epoch at a time, the learning rate never gets modified. It seems like the scheduler.step() call is happening in the wrong place, usually it happens as the very last thing after training an epoch. Stopping and restarting training is very useful for a number of different applications (e.g. transfer learning, federated learning, etc.) so it's important that this behaviour be the same as that of vanilla pytorch.
Here we get the same results as in vanilla pytorch despite there being a scheduler in the lightning model but not in the vanilla model. If we specify a learning rate scheduler in the vanilla pytorch model, we can confirm what is indeed happening by looking at the difference between the weights at the start of batch_idx==1 of epoch_idx==1. With the gamma of StepLR set to 0.5, one can clearly see that the weight update on the weight layer from batch 0 to batch 1 is half of that of the lightning model.
Expected behavior
The learning rate should be decayed or modified according to the scheduler regardless of when we may stop and restart training on the same model. e.g. if I am decaying learning rate every epoch but only training one epoch at a time, my learning rate will never decay. This should not be the case, the learning rate should continue to be modified as specified.
Environment
CUDA:
GPU:
available: False
version: 11.1
Packages:
numpy: 1.19.5
pyTorch_debug: False
pyTorch_version: 1.10.0+cu111
pytorch-lightning: 1.5.10
tqdm: 4.62.3
System:
OS: Linux
architecture:
64bit
processor: x86_64
python: 3.7.12
version: # 1 SMP Tue Dec 7 09:58:10 PST 2021
Additional context
This is part of my effort to ensure I'm getting the same results in lightning as I do in vanilla pytorch.
🐛 Bug
The learning rate scheduler does not actually modify the learning rate when stopping and re-starting training on the same model. For instance if I set the learning rate scheduler as a simple
StepLR
which modifies the learning rate every epoch (step_size=1
), but I only train the model one epoch at a time, the learning rate never gets modified. It seems like thescheduler.step()
call is happening in the wrong place, usually it happens as the very last thing after training an epoch. Stopping and restarting training is very useful for a number of different applications (e.g. transfer learning, federated learning, etc.) so it's important that this behaviour be the same as that of vanilla pytorch.To Reproduce
Please see this in effect with the Boring model: https://colab.research.google.com/drive/1zZWp5kALBJXz4VcWI-ldYmlCrdkwQNRi?usp=sharing
Here we get the same results as in vanilla pytorch despite there being a scheduler in the lightning model but not in the vanilla model. If we specify a learning rate scheduler in the vanilla pytorch model, we can confirm what is indeed happening by looking at the difference between the weights at the start of
batch_idx==1
ofepoch_idx==1
. With thegamma
ofStepLR
set to0.5
, one can clearly see that the weight update on the weight layer from batch 0 to batch 1 is half of that of the lightning model.Expected behavior
The learning rate should be decayed or modified according to the scheduler regardless of when we may stop and restart training on the same model. e.g. if I am decaying learning rate every epoch but only training one epoch at a time, my learning rate will never decay. This should not be the case, the learning rate should continue to be modified as specified.
Environment
Additional context
This is part of my effort to ensure I'm getting the same results in lightning as I do in vanilla pytorch.
cc @rohitgr7
The text was updated successfully, but these errors were encountered: