-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PET checkpoint format #399
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Just some small comments on the docs (but we can also improve them later) and potentially one bug
@@ -82,7 +68,8 @@ These parameters control whether to use LoRA for pre-trained model fine-tuning | |||
(``LORA_RANK``), and the regularization factor for the low-rank matrices | |||
(``LORA_ALPHA``). | |||
|
|||
4. Run ``mtt train options.yaml`` to fine-tune the model. | |||
4. Run ``mtt train options.yaml -c best_model_*.ckpt`` to fine-tune the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the * stand for here? Perhaps it would be worth explaining what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_
prefix. But I think here I will just leave it as best_model.ckpt
, since this is the default name now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the fact that the checkpoint with best model is now forced to be have the
best_model_
prefix
Where is this enforced? I don't see it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the fact that the checkpoint with best model is now forced to be have the
best_model_
prefixWhere is this enforced? I don't see it in this PR
It happens in the Trainer.save_checkpoint()
, where I add last_checkpoint_
and best_
prefixes to the path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the published checkpoint for PET-MAD will be named best_model_{something}.ckpt
? Or can it be renamed and still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens in the Trainer.save_checkpoint()
Ok, but then this is only something for PET, not a general thing. This should be fine here for the checkpoint name, but I now realize that this file (which should apply to any architecture) contains a lot of PET-specific documentation (LoRA hypers, …). Should we move this elsewhere to be explicitly about PET?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, there is a warning on top. Might still make sense to move the file, but this can happen later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the published checkpoint for PET-MAD will be named
best_model_{something}.ckpt
? Or can it be renamed and still work?
Nono, for sure the final name of the file can be anything, it's just the default name (i.e. the one the checkpoint will have after the training) is forced to have this prefix to distinguish between the checkpoint with only the best model, and the one with the last step of the training. You can rename the model later and use it with -c
normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, there is a warning on top. Might still make sense to move the file, but this can happen later
Maybe we can do it while working on a PR on more general and high-level fine-tuning implementation, but for now I would keep it like this
This PR fixes issue #396
Starting from now, two versions of the PET checkpoint will be saved in the end of the training
The first one will contain the data on the step of the training (i.e. optimiser and scheduler states, and also the model weights). The second one will contains only the best model state with no trainer state.
API will accept both formats and load the trainer state only if it is available.
Contributor (creator of pull-request) checklist
-c
works (both last and best checkpoint)📚 Documentation preview 📚: https://metatrain--399.org.readthedocs.build/en/399/