Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PET checkpoint format #399

Merged
merged 21 commits into from
Nov 25, 2024
Merged

Update PET checkpoint format #399

merged 21 commits into from
Nov 25, 2024

Conversation

abmazitov
Copy link
Contributor

@abmazitov abmazitov commented Nov 21, 2024

This PR fixes issue #396
Starting from now, two versions of the PET checkpoint will be saved in the end of the training
The first one will contain the data on the step of the training (i.e. optimiser and scheduler states, and also the model weights). The second one will contains only the best model state with no trainer state.
API will accept both formats and load the trainer state only if it is available.

Contributor (creator of pull-request) checklist

  • Issue referenced (for PRs that solve an issue)?
  • Model training works
  • Model training from last checkpoint works
  • Model training from best checkpoint works
  • PET.restart(new_dataset_info) adapted
  • Best model keeper works throughout multiple training runs
  • Fine-tuning with -c works (both last and best checkpoint)
  • Model loads and exports normally
  • PET-MAD in checkpoints are uploaded to HF
  • HF fetch works
  • Documentation updated (for new features)?

📚 Documentation preview 📚: https://metatrain--399.org.readthedocs.build/en/399/

@abmazitov abmazitov marked this pull request as ready for review November 25, 2024 10:31
@abmazitov abmazitov requested review from PicoCentauri, frostedoyster and Luthaf and removed request for PicoCentauri November 25, 2024 10:31
Copy link
Collaborator

@frostedoyster frostedoyster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Just some small comments on the docs (but we can also improve them later) and potentially one bug

docs/src/advanced-concepts/fine-tuning.rst Outdated Show resolved Hide resolved
@@ -82,7 +68,8 @@ These parameters control whether to use LoRA for pre-trained model fine-tuning
(``LORA_RANK``), and the regularization factor for the low-rank matrices
(``LORA_ALPHA``).

4. Run ``mtt train options.yaml`` to fine-tune the model.
4. Run ``mtt train options.yaml -c best_model_*.ckpt`` to fine-tune the model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the * stand for here? Perhaps it would be worth explaining what it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix. But I think here I will just leave it as best_model.ckpt, since this is the default name now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix

Where is this enforced? I don't see it in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix

Where is this enforced? I don't see it in this PR

It happens in the Trainer.save_checkpoint(), where I add last_checkpoint_ and best_ prefixes to the path

Copy link
Collaborator

@frostedoyster frostedoyster Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the published checkpoint for PET-MAD will be named best_model_{something}.ckpt? Or can it be renamed and still work?

Copy link
Member

@Luthaf Luthaf Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens in the Trainer.save_checkpoint()

Ok, but then this is only something for PET, not a general thing. This should be fine here for the checkpoint name, but I now realize that this file (which should apply to any architecture) contains a lot of PET-specific documentation (LoRA hypers, …). Should we move this elsewhere to be explicitly about PET?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there is a warning on top. Might still make sense to move the file, but this can happen later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the published checkpoint for PET-MAD will be named best_model_{something}.ckpt? Or can it be renamed and still work?

Nono, for sure the final name of the file can be anything, it's just the default name (i.e. the one the checkpoint will have after the training) is forced to have this prefix to distinguish between the checkpoint with only the best model, and the one with the last step of the training. You can rename the model later and use it with -c normally.

Copy link
Contributor Author

@abmazitov abmazitov Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there is a warning on top. Might still make sense to move the file, but this can happen later

Maybe we can do it while working on a PR on more general and high-level fine-tuning implementation, but for now I would keep it like this

src/metatrain/experimental/pet/trainer.py Outdated Show resolved Hide resolved
src/metatrain/experimental/pet/trainer.py Outdated Show resolved Hide resolved
src/metatrain/experimental/pet/trainer.py Outdated Show resolved Hide resolved
src/metatrain/experimental/pet/trainer.py Outdated Show resolved Hide resolved
@abmazitov abmazitov merged commit 266264e into main Nov 25, 2024
13 checks passed
@abmazitov abmazitov deleted the new_pet_checkpoint_format branch November 25, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants