Update PET checkpoint format #399

abmazitov · 2024-11-21T11:13:17Z

This PR fixes issue #396
Starting from now, two versions of the PET checkpoint will be saved in the end of the training
The first one will contain the data on the step of the training (i.e. optimiser and scheduler states, and also the model weights). The second one will contains only the best model state with no trainer state.
API will accept both formats and load the trainer state only if it is available.

Contributor (creator of pull-request) checklist

📚 Documentation preview 📚: https://metatrain--399.org.readthedocs.build/en/399/

frostedoyster

It looks good to me. Just some small comments on the docs (but we can also improve them later) and potentially one bug

docs/src/advanced-concepts/fine-tuning.rst

frostedoyster · 2024-11-25T11:03:31Z

docs/src/advanced-concepts/fine-tuning.rst

@@ -82,7 +68,8 @@ These parameters control whether to use LoRA for pre-trained model fine-tuning
 (``LORA_RANK``), and the regularization factor for the low-rank matrices
 (``LORA_ALPHA``).

-4. Run ``mtt train options.yaml`` to fine-tune the model.
+4. Run ``mtt train options.yaml -c best_model_*.ckpt`` to fine-tune the model.


What does the * stand for here? Perhaps it would be worth explaining what it is

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix. But I think here I will just leave it as best_model.ckpt, since this is the default name now.

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix

Where is this enforced? I don't see it in this PR

I was referring to the fact that the checkpoint with best model is now forced to be have the best_model_ prefix

Where is this enforced? I don't see it in this PR

It happens in the Trainer.save_checkpoint(), where I add last_checkpoint_ and best_ prefixes to the path

Does this mean that the published checkpoint for PET-MAD will be named best_model_{something}.ckpt? Or can it be renamed and still work?

It happens in the Trainer.save_checkpoint()

Ok, but then this is only something for PET, not a general thing. This should be fine here for the checkpoint name, but I now realize that this file (which should apply to any architecture) contains a lot of PET-specific documentation (LoRA hypers, …). Should we move this elsewhere to be explicitly about PET?

Ah, there is a warning on top. Might still make sense to move the file, but this can happen later

Does this mean that the published checkpoint for PET-MAD will be named best_model_{something}.ckpt? Or can it be renamed and still work?

Nono, for sure the final name of the file can be anything, it's just the default name (i.e. the one the checkpoint will have after the training) is forced to have this prefix to distinguish between the checkpoint with only the best model, and the one with the last step of the training. You can rename the model later and use it with -c normally.

Ah, there is a warning on top. Might still make sense to move the file, but this can happen later

Maybe we can do it while working on a PR on more general and high-level fine-tuning implementation, but for now I would keep it like this

src/metatrain/experimental/pet/trainer.py

abmazitov added 19 commits November 20, 2024 23:47

Moved PET loading from state dict to utils

641c433

Moved PET loading from state dict to utils

947bba2

Updated the checkpoint format

cd10646

Merge branch 'main' into new_pet_checkpoint_format

98c633c

Updated PET trainer

298b519

Set PET MLIP setting in utils

337b05f

Set PET MLIP setting in utils

e7c3978

Updated state_dict naming convention for both checkpoints

d22b6a4

Fixed dtype in PET.load_checkpoint()

f07bca9

Fixed training from last and best checkpoints

b7901e4

Linting fix

dd4d3d3

Merge branch 'main' into new_pet_checkpoint_format

5337370

Updated checkpoint format

fca0133

Updated name of the checkpoint

d6e2f23

Fixed PET.restart() method

b186ecb

Typo fix

c6e5bf8

Typo fix

8ece9e3

Fixed the last checkpoint name

2ae5823

Updated docs

a2d4ce0

abmazitov marked this pull request as ready for review November 25, 2024 10:31

abmazitov requested review from PicoCentauri, frostedoyster and Luthaf and removed request for PicoCentauri November 25, 2024 10:31

frostedoyster reviewed Nov 25, 2024

View reviewed changes

This was referenced Nov 25, 2024

Adapt models to handle generic targets #386

Merged

best_val_mae_both_model is not written if do_forces = MLIP_SETTINGS.USE_FORCES = False #397

Merged

abmazitov added 2 commits November 25, 2024 13:13

Apply Filippo's suggestions

739388f

Apply Filippo's suggestions

18e1a90

frostedoyster approved these changes Nov 25, 2024

View reviewed changes

abmazitov merged commit 266264e into main Nov 25, 2024
13 checks passed

abmazitov deleted the new_pet_checkpoint_format branch November 25, 2024 12:45

This was referenced Nov 25, 2024

PET checkpoint and models are not suitable for distribution #396

Closed

Hotfix of the PET device #406

Merged

Hotfix of the PET device #408

Merged

Fix missing best_metric in the temporary PET checkpoints #410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PET checkpoint format #399

Update PET checkpoint format #399

abmazitov commented Nov 21, 2024 •

edited

Loading

frostedoyster left a comment

frostedoyster Nov 25, 2024

abmazitov Nov 25, 2024

Luthaf Nov 25, 2024

abmazitov Nov 25, 2024

frostedoyster Nov 25, 2024 •

edited

Loading

Luthaf Nov 25, 2024 •

edited

Loading

Luthaf Nov 25, 2024

abmazitov Nov 25, 2024

abmazitov Nov 25, 2024 •

edited

Loading

Update PET checkpoint format #399

Update PET checkpoint format #399

Conversation

abmazitov commented Nov 21, 2024 • edited Loading

Contributor (creator of pull-request) checklist

frostedoyster left a comment

Choose a reason for hiding this comment

frostedoyster Nov 25, 2024

Choose a reason for hiding this comment

abmazitov Nov 25, 2024

Choose a reason for hiding this comment

Luthaf Nov 25, 2024

Choose a reason for hiding this comment

abmazitov Nov 25, 2024

Choose a reason for hiding this comment

frostedoyster Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Luthaf Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Luthaf Nov 25, 2024

Choose a reason for hiding this comment

abmazitov Nov 25, 2024

Choose a reason for hiding this comment

abmazitov Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

abmazitov commented Nov 21, 2024 •

edited

Loading

frostedoyster Nov 25, 2024 •

edited

Loading

Luthaf Nov 25, 2024 •

edited

Loading

abmazitov Nov 25, 2024 •

edited

Loading