Resume from last correct checkpoint when restarting a pytorch job #432

ppnaik1890 · 2025-01-03T07:59:36Z

Describe the bug

Incase the pytorch job is restarted it should choose the last clean checkpoint instead of the latest checkpoint folder since it may be of an incomplete checkpoint.

Sample Steps

Start a pytorch job
Stop and start the job

Expected behavior

It should cleanly start from the last checkpoint.

Observed behavior

It is unable to start the job and the job errors out with following error

FileNotFoundError: [Errno 2] No such file or directory: '/cos1/output/sg/fine-tuned-tiny-llama-new3/checkpoint-25/trainer_state.json'

This is because it tries to start from a last checkpoint folder even if its not the completed checkpoint folder

The text was updated successfully, but these errors were encountered:

SilverSoldier · 2025-01-06T14:49:26Z

Seems to be HF Trainer related. Added a feature request in Transformers: huggingface/transformers#35525

ppnaik1890 changed the title ~~Resume from last correct checkpoint when restarting a job~~ Resume from last correct checkpoint when restarting a pytorch job Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from last correct checkpoint when restarting a pytorch job #432

Resume from last correct checkpoint when restarting a pytorch job #432

ppnaik1890 commented Jan 3, 2025

SilverSoldier commented Jan 6, 2025

Resume from last correct checkpoint when restarting a pytorch job #432

Resume from last correct checkpoint when restarting a pytorch job #432

Comments

ppnaik1890 commented Jan 3, 2025

Describe the bug

Sample Steps

Expected behavior

Observed behavior

SilverSoldier commented Jan 6, 2025