You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Incase the pytorch job is restarted it should choose the last clean checkpoint instead of the latest checkpoint folder since it may be of an incomplete checkpoint.
Sample Steps
Start a pytorch job
Stop and start the job
Expected behavior
It should cleanly start from the last checkpoint.
Observed behavior
It is unable to start the job and the job errors out with following error
FileNotFoundError: [Errno 2] No such file or directory: '/cos1/output/sg/fine-tuned-tiny-llama-new3/checkpoint-25/trainer_state.json'
This is because it tries to start from a last checkpoint folder even if its not the completed checkpoint folder
The text was updated successfully, but these errors were encountered:
ppnaik1890
changed the title
Resume from last correct checkpoint when restarting a job
Resume from last correct checkpoint when restarting a pytorch job
Jan 3, 2025
Describe the bug
Incase the pytorch job is restarted it should choose the last clean checkpoint instead of the latest checkpoint folder since it may be of an incomplete checkpoint.
Sample Steps
Start a pytorch job
Stop and start the job
Expected behavior
It should cleanly start from the last checkpoint.
Observed behavior
It is unable to start the job and the job errors out with following error
This is because it tries to start from a last checkpoint folder even if its not the completed checkpoint folder
The text was updated successfully, but these errors were encountered: