Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from last correct checkpoint when restarting a pytorch job #432

Open
ppnaik1890 opened this issue Jan 3, 2025 · 1 comment
Open

Comments

@ppnaik1890
Copy link

Describe the bug

Incase the pytorch job is restarted it should choose the last clean checkpoint instead of the latest checkpoint folder since it may be of an incomplete checkpoint.

Sample Steps

Start a pytorch job
Stop and start the job

Expected behavior

It should cleanly start from the last checkpoint.

Observed behavior

It is unable to start the job and the job errors out with following error

FileNotFoundError: [Errno 2] No such file or directory: '/cos1/output/sg/fine-tuned-tiny-llama-new3/checkpoint-25/trainer_state.json'

This is because it tries to start from a last checkpoint folder even if its not the completed checkpoint folder

@ppnaik1890 ppnaik1890 changed the title Resume from last correct checkpoint when restarting a job Resume from last correct checkpoint when restarting a pytorch job Jan 3, 2025
@SilverSoldier
Copy link

Seems to be HF Trainer related. Added a feature request in Transformers: huggingface/transformers#35525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants