Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reporting and wait for preparation task completion #31

Merged
merged 3 commits into from
Mar 29, 2023

Conversation

camallen
Copy link
Contributor

@camallen camallen commented Mar 28, 2023

closes #15 and #10

This PR adds Honeybadger error reporting to the batch scripts to surface errors in the batch processing python code.

Finally this PR also adds a configurable wait time on main task start. This is needed as we don't currently wait for the job preparation task to complete successfully as it leaves the node pool scaled if it doesn't finish properly

so we're adding a configurable wait time on the main task start to allow the preparation task to finish and make the training / prediction code available.

Without this the main task can run before the job preparation task has finished and we get errors like "python: can't open file '/mnt/batch/tasks/shared/train_model_finetune_on_catalog.py': [Errno 2] No such file or directory""

report raised errors etc to honeybadger
we don't currently wait for the job preparation task to complete successfully as it leaves the node pool scaled if it doesn't finish properly

so we're adding a configurable wait time on the main task start to allow the preparation task to finish and make the training / prediction code available.

Without this the main task can run before the job preparation task has finished and we get errors like "python: can't open file '/mnt/batch/tasks/shared/train_model_finetune_on_catalog.py': [Errno 2] No such file or directory""
@camallen camallen force-pushed the error-reporting-and-wait-prep-task branch from eafa23f to abe1ac5 Compare March 29, 2023 12:48
@camallen camallen merged commit 9e8b4d8 into main Mar 29, 2023
@camallen camallen deleted the error-reporting-and-wait-prep-task branch March 29, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add error reporting to prediction and training jobs
1 participant