-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOT_FOUND error on google-batch #5422
Comments
The retry policy could be extended to handle this error, however in principle NOT_FOUND should be a permanent error. Tagging @tom-seqera @jorgee for visibility. |
I think this is a timing issue where nextflow is checking the status of a task that has been deleted |
Solved via aa4d19c |
@pditommaso I am not sure if this retry implementation is fully resolving the problem. The
The retry window isn't long enough to handle the task state transition period in Google Batch. This doesn't seem formally documented but taking a look at the config constructor for the retry behaviour, can we use the following to extend the retry window and backoff?
|
Well, configuration options exist for this. I'd suggest validating best settings before turning into default values by using these config options
|
Unfortunately, even with aggressive retry policies (e.g. 15s initial delay, 45s max delay, 10+ attempts, checks spanning approximately 4 minutes), workflows with a large number of tasks can still fail due to this error. It's also quite hard to reproduce. I am not sure why it's taking so long to return task state as of recent on the Google Batch side. For example (redacted log):
Can we handle the
Just for the sake of discussion, I've proposed this in #5648 Config
Thoughts @pditommaso? |
Is there a more recent example (within 10 days) with job id or uid? |
Here's one from today: |
Thank you! |
@pditommaso : while we are looking into the task issue, does it make sense for Nextflow to query the job state instead of the task state given there is only one task in the job? I think that will be more stable. |
We moved recently to the task API to enable the of tasks array. Even fixing the single task case, would not the same problem arise when using arrays? |
In this PR #5567 I added the checkJobStatus to solve issue #5550. In that case, the tasks were not created due to a scheduling problem. To solve it I added the checkJobStatus to check the job status when no tasks appear in the Job. I think a similar problem is happening in this case. The task is created but for some reason it is not accesible throught the API. Maybe, we could check the checkJobStatus when the NOT_FOUND error is returned by the checkTaskStatus, to check if there are no errors reported in the job. |
That sounds reasonable. Maybe @bolianyin can shed some lights on what's the proper API to be used to cover this case. |
We need time to investigate the task NOT_FOUND issue, it does not seem to be easily reproducible. Checking job state can be a workaround before the issue is fixed. Job status contains a summary of task states, which could be used for jobs with more than one tasks. |
If we can see all of the task states from the job state, I would much rather do that. I think I could not figure out how to do that when I refactored the code to query the task state |
When you describe a job, the job status field contains a "taskGroups" field, which shows task count per task state after tasks are created. Something like below, if a job has two SUCCEEDED tasks:
|
Then, this could be used only for non-array jobs |
For array jobs, do you need to know the exact state of each task or general stats like how many tasks have scheduled, finished, etc? |
It needed to know the status at task level. |
Bug report
I have a RNA-seq pipeline I've been using for a while on GCP with success. Recently, I noticed that when processing medium/large datasets, I get a
NOT_FOUND
error and the workflow gets interrupted. It feels like a temporary unavailable service, but it keeps reoccurring when resuming the run.Expected behavior and actual behavior
No interruption or retry if a given service is not available
Steps to reproduce the problem
Hard to reproduce, could be due to temporarily unavailable services
Program output
Environment
The text was updated successfully, but these errors were encountered: