Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOT_FOUND error on google-batch #5422

Open
Puumanamana opened this issue Oct 22, 2024 · 19 comments · May be fixed by #5690
Open

NOT_FOUND error on google-batch #5422

Puumanamana opened this issue Oct 22, 2024 · 19 comments · May be fixed by #5690
Assignees

Comments

@Puumanamana
Copy link

Bug report

I have a RNA-seq pipeline I've been using for a while on GCP with success. Recently, I noticed that when processing medium/large datasets, I get a NOT_FOUND error and the workflow gets interrupted. It feels like a temporary unavailable service, but it keeps reoccurring when resuming the run.

Expected behavior and actual behavior

No interruption or retry if a given service is not available

Steps to reproduce the problem

Hard to reproduce, could be due to temporarily unavailable services

Program output

Oct-22 16:33:28.660 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=ROME_QUANT_HUMAN:RNASEQC (ERR10232708); work-dir=gs://nxf-work/romequant/cedric/c9/cec2987387dd8a25f738dab2db92a9
  error [com.google.api.gax.rpc.NotFoundException]: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
Oct-22 16:33:28.683 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HUMAN:RNASEQC (ERR10232708)'

Caused by:
  NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found

com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:90)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
        Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
                at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
                at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
                at com.google.cloud.batch.v1.BatchServiceClient.getTask(BatchServiceClient.java:789)
                at com.google.cloud.batch.v1.BatchServiceClient.getTask(BatchServiceClient.java:732)
                at nextflow.cloud.google.batch.client.BatchClient$_describeTask_lambda4.doCall(BatchClient.groovy:113)
                at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
                at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
                at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
                at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
                at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
                at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
                at nextflow.cloud.google.batch.client.BatchClient.apply(BatchClient.groovy:187)
                at nextflow.cloud.google.batch.client.BatchClient.describeTask(BatchClient.groovy:113)
                at nextflow.cloud.google.batch.client.BatchClient.getTaskStatus(BatchClient.groovy:122)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.getTaskState(GoogleBatchTaskHandler.groovy:454)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.checkIfRunning(GoogleBatchTaskHandler.groovy:479)
                at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:642)
                at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:571)
                at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:441)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
                at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.base/java.lang.reflect.Method.invoke(Method.java:568)
                at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
                at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
                at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
                at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:316)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
                at groovy.lang.Closure.run(Closure.java:505)
                ... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
        at io.grpc.Status.asRuntimeException(Status.java:537)
        ... 14 common frames omitted
Oct-22 16:33:28.698 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was>
Oct-22 16:33:28.750 [Task monitor] DEBUG nextflow.Session - The following nodes are still active:

Environment

  • Nextflow version: 24.09.2-edge build 5927
  • Java version: openjdk version "11.0.24" 2024-07-16
  • Operating system: Ubuntu
  • Bash version: zsh 5.8
@pditommaso
Copy link
Member

The retry policy could be extended to handle this error, however in principle NOT_FOUND should be a permanent error.

Tagging @tom-seqera @jorgee for visibility.

@bentsherman
Copy link
Member

bentsherman commented Oct 28, 2024

I think this is a timing issue where nextflow is checking the status of a task that has been deleted

@pditommaso
Copy link
Member

Solved via aa4d19c

@ejseqera
Copy link

ejseqera commented Dec 5, 2024

@pditommaso I am not sure if this retry implementation is fully resolving the problem. The NOT_FOUND errors still occur and fail workflows despite retries, as seen in this recent log from a user encountering this issue with 24.11.0-edge:

Dec-04 19:23:39.091 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 1; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:39.997 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 2; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:41.192 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 3; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.532 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 4; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.576 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
 task: name=PROCESS_NAME:XXXXX (SRR27494952); work-dir=gs://bucket/path/to/work
 error [com.google.api.gax.rpc.NotFoundException]: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.597 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PROCESS_NAME:XXXXX (SRR27494952)'

Caused by:
 NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found


com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found

The retry window isn't long enough to handle the task state transition period in Google Batch. This doesn't seem formally documented but taking a look at the config constructor for the retry behaviour, can we use the following to extend the retry window and backoff?

google.batch.delay = '1s'
google.batch.maxDelay = '15s'
google.batch.maxAttempts = 5
google.batch.jitter = 0.5

@pditommaso
Copy link
Member

Well, configuration options exist for this. I'd suggest validating best settings before turning into default values by using these config options

google.batch.retryPolicy.delay
google.batch.retryPolicy.maxDelay
google.batch.retryPolicy.maxAttempts

@ejseqera
Copy link

ejseqera commented Jan 7, 2025

Unfortunately, even with aggressive retry policies (e.g. 15s initial delay, 45s max delay, 10+ attempts, checks spanning approximately 4 minutes), workflows with a large number of tasks can still fail due to this error. It's also quite hard to reproduce. I am not sure why it's taking so long to return task state as of recent on the Google Batch side.

For example (redacted log):

Dec-18 14:58:52.150 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 2; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 14:59:04.904 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 3; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:00.955 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 4; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:17.717 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 5; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:44.573 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 6; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:47.109 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 7; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:01:12.791 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 8; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:02:30.451 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 9; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found

Can we handle the NotFoundException directly in getTaskState() rather than in the retry conditions? Where:

  1. It allows us to explicitly return a PENDING state, which more accurately reflects that the task exists but isn't yet queryable through the API
  2. It avoids unnecessary retry attempts when we know we just need to wait for task registration, acknowledging it's an expected transitional state
  3. It provides clearer logging about what's actually happening vs generic retry messages
  4. Only fail if the task still cannot be found after some reasonable timeout

Just for the sake of discussion, I've proposed this in #5648

Config maxStatusDuration is added to:

  • It provides a clear upper bound for how long we'll treat "not found" as a pending state
  • It's configurable, allowing users to tune based on their workload patterns/known resource limitations

Thoughts @pditommaso?

@bolianyin
Copy link

Is there a more recent example (within 10 days) with job id or uid?
In addition, I see there is only one task in the created Google Batch job. Is there a reason to query the task state instead of the job state?

@Puumanamana
Copy link
Author

Here's one from today: projects/114884310888/locations/us-central1/jobs/nf-79c480cc-1736434667029/taskGroups/group0/tasks/0
The task is successfull in my Google Batch logs but the workflow failed with NOT FOUND for this task

@bolianyin
Copy link

Thank you!

@bolianyin
Copy link

@pditommaso : while we are looking into the task issue, does it make sense for Nextflow to query the job state instead of the task state given there is only one task in the job? I think that will be more stable.

@pditommaso pditommaso reopened this Jan 10, 2025
@pditommaso
Copy link
Member

We moved recently to the task API to enable the of tasks array. Even fixing the single task case, would not the same problem arise when using arrays?

@jorgee
Copy link
Contributor

jorgee commented Jan 10, 2025

In this PR #5567 I added the checkJobStatus to solve issue #5550. In that case, the tasks were not created due to a scheduling problem. To solve it I added the checkJobStatus to check the job status when no tasks appear in the Job. I think a similar problem is happening in this case. The task is created but for some reason it is not accesible throught the API. Maybe, we could check the checkJobStatus when the NOT_FOUND error is returned by the checkTaskStatus, to check if there are no errors reported in the job.

@pditommaso
Copy link
Member

That sounds reasonable. Maybe @bolianyin can shed some lights on what's the proper API to be used to cover this case.

@bolianyin
Copy link

We need time to investigate the task NOT_FOUND issue, it does not seem to be easily reproducible. Checking job state can be a workaround before the issue is fixed. Job status contains a summary of task states, which could be used for jobs with more than one tasks.

@bentsherman
Copy link
Member

If we can see all of the task states from the job state, I would much rather do that. I think I could not figure out how to do that when I refactored the code to query the task state

@bolianyin
Copy link

When you describe a job, the job status field contains a "taskGroups" field, which shows task count per task state after tasks are created. Something like below, if a job has two SUCCEEDED tasks:

  taskGroups:
    group0:
      counts:
        SUCCEEDED: '2'

@pditommaso
Copy link
Member

Then, this could be used only for non-array jobs

@bolianyin
Copy link

For array jobs, do you need to know the exact state of each task or general stats like how many tasks have scheduled, finished, etc?

@pditommaso
Copy link
Member

It needed to know the status at task level.

@jorgee jorgee self-assigned this Jan 21, 2025
@jorgee jorgee linked a pull request Jan 21, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants