NOT_FOUND error on google-batch #5422

Puumanamana · 2024-10-22T18:06:42Z

Bug report

I have a RNA-seq pipeline I've been using for a while on GCP with success. Recently, I noticed that when processing medium/large datasets, I get a NOT_FOUND error and the workflow gets interrupted. It feels like a temporary unavailable service, but it keeps reoccurring when resuming the run.

Expected behavior and actual behavior

No interruption or retry if a given service is not available

Steps to reproduce the problem

Hard to reproduce, could be due to temporarily unavailable services

Program output

Oct-22 16:33:28.660 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=ROME_QUANT_HUMAN:RNASEQC (ERR10232708); work-dir=gs://nxf-work/romequant/cedric/c9/cec2987387dd8a25f738dab2db92a9
  error [com.google.api.gax.rpc.NotFoundException]: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
Oct-22 16:33:28.683 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HUMAN:RNASEQC (ERR10232708)'

Caused by:
  NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found

com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:90)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
        Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
                at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
                at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
                at com.google.cloud.batch.v1.BatchServiceClient.getTask(BatchServiceClient.java:789)
                at com.google.cloud.batch.v1.BatchServiceClient.getTask(BatchServiceClient.java:732)
                at nextflow.cloud.google.batch.client.BatchClient$_describeTask_lambda4.doCall(BatchClient.groovy:113)
                at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
                at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
                at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
                at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
                at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
                at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
                at nextflow.cloud.google.batch.client.BatchClient.apply(BatchClient.groovy:187)
                at nextflow.cloud.google.batch.client.BatchClient.describeTask(BatchClient.groovy:113)
                at nextflow.cloud.google.batch.client.BatchClient.getTaskStatus(BatchClient.groovy:122)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.getTaskState(GoogleBatchTaskHandler.groovy:454)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.checkIfRunning(GoogleBatchTaskHandler.groovy:479)
                at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:642)
                at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:571)
                at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:441)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
                at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.base/java.lang.reflect.Method.invoke(Method.java:568)
                at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
                at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
                at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
                at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:316)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
                at groovy.lang.Closure.run(Closure.java:505)
                ... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was not found
        at io.grpc.Status.asRuntimeException(Status.java:537)
        ... 14 common frames omitted
Oct-22 16:33:28.698 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/114884310888/locations/us-central1/jobs/nf-c9cec298-1729608617842/taskGroups/group0/tasks/0" was>
Oct-22 16:33:28.750 [Task monitor] DEBUG nextflow.Session - The following nodes are still active:

Environment

Nextflow version: 24.09.2-edge build 5927
Java version: openjdk version "11.0.24" 2024-07-16
Operating system: Ubuntu
Bash version: zsh 5.8

The text was updated successfully, but these errors were encountered:

pditommaso · 2024-10-28T09:14:44Z

The retry policy could be extended to handle this error, however in principle NOT_FOUND should be a permanent error.

Tagging @tom-seqera @jorgee for visibility.

bentsherman · 2024-10-28T11:13:49Z

I think this is a timing issue where nextflow is checking the status of a task that has been deleted

pditommaso · 2024-11-01T17:42:03Z

Solved via aa4d19c

ejseqera · 2024-12-05T17:25:28Z

@pditommaso I am not sure if this retry implementation is fully resolving the problem. The NOT_FOUND errors still occur and fail workflows despite retries, as seen in this recent log from a user encountering this issue with 24.11.0-edge:

Dec-04 19:23:39.091 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 1; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:39.997 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 2; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:41.192 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 3; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.532 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 4; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.576 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
 task: name=PROCESS_NAME:XXXXX (SRR27494952); work-dir=gs://bucket/path/to/work
 error [com.google.api.gax.rpc.NotFoundException]: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found
Dec-04 19:23:44.597 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PROCESS_NAME:XXXXX (SRR27494952)'

Caused by:
 NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found


com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXX/locations/us-central1/jobs/nf-da4de5a0-1733338269428/taskGroups/group0/tasks/0" was not found

The retry window isn't long enough to handle the task state transition period in Google Batch. This doesn't seem formally documented but taking a look at the config constructor for the retry behaviour, can we use the following to extend the retry window and backoff?

google.batch.delay = '1s'
google.batch.maxDelay = '15s'
google.batch.maxAttempts = 5
google.batch.jitter = 0.5

pditommaso · 2024-12-09T21:05:31Z

Well, configuration options exist for this. I'd suggest validating best settings before turning into default values by using these config options

google.batch.retryPolicy.delay
google.batch.retryPolicy.maxDelay
google.batch.retryPolicy.maxAttempts

ejseqera · 2025-01-07T03:05:18Z

Unfortunately, even with aggressive retry policies (e.g. 15s initial delay, 45s max delay, 10+ attempts, checks spanning approximately 4 minutes), workflows with a large number of tasks can still fail due to this error. It's also quite hard to reproduce. I am not sure why it's taking so long to return task state as of recent on the Google Batch side.

For example (redacted log):

Dec-18 14:58:52.150 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 2; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 14:59:04.904 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 3; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:00.955 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 4; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:17.717 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 5; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:44.573 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 6; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:00:47.109 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 7; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:01:12.791 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 8; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found
Dec-18 15:02:30.451 [Task monitor] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] response error - attempt: 9; reason: io.grpc.StatusRuntimeException: NOT_FOUND: Resource "projects/XXXXXXXXXXXX/locations/us-central1/jobs/nf-32d5ac52-1734532523619/taskGroups/group0/tasks/0" was not found

Can we handle the NotFoundException directly in getTaskState() rather than in the retry conditions? Where:

It allows us to explicitly return a PENDING state, which more accurately reflects that the task exists but isn't yet queryable through the API
It avoids unnecessary retry attempts when we know we just need to wait for task registration, acknowledging it's an expected transitional state
It provides clearer logging about what's actually happening vs generic retry messages
Only fail if the task still cannot be found after some reasonable timeout

Just for the sake of discussion, I've proposed this in #5648

Config maxStatusDuration is added to:

It provides a clear upper bound for how long we'll treat "not found" as a pending state
It's configurable, allowing users to tune based on their workload patterns/known resource limitations

Thoughts @pditommaso?

bolianyin · 2025-01-08T20:32:57Z

Is there a more recent example (within 10 days) with job id or uid?
In addition, I see there is only one task in the created Google Batch job. Is there a reason to query the task state instead of the job state?

Puumanamana · 2025-01-09T16:06:30Z

Here's one from today: projects/114884310888/locations/us-central1/jobs/nf-79c480cc-1736434667029/taskGroups/group0/tasks/0
The task is successfull in my Google Batch logs but the workflow failed with NOT FOUND for this task

bolianyin · 2025-01-10T00:22:47Z

Thank you!

bolianyin · 2025-01-10T00:52:11Z

@pditommaso : while we are looking into the task issue, does it make sense for Nextflow to query the job state instead of the task state given there is only one task in the job? I think that will be more stable.

pditommaso · 2025-01-10T09:09:19Z

We moved recently to the task API to enable the of tasks array. Even fixing the single task case, would not the same problem arise when using arrays?

jorgee · 2025-01-10T10:17:41Z

In this PR #5567 I added the checkJobStatus to solve issue #5550. In that case, the tasks were not created due to a scheduling problem. To solve it I added the checkJobStatus to check the job status when no tasks appear in the Job. I think a similar problem is happening in this case. The task is created but for some reason it is not accesible throught the API. Maybe, we could check the checkJobStatus when the NOT_FOUND error is returned by the checkTaskStatus, to check if there are no errors reported in the job.

pditommaso · 2025-01-10T10:48:02Z

That sounds reasonable. Maybe @bolianyin can shed some lights on what's the proper API to be used to cover this case.

bolianyin · 2025-01-13T05:48:23Z

We need time to investigate the task NOT_FOUND issue, it does not seem to be easily reproducible. Checking job state can be a workaround before the issue is fixed. Job status contains a summary of task states, which could be used for jobs with more than one tasks.

bentsherman · 2025-01-13T14:20:03Z

If we can see all of the task states from the job state, I would much rather do that. I think I could not figure out how to do that when I refactored the code to query the task state

bolianyin · 2025-01-14T18:17:31Z

When you describe a job, the job status field contains a "taskGroups" field, which shows task count per task state after tasks are created. Something like below, if a job has two SUCCEEDED tasks:

  taskGroups:
    group0:
      counts:
        SUCCEEDED: '2'

pditommaso · 2025-01-15T09:13:04Z

Then, this could be used only for non-array jobs

bolianyin · 2025-01-16T03:40:42Z

For array jobs, do you need to know the exact state of each task or general stats like how many tasks have scheduled, finished, etc?

pditommaso · 2025-01-16T09:19:01Z

It needed to know the status at task level.

pditommaso closed this as completed Nov 1, 2024

bentsherman added the executor/google-batch label Dec 9, 2024

ejseqera mentioned this issue Jan 7, 2025

Handle not found exception in task status check #5648

Open

pditommaso reopened this Jan 10, 2025

jorgee self-assigned this Jan 21, 2025

jorgee linked a pull request Jan 21, 2025 that will close this issue

Add Google Batch NOT_FOUND error management #5690

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOT_FOUND error on google-batch #5422

NOT_FOUND error on google-batch #5422

Puumanamana commented Oct 22, 2024

pditommaso commented Oct 28, 2024

bentsherman commented Oct 28, 2024 •

edited

Loading

pditommaso commented Nov 1, 2024

ejseqera commented Dec 5, 2024

pditommaso commented Dec 9, 2024

ejseqera commented Jan 7, 2025 •

edited

Loading

bolianyin commented Jan 8, 2025

Puumanamana commented Jan 9, 2025

bolianyin commented Jan 10, 2025

bolianyin commented Jan 10, 2025

pditommaso commented Jan 10, 2025

jorgee commented Jan 10, 2025 •

edited

Loading

pditommaso commented Jan 10, 2025

bolianyin commented Jan 13, 2025

bentsherman commented Jan 13, 2025

bolianyin commented Jan 14, 2025

pditommaso commented Jan 15, 2025

bolianyin commented Jan 16, 2025

pditommaso commented Jan 16, 2025

NOT_FOUND error on google-batch #5422

NOT_FOUND error on google-batch #5422

Comments

Puumanamana commented Oct 22, 2024

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

pditommaso commented Oct 28, 2024

bentsherman commented Oct 28, 2024 • edited Loading

pditommaso commented Nov 1, 2024

ejseqera commented Dec 5, 2024

pditommaso commented Dec 9, 2024

ejseqera commented Jan 7, 2025 • edited Loading

bolianyin commented Jan 8, 2025

Puumanamana commented Jan 9, 2025

bolianyin commented Jan 10, 2025

bolianyin commented Jan 10, 2025

pditommaso commented Jan 10, 2025

jorgee commented Jan 10, 2025 • edited Loading

pditommaso commented Jan 10, 2025

bolianyin commented Jan 13, 2025

bentsherman commented Jan 13, 2025

bolianyin commented Jan 14, 2025

pditommaso commented Jan 15, 2025

bolianyin commented Jan 16, 2025

pditommaso commented Jan 16, 2025

bentsherman commented Oct 28, 2024 •

edited

Loading

ejseqera commented Jan 7, 2025 •

edited

Loading

jorgee commented Jan 10, 2025 •

edited

Loading