-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Flow jobs child failure handling is broken #2229
Comments
hi @CodingMeSwiftly, for the first point, worker failed events is emitted only for the current job that is moved to failed, so if the parent is also moved to failed, only the child as the current job will be send in the worker failed event, it's more useful a queueEvents instance in this case. For the 3rd point we have a test case where the event is invoked https://github.com/taskforcesh/bullmq/blob/master/tests/test_flow.ts#L1616-L1624, you can take a look |
btw, the prs and issues your are linking are related to events that are used by queueEvents instance, workers don't read our event streams, workers only emit events for the current job that is being processed |
Thank you for clearing things up about the difference between worker events and queue events. Do you prefer to keep this issue open to track the |
I am experiencing this too, with BullMQ v4.12.3 |
This 'issue' might be more tricky to solve than initially thought. Consider this scenario:
According to https://docs.bullmq.io/guide/flows#jobs-removal:
|
I must say upfront that I just started using BullMQ, but my experience is that, when In consideration of the scenario described by @CodingMeSwiftly, I would therefore dare to say that the expected behavior is the sequence as follows:
The key would be to defer the failure and removal to the end of the process. My understanding is that the former is already implemented this way, so the hope is that the expert maintainers can tap into the same mechanism to implement the latter. |
hi @matpen, is |
@roggervalf in my actual configuration, yes: all children have both Please find my actual config within, if it can help
await flowProducer.add({
name: 'batch0',
queueName: queue.name,
data: { step: 0 },
opts: {
jobId: `${job.data.type}_${job.data.procedure}_batch0`,
failParentOnFailure: true,
removeOnComplete: true,
removeOnFail: true,
},
children: childJobs.map(procedure => ({
name: procedure,
queueName: queue.name,
data: { procedure },
opts: {
jobId: `${job.data.type}_${procedure}`,
attempts: 2,
backoff: {
type: 'fixed',
delay: 300_000,
},
failParentOnFailure: true,
removeOnComplete: true,
removeOnFail: true,
},
})),
}); For the general case I described in #2229 (comment), my guess is that it would make sense to fail the parent if just one of the children has |
pls try to modify one of our test cases to see if you can replicate your case |
Thank you, @roggervalf for trying to reproduce the case I described. By looking at the code you linked, I see now that there is a misunderstanding:
IIUC this means that, differently from what I described in my comment, the current sequence of events is
If I got this right, the next question would be whether this approach is by design, or simply because it is (understandably) the most straightforward solution. And if so, whether it is possible to complicate the system and defer the removal of parents (as per the sequence in my previous comment). Otherwise, it seems to me that |
Same problem here. |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [bullmq](https://bullmq.io/) ([source](https://github.com/taskforcesh/bullmq)) | dependencies | minor | [`5.29.1` -> `5.31.1`](https://renovatebot.com/diffs/npm/bullmq/5.29.1/5.31.1) | --- ### Release Notes <details> <summary>taskforcesh/bullmq (bullmq)</summary> ### [`v5.31.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.31.1) [Compare Source](taskforcesh/bullmq@v5.31.0...v5.31.1) ##### Bug Fixes - **scheduler-template:** remove console.log when getting template information ([#​2950](taskforcesh/bullmq#2950)) ([3402bfe](taskforcesh/bullmq@3402bfe)) ### [`v5.31.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.31.0) [Compare Source](taskforcesh/bullmq@v5.30.1...v5.31.0) ##### Features - **queue:** enhance getJobScheduler method to include template information ([#​2929](taskforcesh/bullmq#2929)) ref [#​2875](taskforcesh/bullmq#2875) ([cb99080](taskforcesh/bullmq@cb99080)) ### [`v5.30.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.30.1) [Compare Source](taskforcesh/bullmq@v5.30.0...v5.30.1) ##### Bug Fixes - **flow:** allow using removeOnFail and failParentOnFailure in parents ([#​2947](taskforcesh/bullmq#2947)) fixes [#​2229](taskforcesh/bullmq#2229) ([85f6f6f](taskforcesh/bullmq@85f6f6f)) ### [`v5.30.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.30.0) [Compare Source](taskforcesh/bullmq@v5.29.1...v5.30.0) ##### Bug Fixes - **job-scheduler:** upsert template when same pattern options are provided ([#​2943](taskforcesh/bullmq#2943)) ref [#​2940](taskforcesh/bullmq#2940) ([b56c3b4](taskforcesh/bullmq@b56c3b4)) ##### Features - **queue:** add getDelayedCount method \[python] ([#​2934](taskforcesh/bullmq#2934)) ([71ce75c](taskforcesh/bullmq@71ce75c)) - **queue:** add getJobSchedulersCount method ([#​2945](taskforcesh/bullmq#2945)) ([38820dc](taskforcesh/bullmq@38820dc)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4xNDIuNyIsInVwZGF0ZWRJblZlciI6IjM4LjE0Mi43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiXX0=--> Reviewed-on: https://git.tristess.app/alexandresoro/ouca/pulls/361 Reviewed-by: Alexandre Soro <[email protected]> Co-authored-by: renovate <[email protected]> Co-committed-by: renovate <[email protected]>
Version
v4.6.3
Platform
NodeJS
What happened?
This is a follow up for #1469.
worker.on('failed', () => { ... }
is invoked for the child job.failParentOnFailure
is set on the child.event: failed, jobId: xxx, failedReason: child xyz failed, prev: waiting-children
, so the event is created, but the callback is never invoked.According to #1469, this was fixed by #1481. However, it seems a regression was introduced with #1953.
Additionally,
failParentOnFailure
does not work withremoveOnFail
.Expected behaviour: If
failParentOnFailure
is set for a child and the parent hasremoveOnFail
, the parent should be removed from the queue.How to reproduce.
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: