add flag --no-skip-on-failure for command build, run, retry, seed #9020

leo-schick · 2023-11-07T12:42:56Z

resolves #2142

Problem

Downstream models are skipped. Sometimes e.g. when working in big data warehouses, it might be desirable to proceed with downstream models even a upstream model failed.

Solution

Adds a new flag --no-skip-on-failture for the dbt commands build, run, retry, clone and seed. If set, the downstream models will not be added to self._skipped_children and therefore will not be skipped.

Note: user docs might need to be updated.

Testing

I created a initial project with two additional models:

must_fail - always fails
dependent_model - uses the failing model

Test 1 - run without the new flag

The model dependent_model is still skipped.

Test 2 - run with flag `--no-skip-on-failture`

The model dependent_model is now executed, but fails since the table for must_fail does not exist. (This is expected behavior. In a real world scenario the failing model table might already exist from a previous execution and the model dependent_model would process successfully).

Test 3 - run retry with flag `--no-skip-on-failture`

The failing models are executed as expected.

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR (IMO not required; if wanted please assist how to build the tests)
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
This PR includes type annotations for new and modified functions (not required)

codecov · 2023-11-07T12:45:56Z

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 61.90%. Comparing base (e81f7fd) to head (3ec1e03).

Files	Patch %	Lines
core/dbt/task/runnable.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #9020       +/-   ##
===========================================
- Coverage   88.13%   61.90%   -26.24%     
===========================================
  Files         178      178               
  Lines       22456    22464        +8     
===========================================
- Hits        19792    13906     -5886     
- Misses       2664     8558     +5894

Flag	Coverage Δ
integration	`?`
unit	`61.90% <60.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chancekbarkley · 2024-01-31T17:54:21Z

@leo-schick do you know where this PR stands today? Looks like would be great for our use case.

leo-schick · 2024-01-31T18:10:56Z

It works

i-brahimi · 2024-01-31T19:32:44Z

It works

@leo-schick you said it works but you didn't specify on which version of DBT

leo-schick · 2024-02-01T13:49:57Z

@i-brahimi It works on the main development branch which uses version tag 1.8.0a1 which is not released yet. I don't know why the dbt core team did not respond on this PR yet since so many people where requesting this feature. Maybe someone can bump them to take a look at this and possibly merge it for the next upcoming release.

Aarboush · 2024-03-06T05:43:39Z

@jtcohen6 There are so many people using DBT waiting for this feature !

cwcartmell

@leo-schick, thank you so much for opening this PR! I just ran into the same issue so I'm thrilled to see that someone is already working on it.

.changes/unreleased/Features-20231107-132842.yaml

core/dbt/cli/main.py

core/dbt/cli/params.py

Co-authored-by: Conrad Cartmell <[email protected]>

leo-schick · 2024-04-08T19:00:18Z

@cwcartmell all requested changes are done now.

BenLiyanage · 2024-05-14T22:58:04Z

Has there been any word on getting this feature into dbt?

cwcartmell

I've had some more time on my hands so I came back to this PR post-dbt 1.8.0. I think that changes that have been made recently have broken this PR. When I pull down the fork and pull upstream main into the branch, I get errors when trying to run dbt commands:

$ dbt test               
01:42:42  Running with dbt=1.9.0-a1
01:42:42  Registered adapter: postgres=1.9.0-a1
01:42:42  Unable to do partial parsing because profile has changed
01:42:43  Found 2 models, 4 data tests, 413 macros
01:42:43  
01:42:44  Concurrency: 4 threads (target='dev')
01:42:44  
01:42:44  1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]
01:42:44  2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]
01:42:44  3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]
01:42:44  4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]
01:42:44  2 of 4 ERROR not_null_my_second_dbt_model_id ................................... [ERROR in 0.12s]
01:42:44  1 of 4 ERROR not_null_my_first_dbt_model_id .................................... [ERROR in 0.13s]
01:42:44  4 of 4 ERROR unique_my_second_dbt_model_id ..................................... [ERROR in 0.09s]
01:42:44  3 of 4 ERROR unique_my_first_dbt_model_id ...................................... [ERROR in 0.10s]
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 592, in _handle_results
    cache[job]._set(i, obj)
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 776, in _set
    self._callback(self._value)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 281, in callback
    self._handle_result(result)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/compile.py", line 132, in _handle_result
    super()._handle_result(result)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 338, in _handle_result
    self._mark_dependent_errors(node.unique_id, result, cause)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 421, in _mark_dependent_errors
    no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE
  File "/home/conrad/Documents/dbt-core/core/dbt/cli/flags.py", line 360, in __getattr__
    return super().__getattribute__(name)  # type: ignore
AttributeError: 'Flags' object has no attribute 'NO_SKIP_ON_FAILURE'

BenLiyanage · 2024-05-17T22:45:28Z

Seems pretty unfortunate. This is a pretty critical feature to prevent one failure from having everything fail.

leo-schick · 2024-05-18T07:07:19Z

@cwcartmell Thanks for checking and pointing this out.

I am actually quite disappointed with the dbt dev team here. This pull request is now open over 6 months (the original ticket is from Feb. 2020) and there has not been any response from the dbt team as far as I can see if this feature is wanted or not - or - should be implemented in another way. It’s not such a difficult code change and the community responses tell the story that many people would like to have a feature to process models even downstream models fail.
I see a point in that this feature would maybe break some constraints and therefore there could be a valid reason to not merge it.

But the lack of response from the dbt team is just disappointing to me right now and I don‘t know if it is worth fixing this now or spending first the time to get someone from the dbt core team on this to respond if such a feature is wanted or not (e.g. by design decision)

jtcohen6

@leo-schick I'm sorry for the delayed response here. This issue had been open for a long time; your PR has helped it gain renewed traction.

The last time I thought about this in detail was ~2 years ago, and we’ve been trying to define a clearer, narrower, more maintainable scope for dbt-core since then. I’ve become warier of feature requests that feel they go against some of dbt’s core principles, one of which is the integrity of the DAG. Whether to skip downstream nodes on upstream errors is less of a dogmatic concern than allowing node-level cycles, or running the DAG out of order, but it’s one of the simplifying assumptions that’s gotten us a lot over time.

That being said:

It’s already possible to achieve the end state described here, by retrying a failed job while simply excluding the failing models from the previous run. Users have pointed out that this is already possible (retry from point of failure, skip point of failure, keep building downstream) — but it requires quite a bit of custom work(around).
This is analogous to marking a test as “warn”: notify me that it detected failures, but don’t stop the DAG for it.

I reread #2142 to remind myself of the use cases we’re solving for, and the majority of the examples provided deal with issues with upstream data ingestion that can cause flakey model failure. People would rather have stale data coming from one place than stale transformations across the board. There is a judgment call here, model by model. To that end, I am more convinced by the value of this as a model-level config, defined on parent model, than as a runtime-level config set for all models at once.

Previously I said:

If we decide to support both, I'd see this being similar to how the --full-refresh flag / full_refresh model config work today (docs), where the model-level config (if set) takes precedence over the run-level flag.

That doesn’t feel right to me. It good insofar as it’s consistent with full-refresh and store-failures, but I don’t think we nailed the interplay of full-refresh flag and model-level config.

I coul see this as behavior that wants to differ across environments: dbt should always skip (strict) in dev/CI, but allow specific downstream models to build in deployment environments. It would be possible to accomplish that with conditional Jinja configs, but probably better to have some rule that combines both a model configuration and a flag setting.

So here are the options I see, in order of preference (1 then 2 then 3 then 4):

By default, any model in any environment skips children on failures. A specific flaky model can opt into on_error: continue for a specific environment.
By default, any model in any environment skips children on failure. A specific model can opt into on_error: continue for all environments (dev, prod, CI, etc).
By default, any model in any environment skips children on failures. In some invocations/environments, you can optionally switch this behavior for all models to on_error: continue.
By default, any model in any environment skips children on failures. For some models, you can set them to always skip or always continue. In some invocations/environments, can also optionally switch the behavior for all other (unset) models.

This PR implements (3), with the option to add in node-level config later on (4). I don’t think we could go from 3 -> 2 without a breaking change, since the continue-on-skip behavior becomes “AND” instead of “OR.”

Where I'm landing is: I'm open to adding this as a node-level config (exception not the rule), but I don't think we should start with the global runtime flag. If we decide a flag is appropriate to support conditional per-environment behavior ("strict" in dev/CI, "forgiving" in Prod), we can do that later on.

Next steps:

Do you (and the folks who have been following this thread) agree with my line of reasoning above?
Would you (or someone else) be interested in carrying that work forward? Again, apologies for the delay on our end, and I understand if it means you're no longer able or interested.

jtcohen6 · 2024-05-22T15:14:42Z

core/dbt/task/runnable.py

-        for dep_node_id in self.graph.get_dependent_nodes(UniqueId(node_id)):
-            self._skipped_children[dep_node_id] = cause
+        no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE
+        if not no_skip_on_failure:


If I'm thinking about this right, it would also have the effect for dbt build of not skipping downstream models if an upstream test fails. That doesn't feel desirable as a global effect.

mchurley · 2024-06-10T10:40:24Z

For my team our need is specifically to have a global runtime flag. The majority of our pipeline is constructed in a way where we still want to run the entire pipeline regardless of the state of parent models and leave the decision making regarding the validity of that data to a later point. As an example, there might be one column in a later model that is impacted by a parent having incorrect data/not running due to an error but that does not mean we would not want to refresh the table as a whole anyway. It is more beneficial to report that inconsistency in a different manner rather than simply not have that child model run.

As a direct consequence of not having this we are making the decision to maintain specific DAG tasks per model/test which is incredibly inefficient and butchers a lot of the great features dbt provides.

trevorwallis13 · 2024-06-26T15:56:53Z

@jtcohen6 Thanks for reviving this issue! I wanted to provided one use case where it would be helpful to be able to set this on the child model rather than the parent:

We maintain a bunch of user attributes in individual models so we can more easily control each one. We then have an incremental model that unions the new records from all the attribute models so we can sync to an external source. If one of the attribute models fails, we still want to run the final model and insert the records from all the models that succeeded.

In this case, I think it would make the most sense to set the skip failure rule at the child level instead of the parent. Sure, I could set the flag on the folder level for all the attribute models in dbt_project.yml but it feels like the rule has more to do with how I want that specific child to behave -- not necessarily how I expect every potential model downstream from the parent to behave.

I think a good parallel to how I envision this would be Airflow's trigger_rule argument. In my particular use case, I would want to use an all_done rule if things were equivalent.

leo-schick requested a review from a team as a code owner November 7, 2023 12:42

leo-schick requested a review from emmyoop November 7, 2023 12:42

cla-bot bot added the cla:yes label Nov 7, 2023

leo-schick force-pushed the no-skip-on-failture branch from f88c041 to af96e00 Compare January 15, 2024 07:32

leo-schick requested review from a team as code owners January 15, 2024 07:32

leo-schick requested review from chrismorrisette-dbt and removed request for a team January 15, 2024 07:32

leo-schick force-pushed the no-skip-on-failture branch from af96e00 to 362da1c Compare January 15, 2024 07:33

dbeatty10 added the community This PR is from a community member label Mar 22, 2024

graciegoheen assigned jtcohen6 Apr 3, 2024

cwcartmell reviewed Apr 5, 2024

View reviewed changes

.changes/unreleased/Features-20231107-132842.yaml Outdated Show resolved Hide resolved

core/dbt/cli/main.py Outdated Show resolved Hide resolved

core/dbt/cli/params.py Outdated Show resolved Hide resolved

dbeatty10 changed the title ~~add flag --no-skip-on-failture for command build, run, retry, seed~~ add flag --no-skip-on-failure for command build, run, retry, seed Apr 8, 2024

leo-schick and others added 4 commits April 8, 2024 20:13

add flag --no-skip-on-failture for command build, run, retry, seed

d816c85

Update core/dbt/cli/params.py

e295cdb

Co-authored-by: Conrad Cartmell <[email protected]>

typo

f8c495d

add --no-skip-on-failure for command clone

3ec1e03

leo-schick force-pushed the no-skip-on-failture branch from 4321942 to 3ec1e03 Compare April 8, 2024 18:58

leo-schick requested a review from cwcartmell April 8, 2024 18:59

cwcartmell reviewed May 17, 2024

View reviewed changes

jtcohen6 reviewed May 22, 2024

View reviewed changes

jtcohen6 mentioned this pull request May 22, 2024

Option to disable "skip downstream models on failure" #2142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add flag --no-skip-on-failure for command build, run, retry, seed #9020

add flag --no-skip-on-failure for command build, run, retry, seed #9020

leo-schick commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023 •

edited

Loading

chancekbarkley commented Jan 31, 2024

leo-schick commented Jan 31, 2024

i-brahimi commented Jan 31, 2024

leo-schick commented Feb 1, 2024

Aarboush commented Mar 6, 2024 •

edited

Loading

cwcartmell left a comment

leo-schick commented Apr 8, 2024

BenLiyanage commented May 14, 2024

cwcartmell left a comment

BenLiyanage commented May 17, 2024

leo-schick commented May 18, 2024

jtcohen6 left a comment

jtcohen6 May 22, 2024

mchurley commented Jun 10, 2024

trevorwallis13 commented Jun 26, 2024

add flag --no-skip-on-failure for command build, run, retry, seed #9020

Are you sure you want to change the base?

add flag --no-skip-on-failure for command build, run, retry, seed #9020

Conversation

leo-schick commented Nov 7, 2023 • edited Loading

Problem

Solution

Testing

Test 1 - run without the new flag

Test 2 - run with flag --no-skip-on-failture

Test 3 - run retry with flag --no-skip-on-failture

Checklist

codecov bot commented Nov 7, 2023 • edited Loading

Codecov Report

chancekbarkley commented Jan 31, 2024

leo-schick commented Jan 31, 2024

i-brahimi commented Jan 31, 2024

leo-schick commented Feb 1, 2024

Aarboush commented Mar 6, 2024 • edited Loading

cwcartmell left a comment

Choose a reason for hiding this comment

leo-schick commented Apr 8, 2024

BenLiyanage commented May 14, 2024

cwcartmell left a comment

Choose a reason for hiding this comment

BenLiyanage commented May 17, 2024

leo-schick commented May 18, 2024

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 May 22, 2024

Choose a reason for hiding this comment

mchurley commented Jun 10, 2024

trevorwallis13 commented Jun 26, 2024

leo-schick commented Nov 7, 2023 •

edited

Loading

Test 2 - run with flag `--no-skip-on-failture`

Test 3 - run retry with flag `--no-skip-on-failture`

codecov bot commented Nov 7, 2023 •

edited

Loading

Aarboush commented Mar 6, 2024 •

edited

Loading