Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flag --no-skip-on-failure for command build, run, retry, seed #9020

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

leo-schick
Copy link
Contributor

@leo-schick leo-schick commented Nov 7, 2023

resolves #2142

Problem

Downstream models are skipped. Sometimes e.g. when working in big data warehouses, it might be desirable to proceed with downstream models even a upstream model failed.

Solution

Adds a new flag --no-skip-on-failture for the dbt commands build, run, retry, clone and seed. If set, the downstream models will not be added to self._skipped_children and therefore will not be skipped.

Note: user docs might need to be updated.

Testing

I created a initial project with two additional models:

  • must_fail - always fails
  • dependent_model - uses the failing model

Test 1 - run without the new flag

image

The model dependent_model is still skipped.

Test 2 - run with flag --no-skip-on-failture

image

The model dependent_model is now executed, but fails since the table for must_fail does not exist. (This is expected behavior. In a real world scenario the failing model table might already exist from a previous execution and the model dependent_model would process successfully).

Test 3 - run retry with flag --no-skip-on-failture

image

The failing models are executed as expected.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR (IMO not required; if wanted please assist how to build the tests)
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions (not required)

@leo-schick leo-schick requested a review from a team as a code owner November 7, 2023 12:42
@leo-schick leo-schick requested a review from emmyoop November 7, 2023 12:42
@cla-bot cla-bot bot added the cla:yes label Nov 7, 2023
Copy link

codecov bot commented Nov 7, 2023

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 61.90%. Comparing base (e81f7fd) to head (3ec1e03).

Files Patch % Lines
core/dbt/task/runnable.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #9020       +/-   ##
===========================================
- Coverage   88.13%   61.90%   -26.24%     
===========================================
  Files         178      178               
  Lines       22456    22464        +8     
===========================================
- Hits        19792    13906     -5886     
- Misses       2664     8558     +5894     
Flag Coverage Δ
integration ?
unit 61.90% <60.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@leo-schick leo-schick requested review from a team as code owners January 15, 2024 07:32
@leo-schick leo-schick requested review from chrismorrisette-dbt and removed request for a team January 15, 2024 07:32
@chancekbarkley
Copy link

@leo-schick do you know where this PR stands today? Looks like would be great for our use case.

@leo-schick
Copy link
Contributor Author

It works

@i-brahimi
Copy link

It works

@leo-schick you said it works but you didn't specify on which version of DBT

@leo-schick
Copy link
Contributor Author

@i-brahimi It works on the main development branch which uses version tag 1.8.0a1 which is not released yet. I don't know why the dbt core team did not respond on this PR yet since so many people where requesting this feature. Maybe someone can bump them to take a look at this and possibly merge it for the next upcoming release.

@Aarboush
Copy link

Aarboush commented Mar 6, 2024

@jtcohen6 There are so many people using DBT waiting for this feature !

@dbeatty10 dbeatty10 added the community This PR is from a community member label Mar 22, 2024
Copy link

@cwcartmell cwcartmell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-schick, thank you so much for opening this PR! I just ran into the same issue so I'm thrilled to see that someone is already working on it.

.changes/unreleased/Features-20231107-132842.yaml Outdated Show resolved Hide resolved
core/dbt/cli/main.py Outdated Show resolved Hide resolved
core/dbt/cli/params.py Outdated Show resolved Hide resolved
@dbeatty10 dbeatty10 changed the title add flag --no-skip-on-failture for command build, run, retry, seed add flag --no-skip-on-failure for command build, run, retry, seed Apr 8, 2024
@leo-schick leo-schick force-pushed the no-skip-on-failture branch from 4321942 to 3ec1e03 Compare April 8, 2024 18:58
@leo-schick leo-schick requested a review from cwcartmell April 8, 2024 18:59
@leo-schick
Copy link
Contributor Author

@cwcartmell all requested changes are done now.

@BenLiyanage
Copy link

Has there been any word on getting this feature into dbt?

Copy link

@cwcartmell cwcartmell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had some more time on my hands so I came back to this PR post-dbt 1.8.0. I think that changes that have been made recently have broken this PR. When I pull down the fork and pull upstream main into the branch, I get errors when trying to run dbt commands:

$ dbt test               
01:42:42  Running with dbt=1.9.0-a1
01:42:42  Registered adapter: postgres=1.9.0-a1
01:42:42  Unable to do partial parsing because profile has changed
01:42:43  Found 2 models, 4 data tests, 413 macros
01:42:43  
01:42:44  Concurrency: 4 threads (target='dev')
01:42:44  
01:42:44  1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]
01:42:44  2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]
01:42:44  3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]
01:42:44  4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]
01:42:44  2 of 4 ERROR not_null_my_second_dbt_model_id ................................... [ERROR in 0.12s]
01:42:44  1 of 4 ERROR not_null_my_first_dbt_model_id .................................... [ERROR in 0.13s]
01:42:44  4 of 4 ERROR unique_my_second_dbt_model_id ..................................... [ERROR in 0.09s]
01:42:44  3 of 4 ERROR unique_my_first_dbt_model_id ...................................... [ERROR in 0.10s]
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 592, in _handle_results
    cache[job]._set(i, obj)
  File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 776, in _set
    self._callback(self._value)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 281, in callback
    self._handle_result(result)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/compile.py", line 132, in _handle_result
    super()._handle_result(result)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 338, in _handle_result
    self._mark_dependent_errors(node.unique_id, result, cause)
  File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 421, in _mark_dependent_errors
    no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE
  File "/home/conrad/Documents/dbt-core/core/dbt/cli/flags.py", line 360, in __getattr__
    return super().__getattribute__(name)  # type: ignore
AttributeError: 'Flags' object has no attribute 'NO_SKIP_ON_FAILURE'

@BenLiyanage
Copy link

Seems pretty unfortunate. This is a pretty critical feature to prevent one failure from having everything fail.

@leo-schick
Copy link
Contributor Author

@cwcartmell Thanks for checking and pointing this out.

I am actually quite disappointed with the dbt dev team here. This pull request is now open over 6 months (the original ticket is from Feb. 2020) and there has not been any response from the dbt team as far as I can see if this feature is wanted or not - or - should be implemented in another way. It’s not such a difficult code change and the community responses tell the story that many people would like to have a feature to process models even downstream models fail.
I see a point in that this feature would maybe break some constraints and therefore there could be a valid reason to not merge it.

But the lack of response from the dbt team is just disappointing to me right now and I don‘t know if it is worth fixing this now or spending first the time to get someone from the dbt core team on this to respond if such a feature is wanted or not (e.g. by design decision)

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-schick I'm sorry for the delayed response here. This issue had been open for a long time; your PR has helped it gain renewed traction.

The last time I thought about this in detail was ~2 years ago, and we’ve been trying to define a clearer, narrower, more maintainable scope for dbt-core since then. I’ve become warier of feature requests that feel they go against some of dbt’s core principles, one of which is the integrity of the DAG. Whether to skip downstream nodes on upstream errors is less of a dogmatic concern than allowing node-level cycles, or running the DAG out of order, but it’s one of the simplifying assumptions that’s gotten us a lot over time.

That being said:

  • It’s already possible to achieve the end state described here, by retrying a failed job while simply excluding the failing models from the previous run. Users have pointed out that this is already possible (retry from point of failure, skip point of failure, keep building downstream) — but it requires quite a bit of custom work(around).
  • This is analogous to marking a test as “warn”: notify me that it detected failures, but don’t stop the DAG for it.

I reread #2142 to remind myself of the use cases we’re solving for, and the majority of the examples provided deal with issues with upstream data ingestion that can cause flakey model failure. People would rather have stale data coming from one place than stale transformations across the board. There is a judgment call here, model by model. To that end, I am more convinced by the value of this as a model-level config, defined on parent model, than as a runtime-level config set for all models at once.

Previously I said:

If we decide to support both, I'd see this being similar to how the --full-refresh flag / full_refresh model config work today (docs), where the model-level config (if set) takes precedence over the run-level flag.

That doesn’t feel right to me. It good insofar as it’s consistent with full-refresh and store-failures, but I don’t think we nailed the interplay of full-refresh flag and model-level config.

I coul see this as behavior that wants to differ across environments: dbt should always skip (strict) in dev/CI, but allow specific downstream models to build in deployment environments. It would be possible to accomplish that with conditional Jinja configs, but probably better to have some rule that combines both a model configuration and a flag setting.

So here are the options I see, in order of preference (1 then 2 then 3 then 4):

  1. By default, any model in any environment skips children on failures. A specific flaky model can opt into on_error: continue for a specific environment.
  2. By default, any model in any environment skips children on failure. A specific model can opt into on_error: continue for all environments (dev, prod, CI, etc).
  3. By default, any model in any environment skips children on failures. In some invocations/environments, you can optionally switch this behavior for all models to on_error: continue.
  4. By default, any model in any environment skips children on failures. For some models, you can set them to always skip or always continue. In some invocations/environments, can also optionally switch the behavior for all other (unset) models.

This PR implements (3), with the option to add in node-level config later on (4). I don’t think we could go from 3 -> 2 without a breaking change, since the continue-on-skip behavior becomes “AND” instead of “OR.”

Where I'm landing is: I'm open to adding this as a node-level config (exception not the rule), but I don't think we should start with the global runtime flag. If we decide a flag is appropriate to support conditional per-environment behavior ("strict" in dev/CI, "forgiving" in Prod), we can do that later on.

Next steps:

  • Do you (and the folks who have been following this thread) agree with my line of reasoning above?
  • Would you (or someone else) be interested in carrying that work forward? Again, apologies for the delay on our end, and I understand if it means you're no longer able or interested.

for dep_node_id in self.graph.get_dependent_nodes(UniqueId(node_id)):
self._skipped_children[dep_node_id] = cause
no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE
if not no_skip_on_failure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm thinking about this right, it would also have the effect for dbt build of not skipping downstream models if an upstream test fails. That doesn't feel desirable as a global effect.

@mchurley
Copy link

For my team our need is specifically to have a global runtime flag. The majority of our pipeline is constructed in a way where we still want to run the entire pipeline regardless of the state of parent models and leave the decision making regarding the validity of that data to a later point. As an example, there might be one column in a later model that is impacted by a parent having incorrect data/not running due to an error but that does not mean we would not want to refresh the table as a whole anyway. It is more beneficial to report that inconsistency in a different manner rather than simply not have that child model run.

As a direct consequence of not having this we are making the decision to maintain specific DAG tasks per model/test which is incredibly inefficient and butchers a lot of the great features dbt provides.

@trevorwallis13
Copy link

@jtcohen6 Thanks for reviving this issue! I wanted to provided one use case where it would be helpful to be able to set this on the child model rather than the parent:

We maintain a bunch of user attributes in individual models so we can more easily control each one. We then have an incremental model that unions the new records from all the attribute models so we can sync to an external source. If one of the attribute models fails, we still want to run the final model and insert the records from all the models that succeeded.

In this case, I think it would make the most sense to set the skip failure rule at the child level instead of the parent. Sure, I could set the flag on the folder level for all the attribute models in dbt_project.yml but it feels like the rule has more to do with how I want that specific child to behave -- not necessarily how I expect every potential model downstream from the parent to behave.

I think a good parallel to how I envision this would be Airflow's trigger_rule argument. In my particular use case, I would want to use an all_done rule if things were equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes community This PR is from a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to disable "skip downstream models on failure"
10 participants