-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to disable "skip downstream models on failure" #2142
Comments
Thanks @ian-whitestone - the example use case you described is a really good one! I think we can make this a model config. I'm picturing something a config key called The values map onto the following behavior:
You buy this? I don't know that we want/need to include the |
👍 this makes sense to me. Only alternative I can think of is having this specified at the downstream (dependent) model level. i.e. same values but different meanings:
The advantage of this approach would be that it gives individual analysts/modellers the ability to specify how they want to handle errors for their specific model without dictating the response for everyone else. |
What is the difference between:
Does |
@bashyroger I'm not picturing |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
@drewbanin I've been following this since the use case that @ian-whitestone would be useful to me as well. Curious if there has been any more work in this area. Thanks! |
I would love this feature :-) |
This would be very helpful |
+1 would be very helpful |
Reopening this issue given the renewed interest! Some of this is now possible with metadata, namely the
I'd say, I'm open to this possibility, of conditionally disabling "skip downstream models on failure." It's a core construct of Where to configure?I think it feels appropriate to support this config on either the parent node ("I'm worth stopping the DAG for"), or the child node ("I don't care about upstream failures"). By picking one OR the other, it's much simpler to reason about, and much simpler to implement. I'm hesitant to support defining this configuration for the full matrix of parent-child relationships, as I think that can get unwieldy very quickly. I think it might be possible to deploy ephemeral "middle models" that would be "passthroughs" for the upstream model's data, while allowing you to customize this behavior. Similar to the discussion about How to configure?There are two ways we could support this:
If we decide to support both, I'd see this being similar to how the How to implement?For parents, that would look like adding a check here for dbt-core/core/dbt/task/runnable.py Lines 337 to 342 in 72c17c4
For children, that would look like checking whether the "dependent node" has dbt-core/core/dbt/task/runnable.py Lines 399 to 400 in 72c17c4
Next stepsI think there are still some open questions here—namely, whether parents or children are the right place to put this config. It'd be helpful to get more context on specific use cases, to motivate the choice! Personally, I lean toward parents. This feels analogous to the "warning" behavior of tests, which say: I found some failures, which is good information to know, but it's not worth stopping the DAG for. |
Our context: In our As a temporary solution we are no longer enforcing naming conventions or typecasting columns from Salesforce tables. Long-term we would like to trigger a warning instead of a failure when issues are detected in dimension tables. |
@nick-heron-zip Thanks for sharing the use case! My instinct would be to include only the columns in your staging model that you are actually using downstream, those which are actually relevant to your "analytical universe." I realize it doesn't say that in the docs or discourse explicitly, but that's always been, for me, a significant consideration when writing a staging model. I also know that I've occasionally used staging model definitions as a form of documentation / quick look-up to see what's available in the source—but that's why you can document and describe columns in sources, and |
Our context: Our source is JSON and with Snowpipe these is loading into a table as the columntype Another scenarie is when creating a merge statement in Snowflake it is require that the rows are unique otherwise this error is raise In both cases dataset X is returning an error and the dbt Cloud job will stop. Dataset Z, Y and Q is working perfectly but because X return error all models are blocked, even if there is reports only depending on Z and Y dataset. Hence we are wondering how to avoid a all or nothing scenario as it is today. |
our use case: I can also see myself using this setting as a quick fix to get the load running again after failure, just until I can find the root cause and fix it properly. Excluding the model in the dbt run command means messing with the devops/build stuff, risking the entire load. Setting a config on one model is low impact and I can do it in my normal work flow of editing models. |
We're getting pretty desperate for this. There are all kinds of models we have where we pull in data from a variety of data providers, and one of those providers having out of date data (due to e.g. a schema breakage) doesn't invalidate the rest of the providers' data in the downstream models. I'm about to have to break up our DAG purely to avoid this breakage, which nullifies a significant part of dbt's value, in DAG management. |
@boje @jens-koster @davidsr2r I've marked this issue as Do you have thoughts on the questions I raised in my comment above, around how you'd want to configure this? Would you look for an all-or-nothing runtime config ( |
We'd be needing a per-model config, so that the rerun of skipped models is intentional and explicit, and avoids downstream runs that shouldn't happen. Re: Parents vs. Children, I'd say that it probably depends on your use case preferences; if it's on parents then you need to configure it in fewer places and it is information that is specific to that model, which makes a lot of sense, however if it's on children then the query that needs to be aware of the possibility of failure is in the same file that contains the config option, making the dependency and failure-case more explicit for the developer. I always prefer maintainability and so, while I could be persuaded otherwise, I'd therefore want us to use the latter option. Why not support both, or just one or the other? |
@jtcohen6 In our situation it would make sense to have the configuration on the parent model, since a larger number of models references the failing model. We would appreciate a feature described earlier in this thread Btw. our workaround for now is to just temporarily exclude the failing model from our |
My only worry about putting the config in the child would be this use case
So C would only have data from B from the last time that A passed. I think this could possibly lead analysts (or more likely business side) to have bad trust in thinking that the data in C is up to date (even though it will be stale for the B portion), because of some |
our use case: |
We have a large project building our entire data warehouse - there are very few points of failure that would make us want to terminate our entire downstream build. we would rather have downstream continuation by default and define our points of failure with tests. id be happy with a --no-skip-on-failures option |
Would still be very interested in the Would be a game changer for our flow. At the moment we have to do a lot of convoluted workarounds. |
I've left a comment on the linked PR: After rereading the use cases described in the comments above (thank you everyone!), I'm interested in moving forward with this as a model-level config set on "parent" models, which are generally flakey failures caused by changes in unreliable upstream data sources. I don't believe we should support this on "child" models (at all), or as a global flag broadly applied (yet). |
I run analytics pipelines for reports, not data engineering pipelines. If a tests fails (esp. anomaly detection tests) I want the whole pipeline to run. I'll get an alert about the anomaly or other failure and deal with it. Two bad things about current behavior:
edit: |
Describe the feature
When a model run fails, all dependent, downstream model runs are skipped. This is a great default, but there can be some cases where you still want to run the downstream model.
For example, one case where I would actually want this default behaviour overridden is for jobs that rely on an infrequently updated dimension. Something like a country dimension or currency dimension which changes very rarely. Say you have a daily summary/rollup model that uses order transaction data and a country dimension. If country dimension starts failing i’d still want my model to keep running off the new order transaction data, and not have to wait until the responsible team fixes it.
Describe alternatives you've considered
Individually running models and keeping track of which ones ran successfully and which ones failed. This would be quite time consuming.
Who will this benefit?
Can't comment on how many people this will benefit, but the added flexibility is always a plus.
The text was updated successfully, but these errors were encountered: