Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable "skip downstream models on failure" #2142

Open
ian-whitestone opened this issue Feb 17, 2020 · 24 comments · May be fixed by #9020
Open

Option to disable "skip downstream models on failure" #2142

ian-whitestone opened this issue Feb 17, 2020 · 24 comments · May be fixed by #9020
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Refinement Maintainer input needed

Comments

@ian-whitestone
Copy link

ian-whitestone commented Feb 17, 2020

Describe the feature

When a model run fails, all dependent, downstream model runs are skipped. This is a great default, but there can be some cases where you still want to run the downstream model.

For example, one case where I would actually want this default behaviour overridden is for jobs that rely on an infrequently updated dimension. Something like a country dimension or currency dimension which changes very rarely. Say you have a daily summary/rollup model that uses order transaction data and a country dimension. If country dimension starts failing i’d still want my model to keep running off the new order transaction data, and not have to wait until the responsible team fixes it.

Describe alternatives you've considered

Individually running models and keeping track of which ones ran successfully and which ones failed. This would be quite time consuming.

Who will this benefit?

Can't comment on how many people this will benefit, but the added flexibility is always a plus.

@ian-whitestone ian-whitestone added enhancement New feature or request triage labels Feb 17, 2020
@drewbanin drewbanin removed the triage label Feb 18, 2020
@drewbanin
Copy link
Contributor

Thanks @ian-whitestone - the example use case you described is a really good one!

I think we can make this a model config. I'm picturing something a config key called on_error with values like continue, skip_children, and abort.

The values map onto the following behavior:

  • skip_children - the default, and the current behavior. Skips children of this model
  • continue - try running children models, letting them fail or succeed as they will
  • abort - this is my addition, and it would stop the entire run (ie. skip all remaining nodes)

You buy this? I don't know that we want/need to include the abort parameter, but I thought it was worth discussing :)

@ian-whitestone
Copy link
Author

ian-whitestone commented Feb 19, 2020

👍 this makes sense to me. Only alternative I can think of is having this specified at the downstream (dependent) model level. i.e. same values but different meanings:

  • skip - the default, and the current behaviour. do not run this model if any of the upstream dependencies fail
  • continue - try running this model even if the upstream dependencies fail

The advantage of this approach would be that it gives individual analysts/modellers the ability to specify how they want to handle errors for their specific model without dictating the response for everyone else.

@bashyroger
Copy link

bashyroger commented Feb 20, 2020

What is the difference between:

  • skip_children - the default, and the current behavior. Skips children of this model

  • abort - this is my addition, and it would stop the entire run (ie. skip all remaining nodes)

Does abort do a ROLLBACK of the full tree (so, parents too?)

@drewbanin
Copy link
Contributor

@bashyroger I'm not picturing abort doing any sort of rollback. I was instead thinking that it would compel dbt to skip all models that have not yet run, regardless of their relationship to the failed model. It's sort of a solution in search of a problem, and we probably shouldn't implement it until we have a better handle on some good uses cases :)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2021

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 4, 2021
@cmdugan13
Copy link

@drewbanin I've been following this since the use case that @ian-whitestone would be useful to me as well. Curious if there has been any more work in this area. Thanks!

@smitsrr
Copy link

smitsrr commented May 3, 2022

I would love this feature :-)

@uttam17
Copy link

uttam17 commented May 3, 2022

This would be very helpful

@nick-heron-zip
Copy link

+1 would be very helpful

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 13, 2022

Reopening this issue given the renewed interest!

Some of this is now possible with metadata, namely the result status that can select skipped models to rerun (docs). Think about this as, you can pick all nodes where both are true: "skipped" because of an upstream failure, and tagged to rerun_despite_upstream_failure.

dbt run --select result:skipped,rerun_despite_upstream_failure --state previous/run_results.json

I'd say, I'm open to this possibility, of conditionally disabling "skip downstream models on failure." It's a core construct of dbt-core's execution model, but I don't know that it's an absolutely essential component of the opinionated workflow, or of DAGs more generally.

Where to configure?

I think it feels appropriate to support this config on either the parent node ("I'm worth stopping the DAG for"), or the child node ("I don't care about upstream failures"). By picking one OR the other, it's much simpler to reason about, and much simpler to implement.

I'm hesitant to support defining this configuration for the full matrix of parent-child relationships, as I think that can get unwieldy very quickly. I think it might be possible to deploy ephemeral "middle models" that would be "passthroughs" for the upstream model's data, while allowing you to customize this behavior. Similar to the discussion about dbt build and test failures in #4591. Unlike test failures, though, which access actual data, the ephemeral model would need some way to know that the upstream model errored. Query logs...?

How to configure?

There are two ways we could support this:

  • a per-model config, which controls behavior about this node's children when it fails
  • an overall "run-level" config (--skip-on-failures=...)

If we decide to support both, I'd see this being similar to how the --full-refresh flag / full_refresh model config work today (docs), where the model-level config (if set) takes precedence over the run-level flag.

How to implement?

For parents, that would look like adding a check here for result.node.config.get('on_error') == 'continue', before "marking dependent errors":

if result.status in self.MARK_DEPENDENT_ERRORS_STATUSES:
if is_ephemeral:
cause = result
else:
cause = None
self._mark_dependent_errors(node.unique_id, result, cause)

For children, that would look like checking whether the "dependent node" has on_upstream_error: continue before marking it to skip:

for dep_node_id in self.graph.get_dependent_nodes(node_id):
self._skipped_children[dep_node_id] = cause

Next steps

I think there are still some open questions here—namely, whether parents or children are the right place to put this config. It'd be helpful to get more context on specific use cases, to motivate the choice!

Personally, I lean toward parents. This feels analogous to the "warning" behavior of tests, which say: I found some failures, which is good information to know, but it's not worth stopping the DAG for.

@jtcohen6 jtcohen6 reopened this May 13, 2022
@jtcohen6 jtcohen6 added Team:Execution help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed stale Issues that have gone stale labels May 13, 2022
@nick-heron-zip
Copy link

Our context:

In our source layer we rename and recast all fields where necessary. One of the Salesforce tables (dimension) we ingest via Fivetran has 400+ columns. Less than 10 of these columns are used downstream in business-critical models. Recently one of the ~390 unused columns was dropped from the source schema, leading to a DAG failure and delayed business data while we fixed the error.

As a temporary solution we are no longer enforcing naming conventions or typecasting columns from Salesforce tables. Long-term we would like to trigger a warning instead of a failure when issues are detected in dimension tables.

@jtcohen6
Copy link
Contributor

@nick-heron-zip Thanks for sharing the use case!

My instinct would be to include only the columns in your staging model that you are actually using downstream, those which are actually relevant to your "analytical universe." I realize it doesn't say that in the docs or discourse explicitly, but that's always been, for me, a significant consideration when writing a staging model.

I also know that I've occasionally used staging model definitions as a form of documentation / quick look-up to see what's available in the source—but that's why you can document and describe columns in sources, and dbt docs generate produces catalog info (including full column list) for sources, too.

@boje
Copy link

boje commented Jun 21, 2022

Our context:

Our source is JSON and with Snowpipe these is loading into a table as the columntype VARIANT. In some cases loading data into a VARIANT is recommended by Snowflake as new schema changes won't break the load of data into Snowflake.
When staging these data, the JSON elements is flattern out into a table with a number of column and datatype.
If a column have been decided as int but now is sent as char the stage-model will break.

Another scenarie is when creating a merge statement in Snowflake it is require that the rows are unique otherwise this error is raise Duplicate row detected during DML action.

In both cases dataset X is returning an error and the dbt Cloud job will stop. Dataset Z, Y and Q is working perfectly but because X return error all models are blocked, even if there is reports only depending on Z and Y dataset.

Hence we are wondering how to avoid a all or nothing scenario as it is today.
One option is to split the jobs into smaller jobs, but that will create dependency cross jobs and something that is not support in dbt Cloud today.

@jens-koster
Copy link

our use case:
Some sources are google sheets of rarely changing data, edited by humans. Notoriously breaky due to typos. This data is not worth skipping the downstream models for. I would stage this source to a table, which would occasionally fail without affecting downstream loads. On failure the downstream loads would use the data I materialised in the previous run. That works well enough and I will catch the problem using freshness alerts on the table of materialised data.
I can achieve this by running the staging of these tables in a separate step before the main run. Having this feature would make it a little neater and less work.

I can also see myself using this setting as a quick fix to get the load running again after failure, just until I can find the root cause and fix it properly. Excluding the model in the dbt run command means messing with the devops/build stuff, risking the entire load. Setting a config on one model is low impact and I can do it in my normal work flow of editing models.

@davidsr2r
Copy link

We're getting pretty desperate for this. There are all kinds of models we have where we pull in data from a variety of data providers, and one of those providers having out of date data (due to e.g. a schema breakage) doesn't invalidate the rest of the providers' data in the downstream models.

I'm about to have to break up our DAG purely to avoid this breakage, which nullifies a significant part of dbt's value, in DAG management.

@jtcohen6
Copy link
Contributor

@boje @jens-koster @davidsr2r I've marked this issue as help_wanted, meaning that we're not planning to prioritize it ourselves in the next few months, but we'd be happy to have a motivated community member work on contributing it.

Do you have thoughts on the questions I raised in my comment above, around how you'd want to configure this? Would you look for an all-or-nothing runtime config (--no-skip-on-failures), or a config you can specify in project code for specific upstream/parent models?

@davidsr2r
Copy link

@jtcohen6

We'd be needing a per-model config, so that the rerun of skipped models is intentional and explicit, and avoids downstream runs that shouldn't happen.

Re: Parents vs. Children, I'd say that it probably depends on your use case preferences; if it's on parents then you need to configure it in fewer places and it is information that is specific to that model, which makes a lot of sense, however if it's on children then the query that needs to be aware of the possibility of failure is in the same file that contains the config option, making the dependency and failure-case more explicit for the developer. I always prefer maintainability and so, while I could be persuaded otherwise, I'd therefore want us to use the latter option. Why not support both, or just one or the other?

@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Nov 28, 2022
@Inviso-LiseBohr
Copy link

@jtcohen6
We have a similar situation where we are waiting for the responsible team to fix an issue causing a stage model fail. This model contains very rarely changing data and therefore we can live without having it updated for some time.
In the meantime the rest of our dag is skipped and none of our most important models are not updated until the issue is fixed.

In our situation it would make sense to have the configuration on the parent model, since a larger number of models references the failing model.

We would appreciate a feature described earlier in this thread on_error key and values like skip_children and continue
Is there a good way to upvote the implementation of this?

Btw. our workaround for now is to just temporarily exclude the failing model from our dbt build,
the downside is that it becomes a little manual and we need to remember to remove it again when the issue is fixed.

@VDFaller
Copy link

My only worry about putting the config in the child would be this use case

  • We have a chain A->B->C
  • B is configured to skip if upstream fails
  • C is configured not to skip if upstream fails
  • A fails

So C would only have data from B from the last time that A passed. I think this could possibly lead analysts (or more likely business side) to have bad trust in thinking that the data in C is up to date (even though it will be stale for the B portion), because of some loaded_date column being up to date.

@pengyusong
Copy link

our use case:
our bi tool use serveral wide-table to show metrics, the wide-table composed of multiple tables, when one upsteam table result in wide-table willl failed, so the impact is significant , i want ignore some upsteam table failures, then wide-table can run success, then I have enough time to fix the abnormal upstream model

@CitationJamespaul
Copy link

CitationJamespaul commented Nov 1, 2023

We have a large project building our entire data warehouse - there are very few points of failure that would make us want to terminate our entire downstream build.

we would rather have downstream continuation by default and define our points of failure with tests.

id be happy with a --no-skip-on-failures option

@maxrostron
Copy link

Would still be very interested in the skip_children , abort, continue proposal and being able to specify that at the top of specific models - especially as we use dbt as part of dagster, so we don't always trigger dbt flows using command line prompts like --no-skip-on-failures.

Would be a game changer for our flow. At the moment we have to do a lot of convoluted workarounds.

@jtcohen6
Copy link
Contributor

I've left a comment on the linked PR:

After rereading the use cases described in the comments above (thank you everyone!), I'm interested in moving forward with this as a model-level config set on "parent" models, which are generally flakey failures caused by changes in unreliable upstream data sources. I don't believe we should support this on "child" models (at all), or as a global flag broadly applied (yet).

@mahiki
Copy link

mahiki commented Sep 22, 2024

I run analytics pipelines for reports, not data engineering pipelines.

If a tests fails (esp. anomaly detection tests) I want the whole pipeline to run. I'll get an alert about the anomaly or other failure and deal with it.

Two bad things about current behavior:

  1. I miss downstream errors that are unrelated to the first test failure. Yes there can be two unrelated failures and I want to see them both after a single build or run/test. I don't want to fix one thing, rebuild, find new error, fix, rebuild, etc multiple times.
  2. What I cannot have is a stack of empty reports with nothing in them for days while I work out the bug in one small part of the outputs. A test failure in my domain is something the business team would prefer to talk about rather than skip review of many blank pages of reports, or skip the business review altogether.

edit:
Oh, I would just do a dbt run, and then dbt test in a different process that could handle failure in a customized way. Sorry, I just started using dbt and elementary. Carry on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Refinement Maintainer input needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.