Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Snapshots of "source" tables? #410

Open
codigo-ergo-sum opened this issue Nov 30, 2023 · 1 comment
Open

[Feature]: Snapshots of "source" tables? #410

codigo-ergo-sum opened this issue Nov 30, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@codigo-ergo-sum
Copy link

Overview

I'm not sure if this is really a feature request or if I'm misunderstanding something about how this (very helpful and much appreciated) package works.

The "source" tables (e.g. models, tests, etc.) where the dbt execution info is loaded via INSERT INTO statements are defined as incremental models. I understand on one level why this is done - if you just used a standard table materialization then every time the model was run the data would get removed, but the issue for us is that we run --full-refresh on our existing dbt project fairly often, i.e. once a week or so. We do this because we have a number of incremental models which cut down on query processing time and costs but occasionally there are some really late-arriving facts or other sneaky data cases that sneak past our incremental windowing filters, so just to be safe it's easier for us to do a full trunc and reload on our incremental tables every so often.

But... This causes a problem with this package because it means we'll lose all the historical invocation information every time we do a --full-refresh. Yes we could do some exclusions on our dbt job invocations in our dev and prod environments like --exclude dbt-artifacts but that's fragile and vulnerable to somebody forgetting to not exclude those when they're editing dbt job parameters in dbt cloud. We want to keep all of that over time and not have to worry about ever losing all of our dbt invocation history just because somebody accidentally configured a job wrong or didn't understand.

Am I understanding the situation correctly here? Wouldn't others want to ensure that they keep the full invocation history even when running a full refresh? And in that case wouldn't dbt snapshots be the proper method to do this? So, have the source tables that could be incremental, or just regular table materializations, but then have a snapshot on top of them to capture and persist the data?

Let me know if I misunderstood the intention and use case here, maybe I'm not "getting" it.

What problem would this solve?

It would ensure persistence of dbt artifact data over time regardless of how other incremental models are being used and refreshed in the project.

Would you be willing to contribute?

Yes but not able to at this exact moment.

@codigo-ergo-sum codigo-ergo-sum added the enhancement New feature or request label Nov 30, 2023
@codigo-ergo-sum
Copy link
Author

Just checking if any feedback from anyone on this? Looking back now to adding dbt-artifacts to our bigquery project since some of the previous issues around managing large numbers of tests seem to have been resolved but still scratching my head about the question I posted previously in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant