Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Snapshot contract errors for dbt_change_type and dbt_unique_key #9542

Closed
2 tasks done
david-burke5000 opened this issue Feb 8, 2024 · 7 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working model_contracts multi_project Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation wontfix Not a bug or out of scope for dbt-core

Comments

@david-burke5000
Copy link

david-burke5000 commented Feb 8, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Defining in the config of a snapshot schema that contracts are to be enforced, one gets error relating to fields

  • dbt_change_type
  • dbt_unique_key

If one includes them in the schema for the first run of a snapshot, this generates an error since the fields do not exist yet. In subsequent snapshots (after an initial snapshot, and where a change has occurred) one gets an error if they are missing in the schema definition. This then requires one, especially in the development phase, to continually put these contracts in or take them out. Understandably, one has to define a contract for all columns in the schema that finally appear in the snapshot table. However, for internally generated columns, this is wasting development time. This doesn not happen for dbt_scd_id column for example.

Expected Behavior

If columns do not appear in the snapshot tables, and are internal, it should not cause a problem to not define their contracts directly in a schema.

Steps To Reproduce

  1. Create schema for snapshot, with enforced contract in the config. Do not include contract for dbt_change_type and dbt_unique_key
  2. Create a snapshot
  3. Try and create a further snapshot where changes have occurred

Relevant log output

No response

Environment

- Python: 3.11
- dbt:1.7.1
- db: postgres

Which database adapter are you using with dbt?

postgres

Additional Context

No response

@david-burke5000 david-burke5000 added bug Something isn't working triage labels Feb 8, 2024
@dbeatty10 dbeatty10 added snapshots Issues related to dbt's snapshot functionality model_contracts labels Feb 8, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @david-burke5000 !

Initial assessment

I suspect that we didn't intentionally include support model contracts for snapshots -- it's not listed here either way.

But at the very least, you did figure out how to get model contracts to work for snapshots in dbt-postgres (albeit with behavior that is a bit wonky).

When I tried it locally with dbt-postgres, I got the same behavior as you, but it didn't really work at all when I tried with dbt-snowflake.

Options

We've got a couple options moving forward:

  1. Leave things as-is but update the documentation here to indicate that contracts are not supported for snapshots
  2. Explicitly disable support by detect if a contract is defined on a snapshot and raising an error or a warning
  3. Explicitly enable support by adding test cases and making code updates to dbt-postgres, dbt-snowflake, etc.

I'm going to label this as "refinement" for us to take a closer look.

@dbeatty10
Copy link
Contributor

We talked about this briefly today (@graciegoheen and myself). We thought we'd leave this in the Court of the Mesh King 👑 to decide. What do you think @jtcohen6?

@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core user docs [docs.getdbt.com] Needs better documentation multi_project and removed triage labels Feb 20, 2024
@jtcohen6
Copy link
Contributor

@david-burke5000 @dbeatty10 My inclination is that we should not seek to officially support contracts on snapshots. Here's my rationale:

For snapshots that depend on sources (best practice), when a column in the upstream source system is added or changes data type, the snapshot will handle this by adding a new column with that name / data type. If a column is removed, the snapshot does not remove its corresponding column. In this way, we ensure that data is not lost (good!), but we trade off having a fully predictable schema for snapshots.

For snapshots that depend on models, and sit in the middle of the DAG (not something we encourage outright, but it definitely happens): I would recommend limiting the transformation logic in the snapshot to just select * from {{ ref('upstream_model') }} — something @graciegoheen and I have been discussing! — and applying the model contract on upstream_model instead. Then any schema mismatches will be caught by a build failure in upstream_model (good!), rather than the snapshot. Build failures for snapshots are scarier, because there's the potential implication of data loss.

To that end, I've opened a docs PR:

What say ye? Any strong dissent?

@jtcohen6 jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2024
mirnawong1 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Feb 20, 2024
…4954)

## What are you changing in this pull request and why?

Clarify that model contracts are only supported for models, not for
other resource types.

Context:
- dbt-labs/dbt-core#9542
@graciegoheen
Copy link
Contributor

No dissent from me, but I do think if we go this route we should add an error message when you attempt to add a contract to a snapshot, so folks don't get into this tricky spot.

@dbeatty10
Copy link
Contributor

To round out this story, here's the components that would be the most helpful:

@dbeatty10
Copy link
Contributor

Workaround using a pass-through model

Suppose you have a snapshot (or seed) named not_a_model. (Without loss of generalization, we can imagine a source with a similar name, but it would use source() instead of ref() below.)

To add a pass-through model contract, you can do the following:

  1. Make a simple pass-through view
  2. Enforce a contract on the pass-through view

1. Make a simple pass-through view

models/contract_for_not_a_model

{{ config(materialized="view") }}

select * from {{ ref("not_a_model") }}

2. Enforce a contract on the pass-through view

models/_models.yml

models:
  - name: contract_for_not_a_model

    contract:
      enforced: true

    columns:

      # snapshot-specific columns
      
      # your columns here

      - name: column_1
        data_type: dtype_1

      # etc.

      # snapshot-specific columns

      - name: dbt_scd_id
        data_type: text

      - name: dbt_updated_at
        data_type: timestamp

      - name: dbt_valid_from
        data_type: timestamp

      - name: dbt_valid_to
        data_type: timestamp

@slessl
Copy link

slessl commented Mar 5, 2024

Dear @graciegoheen, @dbeatty10, @jtcohen6 ,

Thank you for your support and feedback on this issue! I am working together with @david-burke5000.
Unfortunately I am a bit late to this discussion but please allow me to add my two cents.

I am commenting the situation with the dbt-postgres-adapter in mind.

We work in a highly regulated industry where it is imperative that we track how data has changed over time. This is where we rely on the SCD2 snapshots to build the history of the data, incl. the time intervals a record is valid. All our downstream models are based on these snapshots.
Consequently, we want to ensure that the data in these snapshot tables is trustworthy by defining the appropriate data contracts (including database constraints, especially foreign key references).

Our current DAG: source table => view (adding metadata and derive surrogate key column + dbt tests) => snapshot (with data contracts) => any downstream model

Snapshots vs. incremental models:

@jtcohen6 argued:

if a column is removed, the snapshot does not remove its corresponding column. In this way, we ensure that data is not lost (good!), but we trade off having a fully predictable schema for snapshots.

Currently contracts are supported for incremental models. As far as I understand, the behavior for incremental models with on_schema_change: append_new_columns is the same as for snapshots: new columns are added to the table, deleted columns are left within the table.
What is the reasoning to treat snapshots differently from increment models? In both cases, if the structure of an upstream model should change, it would be required to update the respective contract of the table.

Workaround using a pass-through model:

The workaround to add a view-based pass-through model would not (fully) work, as database constraints (e.g. not_null, primary_key, foreign_key) can only be applied on an actual table (in our case the snapshot table) and not on a view.
Adding a materialized pass-through model also seems rather unintuitive.

Kindly ask for further consideration and feedback on our thoughts.

Thank you and best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model_contracts multi_project Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

5 participants