Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DBT unit tests don't work properly when source table name matches another source or model #10433

Open
2 tasks done
yengibar-manasyan-sp opened this issue Jul 11, 2024 · 9 comments
Labels
bug Something isn't working unit tests Issues related to built-in dbt unit testing functionality

Comments

@yengibar-manasyan-sp
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

One of my models had the same name as the source Snowflake table. They live in different schemas. dbt run works fine and generates all the data as expected. However, DBT unit tests fail because unit tests can not differentiate mocked source tables from these tables. Here is an example:

unit_tests:
  - name: test_model_1
    description:
    model: MODEL_1
    overrides:
      macros:
        is_incremental: true

    given:
      - input: this
        rows:
          - { column_1: "1", column_2: "2" }

      - input: source('src', 'MODEL_1')
        rows:
          - { column_1: "1", column_2: "3" }

    expect:
      rows:
        - { column_1: "1", column_2: "3" }

When I checked the generated SQL query, I saw that it adds __dbt__cte__ prefixes to the CTE names and replaced raws for source('src', 'MODEL_1') with values from this.

Expected Behavior

Fix unit tests to support identical model and source table names in different schemas.

Steps To Reproduce

  1. Create a model that relies on a table with the table's exact name
  2. Create an incremental unit test and add data both for the source table and this (the model itself)
  3. Run the unit test and see its outcome

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@yengibar-manasyan-sp yengibar-manasyan-sp added bug Something isn't working triage labels Jul 11, 2024
@dbeatty10 dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label Jul 11, 2024
@dbeatty10 dbeatty10 self-assigned this Jul 18, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @yengibar-manasyan-sp !

Root cause

It appears that the root cause is that during unit testing, sources aren't guaranteed to have a CTE that is unique from refs or other sources.

Example

See below for a reproducible example ("reprex") of the issue you reported when a model and a source have the same name (case_1). It also includes a separate issue when two sources have the same name (case_2).

Reprex

dbt_project.yml

name: "my_project"
version: "1.0.0"
config-version: 2

models/model_a.sql

select -1 as id

models/model_b.sql

select -2 as id

models/case_0.sql

select * from {{ ref("model_a") }}

union all

select * from {{ ref("model_b") }}

models/case_1.sql

select * from {{ ref("model_a") }}

union all

select * from {{ source("src_2", "model_a") }}

models/case_2.sql

select * from {{ source("src_1", "model_a") }}

union all

select * from {{ source("src_2", "model_a") }}

Create some sources that just point to the database objects created by a couple of our models 😂

models/_sources.yml

sources:

  - name: src_1
    schema: "{{ target.schema }}"
    tables:
      - name: model_a

  - name: src_2
    schema: "{{ target.schema }}"
    tables:
      - name: model_a

models/_unit_tests.yml

unit_tests:

  - name: test_case_0
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_0.sql"
    model: case_0
    given:
      - input: ref("model_a")
        rows:
          - { id: 1 }
      - input: ref("model_b")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

  - name: test_case_1
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_1.sql"
    model: case_1
    given:
      - input: ref("model_a")
        rows:
          - { id: 1 }
      - input: source("src_2", "model_a")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

  - name: test_case_2
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_2.sql"
    model: case_2
    given:
      - input: source("src_1", "model_a")
        rows:
          - { id: 1 }
      - input: source("src_2", "model_a")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

Commands:

dbt run -s model_a model_b
dbt build -s case_0
dbt build -s case_1
dbt build -s case_2

I get the following output:

$ dbt build -s case_0

20:42:08  1 of 2 START unit_test case_0::test_case_0 ..................................... [RUN]
20:42:08  1 of 2 PASS case_0::test_case_0 ................................................ [PASS in 0.21s]
20:42:08  2 of 2 START sql view model feature_456.case_0 ................................. [RUN]
20:42:08  2 of 2 OK created sql view model feature_456.case_0 ............................ [OK in 0.12s]
20:42:08  
20:42:08  Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.54 seconds (0.54s).
20:42:08  
20:42:08  Completed successfully
$ dbt build -s case_1       

20:41:15  Failure in unit_test test_case_1 (models/_unit_tests.yml)
20:41:15    

actual differs from expected:

@@ ,id
---,1
   ,2
+++,2
$ dbt build -s case_2

20:42:57  1 of 2 START unit_test case_2::test_case_2 ..................................... [RUN]
20:42:57  1 of 2 ERROR case_2::test_case_2 ............................................... [ERROR in 0.03s]
20:42:57  2 of 2 SKIP relation feature_456.case_2 ........................................ [SKIP]
20:42:57  
20:42:57  Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.23 seconds (0.23s).
20:42:57  
20:42:57  Completed with 1 error and 0 warnings:
20:42:57  
20:42:57    Compilation Error in unit_test test_case_2 (models/_unit_tests.yml)
  Unit_Test 'unit_test.my_project.case_2.test_case_2' (models/_unit_tests.yml) depends on a source named 'src_1.model_a' which was not found

Looking at the output files like the following helps highlight what is going wrong:

  • target/run/my_project/models/_unit_tests.yml/models/test_case_0.sql
  • target/run/my_project/models/_unit_tests.yml/models/test_case_1.sql

@dbeatty10 dbeatty10 removed the triage label Jul 18, 2024
@dbeatty10 dbeatty10 removed their assignment Jul 18, 2024
@dbeatty10 dbeatty10 changed the title [Bug] DBT unit tests fail when source table and model name matches [Bug] DBT unit tests don't work properly when source table name matches another source or model Jul 19, 2024
@dbeatty10
Copy link
Contributor

dbeatty10 commented Jul 19, 2024

Acceptance criteria

The key outcome we'd need is:

  • each node in the DAG to have a corresponding CTE name that is guaranteed to be unique from that of every other node in the DAG (regardless of node type like source, model, seed, snapshot, etc.).

A possible building block

One thing that is already unique across the DAG is unique_id, so possibly that could be transformed into a suitable CTE name.

Related code

Some related code is described in #5273 (comment).

dbt-labs/dbt-adapters#236 and #10290 have proposed updates to the code that generates the CTE name, but I don't think either would fix this particular issue with unit tests.

matthewshaver pushed a commit to dbt-labs/docs.getdbt.com that referenced this issue Jul 19, 2024
Previews:
-
[ref](https://docs-getdbt-com-git-dbeatty10-patch-1-dbt-labs.vercel.app/reference/dbt-jinja-functions/ref)
-
[source](https://docs-getdbt-com-git-dbeatty10-patch-1-dbt-labs.vercel.app/reference/dbt-jinja-functions/source)

## What are you changing in this pull request and why?

This update is useful to understand some of the differences between
[`ref()`](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) and
[`source()`](https://docs.getdbt.com/reference/dbt-jinja-functions/source).
In turn, that is useful context for issues like
dbt-labs/dbt-core#10433.

The
[`source()`](https://docs.getdbt.com/reference/dbt-jinja-functions/source)
and [`ref()`](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)
functions are complementary:

`source()` applies to:
- [sources](https://docs.getdbt.com/docs/build/sources)

Whereas `ref()` applies to:
- [models](https://docs.getdbt.com/docs/build/models) (both [SQL
models](https://docs.getdbt.com/docs/build/sql-models) and [Python
models](https://docs.getdbt.com/docs/build/python-models))
- [seeds](https://docs.getdbt.com/docs/build/seeds)
- [snapshots](https://docs.getdbt.com/docs/build/snapshots)

`ref()` includes:
- [versioned
models](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#versioned-ref)
- [package-specific
nodes](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#ref-project-specific-models)
- [cross-project
nodes](https://docs.getdbt.com/docs/collaborate/govern/project-dependencies#how-to-write-cross-project-ref)

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.

---------

Co-authored-by: Mirna Wong <[email protected]>
@dbeatty10
Copy link
Contributor

Relevant code

def this(self) -> Optional[str]:
if self.model.this_input_node_unique_id:
this_node = self.manifest.expect(self.model.this_input_node_unique_id)
self.model.set_cte(this_node.unique_id, None) # type: ignore
return self.adapter.Relation.add_ephemeral_prefix(this_node.name)
return None

Specifically, we could generate some kind of globally_unique_identifier and then use it like this:

            return self.adapter.Relation.add_ephemeral_prefix(globally_unique_identifier)

@dbeatty10
Copy link
Contributor

Implementation ideas

Here are a couple ideas of how to create the CTE in unit tests so that they are unique:

  1. Use a quoted version of unique_id + quoted identifiers
  2. Use a hexadecimal formatted hash of the unique_id + quoted identifiers

1. Use a quoted version of unique_id + quoted identifiers

Example:

with "__dbt__cte__source.my_project.src_1.model_a" as (
...
),
"__dbt__cte__model.my_project.model_a" as (

Pros:

  • Retains and utilizes uniqueness of the unique_id
  • Doesn't transform the unique_id at all -- just quotes it
  • Preserves readability of the CTE name
  • Would also solve the problem described in dbt-core #5273

Cons:

  • Not sure if data warehouses commonly allow CTE names to have quoted identifiers or not 🤷

2. Use a hexadecimal formatted hash of the unique_id + quoted identifiers

Examples:

with __dbt__cte__e83c5163316f89bfbde7d9ab23ca2e25604af290 as (

or

with __dbt__cte__dim_customers_e83c5163316f as (

Pros:

  • Utilizes uniqueness of the unique_id
  • Doesn't rely upon quoted identifiers

Cons:

  • CTE name is either less readable or not readable at all
  • Not 100% guaranteed to be unique after hashing

Details

We could hash the unique_id similar to _create_sha1_hash:

return sha1("\n".join(package_strs).encode("utf-8")).hexdigest()

If we don't care about readability of the CTE name, then we could just use the hexdigest as-is which for SHA1 would be a 40-character value like this:

e83c5163316f89bfbde7d9ab23ca2e25604af290

If there are readability concerns, then we could combine an abbreviated version of the hash along with the node name (or identifier), which might look like this:

dim_customers_e83c5163316f

@dbeatty10
Copy link
Contributor

dbeatty10 commented Jul 24, 2024

Implementation ideas (continued)

Here's some other ideas:

  1. Allow configuration of a source alias that is only used during unit tests
  2. Do option 1 or 2, but only for sources

3. Allow configuration of a source alias that is only used during unit tests

Similar to #10290

Example:

with "__dbt__cte__PROVIDED_SOURCE_ALIAS_HERE" as (
...
),
"__dbt__cte__model_a" as (

Pros:

  • Provides a way to work around the issue when it comes up
  • Preserves readability of the CTE name

Cons:

  • The user would likely still get a non-unique error, and then they'd need to somehow know to add this new configuration to resolve the issue.
  • If the new configuration were named alias, it would be confusing because it wouldn't behave the exact same as alias for other resources.
  • We'd prefer not to add new configurations when we can avoid it.

4. Do option 1 or 2, but only for sources

Pros:

  • Preserves readability of the CTE name for models by leaving the CTE for models as-is
  • Same pros as those for 1 or 2

Cons:

  • Same cons as those for 1 or 2

@larsstromholm
Copy link

We've just come across this issue ourselves. Are you planning contributing to this in the nearest future, @dbeatty10 ?

@jhoffland
Copy link

Did anyone find a workaround?

@devmessias
Copy link
Contributor

There’s no workaround, @jhoffland. However, if you’re using dbt-core, you could modify the code to use the full path (source name + model) or hash like previous suggested

@Kayrnt
Copy link
Contributor

Kayrnt commented Dec 4, 2024

Different bug but that workaround should work with this issue too:
#11075 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unit tests Issues related to built-in dbt unit testing functionality
Projects
None yet
Development

No branches or pull requests

6 participants