[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

yengibar-manasyan-sp · 2024-07-11T21:39:26Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

One of my models had the same name as the source Snowflake table. They live in different schemas. dbt run works fine and generates all the data as expected. However, DBT unit tests fail because unit tests can not differentiate mocked source tables from these tables. Here is an example:

unit_tests:
  - name: test_model_1
    description:
    model: MODEL_1
    overrides:
      macros:
        is_incremental: true

    given:
      - input: this
        rows:
          - { column_1: "1", column_2: "2" }

      - input: source('src', 'MODEL_1')
        rows:
          - { column_1: "1", column_2: "3" }

    expect:
      rows:
        - { column_1: "1", column_2: "3" }

When I checked the generated SQL query, I saw that it adds __dbt__cte__ prefixes to the CTE names and replaced raws for source('src', 'MODEL_1') with values from this.

Expected Behavior

Fix unit tests to support identical model and source table names in different schemas.

Steps To Reproduce

Create a model that relies on a table with the table's exact name
Create an incremental unit test and add data both for the source table and this (the model itself)
Run the unit test and see its outcome

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2024-07-18T20:47:55Z

Thanks for reporting this @yengibar-manasyan-sp !

Root cause

It appears that the root cause is that during unit testing, sources aren't guaranteed to have a CTE that is unique from refs or other sources.

Example

See below for a reproducible example ("reprex") of the issue you reported when a model and a source have the same name (case_1). It also includes a separate issue when two sources have the same name (case_2).

Reprex

dbt_project.yml

name: "my_project"
version: "1.0.0"
config-version: 2

models/model_a.sql

select -1 as id

models/model_b.sql

select -2 as id

models/case_0.sql

select * from {{ ref("model_a") }}

union all

select * from {{ ref("model_b") }}

models/case_1.sql

select * from {{ ref("model_a") }}

union all

select * from {{ source("src_2", "model_a") }}

models/case_2.sql

select * from {{ source("src_1", "model_a") }}

union all

select * from {{ source("src_2", "model_a") }}

Create some sources that just point to the database objects created by a couple of our models 😂

models/_sources.yml

sources:

  - name: src_1
    schema: "{{ target.schema }}"
    tables:
      - name: model_a

  - name: src_2
    schema: "{{ target.schema }}"
    tables:
      - name: model_a

models/_unit_tests.yml

unit_tests:

  - name: test_case_0
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_0.sql"
    model: case_0
    given:
      - input: ref("model_a")
        rows:
          - { id: 1 }
      - input: ref("model_b")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

  - name: test_case_1
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_1.sql"
    model: case_1
    given:
      - input: ref("model_a")
        rows:
          - { id: 1 }
      - input: source("src_2", "model_a")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

  - name: test_case_2
    description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_2.sql"
    model: case_2
    given:
      - input: source("src_1", "model_a")
        rows:
          - { id: 1 }
      - input: source("src_2", "model_a")
        rows:
          - { id: 2 }
    expect:
      rows:
          - { id: 1 }
          - { id: 2 }

Commands:

dbt run -s model_a model_b
dbt build -s case_0
dbt build -s case_1
dbt build -s case_2

I get the following output:

$ dbt build -s case_0

20:42:08  1 of 2 START unit_test case_0::test_case_0 ..................................... [RUN]
20:42:08  1 of 2 PASS case_0::test_case_0 ................................................ [PASS in 0.21s]
20:42:08  2 of 2 START sql view model feature_456.case_0 ................................. [RUN]
20:42:08  2 of 2 OK created sql view model feature_456.case_0 ............................ [OK in 0.12s]
20:42:08  
20:42:08  Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.54 seconds (0.54s).
20:42:08  
20:42:08  Completed successfully

$ dbt build -s case_1       

20:41:15  Failure in unit_test test_case_1 (models/_unit_tests.yml)
20:41:15    

actual differs from expected:

@@ ,id
---,1
   ,2
+++,2

$ dbt build -s case_2

20:42:57  1 of 2 START unit_test case_2::test_case_2 ..................................... [RUN]
20:42:57  1 of 2 ERROR case_2::test_case_2 ............................................... [ERROR in 0.03s]
20:42:57  2 of 2 SKIP relation feature_456.case_2 ........................................ [SKIP]
20:42:57  
20:42:57  Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.23 seconds (0.23s).
20:42:57  
20:42:57  Completed with 1 error and 0 warnings:
20:42:57  
20:42:57    Compilation Error in unit_test test_case_2 (models/_unit_tests.yml)
  Unit_Test 'unit_test.my_project.case_2.test_case_2' (models/_unit_tests.yml) depends on a source named 'src_1.model_a' which was not found

Looking at the output files like the following helps highlight what is going wrong:

target/run/my_project/models/_unit_tests.yml/models/test_case_0.sql
target/run/my_project/models/_unit_tests.yml/models/test_case_1.sql

dbeatty10 · 2024-07-19T02:36:36Z

Acceptance criteria

The key outcome we'd need is:

each node in the DAG to have a corresponding CTE name that is guaranteed to be unique from that of every other node in the DAG (regardless of node type like source, model, seed, snapshot, etc.).

A possible building block

One thing that is already unique across the DAG is unique_id, so possibly that could be transformed into a suitable CTE name.

Related code

Some related code is described in #5273 (comment).

dbt-labs/dbt-adapters#236 and #10290 have proposed updates to the code that generates the CTE name, but I don't think either would fix this particular issue with unit tests.

dbeatty10 · 2024-07-20T21:17:45Z

Relevant code

dbt-core/core/dbt/context/providers.py

Lines 1595 to 1600 in c668846

    
           def this(self) -> Optional[str]: 
        
               if self.model.this_input_node_unique_id: 
        
                   this_node = self.manifest.expect(self.model.this_input_node_unique_id) 
        
                   self.model.set_cte(this_node.unique_id, None)  # type: ignore 
        
                   return self.adapter.Relation.add_ephemeral_prefix(this_node.name) 
        
               return None

Specifically, we could generate some kind of globally_unique_identifier and then use it like this:

            return self.adapter.Relation.add_ephemeral_prefix(globally_unique_identifier)

dbeatty10 · 2024-07-20T22:27:21Z

Implementation ideas

Here are a couple ideas of how to create the CTE in unit tests so that they are unique:

Use a quoted version of unique_id + quoted identifiers
Use a hexadecimal formatted hash of the unique_id + quoted identifiers

1. Use a quoted version of `unique_id` + quoted identifiers

Example:

with "__dbt__cte__source.my_project.src_1.model_a" as (
...
),
"__dbt__cte__model.my_project.model_a" as (

Pros:

Retains and utilizes uniqueness of the unique_id
Doesn't transform the unique_id at all -- just quotes it
Preserves readability of the CTE name
Would also solve the problem described in dbt-core #5273

Cons:

Not sure if data warehouses commonly allow CTE names to have quoted identifiers or not 🤷

2. Use a hexadecimal formatted hash of the `unique_id` + quoted identifiers

Examples:

with __dbt__cte__e83c5163316f89bfbde7d9ab23ca2e25604af290 as (

or

with __dbt__cte__dim_customers_e83c5163316f as (

Pros:

Utilizes uniqueness of the unique_id
Doesn't rely upon quoted identifiers

Cons:

CTE name is either less readable or not readable at all
Not 100% guaranteed to be unique after hashing

Details

We could hash the unique_id similar to _create_sha1_hash:

dbt-core/core/dbt/task/deps.py

Line 56 in c668846

return sha1("\n".join(package_strs).encode("utf-8")).hexdigest()

If we don't care about readability of the CTE name, then we could just use the hexdigest as-is which for SHA1 would be a 40-character value like this:

e83c5163316f89bfbde7d9ab23ca2e25604af290

If there are readability concerns, then we could combine an abbreviated version of the hash along with the node name (or identifier), which might look like this:

dim_customers_e83c5163316f

dbeatty10 · 2024-07-24T16:20:39Z

Implementation ideas (continued)

Here's some other ideas:

Allow configuration of a source alias that is only used during unit tests
Do option 1 or 2, but only for sources

3. Allow configuration of a source `alias` that is only used during unit tests

Similar to #10290

Example:

with "__dbt__cte__PROVIDED_SOURCE_ALIAS_HERE" as (
...
),
"__dbt__cte__model_a" as (

Pros:

Provides a way to work around the issue when it comes up
Preserves readability of the CTE name

Cons:

The user would likely still get a non-unique error, and then they'd need to somehow know to add this new configuration to resolve the issue.
If the new configuration were named alias, it would be confusing because it wouldn't behave the exact same as alias for other resources.
We'd prefer not to add new configurations when we can avoid it.

4. Do option 1 or 2, but only for sources

Pros:

Preserves readability of the CTE name for models by leaving the CTE for models as-is
Same pros as those for 1 or 2

Cons:

Same cons as those for 1 or 2

larsstromholm · 2024-09-13T12:42:38Z

We've just come across this issue ourselves. Are you planning contributing to this in the nearest future, @dbeatty10 ?

jhoffland · 2024-10-25T08:35:50Z

Did anyone find a workaround?

devmessias · 2024-10-25T17:19:37Z

There’s no workaround, @jhoffland. However, if you’re using dbt-core, you could modify the code to use the full path (source name + model) or hash like previous suggested

Kayrnt · 2024-12-04T12:54:58Z

Different bug but that workaround should work with this issue too:
#11075 (comment)

yengibar-manasyan-sp added bug Something isn't working triage labels Jul 11, 2024

dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label Jul 11, 2024

dbeatty10 mentioned this issue Jul 18, 2024

Explain which resource types the ref function applies to dbt-labs/docs.getdbt.com#5804

Merged

1 task

dbeatty10 self-assigned this Jul 18, 2024

dbeatty10 removed the triage label Jul 18, 2024

dbeatty10 removed their assignment Jul 18, 2024

dbeatty10 changed the title ~~[Bug] DBT unit tests fail when source table and model name matches~~ [Bug] DBT unit tests don't work properly when source table name matches another source or model Jul 19, 2024

jhoffland mentioned this issue Oct 25, 2024

[Bug] Unit test fixture is returning syntax errors for sources with hyphens ('-') in their name #10345

Open

2 tasks

devmessias mentioned this issue Oct 25, 2024

[Bug] Unit tests not working for incremental models #10686

Closed

2 tasks

dbeatty10 mentioned this issue Dec 4, 2024

[Regression] BigQuery: Unit tests using sources with sharded tables no longer work after CTE naming change #11075

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

yengibar-manasyan-sp commented Jul 11, 2024

dbeatty10 commented Jul 18, 2024

Reprex

dbeatty10 commented Jul 19, 2024 •

edited

Loading

dbeatty10 commented Jul 20, 2024

dbeatty10 commented Jul 20, 2024

dbeatty10 commented Jul 24, 2024 •

edited

Loading

larsstromholm commented Sep 13, 2024

jhoffland commented Oct 25, 2024

devmessias commented Oct 25, 2024

Kayrnt commented Dec 4, 2024

[Bug] DBT unit tests don't work properly when source table name matches another source or model #10433

[Bug] DBT unit tests don't work properly when source table name matches another source or model #10433

Comments

yengibar-manasyan-sp commented Jul 11, 2024

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented Jul 18, 2024

Root cause

Example

Reprex

dbeatty10 commented Jul 19, 2024 • edited Loading

Acceptance criteria

A possible building block

Related code

dbeatty10 commented Jul 20, 2024

Relevant code

dbeatty10 commented Jul 20, 2024

Implementation ideas

1. Use a quoted version of unique_id + quoted identifiers

2. Use a hexadecimal formatted hash of the unique_id + quoted identifiers

Details

dbeatty10 commented Jul 24, 2024 • edited Loading

Implementation ideas (continued)

3. Allow configuration of a source alias that is only used during unit tests

4. Do option 1 or 2, but only for sources

larsstromholm commented Sep 13, 2024

jhoffland commented Oct 25, 2024

devmessias commented Oct 25, 2024

Kayrnt commented Dec 4, 2024

[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

dbeatty10 commented Jul 19, 2024 •

edited

Loading

1. Use a quoted version of `unique_id` + quoted identifiers

2. Use a hexadecimal formatted hash of the `unique_id` + quoted identifiers

dbeatty10 commented Jul 24, 2024 •

edited

Loading

3. Allow configuration of a source `alias` that is only used during unit tests