Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using {{ ref('anything') }} in a set_sql_header call resolves to the current model and not 'anything' #2793

Open
1 of 5 tasks
yuzeh opened this issue Sep 26, 2020 · 15 comments
Labels
bug Something isn't working user docs [docs.getdbt.com] Needs better documentation

Comments

@yuzeh
Copy link

yuzeh commented Sep 26, 2020

Describe the bug

Here's a minimal reproducible example (running on BigQuery).

-- test_tmp_1.sql
{{
    config(materialized="table")
}}

select a.* from unnest([
    struct(1 as key, 1 as value)
]) as a
-- test_tmp_2.sql
{{
    config(materialized="table")
}}

{% call set_sql_header(config) %}
    select * from {{ ref('test_tmp_1') }};
{% endcall %}

select * from {{ ref('test_tmp_1') }}

When I dbt run this project, I get an error when test_tmp_2 is run saying something along the lines of "test_tmp_2 does not exist".

The output SQL for test_tmp_2 after a dbt run shows:

    select * from `project`.`dataset`.`test_tmp_2`;


  create or replace table `project`.`dataset`.`test_tmp_2`
  
  
  OPTIONS()
  as (
    



select * from `project`.`dataset`.`test_tmp_1`
  );
    

Expected behavior

The ref('test_tmp_1') call in the call set_sql_header(config) block should resolve to test_tmp_1.

Instead it resolves to test_tmp_2.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.0

Up to date!

Plugins:
  - bigquery: 0.18.0
  - snowflake: 0.18.0
  - redshift: 0.18.0
  - postgres: 0.18.0

The operating system you're using:

macOS 10.15.6

The output of python --version:

Python 3.8.1

Additional context

N/A

@yuzeh yuzeh added bug Something isn't working triage labels Sep 26, 2020
@jtcohen6 jtcohen6 removed the triage label Sep 29, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 29, 2020

@yuzeh This is a weird one, all right. I've confirmed that this behavior occurs on other databases (Postgres) and regardless of whether you use call set_sql_header(config) or config(sql_header = ...). I also confirmed that this isn't a v0.18 regression, since the same thing happens with v0.17.2.

I think what's going on here is that, in its first parse through the project, dbt captures all refs (to construct the DAG) but renders the code with a "dummy" value. That "dummy" value is, for no great reason, the current node reference. It's clear that sql_header is not re-rendered during dbt's second walk through the project, when it comes time to actually run the model.

I admit it feels quite strange to see two different values interpolated for rel below:

{{config(
    materialized="table"
)}}

{% set rel = ref('test_tmp_1') %}

{% call set_sql_header(config) %}
    select * from {{ rel }};
{% endcall %}

select * from {{ rel }}

I'd welcome a community contribution for this one, though I need to get to the bottom of what's going on here, and it's likely too tricky to be a good first issue. I don't think we'll prioritize an immediate fix here because the intended purpose of set_sql_header is quite narrow: creating UDFs, setting script variables (BQ), setting temporary session parameters (Snowflake), etc.

In the meantime, I think the near-term fix is to document that Jinja + macros (including ref + source) will not work as expected inside of sql_header.

@danielefrigo
Copy link
Contributor

The same issue appears also with other config parameters: I built a custom materialization and wanted to pass ref to another model as a parameter, but I ended with the same issue highlighted here.
So the issue is not restricted to SQL header setting.

@pichot
Copy link

pichot commented Nov 20, 2021

Also seeing this issue when when passing ref(model_name) as a config param to a macro in config(post_hook=[]) on a model.

When logging what the macro is generating, I see two values:

  • First, the current node reference "dummy" value @jtcohen6 mentions above
  • Then the correct reference

But for some reason, the executed SQL does not use the correct reference, but instead the current node reference "dummy" value.

@gmailMichaelSha
Copy link

gmailMichaelSha commented Apr 8, 2022

I am seeing this issue when I am trying to create a temporary table using set_sql_header with a source.
Using snowflake database

{{
    config(
        materialized='table',
    )
}}
{%- call set_sql_header(config) -%}
    CREATE OR REPLACE TEMPORARY TABLE {{ target.database }}.{{target.schema}}.temp_tbl_1 AS
    SELECT * FROM {{ source('source_1', 'world_cities') }};
{%- endcall -%}

SELECT *
FROM {{ target.database ~ "." ~ target.schema ~ ".temp_tbl_1" }}

@elyobo
Copy link

elyobo commented Jun 28, 2022

Ran into this too, my use case is the same as @gmailMichaelSha's - creating a temp table. In my case to split a large complex query in two to avoid BigQuery's Not enough resources for query planning error. Creating a temp table before execution so that it's cleaned up after is how we previously did this with dataform using pre_operations.

@noahyetter-nb
Copy link

Any updates here? This has been broken for a long time, and remains undocumented.

@elyobo
Copy link

elyobo commented Oct 26, 2022

Another use case: setting a variable with the a value to use in incremental builds for BQ, as BQ doesn't prune partitions if you use a subquery to find this (there are other ways around this though, e.g. using run_query()).

@elyobo
Copy link

elyobo commented Oct 30, 2022

Another use case: setting a variable to the result of part of a query, so that it can be referenced in multiple places. BigQuery reevaluates CTEs if they're queried more than once (I assume there's some rational behaviour for this, I guess due to parallelisation?) and so the recommended approach is to set a variable to the result and reference it instead.

@louis-vines
Copy link

louis-vines commented Mar 10, 2023

I think I've come across another bug related to this too: I'm trying to create my own custom version of _dbt_max_partitions that only looks at the last 3 months of data instead of scanning the column for all time: (all time is ~ 10gb of data), when doing:

{%- call set_sql_header(config) -%}
    declare max_date date default (
      select max(date_) from {{ this }} where date_ is not null and date_ >= '2023-01-01'
    );
{%- endcall %}

this doesn't work because I'm using a custom generate_alias_name and {{ this }} isn't using it in this context

I've raised this as a seperate issue with a bit more explanation: #7151

@moltar
Copy link

moltar commented Jun 6, 2023

Another use case is setting the config value from a ref:

Inside a model:

-- my_model.sql

{{
  config(
    meta = {
      'ref': ref('some_other_model').identifier
    },
  )
}}

After compiling, the value of node.my_model.config.meta.ref (and node.my_model.meta.ref) will be my_model (= this.identifier) rather than some_other_model.

@luksfarris
Copy link

Ran into this today, except that I'm using source instead of ref inside the set_sql_header.
Here's a simplified model:

{{ config(materialized = 'view') }}

{% call set_sql_header(config) %}
DECLARE latest_date DEFAULT (SELECT MAX(dt) AS dt FROM {{ source('sources', 'table') }});
{%- endcall %}

SELECT * FROM source('sources', 'table') WHERE dt = latest_date

This renders the DECLARE statement with the incorrect schema and table name.

@carolinabtt
Copy link

I ran into the same issue and I could solved it declaring the referenced table name before the set_sql_reader:

{% set target_relation = api.Relation.create(database=database, schema=schema, identifier='table_referenced') -%}

{% call set_sql_header(config) %}
 select * from {{ target_relation }}
{%- endcall %}

{{ config(materialized = 'incremental') }}

select * from source('sources', 'table') 
where date = latest_date

@tapmz
Copy link

tapmz commented Jun 20, 2024

The ability to create temporary tables during incremental executions is a really important capability that I am missing in the current behavior of set_sql_header() to be honest. I am using postgres and the difference in terms of performance with or without this capability is huge. Is there any solution to the fact that is_incremental() always returns false during the first parsing step? Thank you in advance!

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 20, 2024
@elyobo
Copy link

elyobo commented Dec 20, 2024

Keep alive please, worth resolving.

@github-actions github-actions bot removed the stale Issues that have gone stale label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests