Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-278] [Feature] Allow meta property to be accessible at compile time #4776

Closed
1 task done
DVAlexHiggs opened this issue Feb 24, 2022 · 21 comments
Closed
1 task done
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@DVAlexHiggs
Copy link

DVAlexHiggs commented Feb 24, 2022

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

It would be great if the config.meta property was populated at compile time. This would allow users to use the meta dictionary to generate SQL for their dbt models at scale using the meta dictionary as parameters. For highly parameterised SQL via macros, such as in dbtvault, the following approaches are currently possible:

  • providing parameters in the model itself example
  • as a yaml string (using fromyaml) example
  • or even as individual global variables in dbt_project.yml

All three of these approaches work, but aren't particularly scalable or easy to maintain.

Some downsides of the above approaches:

  • fromyaml is sensitive to formatting and is harder to debug because it's a YAML string rather than a YAML file
  • Providing parameters in the model itself is fine for small amounts of parameters/metadata but when providing large numbers of parameters or mixed data types (e.g. dictionaries), it can be unwieldly and quite ugly.
  • In the case of global variables, a dbt_project.yml file can quickly grow and become hard or even impossible to manage.

I requested a similar feature here and I believe allowing config and meta to be declared in property (.yml) files was one of the answers to this idea originally. Whilst this is a fantastic feature and has ensured dbt has come on leaps and bounds with metadata management, I believe it's not quite there yet.

I recently tried using meta to replace the previously described approaches (fromyaml, params in model, global variables) and unfortunately came across the issue that we couldn't access the data at compile time. The snippets below demonstrate this:

-- my_stage.sql
SELECT 1 AS "MY_TEST_COLUMN"
-- my_table.sql

{{ config(materialized='table') }}

{% do log('meta: ' ~ config.get('meta'), true) %}
{% do log('param_1: ' ~ config.get('meta')['param_1'], true) %}
{% do log('param_2: ' ~ config.get('meta')['param_2'], true) %}
{% do log('param_3: ' ~ config.get('meta')['param_3'], true) %}
{% do log('param_4: ' ~ config.get('meta')['param_4'], true) %}
{% do log('source_model_name: ' ~ config.get('meta')['source_model_name'], true) %}

{% set param_1 = config.get('meta')['param_1'] -%}
{% set param_2 = config.get('meta')['param_2'] -%}
{% set param_3 = config.get('meta')['param_3'] -%}
{% set param_4 = config.get('meta')['param_4'] -%}
{% set source_model_name = config.get('meta')['source_model_name'] -%}

SELECT 1 AS "{{ param_1 }}"
       , 2 AS "{{ param_2 }}"
       , 3 AS "{{ param_3 }}"
       , 4 AS "{{ param_4 }}"
FROM {{ ref(source_model_name) }}
# properties.yml
version: 2

models:
  - name: my_table
    meta:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: my_stage

Expected/Desired outcome

-- my_table.sql (compiled)

SELECT 1 AS "COLUMN_1"
       , 2 AS "COLUMN_2"
       , 3 AS "COLUMN_3"
       , 4 AS "COLUMN_4"
FROM MY_DATABASE.MY_SCHEMA.my_stage

Current outcome (dbt 1.0.3)

00:10:38  meta:
00:10:38  param_1:
00:10:38  param_2:
00:10:38  param_3:
00:10:38  param_4:
00:10:38  source_model_name:
00:10:38 Encountered an error:
Compilation Error in model my_table (models/my_table.sql)
  The name argument to ref() must be a string, got <class 'jinja2.runtime.Undefined'>

I realise the original intent was probably that metadata would be static data to have alongside models, similar to tags and documentation, however there is a definite use case for it to drive SQL generation, as well.

Thank you!

Describe alternatives you've considered

  • providing parameters in the model itself example
  • as a yaml string (using fromyaml) example
  • or even as individual global variables in dbt_project.yml

As descried above.

Who will this benefit?

  • dbt users who wish to highly parameterise their dbt models at scale
  • Users who make use of pattern-based SQL generation via macros
  • Users who need to store or make use of their metadata in their dbt models
  • dbtvault users
  • dbt users who want to make their dbt models entirely metadata-driven

Anything else?

@DVAlexHiggs DVAlexHiggs added enhancement New feature or request triage labels Feb 24, 2022
@github-actions github-actions bot changed the title [Feature] Allow meta property to be accessible at compile time [CT-278] [Feature] Allow meta property to be accessible at compile time Feb 24, 2022
@gshank
Copy link
Contributor

gshank commented Mar 16, 2022

I think the problem isn't so much that meta isn't available at compile time, it's that the config from properties.yml isn't available when the ref is processed. A dbt project is parsed in the order: 1) macros, 2) models, 3) schema files. When the model is "parsed", we extract the ref, source, and config calls and use them to set information in the nodes. It doesn't even help to put the meta config in the model, because that data is extracted at the same time as the ref. The model that is ref'd must resolve to a string at parse time, or we cannot construct the DAG properly.

We do have discussions on a regular basis about whether there are advantages to tweaking the order that things are processed, or about moving some flavors of setting to a different place. But these are all pretty complicated and delicate things to change since they are so central to the way that dbt works.

Let us know if you have other ideas about how to make this easier to work with.

@gshank gshank removed the triage label Mar 16, 2022
@DVAlexHiggs DVAlexHiggs reopened this Mar 17, 2022
@DVAlexHiggs
Copy link
Author

DVAlexHiggs commented Mar 17, 2022

Apologies, accidentally closed.

Thanks for your response. Back in the old days we could scope variables specifically to models, and these variables were provided in the dbt_project.yml and used the directory structure to assign the variables to a specific model.

This was good but it quickly became hard to manage as the dbt_project.yml became larger and larger for even the smallest projects.

We should not revert to that, but I'm wondering if model-scoped variables and being able to provide these in properties.yml would provide the same cabability, whilst also being vastly superior in terms of scaling; including all of the advantages of properties.yml such as arbitrary nesting etc, allowing a user to split the variables across multiple files, example:

# properties.yml
version: 2

models:
  - name: my_table
    variables:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: my_stage
-- my_table.sql

{{ config(materialized='table') }}

SELECT 1 AS "{{ var('param_1') }}"
       , 2 AS "{{ var('param_1') }}"
       , 3 AS "{{ var('param_1') }}"
       , 4 AS "{{ var('param_1') }}"
FROM {{ ref(var('source_model_name')) }}

Perhaps then, adding variable processing to the same level as macros or before or after, but before models, would be a minimal solution to this and not have undesirable impact on the fundamental operation of dbt as a whole?

Really eager to find a solution to this, as it will have huge implications, as described in my initial proposal.

EDIT:

I've thought about this a little more and the downside I can see with this is that it's perhaps not intuitive, as these parameters are fundamentally for configuring the models, rather than variables per-say.

@scajanus
Copy link

scajanus commented Mar 18, 2022

@DVAlexHiggs does replacing e.g.:

{% set source_model_name = config.get('meta')['source_model_name'] -%}
with
{% set source_model_name = model.config.meta.source_model_name -%}

work for you as of now? Otherwise can you provide a minimal working example of a project, and I could check if your issue is the same I ran across.

Edit: This is actually adviced against here. The workaround would be to use the execute flag as adviced, but that requires you to add -- depends_on: ref(... back to the referenced model, which is exactly what you were hoping to avoid.

@DVAlexHiggs
Copy link
Author

Hi @scajanus. Yes advice against this is the reason I had not explored that option.

I can confirm it does not work reliably, due to the reasons mentioned in the docs you linked (incomplete graph).

There must be a way we can get this feature implemented in a dbt-esque way

@DVAlexHiggs
Copy link
Author

DVAlexHiggs commented Mar 27, 2022

Putting the meta in the model's config block also doesn't seem to work as desired:

Datavault-UK/automate-dv#106 (comment)

@gshank Odd that using a config block in the model doesn't work either, is the config processed after the model? That doesn't seem to make sense to me

@scajanus
Copy link

scajanus commented Mar 29, 2022

@DVAlexHiggs my understanding is that none of the config is treated as final until the full graph is parsed, i.e. there is no (pre)parser that would check if a part of the config is already safe to use -- as in the model case.

The easiest way forward would seem to be allowing defining globals (or template/model locals) that cannot be changed by the graph/template parsing. Jinja has The Global Namespace and Template Globals

Otherwise you would need to define the order in which each of the tags are expanded and do multiple passes of parsing, which seems nasty; as you would need to also check if config is mutated and determine if that is allowed/what should happen then? If I've understood the discussion here correctly, that is :)

Either your config is fixed before parsing the graph (as it used to be?), or you allow the graph to change the config. If you allow the graph to change the config, you cannot use it to alter the way the graph is parsed (and expect consistent results). So if you want to use some variables during the parsing, they need to be immutable by the parsing, as the current variables are. We seem to be looking for a way to have variables/config that is model-specific but immutable.

I went through some of the related issues/comments from before (#2377 (comment), #2401, #1790) but none of those change the fact that variables are package global and config can be changed by the graph. They just allow new locations for setting them. Subsequently the feature requested here should not apply for config, subset of config ('meta') or variables, but be its own entity.

@DVAlexHiggs
Copy link
Author

DVAlexHiggs commented Mar 29, 2022

@scajanus Thank you for looking into this! This makes sense and helps me to understand why it's not a simple feature to implement. Any chance we can explore this further @gshank ?

@DVAlexHiggs
Copy link
Author

DVAlexHiggs commented Apr 28, 2022

I've found a semi-working solution for this, though it's a bit messy due to the need for if execute and -- depends on.

Primarily, the need for if execute severely limits this approach, as users would not be able to make use of only compiling, and not running, as well as other useful dbt features which don't rely on things using dbt run.

# properties.yml 
version: 2

models:
  - name: my_meta_table
    meta:
      param_1: COLUMN_1
      param_2: COLUMN_2
      param_3: COLUMN_3
      param_4: COLUMN_4
      source_model_name: v_stg_orders
-- my_meta_table.sql
{{ config(materialized='table') }}

{% do log('meta: ' ~ config.get('meta'), true) %}
{% do log('param_1: ' ~ config.get('meta')['param_1'], true) %}
{% do log('param_2: ' ~ config.get('meta')['param_2'], true) %}
{% do log('param_3: ' ~ config.get('meta')['param_3'], true) %}
{% do log('param_4: ' ~ config.get('meta')['param_4'], true) %}
{% do log('source_model_name: ' ~ config.get('meta')['source_model_name'], true) %}

{% set param_1 = config.get('meta')['param_1'] -%}
{% set param_2 = config.get('meta')['param_2'] -%}
{% set param_3 = config.get('meta')['param_3'] -%}
{% set param_4 = config.get('meta')['param_4'] -%}
{% set source_model_name = config.get('meta')['source_model_name'] -%}


-- depends_on: {{ ref('v_stg_orders') }}

{% if execute %}

SELECT 1 AS "{{ param_1 }}"
       , 2 AS "{{ param_2 }}"
       , 3 AS "{{ param_3 }}"
       , 4 AS "{{ param_4 }}"

FROM {{ ref(source_model_name) }}
{% endif %}

Logs:

08:56:46  Running with dbt=1.0.6
08:56:46  meta: 
08:56:46  param_1: 
08:56:46  param_2: 
08:56:46  param_3: 
08:56:46  param_4: 
08:56:46  source_model_name: 
08:56:46  Found 32 models, 0 tests, 0 snapshots, 0 analyses, 497 macros, 0 operations, 0 seed files, 8 sources, 0 exposures, 0 metrics
08:56:46  
08:56:48  Concurrency: 4 threads (target='dev')
08:56:48  
08:56:48  1 of 1 START table model ALEX_HIGGS.my_meta_table............................... [RUN]
08:56:48  meta: {'param_1': 'COLUMN_1', 'param_2': 'COLUMN_2', 'param_3': 'COLUMN_3', 'param_4': 'COLUMN_4', 'source_model_name': 'v_stg_orders'}
08:56:48  param_1: COLUMN_1
08:56:48  param_2: COLUMN_2
08:56:48  param_3: COLUMN_3
08:56:48  param_4: COLUMN_4
08:56:48  source_model_name: v_stg_orders
08:56:52  1 of 1 OK created table model ALEX_HIGGS.my_meta_table.......................... [SUCCESS 1 in 3.66s]
08:56:52  
08:56:52  Finished running 1 table model in 5.63s.
08:56:52  
08:56:52  Completed successfully
08:56:52  
08:56:52  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Compiled SQL:

-- depends_on: SANDBOX.ALEX_HIGGS.v_stg_orders



SELECT 1 AS "COLUMN_1"
       , 2 AS "COLUMN_2"
       , 3 AS "COLUMN_3"
       , 4 AS "COLUMN_4"

FROM SANDBOX.ALEX_HIGGS.v_stg_orders

@rumbin
Copy link

rumbin commented Aug 2, 2022

I can only second this feature request.

We are looking for a solution which would allow us to reuse as much as possible from existing marts' models for new marts that are schematically identical but which only differ by some filter or source model names. Thus, we seen to inject the mart's name into the models and into the ref() functions therein.

However, extracting the schema name of the current model is harder than I had expected:

  • If we try to extract it from {{ this }}, we end up with the (dummy) schema name configured for the current user instead of the schema name configured for the current folder (via dbt_project.yml)
  • The {{ node.name }} does not seem to be accessible from within the model itself, so we cannot extract the schema name vorm the model's name (delimited by double-underscores in our case)
  • We can extract the schema name via {% set mart_id = config.get('schema') %}, however, the mart_id variable then cannot be used further below in ref() functions, apparently. In this case, mart_id will be resolved to an empty string. I suppose that this is an issue with the execution order of the Jinja code.

I am mentioning this use case and the not-working solutions at this point, as I feel that the proposed solutions above fail for the same reason.

On a side note, I also want to mention that folder-level variables might serve some of the abovementioned use cases, as long as they would be available at compile time: #4938 (comment)

@lostmygithubaccount lostmygithubaccount self-assigned this Aug 2, 2022
@lostmygithubaccount lostmygithubaccount removed their assignment Aug 10, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 7, 2023
@rumbin
Copy link

rumbin commented Feb 7, 2023

This is too useful to be closed due to staleness...

@DVAlexHiggs
Copy link
Author

Yes this should remain open

@github-actions github-actions bot removed the stale Issues that have gone stale label Feb 8, 2023
@Turbine1991
Copy link

I'm confused as to why schema files are parsed after the model.

Furthermore, I can print the variable, just not use it. Which also feels odd.

@jolsby
Copy link

jolsby commented Aug 10, 2023

This would be a powerful addition to dbt, and be highly valuable. ⬆

@datacom-bozhu
Copy link

Voting for this as well

@kylienhu
Copy link

Voting for this

@darkcofy
Copy link

voting for this

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Sep 23, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
@fredrikhgrelland
Copy link

I do not believe this should be closed.

@MinhCT
Copy link

MinhCT commented Nov 8, 2024

Upvote this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests