Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-1058] [Feature] Implement refresh strategy for dynamic tables #859

Closed
3 tasks done
mikealfare opened this issue Dec 2, 2023 · 9 comments
Closed
3 tasks done
Labels

Comments

@mikealfare
Copy link
Contributor

mikealfare commented Dec 2, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Provide an option for analytic engineers to delay the refresh of data until the next scheduled refresh instead of forcing a refresh upon object creation. The primary benefit of this feature is that the entire pipeline can be built and the logic can be validated with needing to run a potentially expensive query to initially populate the dynamic table. The proposed name for this configuration is refresh_strategy, as identified in #603.

Describe alternatives you've considered

No response

Who will this benefit?

This will benefit analytics engineers that heavily use dynamic tables and who want to validate logic in their pipelines without reloading data. This also allows a little more flexibility and control over when data is populated. For example, the engineer may not want to run an expensive query in the middle of the day, but also may not want to wait until overnight to deploy the dynamic table.

Are you interested in contributing this feature?

No response

Anything else?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

@mikealfare mikealfare added enhancement New feature or request triage labels Dec 2, 2023
@github-actions github-actions bot changed the title [Feature] Implement refresh strategy for dynamic tables [ADAP-1058] [Feature] Implement refresh strategy for dynamic tables Dec 2, 2023
@tim-sparq
Copy link

@mikealfare

Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy.

The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created.

What this means is that, given a DAG of dynamic tables like this:

A <- B <- C <- D

where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

  1. Deploy A
  2. Refresh A
  3. Deploy B
  4. Refresh A
  5. Refresh B
  6. Deploy C
  7. Refresh A
  8. Refresh B
  9. Refresh C
  10. Deploy D
  11. Refresh A
  12. Refresh B
  13. Refresh C
  14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

@mikealfare
Copy link
Contributor Author

@mikealfare

Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy.

The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created.

What this means is that, given a DAG of dynamic tables like this:

A <- B <- C <- D

where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

1. Deploy A

2. Refresh A

3. Deploy B

4. Refresh A

5. Refresh B

6. Deploy C

7. Refresh A

8. Refresh B

9. Refresh C

10. Deploy D

11. Refresh A

12. Refresh B

13. Refresh C

14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

When dynamic tables were first introduced, they did not automatically refresh. This required us to run the refresh statement to populate them. That statement should be in 1.6 as a result. Then Snowflake decided to always run the initial refresh automatically. We should have removed that statement in 1.7 as a result. The former would have allowed us to control whether a dynamic table was initially refreshed. The latter does not as Snowflake runs this internally. I suspect at some point Snowflake will make this configurable and we can tie into that.

@tim-sparq
Copy link

@mikealfare
Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy.
The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created.
What this means is that, given a DAG of dynamic tables like this:
A <- B <- C <- D
where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

1. Deploy A

2. Refresh A

3. Deploy B

4. Refresh A

5. Refresh B

6. Deploy C

7. Refresh A

8. Refresh B

9. Refresh C

10. Deploy D

11. Refresh A

12. Refresh B

13. Refresh C

14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

When dynamic tables were first introduced, they did not automatically refresh. This required us to run the refresh statement to populate them. That statement should be in 1.6 as a result. Then Snowflake decided to always run the initial refresh automatically. We should have removed that statement in 1.7 as a result. The former would have allowed us to control whether a dynamic table was initially refreshed. The latter does not as Snowflake runs this internally. I suspect at some point Snowflake will make this configurable and we can tie into that.

@mikealfare thanks for explaining - I have upgraded to 1.7 and can see that the refresh statement is no longer present in the deployment scripts

@sfc-gh-dflippo
Copy link

I have a related recommendation that we should add a refresh when we are not creating/replacing a dynamic table to ensure that any downstream table materializations and tests are querying the current information during the dbt batch run.

I have customers using the new feature and we discovered that switching tables to dynamic table materializations, our downstream table models could be querying stale data because although a downstream dynamic table can trigger an upstream dynamic table, the same is not true for other queries. We have been adding post hooks to trigger the refresh as an interim solution.

I welcome feedback if this was intentional but otherwise, I recommend we add two lines to the snowflake__get_alter_dynamic_table_as_sql macro that calls snowflake__refresh_dynamic_table similar to the existing call in the replace.sql:

{% macro snowflake__get_alter_dynamic_table_as_sql(
    existing_relation,
    configuration_changes,
    target_relation,
    sql
) -%}
    {{- log('Applying ALTER to: ' ~ existing_relation) -}}

    {% if configuration_changes.requires_full_refresh %}
        {{- get_replace_sql(existing_relation, target_relation, sql) -}}

    {% else %}

        {%- set target_lag = configuration_changes.target_lag -%}
        {%- if target_lag -%}{{- log('Applying UPDATE TARGET_LAG to: ' ~ existing_relation) -}}{%- endif -%}
        {%- set snowflake_warehouse = configuration_changes.snowflake_warehouse -%}
        {%- if snowflake_warehouse -%}{{- log('Applying UPDATE WAREHOUSE to: ' ~ existing_relation) -}}{%- endif -%}

        alter dynamic table {{ existing_relation }} set
            {% if target_lag %}target_lag = '{{ target_lag.context }}'{% endif %}
            {% if snowflake_warehouse %}warehouse = {{ snowflake_warehouse.context }}{% endif %}
        ;
        {{ snowflake__refresh_dynamic_table(target_relation) }}

    {%- endif -%}

{%- endmacro %}

I also suggest that we should update the snowflake__dynamic_table_get_build_sql macro to perform a refresh as the default when there are no other changes:

{% macro snowflake__dynamic_table_get_build_sql(existing_relation, target_relation) %}

    {% set full_refresh_mode = should_full_refresh() %}

    -- determine the scenario we're in: create, full_refresh, alter, refresh data
    {% if existing_relation is none %}
        {% set build_sql = get_create_sql(target_relation, sql) %}
    {% elif full_refresh_mode or not existing_relation.is_dynamic_table %}
        {% set build_sql = get_replace_sql(existing_relation, target_relation, sql) %}
    {% else %}

        -- get config options
        {% set on_configuration_change = config.get('on_configuration_change') %}
        {% set configuration_changes = snowflake__get_dynamic_table_configuration_changes(existing_relation, config) %}

        {% if configuration_changes is none %}
            {% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
            {{ exceptions.warn("No configuration changes were identified on: `" ~ target_relation ~ "`. Refreshing.") }}

        {% elif on_configuration_change == 'apply' %}
            {% set build_sql = snowflake__get_alter_dynamic_table_as_sql(existing_relation, configuration_changes, target_relation, sql) %}
        {% elif on_configuration_change == 'continue' %}
            {% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
            {{ exceptions.warn("Configuration changes were identified and `on_configuration_change` was set to `continue` for `" ~ target_relation ~ "`. Refreshing.") }}
        {% elif on_configuration_change == 'fail' %}
            {{ exceptions.raise_fail_fast_error("Configuration changes were identified and `on_configuration_change` was set to `fail` for `" ~ target_relation ~ "`") }}

        {% else %}
            -- this only happens if the user provides a value other than `apply`, 'continue', 'fail'
            {{ exceptions.raise_compiler_error("Unexpected configuration scenario: `" ~ on_configuration_change ~ "`") }}

        {% endif %}

    {% endif %}

    {% do return(build_sql) %}

{% endmacro %}

@sfc-gh-yostrinsky
Copy link

@tim-sparq @mikealfare

I suspect at some point Snowflake will make this configurable and we can tie into that.

It's now a part of the create statement, there is a new "INITIALIZE" parameter that sets if the DT should be refreshed on creation or not. Can this be added as a configuration (with the lag and warehouse)? and remove any "DBT logic" around the refresh?
BTW, there is also a new parameter to enforce a refresh mode, can this be added as well?
https://docs.snowflake.com/en/sql-reference/sql/create-dynamic-table

Thanks!

@mikealfare
Copy link
Contributor Author

@Fleid I want to bring this to your attention as I slotted this for refinement. Let us know where this falls regarding prioritization. Also, while I was reviewing the docs linked above, I noticed that there's also a refresh mode option now (separate feature).

@yanithx
Copy link

yanithx commented Jun 13, 2024

@tim-sparq Snowflake is changing the behavior of the default refresh mode and they suggest explicitly setting the refresh mode on all production dynamic tables. Could we add the option to specify the refresh_mode on the dynamic table config?

@mikealfare
Copy link
Contributor Author

I put this back into triage because I think it's actually completed. We added refresh_mode as a config, which is equivalent to this feature. I would like Product to chime in (or close this as done) in case I'm missing something.

@dataders
Copy link
Contributor

dataders commented Aug 1, 2024

@mikealfare I'm happy to close this as completed by #1081. sounds like the thing that cannot still be done is to create a DT that does not auto-populate (i.e. without a refresh right away).

Folks are welcome to comment on this issue should they still want this

@dataders dataders closed this as completed Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants