-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-1058] [Feature] Implement refresh strategy for dynamic tables #859
Comments
Please can you explain this in a little more detail?
I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy. The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an What this means is that, given a DAG of dynamic tables like this:
where
My question I suppose is, what is the reason that |
When dynamic tables were first introduced, they did not automatically refresh. This required us to run the refresh statement to populate them. That statement should be in 1.6 as a result. Then Snowflake decided to always run the initial refresh automatically. We should have removed that statement in 1.7 as a result. The former would have allowed us to control whether a dynamic table was initially refreshed. The latter does not as Snowflake runs this internally. I suspect at some point Snowflake will make this configurable and we can tie into that. |
@mikealfare thanks for explaining - I have upgraded to 1.7 and can see that the refresh statement is no longer present in the deployment scripts |
I have a related recommendation that we should add a refresh when we are not creating/replacing a dynamic table to ensure that any downstream table materializations and tests are querying the current information during the dbt batch run. I have customers using the new feature and we discovered that switching tables to dynamic table materializations, our downstream table models could be querying stale data because although a downstream dynamic table can trigger an upstream dynamic table, the same is not true for other queries. We have been adding post hooks to trigger the refresh as an interim solution. I welcome feedback if this was intentional but otherwise, I recommend we add two lines to the {% macro snowflake__get_alter_dynamic_table_as_sql(
existing_relation,
configuration_changes,
target_relation,
sql
) -%}
{{- log('Applying ALTER to: ' ~ existing_relation) -}}
{% if configuration_changes.requires_full_refresh %}
{{- get_replace_sql(existing_relation, target_relation, sql) -}}
{% else %}
{%- set target_lag = configuration_changes.target_lag -%}
{%- if target_lag -%}{{- log('Applying UPDATE TARGET_LAG to: ' ~ existing_relation) -}}{%- endif -%}
{%- set snowflake_warehouse = configuration_changes.snowflake_warehouse -%}
{%- if snowflake_warehouse -%}{{- log('Applying UPDATE WAREHOUSE to: ' ~ existing_relation) -}}{%- endif -%}
alter dynamic table {{ existing_relation }} set
{% if target_lag %}target_lag = '{{ target_lag.context }}'{% endif %}
{% if snowflake_warehouse %}warehouse = {{ snowflake_warehouse.context }}{% endif %}
;
{{ snowflake__refresh_dynamic_table(target_relation) }}
{%- endif -%}
{%- endmacro %} I also suggest that we should update the {% macro snowflake__dynamic_table_get_build_sql(existing_relation, target_relation) %}
{% set full_refresh_mode = should_full_refresh() %}
-- determine the scenario we're in: create, full_refresh, alter, refresh data
{% if existing_relation is none %}
{% set build_sql = get_create_sql(target_relation, sql) %}
{% elif full_refresh_mode or not existing_relation.is_dynamic_table %}
{% set build_sql = get_replace_sql(existing_relation, target_relation, sql) %}
{% else %}
-- get config options
{% set on_configuration_change = config.get('on_configuration_change') %}
{% set configuration_changes = snowflake__get_dynamic_table_configuration_changes(existing_relation, config) %}
{% if configuration_changes is none %}
{% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
{{ exceptions.warn("No configuration changes were identified on: `" ~ target_relation ~ "`. Refreshing.") }}
{% elif on_configuration_change == 'apply' %}
{% set build_sql = snowflake__get_alter_dynamic_table_as_sql(existing_relation, configuration_changes, target_relation, sql) %}
{% elif on_configuration_change == 'continue' %}
{% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
{{ exceptions.warn("Configuration changes were identified and `on_configuration_change` was set to `continue` for `" ~ target_relation ~ "`. Refreshing.") }}
{% elif on_configuration_change == 'fail' %}
{{ exceptions.raise_fail_fast_error("Configuration changes were identified and `on_configuration_change` was set to `fail` for `" ~ target_relation ~ "`") }}
{% else %}
-- this only happens if the user provides a value other than `apply`, 'continue', 'fail'
{{ exceptions.raise_compiler_error("Unexpected configuration scenario: `" ~ on_configuration_change ~ "`") }}
{% endif %}
{% endif %}
{% do return(build_sql) %}
{% endmacro %} |
It's now a part of the create statement, there is a new "INITIALIZE" parameter that sets if the DT should be refreshed on creation or not. Can this be added as a configuration (with the lag and warehouse)? and remove any "DBT logic" around the refresh? Thanks! |
@Fleid I want to bring this to your attention as I slotted this for refinement. Let us know where this falls regarding prioritization. Also, while I was reviewing the docs linked above, I noticed that there's also a refresh mode option now (separate feature). |
@tim-sparq Snowflake is changing the behavior of the default refresh mode and they suggest explicitly setting the refresh mode on all production dynamic tables. Could we add the option to specify the |
I put this back into |
@mikealfare I'm happy to close this as completed by #1081. sounds like the thing that cannot still be done is to create a DT that does not auto-populate (i.e. without a refresh right away). Folks are welcome to comment on this issue should they still want this |
Is this your first time submitting a feature request?
Describe the feature
Provide an option for analytic engineers to delay the refresh of data until the next scheduled refresh instead of forcing a refresh upon object creation. The primary benefit of this feature is that the entire pipeline can be built and the logic can be validated with needing to run a potentially expensive query to initially populate the dynamic table. The proposed name for this configuration is
refresh_strategy
, as identified in #603.Describe alternatives you've considered
No response
Who will this benefit?
This will benefit analytics engineers that heavily use dynamic tables and who want to validate logic in their pipelines without reloading data. This also allows a little more flexibility and control over when data is populated. For example, the engineer may not want to run an expensive query in the middle of the day, but also may not want to wait until overnight to deploy the dynamic table.
Are you interested in contributing this feature?
No response
Anything else?
It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into
dbt-snowflake
. This issue is a placeholder for if/when that happens.The text was updated successfully, but these errors were encountered: