Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support composition key (multiple columns) when merging increment #238

Closed
JCZuurmond opened this issue Oct 22, 2021 · 3 comments
Closed
Labels
good_first_issue Good for newcomers type:enhancement New feature or request

Comments

@JCZuurmond
Copy link
Collaborator

Describe the feature

Support composition key that result in a join based on multiple columns when upserting an increment.

Describe alternatives you've considered

The dbt docs recommends you to create a new column which combines the two columns. I do not like to add a column just for that purpose. Moreover, (I expect) it is slower than having a composition key.

Additional context

This PR in dbt-core is about allowing unique_key to be a list.

Who will this benefit?

Me. I think it is "nicer" to not have to create a new column that combines your composition key into one column to make a join on multiple columns work. I expect it will be faster too, however I did not benchmark it.

Are you interested in contributing this feature?

This is the macro I use currently:

{% macro spark__get_merge_sql(target, source, unique_key, dest_columns, predicates=none) %}
  {# skip dest_columns, use merge_update_columns config if provided, otherwise use "*" #}
  {%- set update_columns = config.get("merge_update_columns") -%}

  {% set merge_condition %}
    {% if unique_key is not none %}
      {%- if unique_key is string -%}
        {%- set unique_key = [unique_key] -%}
      {%- endif -%}
      on
      {%- for column in unique_key %}
        DBT_INTERNAL_SOURCE.{{ column }} = DBT_INTERNAL_DEST.{{ column }}
        {%- if not loop.last %} and {%- endif %}
      {%- endfor %}
    {% else %}
      on false
    {% endif %}
  {% endset %}

  merge into {{ target }} as DBT_INTERNAL_DEST
    using {{ source.include(schema=false) }} as DBT_INTERNAL_SOURCE

    {{ merge_condition }}

    when matched then update set
      {% if update_columns -%}{%- for column_name in update_columns %}
        {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
        {%- if not loop.last %}, {%- endif %}
      {%- endfor %}
      {%- else %} * {% endif %}

    when not matched then insert *
{% endmacro %}
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 22, 2021

@JCZuurmond I'd welcome this change to dbt-spark, along with parallel changes to dbt-core and any other adapters where we'd need to make the update. Luckily, I think both dbt-bigquery and dbt-snowflake use or shell out to the default get_merge_sql implementation.

I wonder if the merge_condition conditional you have above could even be rolled into its own macro. It could be dispatched, with a reasonable default, from dbt-core:

{% macro get_merge_condition(unique_key) %}
  {% set merge_condition %}
    {% if unique_key is not none %}
      {%- if unique_key is string -%}
        {%- set unique_key = [unique_key] -%}
      {%- endif -%}
      on
      {%- for column in unique_key %}
        DBT_INTERNAL_SOURCE.{{ column }} = DBT_INTERNAL_DEST.{{ column }}
        {%- if not loop.last %} and {%- endif %}
      {%- endfor %}
    {% else %}
      on false
    {% endif %}
  {% endset %}
  {{ return(merge_condition) }}
{% endmacro %}

Then, the only change we'd need in spark__get_merge_sql would be to call that macro.

Additionally, I'm all about adding unique_key as an actual validated config in dbt-core, and having that config accept either a string or a list, as described in dbt-labs/dbt-core#2479 (comment).

So: I think a PR in dbt-core, resolving #2479, is the right way to start. I see you've already jumped onto that thread, so I'll say no more :)

@jtcohen6 jtcohen6 added good_first_issue Good for newcomers and removed triage:product labels Oct 22, 2021
@JCZuurmond
Copy link
Collaborator Author

PR and issue in core that make unique_id a list

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 7, 2022

I forgot we had this issue open! Going to close this one in favor of #282, which will have us make the same change to spark__get_merge_sql and add a test for it

@jtcohen6 jtcohen6 closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good_first_issue Good for newcomers type:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants