[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

darcyMaJiyue · 2022-10-28T03:04:21Z

Contributions

I have read the contribution docs, and understand what's expected of me.

What page(s) or areas on docs.getdbt.com are affected?

What changes are you suggesting?

I think for the snapshot hudi table, the primaryKey should be ‘dbt_scd_id’, which is used in the ‘merge into on’ statement generated by ‘dbt snapshot’ command. And the dbt_scd_id column is related to the primary key of the source table and the update_at referenced column when we reference to timestamp strategy.

But now the primaryKey of the snapshot CTAS hudi table, which is created as we run the first ‘dbt snapshot’ command, is the same with the primary key of the source table.

In my experiment, I change two parts of the sql generated by the dbt, one is the primaryKey of the snapshot CTAS hudi table using ‘dbt_scd_id’, the other one is the ‘insert [referenced columns]’ in the 'merge into insert ’ statement instead of ‘insert *’. After these changes, I get the scd2 table, stored in hudi.

Additional information

In my solution, I copy the files include/spark/macros/adapters.sql and include/spark/macros/materializations/snapshot.sql to the macro directory in my own project

And, I change the related code, that works.

runleonarun · 2023-06-09T00:42:29Z

Hi @darcyMaJiyue thank you so much for opening this issue! We will try to update the docs as soon as I have a clear understanding what to update.

@dbeatty10 Can you help me understand what the change to the docs should be?

dbeatty10 · 2023-06-09T14:17:13Z

@runleonarun I can't tell if this is a request to update the docs or a request to update the implementation in dbt-spark.

It looks more like the latter, so I'm going to transfer this issue to dbt-spark -- we can always transfer it back again if needed 😎

dbeatty10 · 2023-06-09T14:23:10Z

@darcyMaJiyue which of these two are you requesting / suggesting?

update the implementation of spark__options_clause and/or spark__snapshot_merge_sql in dbt-spark
update the docs for unique_key

My colleague @dataders will help determine next steps either way.

darcyMaJiyue · 2023-06-12T04:10:58Z

oh, It's a long time. @dbeatty10 thanks for your response. I'm suggesting the first one.

github-actions · 2023-12-10T01:47:28Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-12-18T01:45:21Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

darcyMaJiyue · 2023-12-18T03:14:28Z

1, update the implementation of sparkt__options_clause

…

On Fri, Jun 9, 2023 at 10:23 PM Doug Beatty ***@***.***> wrote: @darcyMaJiyue <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_darcyMaJiyue&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=xIe7zqzz6vpHwadqEym2KO9clLrVg6yMz0dRQ250RFA&e=> which of these two are you requesting / suggesting? 1. update the implementation of spark__options_clause <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dbt-2Dlabs_dbt-2Dspark_blob_e741034160444eb7aa06aef7550a366cdcacc913_dbt_include_spark_macros_adapters.sql-23L41&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=zmo5tYDtBUeurvLA5kMg7NOw2VBtmAODxqx48fOKRfU&e=> and/or spark__snapshot_merge_sql <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dbt-2Dlabs_dbt-2Dspark_blob_e741034160444eb7aa06aef7550a366cdcacc913_dbt_include_spark_macros_materializations_snapshot.sql-23L15&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=7BoPQ9G_Yn5XsjUZFwqRM3F5obhe_Scgz1mfVhiJavo&e=> in dbt-spark 2. update the docs for unique_key <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.getdbt.com_reference_resource-2Dconfigs_unique-5Fkey&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=1RjytOzCWH3dwnJ4NcYTwttoMDSOLCSaf68SUwCP95o&e=> My colleague @dataders <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dataders&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=5pghM6yHDyhG5yhYlluvNWrJdgSG-WRlxbM6YTY7lIg&e=> will help determine next steps either way. — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dbt-2Dlabs_dbt-2Dspark_issues_801-23issuecomment-2D1584664062&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=ss4yScxpf3_480m0jatn8PSe8fuK_xOCKcbbROH3vVE&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_A33REEY7AKUSAS6KHWYA4JDXKMWVVANCNFSM6AAAAAAZAW356A&d=DwMCaQ&c=8lBT5Jra4Bm5rFhLVR7k1wx3__gIUgr523Abjhgq6Gg&r=SD8zkOIA4htuCih_kib0-LvIkr6Lvmpu5y9sV4rAkEo&m=8bNobNxRwmCH_ICflm6NpDRZeOfyLDV9bSfDS6zaP-FEMU3J3kT1Y_gvd2DExgyJ&s=RWmKpX0ieaNvMsOuydk__IOBRtujcvEtLTK65MjA60k&e=> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dbeatty10 · 2023-12-18T15:58:09Z

Sorry for the long time @darcyMaJiyue.

Could you share reproducible example?

Could you provide an example snapshots SQL + config so that we can try to reproduce the issue that you are seeing?

We'd be looking for two things:

Code of your snapshots SQL + config
Commands to execute

We'd need both in order to add new tests to dbt-spark that trigger this error so that we can be sure any fix works as we expect.

Example

Here's what those two things should look similar to:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      file_format='hudi'

      strategy='timestamp',
      updated_at='updated_at',
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

And then also whichever command(s) trigger the error. So maybe in your case running the snapshot twice in a row causes the issue?

dbt snapshot -s orders_snapshot
dbt snapshot -s orders_snapshot

Could you confirm the necessary code changes?

To be clear, is this the only change you are proposing? i.e., remove these lines from spark__options_clause:

9,10d8
<     {%- elif options is not none and 'primaryKey' in options and options['primaryKey'] != unique_key -%}
<       {{ exceptions.raise_compiler_error("unique_key and options('primaryKey') should be the same column(s).") }}

Or is there some change needed in spark__snapshot_merge_sql, e.g.

2a3,4
>     {%- set insert_cols_csv = insert_cols join(', ') -%}
>   
14c16,17
<         then insert *
---
>         then insert ({{ insert cols csv }})
>         values ({{ insert_cols_csv}})

dbeatty10 · 2023-12-18T16:03:13Z

p.s. support for hudi was originally added in #210 which gives an indication of the testing that was included. I didn't notice anything new tests related to snapshots, so that could explain if you are seeing unexpected behavior.

We would need to add relevant tests for snapshots for hudi in order to resolve this issue.

github-actions · 2024-03-18T01:43:28Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-03-25T01:43:57Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions · 2024-03-25T01:43:57Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbeatty10 transferred this issue from dbt-labs/docs.getdbt.com Jun 9, 2023

github-actions bot changed the title ~~unique_key config for snapshots~~ [ADAP-614] unique_key config for snapshots Jun 9, 2023

dbeatty10 added bug Something isn't working triage labels Jun 9, 2023

dbeatty10 changed the title ~~[ADAP-614] unique_key config for snapshots~~ [ADAP-614] [Bug] unique_key config for snapshots Jun 9, 2023

dbeatty10 added awaiting_response and removed triage labels Jun 9, 2023

github-actions bot added triage and removed awaiting_response labels Jun 12, 2023

github-actions bot added the Stale label Dec 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2023

dbeatty10 changed the title ~~[ADAP-614] [Bug] unique_key config for snapshots~~ [ADAP-614] [Bug] unique_key config for snapshots using hudi file_format Dec 18, 2023

dbeatty10 reopened this Dec 18, 2023

dbeatty10 added awaiting_response and removed triage Stale labels Dec 18, 2023

github-actions bot added triage and removed awaiting_response labels Dec 18, 2023

dbeatty10 added snapshots Issues related to dbt's snapshot functionality awaiting_response and removed triage labels Dec 18, 2023

github-actions bot added the Stale label Mar 18, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

darcyMaJiyue commented Oct 28, 2022

runleonarun commented Jun 9, 2023

dbeatty10 commented Jun 9, 2023

dbeatty10 commented Jun 9, 2023

darcyMaJiyue commented Jun 12, 2023

github-actions bot commented Dec 10, 2023

github-actions bot commented Dec 18, 2023

darcyMaJiyue commented Dec 18, 2023 via email

dbeatty10 commented Dec 18, 2023

dbeatty10 commented Dec 18, 2023

github-actions bot commented Mar 18, 2024

github-actions bot commented Mar 25, 2024

github-actions bot commented Mar 25, 2024

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

Comments

darcyMaJiyue commented Oct 28, 2022

Contributions

What page(s) or areas on docs.getdbt.com are affected?

What changes are you suggesting?

Additional information

runleonarun commented Jun 9, 2023

dbeatty10 commented Jun 9, 2023

dbeatty10 commented Jun 9, 2023

darcyMaJiyue commented Jun 12, 2023

github-actions bot commented Dec 10, 2023

github-actions bot commented Dec 18, 2023

darcyMaJiyue commented Dec 18, 2023 via email

dbeatty10 commented Dec 18, 2023

Could you share reproducible example?

Example

Could you confirm the necessary code changes?

dbeatty10 commented Dec 18, 2023

github-actions bot commented Mar 18, 2024

github-actions bot commented Mar 25, 2024

github-actions bot commented Mar 25, 2024