Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

Closed
1 task done
darcyMaJiyue opened this issue Oct 28, 2022 · 12 comments
Closed
1 task done
Labels
awaiting_response bug Something isn't working snapshots Issues related to dbt's snapshot functionality Stale

Comments

@darcyMaJiyue
Copy link

Contributions

  • I have read the contribution docs, and understand what's expected of me.

What page(s) or areas on docs.getdbt.com are affected?

unique_key config for snapshots
image

What changes are you suggesting?

I think for the snapshot hudi table, the primaryKey should be ‘dbt_scd_id’, which is used in the ‘merge into on’ statement generated by ‘dbt snapshot’ command. And the dbt_scd_id column is related to the primary key of the source table and the update_at referenced column when we reference to timestamp strategy.

But now the primaryKey of the snapshot CTAS hudi table, which is created as we run the first ‘dbt snapshot’ command, is the same with the primary key of the source table.

In my experiment, I change two parts of the sql generated by the dbt, one is the primaryKey of the snapshot CTAS hudi table using ‘dbt_scd_id’, the other one is the ‘insert [referenced columns]’ in the 'merge into insert ’ statement instead of ‘insert *’. After these changes, I get the scd2 table, stored in hudi.

Additional information

In my solution, I copy the files include/spark/macros/adapters.sql and include/spark/macros/materializations/snapshot.sql to the macro directory in my own project
image
And, I change the related code, that works.

image

image

@runleonarun
Copy link

Hi @darcyMaJiyue thank you so much for opening this issue! We will try to update the docs as soon as I have a clear understanding what to update.

@dbeatty10 Can you help me understand what the change to the docs should be?

@dbeatty10
Copy link
Contributor

@runleonarun I can't tell if this is a request to update the docs or a request to update the implementation in dbt-spark.

It looks more like the latter, so I'm going to transfer this issue to dbt-spark -- we can always transfer it back again if needed 😎

@dbeatty10 dbeatty10 transferred this issue from dbt-labs/docs.getdbt.com Jun 9, 2023
@github-actions github-actions bot changed the title unique_key config for snapshots [ADAP-614] unique_key config for snapshots Jun 9, 2023
@dbeatty10 dbeatty10 added bug Something isn't working triage labels Jun 9, 2023
@dbeatty10 dbeatty10 changed the title [ADAP-614] unique_key config for snapshots [ADAP-614] [Bug] unique_key config for snapshots Jun 9, 2023
@dbeatty10
Copy link
Contributor

@darcyMaJiyue which of these two are you requesting / suggesting?

  1. update the implementation of spark__options_clause and/or spark__snapshot_merge_sql in dbt-spark
  2. update the docs for unique_key

My colleague @dataders will help determine next steps either way.

@darcyMaJiyue
Copy link
Author

oh, It's a long time. @dbeatty10 thanks for your response. I'm suggesting the first one.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 10, 2023
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2023
@darcyMaJiyue
Copy link
Author

darcyMaJiyue commented Dec 18, 2023 via email

@dbeatty10 dbeatty10 changed the title [ADAP-614] [Bug] unique_key config for snapshots [ADAP-614] [Bug] unique_key config for snapshots using hudi file_format Dec 18, 2023
@dbeatty10
Copy link
Contributor

Sorry for the long time @darcyMaJiyue.

Could you share reproducible example?

Could you provide an example snapshots SQL + config so that we can try to reproduce the issue that you are seeing?

We'd be looking for two things:

  1. Code of your snapshots SQL + config
  2. Commands to execute

We'd need both in order to add new tests to dbt-spark that trigger this error so that we can be sure any fix works as we expect.

Example

Here's what those two things should look similar to:

{% snapshot orders_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='snapshots',
      unique_key='id',
      file_format='hudi'

      strategy='timestamp',
      updated_at='updated_at',
    )
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

And then also whichever command(s) trigger the error. So maybe in your case running the snapshot twice in a row causes the issue?

dbt snapshot -s orders_snapshot
dbt snapshot -s orders_snapshot

Could you confirm the necessary code changes?

To be clear, is this the only change you are proposing? i.e., remove these lines from spark__options_clause:

9,10d8
<     {%- elif options is not none and 'primaryKey' in options and options['primaryKey'] != unique_key -%}
<       {{ exceptions.raise_compiler_error("unique_key and options('primaryKey') should be the same column(s).") }}

Or is there some change needed in spark__snapshot_merge_sql, e.g.

2a3,4
>     {%- set insert_cols_csv = insert_cols join(', ') -%}
>   
14c16,17
<         then insert *
---
>         then insert ({{ insert cols csv }})
>         values ({{ insert_cols_csv}})

@dbeatty10
Copy link
Contributor

p.s. support for hudi was originally added in #210 which gives an indication of the testing that was included. I didn't notice anything new tests related to snapshots, so that could explain if you are seeing unexpected behavior.

We would need to add relevant tests for snapshots for hudi in order to resolve this issue.

@dbeatty10 dbeatty10 added snapshots Issues related to dbt's snapshot functionality awaiting_response and removed triage labels Dec 18, 2023
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 18, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

1 similar comment
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response bug Something isn't working snapshots Issues related to dbt's snapshot functionality Stale
Projects
None yet
Development

No branches or pull requests

3 participants