Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BQ deduplicate macro to support partition pruning downstream #929

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions macros/sql/deduplicate.sql
Original file line number Diff line number Diff line change
Expand Up @@ -93,12 +93,19 @@
-- clause in BigQuery:
-- https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572
#}
{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
{%- macro bigquery__deduplicate(relation, partition_by, order_by, partition_by_pass_thru=false) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this PR @austinclooney !

I'd prefer not to introduce a new parameter if we can avoid it.

BigQuery supports the qualify clause. Is there any reason we shouldn't / couldn't just use qualify in the same way as Snowflake and Databricks?

The original implementation for BigQuery was for performance reasons. But I'm not sure if those performance reasons still exist or if working for views outweighs those considerations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge the memory issue still exists, yes. Deduplicating with array_agg uses fewer resources than with qualify which allows it to be done on larger datasets before hitting memory issues. The reason I added a parameter here is because explicitly selecting the partition_by by default will change the column order in the output so I think it would technically be a breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several different scenarios to consider for the relation being deduplicated:

  • table
  • view
  • materialized view
  • CTE

What would be the consequences if we did not add a new parameter named partition_by_pass_thru and just treated it as always being true instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only difference between the current method and the changes added in this PR is this PR would move the partition_by columns to the end of the output (or the beginning if preferred) and the current output retains the original column order.

Since we're only explicitly selecting the partitioning keys it won't affect the actual deduplication as those are the groups we're deduplicating within.

Copy link
Contributor

@dbeatty10 dbeatty10 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Edit: I just re-read the prior message and update the question below]

So the net consequence would be the following?

  • partition pruning allowing faster queries for less cost
  • output relation has the same columns but possibly in a different order than before

And these consequences would equally apply to all of tables, views, materialized views, and CTE?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of any way to do the following?

  • output relation with the same columns in the same order as before

If so, that would allow us to ship these performance improvements without any breaking changes or new parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this for a while but I don't believe it's possible without running another query since the input to the macro can be a CTE.

Something like this would work:

{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
    {%- set columns_query -%}
        select * from {{ relation }} limit 1
    {%- endset -%}

    {%- set columns_result = run_query(columns_query) -%}
    {%- set columns = columns_result.column_names -%}

    select
        {%- for column in columns %}
            unique.{{ column }}{%- if not loop.last %},{% endif %}
        {%- endfor %}
    from (
        select
            {{ partition_by }},
            array_agg(
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by {{ partition_by }}
    )
{%- endmacro -%}

But I don't think we'd want to run an external query to get the macro to work.


select unique.*
{%- if partition_by_pass_thru %}
except({{ partition_by }}),
{{ partition_by }}
{%- endif %}
from (
select
array_agg (
{%- if partition_by_pass_thru %}
{{ partition_by }},
{%- endif %}
array_agg(
original
order by {{ order_by }}
limit 1
Expand Down