-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BQ deduplicate macro to support partition pruning downstream #929
Open
austinclooney
wants to merge
1
commit into
dbt-labs:main
Choose a base branch
from
austinclooney:bq-dedup-prune
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this PR @austinclooney !
I'd prefer not to introduce a new parameter if we can avoid it.
BigQuery supports the
qualify
clause. Is there any reason we shouldn't / couldn't just usequalify
in the same way as Snowflake and Databricks?The original implementation for BigQuery was for performance reasons. But I'm not sure if those performance reasons still exist or if working for views outweighs those considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge the memory issue still exists, yes. Deduplicating with
array_agg
uses fewer resources than withqualify
which allows it to be done on larger datasets before hitting memory issues. The reason I added a parameter here is because explicitly selecting thepartition_by
by default will change the column order in the output so I think it would technically be a breaking change?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several different scenarios to consider for the relation being deduplicated:
What would be the consequences if we did not add a new parameter named
partition_by_pass_thru
and just treated it as always beingtrue
instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only difference between the current method and the changes added in this PR is this PR would move the
partition_by
columns to the end of the output (or the beginning if preferred) and the current output retains the original column order.Since we're only explicitly selecting the partitioning keys it won't affect the actual deduplication as those are the groups we're deduplicating within.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Edit: I just re-read the prior message and update the question below]
So the net consequence would be the following?
And these consequences would equally apply to all of tables, views, materialized views, and CTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of any way to do the following?
If so, that would allow us to ship these performance improvements without any breaking changes or new parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this for a while but I don't believe it's possible without running another query since the input to the macro can be a CTE.
Something like this would work:
But I don't think we'd want to run an external query to get the macro to work.