-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1617] Materialization to create external tables (Redshift first) #24
Comments
To take it a step beyond a simple table for cluster management in redshift you may want to "age" data as part of an incremental model (data that would become read only but still be relevant in the model). This could help with full refreshes as well as storage management and cost control as offloading the data to S3 in the case of Redshift and leveraging spectrum can be more cost effective |
Thanks for opening @matt-winkler! In my view, there's a crucial distinction here between "read-only" and "write-read" external tables—sources and sinks, if you will. I believe Redshift/Spectrum is unique in its support of A lot of my thinking about how the So: I think it's fair to treat
Pretty darn cool! We opened a Next stepsI think the right move here is to experiment with this in a local project, and see if we can develop something robust enough to add into (Of course, there might be a few rough edges along the way, such as the bug reported in https://github.com/dbt-labs/dbt/issues/2289. That's a There's are two big challenges I can foresee, both relating to implementation details of Redshift/Spectrum:
I'd be remiss not to ask, as Bruno did in #19: Is this entire workflow better accomplished by using a |
@gboro54 If I understand your comment correctly, you'd really be interested in the following capabilities:
Interested in your thoughts! |
@matt-winkler - apologies for the delay in response. To answer your questions:
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
@jtcohen6 @matt-winkler my team is investigating building this feature for our own needs, and would be interested in learning about whether we could contribute it back. I've done some digging & experimentation on approach, isolated a few definite next steps, & narrowed down the final approach to two primary options. Would you mind reviewing and weighing in on (a) which option you suggest and (b) whether you'd be open to this getting merged in? Motivation
ContextSharing to make sure I'm understanding correctly: dbt materializations
"Materializations generally take the following shape:"
Naive table materializationA (hypothetical) naive table materialization might look like:
However this would have some problems:
Pseudo-transactional table materialization(Hypothetical) improvement:
This would resolve the first two problems:
but introduces one of its own:
Current actual table materializationAccording to my reading of the source, the built-in table materialization as implemented today essentially does the following:
This addresses all three problems:
Transactional "external table" materialization?
Unfortunately, this doesn't work because:
What options do we have?
|
Hey all - apologies for the big wall-o'-text in that last comment. I'm very interested in moving this forward and would love any feedback on approach before we start down an implementation path. Any questions I can answer about it? |
Our current approach to decide if/how to solve these issues is: Can this problem be solved in user land? Is there a decent work around?Here it seems like yes, as demonstrated in #175:
That's not the most seamless experience, but it works. If not, can this problem be solved via a contribution?You can create your own materialization, as has been discussed in this thread. Now we need to look again at the 3 objections @vergenzt you brought up at the end of the immensely useful "wall of text" above. Are these absolutely blocking? What can we do about them? If not, then maybe dbt Labs should solve itWhich here we are considering, not in itself but to support longer plans toward cross-warehouse projects (we will need all sorts of external tables for that) for 1.5+. A lot of hypothetical here, and we need to think about a common interface for external tables across all warehouses to target most warehouses. I'm sure we'll have discussions on the topic when time comes. |
Thanks @Fleid! TL;DR: I do think this is achievable in user-space. My current recommendation would be to create a custom Redshift
Note that the user of this materialization would be responsible for ensuring that their models' materialization type ( Also with this approach, the external schema has to exist before the dbt run starts. If it doesn't exist then as part of task initialization, dbt automatically creates it as an on-cluster schema before invoking on-run-start hooks. Would definitely need to merge a fix to #17 first, without which existing Redshift external tables aren't detected by dbt when scanning to see what relations are already materialized. (There's already a solution posted in that thread; it just needs to get PR'd and merged into the mainline.) As an FYI, my team has pivoted away from trying to materialize into external tables from within Redshift, so I likely won't be pushing this further forward at the moment. |
Thanks for the update @vergenzt. |
Hi @Fleid! I've since moved teams so not sure if this is still accurate, but I believe the new plan was to use Databricks/Spark more broadly rather than trying to force external table materialization into dbt-redshift. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
@dataders FYI |
@dataders opened a Discussion about adding external table capabilities into dbt Core: https://github.com/dbt-labs/dbt-core/discussions/8617 |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Describe the feature
When using Redshift Spectrum, dbt is able to stage external tables stored in S3 and do useful things such as declaring the source data format (csv, json, parquet, etc.). The idea is to create a new
materialization
(potentially "external" would work) option which would handle persisting query results and enable a few capabilties:CLEAR PATH
to overwrite existing partitions in the lakeDescribe alternatives you've considered
The implementation of
create_external_table
here accomplishes this when triggered by arun-operation
. The goal here is to make that logic a materialization so that it can become part of thedbt run
pipeline.Additional context
Believe this is relevant for any of the databases currently supported in the external tables package:
Who will this benefit?
dbt Users who have existing infrastructure that leverages a more data lake centric approach for managing persistence will benefit from this. They can use dbt and the warehouse as an ephemeral compute / transform layer, and then persist the data to a file store, which enables other tools (e.g. AWS Glue / Athena) to query the results using existing analytical patterns.
Are you interested in contributing this feature?
Yes.
The text was updated successfully, but these errors were encountered: