Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1617] Materialization to create external tables (Redshift first) #24

Closed
matt-winkler opened this issue Aug 26, 2021 · 17 comments
Closed
Assignees
Labels

Comments

@matt-winkler
Copy link
Contributor

Describe the feature

When using Redshift Spectrum, dbt is able to stage external tables stored in S3 and do useful things such as declaring the source data format (csv, json, parquet, etc.). The idea is to create a new materialization (potentially "external" would work) option which would handle persisting query results and enable a few capabilties:

  • Specifying a location to output partitioned results generated by a model's sql
  • Specifying the output format (csv, json, parquet) and compression where applicable
  • Control over the partitioning logic (e.g. use a defined set of columns to create partitions in the lake)
  • For Redshift, ability to CLEAR PATH to overwrite existing partitions in the lake

Describe alternatives you've considered

The implementation of create_external_table here accomplishes this when triggered by a run-operation. The goal here is to make that logic a materialization so that it can become part of the dbt run pipeline.

Additional context

Believe this is relevant for any of the databases currently supported in the external tables package:

  • Redshift (Spectrum)
  • Snowflake
  • BigQuery
  • Spark
  • Synapse
  • Azure SQL

Who will this benefit?

dbt Users who have existing infrastructure that leverages a more data lake centric approach for managing persistence will benefit from this. They can use dbt and the warehouse as an ephemeral compute / transform layer, and then persist the data to a file store, which enables other tools (e.g. AWS Glue / Athena) to query the results using existing analytical patterns.

Are you interested in contributing this feature?

Yes.

@gboro54
Copy link

gboro54 commented Aug 31, 2021

To take it a step beyond a simple table for cluster management in redshift you may want to "age" data as part of an incremental model (data that would become read only but still be relevant in the model). This could help with full refreshes as well as storage management and cost control as offloading the data to S3 in the case of Redshift and leveraging spectrum can be more cost effective

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 2, 2021

Thanks for opening @matt-winkler!

In my view, there's a crucial distinction here between "read-only" and "write-read" external tables—sources and sinks, if you will. I believe Redshift/Spectrum is unique in its support of create external table ... as <sql_query> (docs). The other major data warehouses conceive of external tables only as sources of data in external storage—there's always the option of offloading some data back to external storage via copy or unload, but the two aren't tightly linked. A key outlier here is Apache Spark, where all tables are external tables; some are managed, others have their location specified explicitly (which dbt-spark accomplishes via a location config).

A lot of my thinking about how the dbt-external-tables package should work is based on this "read-only" view, reflected in dbt-labs/dbt-external-tables#1. That issue is a bit lengthy, but it includes the rationale for the decision to treat external tables as sources (metadata pointers only) instead of models (logic-bearing, env-specific, materialized objects). I agree that the lines can be blurry, and there's points in favor of either side. (There's also a bit more of this viewpoint reflected in dbt-labs/dbt-core#3391.)

So: I think it's fair to treat create external table ... as <sql_query> as a pretty special and different thing, which just happens to also be called external table. This version feels much more akin to a model (logic-bearing, materialized object) than a source (pointer):

  • you write to S3 directly from model SQL
  • a ref to that model downstream resolves to its Spectrum location, and the downstream model reads from its data in S3

Pretty darn cool! We opened a good first issue for that idea a long time ago, but we didn't get any immediate takers: https://github.com/dbt-labs/dbt/issues/2516. I'd happily close #19 in favor of this issue if we can get some more traction this time around.

Next steps

I think the right move here is to experiment with this in a local project, and see if we can develop something robust enough to add into dbt-redshift as a built-in materialization. There's nothing stopping someone from implementing their own external_table materialization, just as there's very little standing in the way of a user-defined materialized_view materialization. There's some quite detailed code already in https://github.com/dbt-labs/dbt/issues/2516#issuecomment-654322952.

(Of course, there might be a few rough edges along the way, such as the bug reported in https://github.com/dbt-labs/dbt/issues/2289. That's a good first issue, and a fix for it would be an uncontroversial improvement, one for which I'd happily welcome a PR. Any takers?)

There's are two big challenges I can foresee, both relating to implementation details of Redshift/Spectrum:

  • Querying data from Spectrum is quite limited, relative to standard Redshift SQL. If I remember right, there are strict limitations around CTEs and subqueries, and the support for nested data is pretty impossible to use in practice. This isn't a hard blocker to using an external table materialization, but it is a ceiling on what's possible with this workflow.
  • Spectrum external schemas are trickier than normal schemas, and everyone needs to be able to create their own external schema: (To create external tables, you must be the owner of the external schema or a superuser.) Creating them requires more complex DDL than just create schema, including an IAM role or secret keys. Should creating an external schema be something dbt tries to do automatically? How can it know if a schema is meant to be external? (Perhaps, if models with that resolved schema value have materialized: external_table.) Should dbt seek to reuse the iam profile config? Env vars for secrets? I think we can work around this in the meantime, via manual run-operations, but it's a question we'd need to answer before making this a built-in

I'd be remiss not to ask, as Bruno did in #19: Is this entire workflow better accomplished by using a dbt-athena adapter (e.g. https://github.com/Tomme/dbt-athena) for heavy-lifting transformations? And, to make the handoff more seamless, building capabilities within dbt that make it possible to run against multiple adapters in the same project / DAG / invocation?

@matt-winkler
Copy link
Contributor Author

@gboro54 If I understand your comment correctly, you'd really be interested in the following capabilities:

  • Setting an aging policy on the underlying table. So, when an incremental model runs, we age part of the dataset by checking for records older than X date. Thoughts on what this might look like as configured on the incremental (or some new materialization) itself?
  • We likely need to pass information related to Jeremy's comment about authorization and schema configuration for the external table as well
  • Does there need to be a downstream model (likely a view) that unions the "hot" and "cold" slices of the data? For reference, this is where I thought there may be some overlap with the lambda views architecture.

Interested in your thoughts!

@gboro54
Copy link

gboro54 commented Oct 5, 2021

@matt-winkler - apologies for the delay in response. To answer your questions:

  • Perhaps this should be an filter which would represent a sql snippet to allow for more complex filtering. In terms of it being a new materialization vs incremental im not sure I have a strong opinion on that

  • Agreed. Perhaps this can be set in the profile for redshift vs in the model itself

  • I would think a view on top of the hot and cold is desired. This is what I am doing after creating a table for last years data and an incremental for this years data

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021
@jtcohen6 jtcohen6 added the type:enhancement New feature or request label Oct 12, 2021
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@vergenzt
Copy link

vergenzt commented May 27, 2022

@jtcohen6 @matt-winkler my team is investigating building this feature for our own needs, and would be interested in learning about whether we could contribute it back. I've done some digging & experimentation on approach, isolated a few definite next steps, & narrowed down the final approach to two primary options. Would you mind reviewing and weighing in on (a) which option you suggest and (b) whether you'd be open to this getting merged in?

Motivation

  • My team wants to store model results outside the database (in S3).
  • dbt, on the other hand, expects to "... [only do] the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse." (What is dbt? | dbt Docs)
  • Redshift lets us pretend that data in S3 is "in the warehouse".
    • But it has some important differences from standard tables.
    • Those differences make it not (yet) compatible with dbt for model output materialization.
    • Can we make it compatible?

Context

Sharing to make sure I'm understanding correctly:

dbt materializations

  • A dbt materialization is a registered^ adapter-specific^ Jinja macro^.

    • ^registered → declared via {% materialization <name>… %} instead of {% macro <name>… %}
    • ^adapter-specific → different database adapters can have different implementations, e.g.
      {% materialization <name>, default %} … {% endmaterialization %}
      {% materialization <name>, adapter='redshift' %} ... {% endmaterialization %}
    • ^macro → compiles down to a function with parameters (incl. "context") & return value
  • Called by dbt run for each model with Jinja context that includes:

"Materializations generally take the following shape:"

  • [1.] Prepare the database for the new model [...]
  • [2.] Execute any sql required to implement the desired materialization [...]
  • [3.] Clean up the database as required [...]

Creating new materializations # Anatomy of a materialization | dbt Docs

Naive table materialization

A (hypothetical) naive table materialization might look like:

  • [1.] Prepare the database for the new model [...]

    drop table if exists [m]

  • [2.] Execute any sql required to implement the desired materialization [...]

    create table [m] as (sql)

  • [3.] Clean up the database as required [...]

    // no-op

However this would have some problems:

  • Prior data unavailable while running CTAS [m].
    (Could be a while if m is big.)
  • Prior data unavailable until manually restored/rerun if CTAS [m] fails.
    (Could be a long while, regardless of size of m.)

Pseudo-transactional table materialization

(Hypothetical) improvement:

  • [1.] Prepare the database for the new model [...]

    drop table if exists [m_tmp] &&
    drop table if exists [m_bak]

  • [2.] Execute any sql required to implement the desired materialization [...]

    create table [m_tmp] as (sql)
    && alter table [m] rename to [m_bak]
    && alter table [m_tmp] rename to [m]

  • [3.] Clean up the database as required [...]

    drop table if exists [m_bak]

This would resolve the first two problems:

  • X Prior data only unavailable briefly, for duration of table rename.
  • X Prior data remain in place if create table [m_tmp] as (sql) fails.

but introduces one of its own:

  • Leaves garbage behind if create table [m_tmp] as (sql) fails.

Current actual table materialization

According to my reading of the source, the built-in table materialization as implemented today essentially does the following:

  • [1.] Prepare the database for the new model [...]

    drop table if exists [m_tmp] &&
    drop table if exists [m_bak]

  • [2.] Execute any sql required to implement the desired materialization [...]

    In transaction: [
    create table [m_tmp] as (sql)
    && alter table [m] rename to [m_bak]
    && alter table [m_tmp] rename to [m]
    ]

  • [3.] Clean up the database as required [...]

    drop table if exists [m_bak]

This addresses all three problems:

  • Prior data only unavailable briefly, for duration of table rename.
  • Prior data remain in place if create table [m_tmp] as (sql) fails.
  • No garbage left behind if create table [m_tmp] as (sql) fails.

Transactional "external table" materialization?

  • [1.] Prepare the database for the new model [...]

    drop table if exists [m_tmp] && drop table if exists [m_bak]

  • [2.] Execute any sql required to implement the desired materialization [...]

    In transaction: [
    create external table [m_tmp] as (sql)
    && alter external table [m] rename to [m_bak]
    && alter external table [m_tmp] rename to [m]
    ]

  • [3.] Clean up the database as required [...]

    drop table if exists [m_bak]

Unfortunately, this doesn't work because:

  1. attempting to "create external table as" in a transaction leads to ERROR: CREATE EXTERNAL TABLE cannot run inside a transaction block, ruling out fully transactional external table creation; and
  2. attempting to rename an external table in any context leads to ERROR: Unsupported operation. [Cannot rename external tables.], ruling out even pseudo-transactional external table creation.

What options do we have?

Option 0. Naive materialization. Accept availability constraints.

Less than ideal; we would like to find a better way.

Option 1. Pseudo-transactional via rename = CET w/ same config && delete old.

Use the external-table-config-DDL-generation script (seemingly written by someone at AWS) to "rename" external tables by creating external table with same config at the new name (not create external table as; just create external table). Manually clean up failures in on-run-end hook.

  • Pros: No on-cluster storage used.
  • Cons: Seems very fragile. Generated DDL could break. DDL generation script could break. Underlying views could break. Unclear maintenance ownership.

→ We're not comfortable taking on that maintenance burden / risk. Pretty sure another option would be better.

Option 2a. "Transactional", via table materialization + on-run-end hook

Literally configure the models to materialize as table, then add an on-run-end hook to "create external table as select * from {{ model }}" + "drop table {{ model }}" for every model that got built.

Pros: Simple. No change to materializations.

Cons: dbt manifest's recorded model output locations would be inaccurate, which precludes partial rebuilds and inter-project dependencies. Also, the cluster has to have sufficient capacity to store all model results during the run -- though this storage is reclaimed after the on-run-end hook.

Option 2b. "Transactional", via front-loading option 2a logic into per-model materialization

So the steps would be:

  • [1.] Prepare the database for the new model [...]

    drop table if exists [m_tmp] [in on-cluster schema]

  • [2.] Execute any sql required to implement the desired materialization [...]

    CTAS [m_tmp] [in on-cluster schema]
    && drop table if exists [m] [in external schema]
    && CET [m] [in external schema] as select * from [m_tmp]

  • [3.] Clean up the database as required [...]

    drop table if exists [m_tmp] [from on-cluster schema]

I assume that in order for dbt's manifest to be accurate, the schema returned by generate_schema_name would need to refer to the final destination, correct? In which case, for this option, the "schema" for a model as declared by generate_schema_name should refer to an already-existing external schema. And we would probably specify the name of the temporary on-cluster schema (used for its temporary transactional guarantees) via a model configuration.

Obviously, this option introduces substantial added complexity. But it gets you more "correctness" -- future partial dbt runs would find the outputs in the correct spot, dbt docs' schema names would be accurate, etc.

Option 2c. "Transactional", via 2b but with private per-session temp tables instead of real on-cluster schema?

This option just occurred to me as I was preparing to post this comment! This would obviate the need for the model configuration & having an extra schema lying around.

Are there limitations I'm not aware of that would prevent this approach from working?

Option 3. Pseudo-transactional at schema level

This option would be the most different from what dbt does currently, but it's what I've personally been interested in for a while. Essentially, can we broaden the scope of dbt's "create as temp" + "rename on success" approach to apply at the schema level instead of the individual table level?

Consider the following:

  • on-run-start = create external schema if not exists [s_tmp]
  • materialization impl = naive, into [s_tmp] (won't even have to "drop table [m]" though since s_tmp is fresh for this run!)
  • on-run-end =
    on success → drop schema if exists cascade [s] && rename schema [s_tmp → s]
    on failure → drop schema if exists cascade [s_tmp] (or keep it around for troubleshooting if you want!)

This would require:

  • configurations for S3 location, file type, IAM role, etc. - presumably in model configs?
  • permission to create new external schemas and new Glue catalogs

Things we'll want to do regardless of approach

  1. Add an adapter configuration parameter to allow setting connection.set_session(autocommit=True) in dbt-postgres connection class? Not sure if that's the right place for this, but that setting will be very helpful to not have to constantly commit; our way out of psycopg2-auto-started transactions.

    Would you be open to merging something like this?

  2. Fix dbt-labs/dbt#2289. I think I've got a workable solution for this.

  3. Figure out how to clean up prior runs' data out of S3. The only SQL-runnable option for doing this that I've found is Redshift's UNLOAD statement's CLEANPATH option, with which we could unload an empty select statement to clear an S3 prefix. (Even that leaves a single metadata object in the S3 path though, so we'd still need to use new prefixes for every run. But at least we wouldn't be leaving around whole datasets to store indefinitely.)

    Alternatively, we could just run a recurring cleanup job out of band from dbt runs.

@vergenzt
Copy link

vergenzt commented Jun 7, 2022

Hey all - apologies for the big wall-o'-text in that last comment. I'm very interested in moving this forward and would love any feedback on approach before we start down an implementation path.

Any questions I can answer about it?

@Fleid
Copy link
Contributor

Fleid commented Sep 19, 2022

Re-opening as #175 has brought the topic back on top (#175 was closed as duplicate).

@Fleid Fleid reopened this Sep 19, 2022
@Fleid
Copy link
Contributor

Fleid commented Sep 19, 2022

Our current approach to decide if/how to solve these issues is:

Can this problem be solved in user land? Is there a decent work around?

Here it seems like yes, as demonstrated in #175:

"materializing the model as view and then use a post hook to create an external table from the view."

That's not the most seamless experience, but it works.

If not, can this problem be solved via a contribution?

You can create your own materialization, as has been discussed in this thread.

Now we need to look again at the 3 objections @vergenzt you brought up at the end of the immensely useful "wall of text" above. Are these absolutely blocking? What can we do about them?

If not, then maybe dbt Labs should solve it

Which here we are considering, not in itself but to support longer plans toward cross-warehouse projects (we will need all sorts of external tables for that) for 1.5+.

A lot of hypothetical here, and we need to think about a common interface for external tables across all warehouses to target most warehouses. I'm sure we'll have discussions on the topic when time comes.

@Fleid Fleid self-assigned this Sep 19, 2022
@vergenzt
Copy link

vergenzt commented Sep 29, 2022

Thanks @Fleid!

TL;DR: I do think this is achievable in user-space. My current recommendation would be to create a custom Redshift external_table materialization that:

  1. throws an error if a model with this materialization has any inside_transaction: true pre- or post-hooks;
  2. runs inside_transaction: false pre-hooks;
  3. runs create view {{ temp_relation }} as {{ sql }} (probably using get_create_view_as_sql and a temp relation that has its database and schema set to empty and its identifier prefixed with # to indicate it's temporary);
  4. runs create external table {{ this }} as select * from {{ temp_relation }}; and finally
  5. runs inside_transaction: false post-hooks.

Note that the user of this materialization would be responsible for ensuring that their models' materialization type (external_table vs any others) matches the type of the schema it's configured to materialize into (i.e. an external schema vs internal schema).

Also with this approach, the external schema has to exist before the dbt run starts. If it doesn't exist then as part of task initialization, dbt automatically creates it as an on-cluster schema before invoking on-run-start hooks.

Would definitely need to merge a fix to #17 first, without which existing Redshift external tables aren't detected by dbt when scanning to see what relations are already materialized. (There's already a solution posted in that thread; it just needs to get PR'd and merged into the mainline.)


As an FYI, my team has pivoted away from trying to materialize into external tables from within Redshift, so I likely won't be pushing this further forward at the moment.

@Fleid
Copy link
Contributor

Fleid commented Nov 15, 2022

Thanks for the update @vergenzt.
Do you mind sharing why you pivoted away from external tables, and for what alternative? That is if it's not really specific to your own situation/context. Just wondering if your new approach could be recommended to others :)

@vergenzt
Copy link

Hi @Fleid! I've since moved teams so not sure if this is still accurate, but I believe the new plan was to use Databricks/Spark more broadly rather than trying to force external table materialization into dbt-redshift.

@Fleid Fleid added the jira label Dec 6, 2022
@github-actions github-actions bot changed the title Materialization to create external tables (Redshift first) [CT-1617] Materialization to create external tables (Redshift first) Dec 6, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 5, 2023
@Fleid
Copy link
Contributor

Fleid commented Jun 12, 2023

@dataders FYI

@dbeatty10
Copy link
Contributor

@dataders opened a Discussion about adding external table capabilities into dbt Core: https://github.com/dbt-labs/dbt-core/discussions/8617

@mikealfare mikealfare added the epic label Feb 6, 2024
@mikealfare mikealfare removed the type:enhancement New feature or request label Feb 7, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 6, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants