-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-276] Apache Iceberg Support #294
Comments
It sounds like the three-part namespacing, with catalogs, may be coming to the Apache Spark standard as well: #294 dbt supports three-part namespaces by default, so it's actually much simpler from the standpoint of plugin code. What might be tricky is ongoing conditional support for both 2- or 3-part namespacing, configurable depending on the cluster / file format you're using |
okay so what I understand is that there is no show stoppers with supporting 3-part namespacing. But the adapter should be able to work in either mode based on table format and sql engine. I mostly wanted to know if there would be any objections in adding such support. |
Hi all! I work on Iceberg project and recently came across this issue. I also work very closely with many contributors to the Spark project and occasionally contribute there as well. I'm relatively green to dbt admittedly, but having dbt work properly with Iceberg is important to us, as dbt continues to be a more and more common part of the standard data stack. I can confirm that Spark 3 makes heavy use of catalogs / multipart namespacing and is by all intents "the future" (and really the present) of Spark. If no catalog is declared, spark 3.x uses the implicit I'm going to play around over the weekend with dbt a bit and will follow up after that. 👍 |
Hey @kbendick I've updated my repo with iceberg support. It's incomplete work but feel free to take all the iceberg specific parts. It's in the main branch here https://github.com/cccs-jc/dbt-spark I will focus on support for a pyspark driver means of interfacing with spark. #305 |
[CT-276] Apache Iceberg Support dbt-labs#294 The _schema variable was used for non-iceberg tables but was being overridden by work for iceberg v2 tables. I've made it so the iceberg condition will set _schema rather than blanket changing the schema for all providers.
On second look I wasn't happy with my name choices for macro name and method, hopefully what I have now makes more sense. [CT-276] Apache Iceberg Support dbt-labs#294
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Hi all! I have been out of office for health reasons but I might have some cycles now to look at this. Is the linked PR still relevant or is there some area specifically I can help out in? |
[CT-276] Apache Iceberg Support dbt-labs#294 The _schema variable was used for non-iceberg tables but was being overridden by work for iceberg v2 tables. I've made it so the iceberg condition will set _schema rather than blanket changing the schema for all providers.
On second look I wasn't happy with my name choices for macro name and method, hopefully what I have now makes more sense. [CT-276] Apache Iceberg Support dbt-labs#294
Hi all, is there any update on this? |
commit 8e1a8f4 Merge: 28909f2 908d9fe Author: dparent1 <[email protected]> Date: Thu Feb 2 13:22:23 2023 -0500 Merge pull request dbt-labs#16 from Fokko/fd-revert-some-changes Revert some stuff commit 908d9fe Merge: b286fd0 f877d1e Author: Fokko Driesprong <[email protected]> Date: Thu Feb 2 10:00:18 2023 -0800 Merge branch 'main' of https://github.com/dbt-labs/dbt-spark into fd-revert-some-changes commit b286fd0 Author: Fokko Driesprong <[email protected]> Date: Wed Feb 1 12:00:21 2023 -0800 Revert some stuff I noticed that two Spark tests are failing, so tried to revert some of the changes that seemed related. They are now passing and also my Iceberg dbt project is still running fine commit 28909f2 Merge: 3ba0a72 3c03d9a Author: Dan Parent <[email protected]> Date: Mon Jan 30 15:05:00 2023 -0500 Merge branch 'main' into dparent1/iceberg - in impl.py changed: List[agate.Row] to AttrDict This fixed the pre-commit error after merging in changes from main. commit 3c03d9a Merge: 10c9dac 4d179e0 Author: dparent1 <[email protected]> Date: Mon Jan 30 14:31:18 2023 -0500 Merge branch 'dbt-labs:main' into main commit 3ba0a72 Merge: 83f2d61 10c9dac Author: Dan Parent <[email protected]> Date: Mon Jan 23 16:39:22 2023 -0500 Merge branch 'main' into dparent1/iceberg - Fixing up merge conflicts around Exception/Error types - Putting back accidental newline removal in tox.ini commit 10c9dac Merge: 3a7ca7c 39800e0 Author: dparent1 <[email protected]> Date: Mon Jan 23 15:25:29 2023 -0500 Merge branch 'dbt-labs:main' into main commit 83f2d61 Merge: beac4b4 3a7ca7c Author: Dan Parent <[email protected]> Date: Mon Jan 9 11:12:28 2023 -0500 Merge branch 'main' into dparent1/iceberg commit 3a7ca7c Merge: cad11e2 d9a3d76 Author: dparent1 <[email protected]> Date: Mon Jan 9 11:02:55 2023 -0500 Merge branch 'dbt-labs:main' into main commit beac4b4 Merge: 29610a5 6e6daf5 Author: dparent1 <[email protected]> Date: Tue Dec 20 16:24:54 2022 -0500 Merge pull request dbt-labs#12 from Fokko/fd-comments Cleanup based on comments commit 6e6daf5 Author: Fokko Driesprong <[email protected]> Date: Tue Dec 20 21:49:01 2022 +0100 Cleanup based on comments commit 29610a5 Merge: 34c5549 cad11e2 Author: Dan Parent <[email protected]> Date: Tue Dec 20 13:26:20 2022 -0500 Merge branch 'main' into dparent1/iceberg Fixed conflicts: dbt/adapters/spark/impl.py dbt/adapters/spark/relation.py dbt/include/spark/macros/materializations/incremental/incremental.sql dbt/include/spark/macros/materializations/incremental/strategies.sql Updated tox.ini removing my own addition of allowlist_externals, main has it set now. commit cad11e2 Merge: 37307d1 512b3d0 Author: dparent1 <[email protected]> Date: Tue Dec 20 12:50:27 2022 -0500 Merge branch 'dbt-labs:main' into main commit 34c5549 Author: Dan Parent <[email protected]> Date: Thu Dec 8 10:47:25 2022 -0500 Allowing the use of /bin/bash by tox tox on certain platforms will complain that /bin/bash is not allowed to be used. I'm allowing it to be used with this change. commit 3ffcf5b Author: Dan Parent <[email protected]> Date: Thu Dec 8 10:36:57 2022 -0500 Removed use of ParsedSourceDefinition, add iceberg - upstream dbt changed breaking the use of ParsedSourceDefinition, using SourceDefinition appears to work instead - Added in change to include iceberg in adapters.sql: macro spark__alter_column_comment commit edb6f01 Merge: 65eac08 37307d1 Author: Dan Parent <[email protected]> Date: Thu Dec 8 10:02:36 2022 -0500 Merge changes in main to bring branch up to date commit 37307d1 Merge: d482d66 9511847 Author: dparent1 <[email protected]> Date: Thu Dec 8 09:22:16 2022 -0500 Merge branch 'dbt-labs:main' into main commit 65eac08 Author: Dan Parent <[email protected]> Date: Thu Dec 8 09:05:48 2022 -0500 Backing out previous merge which broke unit tests commit 513293b Merge: 294b7fb b936b7a Author: dparent1 <[email protected]> Date: Thu Dec 8 08:55:36 2022 -0500 Merge pull request dbt-labs#10 from Fokko/fd-enable-docs Add Iceberg to the list commit b936b7a Author: Fokko Driesprong <[email protected]> Date: Wed Dec 7 19:04:24 2022 +0100 Add Iceberg to the list commit 294b7fb Merge: 8d3984f d282b21 Author: dparent1 <[email protected]> Date: Mon Nov 28 09:55:51 2022 -0500 Merge pull request dbt-labs#9 from Fokko/fd-fix-incremental-runs Fix incremental runs commit d282b21 Author: Fokko Driesprong <[email protected]> Date: Fri Nov 25 15:27:14 2022 +0100 Fix incremental runs commit d482d66 Author: dparent1 <[email protected]> Date: Mon Nov 21 16:05:10 2022 -0500 Set up CI with Azure Pipelines [skip ci] commit 8d3984f Merge: 93d6b97 b759267 Author: Dan Parent <[email protected]> Date: Mon Nov 7 14:35:22 2022 -0500 Merge branch 'main' into dparent1/iceberg commit 93d6b97 Merge: 402be01 7f233b1 Author: Dan Parent <[email protected]> Date: Thu Sep 29 09:58:24 2022 -0400 Merge branch 'main' into dparent1/iceberg commit 402be01 Merge: 50f7b94 80dc029 Author: Dan Parent <[email protected]> Date: Wed Sep 28 09:02:05 2022 -0400 Merge branch 'main' into dparent1/iceberg commit 50f7b94 Author: Dan Parent <[email protected]> Date: Tue Sep 27 14:18:53 2022 -0400 Removing the is_iceberg check it is not needed Upon further investigation this check is not needed since self.database will not be set. commit bbf79e4 Merge: 6764e55 ebd011e Author: Dan Parent <[email protected]> Date: Mon Sep 12 11:21:35 2022 -0400 Merge branch 'main' into dparent1/iceberg commit 6764e55 Author: Dan Parent <[email protected]> Date: Fri Aug 19 14:13:58 2022 -0400 Adding changelog entry. commit ee8c7b0 Author: Dan Parent <[email protected]> Date: Fri Aug 19 13:45:24 2022 -0400 Renaming macro and method name On second look I wasn't happy with my name choices for macro name and method, hopefully what I have now makes more sense. [CT-276] Apache Iceberg Support dbt-labs#294 commit 6f96705 Author: Dan Parent <[email protected]> Date: Fri Aug 19 10:09:02 2022 -0400 Setting _schema rather than replacing it [CT-276] Apache Iceberg Support dbt-labs#294 The _schema variable was used for non-iceberg tables but was being overridden by work for iceberg v2 tables. I've made it so the iceberg condition will set _schema rather than blanket changing the schema for all providers. commit 35bc12e Author: Dan Parent <[email protected]> Date: Tue Aug 16 12:51:53 2022 -0400 Adding in additional support for iceberg v2 tables Found a way to identify iceberg tables given that spark returns an error when trying to execute "SHOW TABLE EXTENDED..." See https://issues.apache.org/jira/browse/SPARK-33393 Instead of show table extended a "DESCRIBE EXTENDED" is performed to retrieve the provider information. This allows for identification of iceberg through an is_iceberg member variable. Allow for multiple join conditions to allow for mutliple columns to make a row distinct Use is_iceberg everywhere handling iceberg tables differs from other sources of data.
* Adding in additional support for iceberg v2 tables Found a way to identify iceberg tables given that spark returns an error when trying to execute "SHOW TABLE EXTENDED..." See https://issues.apache.org/jira/browse/SPARK-33393 Instead of show table extended a "DESCRIBE EXTENDED" is performed to retrieve the provider information. This allows for identification of iceberg through an is_iceberg member variable. Allow for multiple join conditions to allow for mutliple columns to make a row distinct Use is_iceberg everywhere handling iceberg tables differs from other sources of data. * Setting _schema rather than replacing it [CT-276] Apache Iceberg Support #294 The _schema variable was used for non-iceberg tables but was being overridden by work for iceberg v2 tables. I've made it so the iceberg condition will set _schema rather than blanket changing the schema for all providers. * Renaming macro and method name On second look I wasn't happy with my name choices for macro name and method, hopefully what I have now makes more sense. [CT-276] Apache Iceberg Support #294 * Adding changelog entry. * Removing the is_iceberg check it is not needed Upon further investigation this check is not needed since self.database will not be set. * Set up CI with Azure Pipelines [skip ci] * Fix incremental runs * Add Iceberg to the list * Backing out previous merge which broke unit tests * Removed use of ParsedSourceDefinition, add iceberg - upstream dbt changed breaking the use of ParsedSourceDefinition, using SourceDefinition appears to work instead - Added in change to include iceberg in adapters.sql: macro spark__alter_column_comment * Allowing the use of /bin/bash by tox tox on certain platforms will complain that /bin/bash is not allowed to be used. I'm allowing it to be used with this change. * Cleanup based on comments * Revert some stuff I noticed that two Spark tests are failing, so tried to revert some of the changes that seemed related. They are now passing and also my Iceberg dbt project is still running fine * Correct makefile -e omission. * Rewrite impl using first order funcs. * Rewrite comment to be clearer. * Add comment to revise str to dict type. * Fix mypy ignore statement. * Move inline funcs to instance methods for performance. * Do some python cleanup as Mike suggested. * added project fixture * added version tag * removed whitespace from seed data --------- Co-authored-by: Dan Parent <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Mila Page <[email protected]> Co-authored-by: Mike Alfare <[email protected]>
Twas merged, to be sent out in the beta, and thus, at last, we may close this. |
Does this also support the incremental models ?? |
@akshay-kokate-06 it does :) |
How can i config catalog for iceberg table ? catalog_type(hadoop, hive), catalog_name, ... |
You need to set up a catalog through the spark configuration and then set the default catalog to the newly created catalog. An example through the Once #781 has been merged, you can also set it when configuring the session:
Also, once #755 is in, we can also select different catalogs. |
Any plans to support iceberg table format. In preliminary testing it seems to work but I wonder if there are particularities which has not been addressed.
Also Iceberg uses catalogs (database) which the current dbt-spark implementation does not support. https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs
Adding support for catalogs (database) in dbt-spark would be required to properly support iceberg tables.
The text was updated successfully, but these errors were encountered: