Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-276] Apache Iceberg Support #294

Closed
cccs-jc opened this issue Feb 23, 2022 · 12 comments
Closed

[CT-276] Apache Iceberg Support #294

cccs-jc opened this issue Feb 23, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@cccs-jc
Copy link

cccs-jc commented Feb 23, 2022

Any plans to support iceberg table format. In preliminary testing it seems to work but I wonder if there are particularities which has not been addressed.

Also Iceberg uses catalogs (database) which the current dbt-spark implementation does not support. https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs

Adding support for catalogs (database) in dbt-spark would be required to properly support iceberg tables.

@cccs-jc cccs-jc added enhancement New feature or request triage labels Feb 23, 2022
@github-actions github-actions bot changed the title Apache Iceberg Support [CT-276] Apache Iceberg Support Feb 23, 2022
@jtcohen6
Copy link
Contributor

It sounds like the three-part namespacing, with catalogs, may be coming to the Apache Spark standard as well: #294

dbt supports three-part namespaces by default, so it's actually much simpler from the standpoint of plugin code. What might be tricky is ongoing conditional support for both 2- or 3-part namespacing, configurable depending on the cluster / file format you're using

@jtcohen6 jtcohen6 removed the triage label Feb 25, 2022
@cccs-jc
Copy link
Author

cccs-jc commented Feb 25, 2022

okay so what I understand is that there is no show stoppers with supporting 3-part namespacing. But the adapter should be able to work in either mode based on table format and sql engine.

I mostly wanted to know if there would be any objections in adding such support.

@kbendick
Copy link

Hi all! I work on Iceberg project and recently came across this issue. I also work very closely with many contributors to the Spark project and occasionally contribute there as well.

I'm relatively green to dbt admittedly, but having dbt work properly with Iceberg is important to us, as dbt continues to be a more and more common part of the standard data stack.

I can confirm that Spark 3 makes heavy use of catalogs / multipart namespacing and is by all intents "the future" (and really the present) of Spark. If no catalog is declared, spark 3.x uses the implicit spark_catalog (which is a SessionCatalog). Happy to answer more questions about it as well.

I'm going to play around over the weekend with dbt a bit and will follow up after that. 👍

@cccs-jc
Copy link
Author

cccs-jc commented Mar 27, 2022

Hey @kbendick I've updated my repo with iceberg support. It's incomplete work but feel free to take all the iceberg specific parts. It's in the main branch here https://github.com/cccs-jc/dbt-spark

I will focus on support for a pyspark driver means of interfacing with spark. #305

dparent1 added a commit to dparent1/dbt-spark that referenced this issue Aug 19, 2022
[CT-276] Apache Iceberg Support dbt-labs#294

The _schema variable was used for non-iceberg tables but was being
overridden by work for iceberg v2 tables.  I've made it so the iceberg
condition will set _schema rather than blanket changing the schema for
all providers.
dparent1 added a commit to dparent1/dbt-spark that referenced this issue Aug 19, 2022
On second look I wasn't happy with my name choices for macro name and
method, hopefully what I have now makes more sense.

[CT-276] Apache Iceberg Support dbt-labs#294
@dparent1 dparent1 mentioned this issue Aug 19, 2022
4 tasks
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 24, 2022
@kbendick
Copy link

Hi all! I have been out of office for health reasons but I might have some cycles now to look at this. Is the linked PR still relevant or is there some area specifically I can help out in?

@github-actions github-actions bot removed the Stale label Sep 25, 2022
cccs-jc pushed a commit to dparent1/dbt-spark that referenced this issue Nov 21, 2022
[CT-276] Apache Iceberg Support dbt-labs#294

The _schema variable was used for non-iceberg tables but was being
overridden by work for iceberg v2 tables.  I've made it so the iceberg
condition will set _schema rather than blanket changing the schema for
all providers.
cccs-jc pushed a commit to dparent1/dbt-spark that referenced this issue Nov 21, 2022
On second look I wasn't happy with my name choices for macro name and
method, hopefully what I have now makes more sense.

[CT-276] Apache Iceberg Support dbt-labs#294
@stevenayers
Copy link

Hi all, is there any update on this?

Fokko added a commit to Fokko/dbt-spark that referenced this issue Feb 2, 2023
commit 8e1a8f4
Merge: 28909f2 908d9fe
Author: dparent1 <[email protected]>
Date:   Thu Feb 2 13:22:23 2023 -0500

    Merge pull request dbt-labs#16 from Fokko/fd-revert-some-changes

    Revert some stuff

commit 908d9fe
Merge: b286fd0 f877d1e
Author: Fokko Driesprong <[email protected]>
Date:   Thu Feb 2 10:00:18 2023 -0800

    Merge branch 'main' of https://github.com/dbt-labs/dbt-spark into fd-revert-some-changes

commit b286fd0
Author: Fokko Driesprong <[email protected]>
Date:   Wed Feb 1 12:00:21 2023 -0800

    Revert some stuff

    I noticed that two Spark tests are failing, so tried to revert
    some of the changes that seemed related. They are now passing
    and also my Iceberg dbt project is still running fine

commit 28909f2
Merge: 3ba0a72 3c03d9a
Author: Dan Parent <[email protected]>
Date:   Mon Jan 30 15:05:00 2023 -0500

    Merge branch 'main' into dparent1/iceberg

    - in impl.py changed:
        List[agate.Row]
        to
        AttrDict

    This fixed the pre-commit error after merging in changes from main.

commit 3c03d9a
Merge: 10c9dac 4d179e0
Author: dparent1 <[email protected]>
Date:   Mon Jan 30 14:31:18 2023 -0500

    Merge branch 'dbt-labs:main' into main

commit 3ba0a72
Merge: 83f2d61 10c9dac
Author: Dan Parent <[email protected]>
Date:   Mon Jan 23 16:39:22 2023 -0500

    Merge branch 'main' into dparent1/iceberg

    - Fixing up merge conflicts around Exception/Error types
    - Putting back accidental newline removal in tox.ini

commit 10c9dac
Merge: 3a7ca7c 39800e0
Author: dparent1 <[email protected]>
Date:   Mon Jan 23 15:25:29 2023 -0500

    Merge branch 'dbt-labs:main' into main

commit 83f2d61
Merge: beac4b4 3a7ca7c
Author: Dan Parent <[email protected]>
Date:   Mon Jan 9 11:12:28 2023 -0500

    Merge branch 'main' into dparent1/iceberg

commit 3a7ca7c
Merge: cad11e2 d9a3d76
Author: dparent1 <[email protected]>
Date:   Mon Jan 9 11:02:55 2023 -0500

    Merge branch 'dbt-labs:main' into main

commit beac4b4
Merge: 29610a5 6e6daf5
Author: dparent1 <[email protected]>
Date:   Tue Dec 20 16:24:54 2022 -0500

    Merge pull request dbt-labs#12 from Fokko/fd-comments

    Cleanup based on comments

commit 6e6daf5
Author: Fokko Driesprong <[email protected]>
Date:   Tue Dec 20 21:49:01 2022 +0100

    Cleanup based on comments

commit 29610a5
Merge: 34c5549 cad11e2
Author: Dan Parent <[email protected]>
Date:   Tue Dec 20 13:26:20 2022 -0500

    Merge branch 'main' into dparent1/iceberg

    Fixed conflicts:
    dbt/adapters/spark/impl.py
    dbt/adapters/spark/relation.py
    dbt/include/spark/macros/materializations/incremental/incremental.sql
    dbt/include/spark/macros/materializations/incremental/strategies.sql

    Updated tox.ini removing my own addition of allowlist_externals, main
    has it set now.

commit cad11e2
Merge: 37307d1 512b3d0
Author: dparent1 <[email protected]>
Date:   Tue Dec 20 12:50:27 2022 -0500

    Merge branch 'dbt-labs:main' into main

commit 34c5549
Author: Dan Parent <[email protected]>
Date:   Thu Dec 8 10:47:25 2022 -0500

    Allowing the use of /bin/bash by tox

    tox on certain platforms will complain that /bin/bash is not allowed to
    be used.  I'm allowing it to be used with this change.

commit 3ffcf5b
Author: Dan Parent <[email protected]>
Date:   Thu Dec 8 10:36:57 2022 -0500

    Removed use of ParsedSourceDefinition, add iceberg

    - upstream dbt changed breaking the use of ParsedSourceDefinition, using
      SourceDefinition appears to work instead
    - Added in change to include iceberg in adapters.sql:
       macro spark__alter_column_comment

commit edb6f01
Merge: 65eac08 37307d1
Author: Dan Parent <[email protected]>
Date:   Thu Dec 8 10:02:36 2022 -0500

    Merge changes in main to bring branch up to date

commit 37307d1
Merge: d482d66 9511847
Author: dparent1 <[email protected]>
Date:   Thu Dec 8 09:22:16 2022 -0500

    Merge branch 'dbt-labs:main' into main

commit 65eac08
Author: Dan Parent <[email protected]>
Date:   Thu Dec 8 09:05:48 2022 -0500

    Backing out previous merge which broke unit tests

commit 513293b
Merge: 294b7fb b936b7a
Author: dparent1 <[email protected]>
Date:   Thu Dec 8 08:55:36 2022 -0500

    Merge pull request dbt-labs#10 from Fokko/fd-enable-docs

    Add Iceberg to the list

commit b936b7a
Author: Fokko Driesprong <[email protected]>
Date:   Wed Dec 7 19:04:24 2022 +0100

    Add Iceberg to the list

commit 294b7fb
Merge: 8d3984f d282b21
Author: dparent1 <[email protected]>
Date:   Mon Nov 28 09:55:51 2022 -0500

    Merge pull request dbt-labs#9 from Fokko/fd-fix-incremental-runs

    Fix incremental runs

commit d282b21
Author: Fokko Driesprong <[email protected]>
Date:   Fri Nov 25 15:27:14 2022 +0100

    Fix incremental runs

commit d482d66
Author: dparent1 <[email protected]>
Date:   Mon Nov 21 16:05:10 2022 -0500

    Set up CI with Azure Pipelines

    [skip ci]

commit 8d3984f
Merge: 93d6b97 b759267
Author: Dan Parent <[email protected]>
Date:   Mon Nov 7 14:35:22 2022 -0500

    Merge branch 'main' into dparent1/iceberg

commit 93d6b97
Merge: 402be01 7f233b1
Author: Dan Parent <[email protected]>
Date:   Thu Sep 29 09:58:24 2022 -0400

    Merge branch 'main' into dparent1/iceberg

commit 402be01
Merge: 50f7b94 80dc029
Author: Dan Parent <[email protected]>
Date:   Wed Sep 28 09:02:05 2022 -0400

    Merge branch 'main' into dparent1/iceberg

commit 50f7b94
Author: Dan Parent <[email protected]>
Date:   Tue Sep 27 14:18:53 2022 -0400

    Removing the is_iceberg check it is not needed

    Upon further investigation this check is not needed since
    self.database will not be set.

commit bbf79e4
Merge: 6764e55 ebd011e
Author: Dan Parent <[email protected]>
Date:   Mon Sep 12 11:21:35 2022 -0400

    Merge branch 'main' into dparent1/iceberg

commit 6764e55
Author: Dan Parent <[email protected]>
Date:   Fri Aug 19 14:13:58 2022 -0400

    Adding changelog entry.

commit ee8c7b0
Author: Dan Parent <[email protected]>
Date:   Fri Aug 19 13:45:24 2022 -0400

    Renaming macro and method name

    On second look I wasn't happy with my name choices for macro name and
    method, hopefully what I have now makes more sense.

    [CT-276] Apache Iceberg Support dbt-labs#294

commit 6f96705
Author: Dan Parent <[email protected]>
Date:   Fri Aug 19 10:09:02 2022 -0400

    Setting _schema rather than replacing it

    [CT-276] Apache Iceberg Support dbt-labs#294

    The _schema variable was used for non-iceberg tables but was being
    overridden by work for iceberg v2 tables.  I've made it so the iceberg
    condition will set _schema rather than blanket changing the schema for
    all providers.

commit 35bc12e
Author: Dan Parent <[email protected]>
Date:   Tue Aug 16 12:51:53 2022 -0400

    Adding in additional support for iceberg v2 tables

    Found a way to identify iceberg tables given that spark returns
    an error when trying to execute "SHOW TABLE EXTENDED..."  See
    https://issues.apache.org/jira/browse/SPARK-33393

    Instead of show table extended a "DESCRIBE EXTENDED" is
    performed to retrieve the provider information.  This allows
    for identification of iceberg through an is_iceberg member
    variable.

    Allow for multiple join conditions to allow for mutliple columns to
    make a row distinct

    Use is_iceberg everywhere handling iceberg tables differs from other
    sources of data.
@VersusFacit VersusFacit self-assigned this Feb 13, 2023
mikealfare added a commit that referenced this issue Mar 2, 2023
* Adding in additional support for iceberg v2 tables

Found a way to identify iceberg tables given that spark returns
an error when trying to execute "SHOW TABLE EXTENDED..."  See
https://issues.apache.org/jira/browse/SPARK-33393

Instead of show table extended a "DESCRIBE EXTENDED" is
performed to retrieve the provider information.  This allows
for identification of iceberg through an is_iceberg member
variable.

Allow for multiple join conditions to allow for mutliple columns to
make a row distinct

Use is_iceberg everywhere handling iceberg tables differs from other
sources of data.

* Setting _schema rather than replacing it

[CT-276] Apache Iceberg Support #294

The _schema variable was used for non-iceberg tables but was being
overridden by work for iceberg v2 tables.  I've made it so the iceberg
condition will set _schema rather than blanket changing the schema for
all providers.

* Renaming macro and method name

On second look I wasn't happy with my name choices for macro name and
method, hopefully what I have now makes more sense.

[CT-276] Apache Iceberg Support #294

* Adding changelog entry.

* Removing the is_iceberg check it is not needed

Upon further investigation this check is not needed since
self.database will not be set.

* Set up CI with Azure Pipelines

[skip ci]

* Fix incremental runs

* Add Iceberg to the list

* Backing out previous merge which broke unit tests

* Removed use of ParsedSourceDefinition, add iceberg

- upstream dbt changed breaking the use of ParsedSourceDefinition, using
  SourceDefinition appears to work instead
- Added in change to include iceberg in adapters.sql:
   macro spark__alter_column_comment

* Allowing the use of /bin/bash by tox

tox on certain platforms will complain that /bin/bash is not allowed to
be used.  I'm allowing it to be used with this change.

* Cleanup based on comments

* Revert some stuff

I noticed that two Spark tests are failing, so tried to revert
some of the changes that seemed related. They are now passing
and also my Iceberg dbt project is still running fine

* Correct makefile -e omission.

* Rewrite impl using first order funcs.

* Rewrite comment to be clearer.

* Add comment to revise str to dict type.

* Fix mypy ignore statement.

* Move inline funcs to instance methods for performance.

* Do some python cleanup as Mike suggested.

* added project fixture

* added version tag

* removed whitespace from seed data

---------

Co-authored-by: Dan Parent <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Mila Page <[email protected]>
Co-authored-by: Mike Alfare <[email protected]>
@VersusFacit
Copy link
Contributor

VersusFacit commented Mar 3, 2023

Twas merged, to be sent out in the beta, and thus, at last, we may close this.

@akshay-kokate-06
Copy link

Does this also support the incremental models ??

@Fokko
Copy link
Contributor

Fokko commented Apr 13, 2023

@akshay-kokate-06 it does :)

@QuangVuMinh
Copy link

How can i config catalog for iceberg table ? catalog_type(hadoop, hive), catalog_name, ...

@Fokko
Copy link
Contributor

Fokko commented Jun 27, 2023

You need to set up a catalog through the spark configuration and then set the default catalog to the newly created catalog. An example through the spark-default.conf can be found here: https://github.com/tabular-io/docker-spark-iceberg/blob/main/spark/spark-defaults.conf You can also set it using startup arguments, as shown in https://tabular.io/blog/dbt/

Once #781 has been merged, you can also set it when configuring the session:

dbt_tabular:
  outputs:
    dev:
      method: session
      schema: dbt_tabular
      type: spark
      host: NA
      server_side_parameters:
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
        "spark.sql.defaultCatalog": "sandbox"
        "spark.sql.catalog.sandbox": "org.apache.iceberg.spark.SparkCatalog"
        "spark.sql.catalog.sandbox.catalog-impl": "org.apache.iceberg.rest.RESTCatalog"
        "spark.sql.catalog.sandbox.uri": "https://api.tabular.io/ws/"
        "spark.sql.catalog.sandbox.credential": "abc"
        "spark.sql.catalog.sandbox.warehouse": "sandbox"

Also, once #755 is in, we can also select different catalogs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants