[CT-783] Spark with Iceberg tables: catalog.json is empty #376

zsvoboda · 2022-06-28T19:13:56Z

Describe the bug

A clear and concise description of what the bug is. What command did you run? What happened?

I'm using Spark 3.2.1 with Iceberg 0.13.2 (more details here spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.2)

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

Create a model
Write model.yml file.
Run dbt docs generate.
Check catalog.json - it is empty

Expected behavior

A clear and concise description of what you expected to happen.

catalog.json is populated with the table schema and docs.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - mysql5:  1.0.0 - Not compatible!
  - mariadb: 1.0.0 - Not compatible!
  - trino:   1.1.1 - Up to date!
  - mysql:   1.0.0 - Not compatible!
  - spark:   1.0.0 - Not compatible!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

Mac OSX Monterey

The output of python --version:

Python 3.8.12

Additional context

Add any other context about the problem here.

Seems to be similar problem like with the Delta tables (#295)

show table extended in warehouse like '*';

SQL Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
ShowTableExtended *, [namespace#21, tableName#22, isTemporary#23, information#24]
+- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@7929bdd7

at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
ShowTableExtended *, [namespace#21, tableName#22, isTemporary#23, information#24]
+- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@7929bdd7

The text was updated successfully, but these errors were encountered:

lostmygithubaccount · 2022-08-11T20:15:43Z

Hi @zsvoboda, thank you for opening this issue! Apache Iceberg is not yet an officially supported file format: https://docs.getdbt.com/reference/resource-configs/spark-configs#configuring-tables

Would you be interested in contributing this? If so, we can likely take some time to give guidance on how this could be implemented. I'll mark this as help wanted as it's likely not something we can prioritize in the near future.

jtcohen6 · 2022-08-12T13:56:19Z

Seems to be similar problem like with the Delta tables (#295)

I'm not sure what the longer-term solution here. If using OSS Delta, Iceberg, or other file formats, do we need to revert to the much older way of doing this (describe table extended once per model/seed/source/snapshot)? Or can we hope for the eventual addition of information_schema (Unity Catalog / Databricks only) to OSS Apache Spark?

kbendick · 2022-09-24T07:02:43Z

I think there will eventually be a migration to some sort of information_schema, but we’d need to have a generic API to support it (like merge into does) so that data sources could implement that.

that will probably be a while and having format v1 vs v2 for the provider in the general configuration for the table would be a good idea. That’s the difference in Spark between the two statements (why the schema looks the way it does) and the SQL queries they need.

But information_schema is not yet part of the Spark catalog API at all so I wouldn’t recommend relying on that if more is to be supported.

My 2 cents. Happy to help where I can as I get back from my break if there’s interest!

brandys11 · 2022-11-03T17:06:53Z

Is there some workaround for this at the moment?

github-actions · 2023-05-03T01:50:57Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Fokko · 2023-05-11T08:45:12Z

This issue has been fixed in #294

zsvoboda added bug Something isn't working triage labels Jun 28, 2022

github-actions bot changed the title ~~Spark with Iceberg tables: catalog.json is empty~~ [CT-783] Spark with Iceberg tables: catalog.json is empty Jun 28, 2022

lostmygithubaccount assigned lostmygithubaccount and jtcohen6 Aug 1, 2022

lostmygithubaccount added enhancement New feature or request and removed bug Something isn't working labels Aug 11, 2022

lostmygithubaccount unassigned jtcohen6 and lostmygithubaccount Aug 11, 2022

Fleid mentioned this issue Dec 6, 2022

iceberg v2 table support #432

Closed

4 tasks

github-actions bot added the Stale label May 3, 2023

github-actions bot closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

zsvoboda commented Jun 28, 2022 •

edited

Loading

lostmygithubaccount commented Aug 11, 2022

jtcohen6 commented Aug 12, 2022

kbendick commented Sep 24, 2022

brandys11 commented Nov 3, 2022

github-actions bot commented May 3, 2023

Fokko commented May 11, 2023

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

[CT-783] Spark with Iceberg tables: catalog.json is empty #376

Comments

zsvoboda commented Jun 28, 2022 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

lostmygithubaccount commented Aug 11, 2022

jtcohen6 commented Aug 12, 2022

kbendick commented Sep 24, 2022

brandys11 commented Nov 3, 2022

github-actions bot commented May 3, 2023

Fokko commented May 11, 2023

zsvoboda commented Jun 28, 2022 •

edited

Loading