Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using hive for metastore #11

Merged
merged 5 commits into from
May 28, 2024
Merged

using hive for metastore #11

merged 5 commits into from
May 28, 2024

Conversation

Tianhao-Gu
Copy link
Collaborator

No description provided.

Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 78.57%. Comparing base (2120545) to head (0a42335).

Files Patch % Lines
src/spark/utils.py 71.42% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   71.42%   78.57%   +7.14%     
==========================================
  Files           1        1              
  Lines          14       28      +14     
==========================================
+ Hits           10       22      +12     
- Misses          4        6       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for key, value in delta_conf.items():
spark_conf.set(key, value)

return SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about hive, really - why is it needed here? Doesn't DeltaLake take care of the metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this approach, we can establish a permanent view that everyone can query without rebuilding/reloading the table. I removed enableHiveSupport() from general spark session builder and only enable this via conf for delta lake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the delta lake SW doesn't maintain the table metadata anywhere other than local memory?

Where are the metadata persisted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in memory by default. The metadata is persisted in the metastore_db folder which is created by default under /cdm_shared_folder and mounted by the rancher settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So anyone wanting to use the shared tables needs the same volume mount for their notebook container? What happens if two people try to create conflicting tables with the same name at the same time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the lock mechanism?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a .lck file. Just delete that file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that seems like that should work regardless of docker

Copy link
Member

@MrCreosote MrCreosote May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check and see if there's any way for the tables to be stored in minio so other systems that can talk deltalake can access them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. I will look into that.

@Tianhao-Gu Tianhao-Gu merged commit 720d734 into main May 28, 2024
9 checks passed
@Tianhao-Gu Tianhao-Gu deleted the dev_using_hive branch May 28, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants