Improving DataLake metadata - Tables & Storage #8124

pmbrull · 2022-10-13T05:40:41Z

pmbrull
Oct 13, 2022
Maintainer

Dear OpenMetadata community ✨

Since we introduced the DataLake connector, we have been getting feedback from different members, both explaining what works for them and what pieces they are missing.

Thanks to these discussions, we have noticed that how we handle DataLakes (buckets, paths, partitions, formats...) might need an upgrade to cover all the use cases. We want to ensure that we get this right, and as always, this starts by putting our community at the center.

WE WOULD LOVE TO LEARN ABOUT YOUR USE CASE 👉

What file formats do you usually work with?
Are you looking to scan for tables' sources or unstructured data?
If you have tables, is there any external metastore service available? (AWS Glue, Hive metatore...)
What features are we currently missing?
What features would you like OpenMetadata to cover?
These are just ideas! Any thoughts you might have are more than welcome

Based on all your feedback, we will be preparing the design of the new solution, aiming to cover as much ground as possible. We will also share early drafts with you to ensure we are in the right direction.

Thanks for your contribution! Looking forward to reading you all in the comments section!

cometta · 2022-10-13T07:43:30Z

cometta
Oct 13, 2022

delta format, use spark thrift server instead of hive server, currently the connector does not have option to pass hadoop options thus can't connect to private s3

0 replies

Alina-Valea-Forter · 2022-10-14T09:48:28Z

Alina-Valea-Forter
Oct 14, 2022

We have json files stored in S3 and there is no external metastore service in use. The json files do not have a filename extension so we cannot rely on them being identified as json files based on their extension. As a first step, we would like to be able to extract a common schema from these json files, perhaps by sampling a limited number of files in the bucket since their number is high.

5 replies

Alina-Valea-Forter Oct 16, 2022

Looking over the OpenMetadata roadmap, I noticed that release 1.0 will add support for "storage service such as S3 and showcase buckets as independent entities". It looks like the use case we have is more in line with this than with ingesting from a datalake.
@pmbrull Can you provide more details on how ingestion from S3 as an independent entity will be different from ingestion from an S3 datalake? Thanks.

pmbrull Oct 16, 2022
Maintainer Author

Hi @Alina-Valea-Forter, this is what we need to define based on the inputs we get on this discussion + further analysis.

The main limitation we encountered with the current setup is that we treat any file within a File System as a table. This might not always be true (e.g., images, mp4, logs, or any unstructured data).

Our idea would be to provide a table-based connector to get structured data into a Table Entity, as well as a filesystem-based connector, to pick any free-form elements and ingest them as a Location Entity.

The input parameters and the internal business logic applied in each case would be different. E.g., from a table connector on a specific directory we would just read the schema from a single file and assume that all files within that dir follow the same schema -> those are just spills from a single table; vs. for a filesystem connector we would ingest each file as an independent entity.

Note that these are just initial thoughts, and the outcome might vary. We will be providing further details once we have some formal design in place.

Thanks

Alina-Valea-Forter Oct 16, 2022

Thanks @pmbrull, that makes sense.

ansonnn07 Nov 29, 2022

Hi @Alina-Valea-Forter, this is what we need to define based on the inputs we get on this discussion + further analysis.

The main limitation we encountered with the current setup is that we treat any file within a File System as a table. This might not always be true (e.g., images, mp4, logs, or any unstructured data).

Our idea would be to provide a table-based connector to get structured data into a Table Entity, as well as a filesystem-based connector, to pick any free-form elements and ingest them as a Location Entity.

The input parameters and the internal business logic applied in each case would be different. E.g., from a table connector on a specific directory we would just read the schema from a single file and assume that all files within that dir follow the same schema -> those are just spills from a single table; vs. for a filesystem connector we would ingest each file as an independent entity.

Note that these are just initial thoughts, and the outcome might vary. We will be providing further details once we have some formal design in place.

Thanks

Hi @pmbrull , I came from the Slack thread here. I was the one mentioning about the issues of multiple partitioned/compressed files within a directory. I believe your comment here is related to it?

Perhaps we can customize specific folders to be ingested as tables?

Alternatively, I believe this can be achieved by just automatically assuming the directory (that found any partitioned directories or compressed files directly within the directory, e.g. site_info/partitioned/year=2022 ) as a single table named partitioned, but the table name might not be the most useful one as it could just take the final directory name as the table name, for example in this case partitioned. Perhaps can customize this in some way before the ingestion, or use more than one parent directory as the table name, e.g. site_info-partitioned?

pmbrull Nov 29, 2022
Maintainer Author

hi @ansonnn07 thanks for chipping in. This is exactly the core of it, how to differentiate structured vs. unstructured data and how to properly assign the naming and schemas.

Thanks for adding a counter-example of a "simple" naming convention. We'll def consider how to surface this 🙏

Alina-Valea-Forter · 2022-10-17T15:33:15Z

Alina-Valea-Forter
Oct 17, 2022

Hi again, an additional user case that we are facing for ingestion from S3 is when we have the schema file available in the same or another S3 bucket. Currently it seems the way OpenMetadata works is to look for the three dbt files (manifest.json, catalog.json and results_run.json) in the S3 bucket that is configured as the DBT source. These three files must all be present in the DBT source bucket for ingestion to take place.
We would like to be able to point OpenMetadata to any json schema file that describes the data in the bucket and tell it to perform ingestion based on that schema file.

0 replies

revit13 · 2022-12-21T07:27:42Z

revit13
Dec 21, 2022

We have the following scenario: we use arrow flight to write Arrow dataset to parquet or csv format. Out datasets are partitioned into several csv/parquet objects (e.g. part-0.csv, part-1.csv, etc.). We would like all csv/parquet objects in the same directory (or that share the same prefix) to be treated as a single table, sharing the same schema and the same tags.
So the desired state would be for the ingestion to discover a single table that contains all of its partitions.

0 replies

lmandadapu · 2023-01-12T13:08:01Z

lmandadapu
Jan 12, 2023

Hi Everyone,

I am currently working with parquet and csv data

S3 path where data resides is as follow: s3://{bucket_name}/{table_name}/year={partition_bale}/year_month={partition_value}/date={partition_value}/bunch_of_parque_files
Currently I am doing one shot ingestion with datalake connector using CLI, in the json structure I specify s3 prefix as table_name.
When I run the ingestion instead of treating partition values as part of the table, and creating a single table, ingestion process is trying to create multiple tables treating all the prefixes that lead to parquet files as a separate table.

It would be nice to treat multiple parquet or csv files separated using partition format to be treated as a same table.

1 reply

theSekyi Aug 1, 2024

@lmandadapu did you find a way around this? I am facing the same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving DataLake metadata - Tables & Storage #8124

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Improving DataLake metadata - Tables & Storage #8124

pmbrull Oct 13, 2022 Maintainer

Replies: 5 comments · 6 replies

pmbrull Oct 16, 2022 Maintainer Author

pmbrull Nov 29, 2022 Maintainer Author

pmbrull
Oct 13, 2022
Maintainer

Replies: 5 comments 6 replies

pmbrull Oct 16, 2022
Maintainer Author

pmbrull Nov 29, 2022
Maintainer Author