Replies: 5 comments 6 replies
-
delta format, use spark thrift server instead of hive server, currently the connector does not have option to pass hadoop options thus can't connect to private s3 |
Beta Was this translation helpful? Give feedback.
-
We have json files stored in S3 and there is no external metastore service in use. The json files do not have a filename extension so we cannot rely on them being identified as json files based on their extension. As a first step, we would like to be able to extract a common schema from these json files, perhaps by sampling a limited number of files in the bucket since their number is high. |
Beta Was this translation helpful? Give feedback.
-
Hi again, an additional user case that we are facing for ingestion from S3 is when we have the schema file available in the same or another S3 bucket. Currently it seems the way OpenMetadata works is to look for the three dbt files (manifest.json, catalog.json and results_run.json) in the S3 bucket that is configured as the DBT source. These three files must all be present in the DBT source bucket for ingestion to take place. |
Beta Was this translation helpful? Give feedback.
-
We have the following scenario: we use arrow flight to write Arrow dataset to parquet or csv format. Out datasets are partitioned into several csv/parquet objects (e.g. part-0.csv, part-1.csv, etc.). We would like all csv/parquet objects in the same directory (or that share the same prefix) to be treated as a single table, sharing the same schema and the same tags. |
Beta Was this translation helpful? Give feedback.
-
Hi Everyone, I am currently working with parquet and csv data S3 path where data resides is as follow: s3://{bucket_name}/{table_name}/year={partition_bale}/year_month={partition_value}/date={partition_value}/bunch_of_parque_files It would be nice to treat multiple parquet or csv files separated using partition format to be treated as a same table. |
Beta Was this translation helpful? Give feedback.
-
Dear OpenMetadata community ✨
Since we introduced the DataLake connector, we have been getting feedback from different members, both explaining what works for them and what pieces they are missing.
Thanks to these discussions, we have noticed that how we handle DataLakes (buckets, paths, partitions, formats...) might need an upgrade to cover all the use cases. We want to ensure that we get this right, and as always, this starts by putting our community at the center.
WE WOULD LOVE TO LEARN ABOUT YOUR USE CASE 👉
Based on all your feedback, we will be preparing the design of the new solution, aiming to cover as much ground as possible. We will also share early drafts with you to ensure we are in the right direction.
Thanks for your contribution! Looking forward to reading you all in the comments section!
Beta Was this translation helpful? Give feedback.
All reactions