-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_parquet not picking up column names from partitioned dataset #154
Comments
The docstring for
(Emphasis mine) I'm pretty sure the intention is for |
This is a helpful reference about spark partition discovery, from @tk3369 |
There is type inference from column path, this from the spark repo doc sql-data-sources-parquet.md:
Pretty sure parquet files can encode metadata into the files, and maybe column type is stored in metadata for some parquet writer applications, but I think for spark the column names come from the path and the type is inferred. If Parquet.jl is inferring column types the way CSV.jl does I think that would be a fine way to solve the problem, matching what spark provides for
|
I do not have much experience with how Spark does it, but it does seem possible: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files Does the dataset you are consuming have the metadata files? |
Hi -- sorry for the long lapse. Generally there are no files I know this is less of a burden in the data science area, where it is expected to pre-process data before ingesting to models or stats work. I would like to use Julia in various data transformations as part of the data engineering components of the work. |
I'm a package user, not a developer so I apologize in advance if I am missing something obvious in the API descriptions.
My use case:
I'm consuming output of some data engineers to produce aggregated reports using Julia, my preferred language.
I have datasets that are written from Spark in parquet format, in the typical fashion of partitioning by certain columns. File layout below. Original complete dataset includes partitition columns
dataset_date
anddevice_family
. The first being the date of observations, allowing incremental build of historical data, and the second is a natural partition of the data because of upstream logic.Here are parquet file and Dataframe load operations that are not detecting the partitions.
Julia Version 1.6.1
Parquet v0.8.3
DataFrames v1.2.0
parquet files written by Spark:
Spark version 2.4.4
Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Expected behavior is same result when reading from spark like the following, note the two last columns reflect the file paths.
The text was updated successfully, but these errors were encountered: