Matching optional GeoParquet with default dataframe #180
Replies: 4 comments 8 replies
-
As of commit 0247d95 the following has been added/updated to the parquet server-side code and is available for private clusters using the
Still to do:
To discuss:
|
Beta Was this translation helpful? Give feedback.
-
@jpswinski Looks like the geometry in parquet output is also flipped (should be Lon, Lat), but is currently Lat, Lon: (for grand mesa example above)
|
Beta Was this translation helpful? Give feedback.
-
I think it's just another column for the parquet point of view. This is why pandas adds in the custom metadata to re-create the dataframe as it appeared when saving. The key seems to be a metadata mapping to tell the software library (pandas) which column to use as the index: # Consider time as just another column
gf = gfs[:3].drop(columns = ['extent_id', 'lat', 'lon'])
gf['time'] = gf[:3].delta_time.apply(lambda x: pd.Timestamp('2018-01-01') + pd.Timedelta(x, unit='s'))
# gf.set_index('time', inplace=True)
gf.sort_index(inplace=True) In the same way as adding the Click to expand
|
Beta Was this translation helpful? Give feedback.
-
@scottyhq, as of commit 15feadd the following has been added/updated to the parquet server-side code and is available for private clusters using the
For convenience, the functionality previously added was:
Still to do:
Questions:
|
Beta Was this translation helpful? Give feedback.
-
The initial Geoparquet output option is great, and opens up possibilities for a lot of exciting workflows! I see there are already some relevant issues open (#159, #160, #171), but thought a meta-level discussion of the initial implementation might be useful.
I think a good goal could be parity with the existing default pandas dataframe output
As a Python user, I did a quick test of the grandmesa example notebook, returning a dataframe and then writing to disk with:
atl06_sr.to_parquet('grandmesa-frompandas.parquet', version='2.6')
Compared to directly using an output config with:
Comparing these files I notice a few things:
both files use format parquet version 2.6, which is necessary to support nanosecond timestamps (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html)
grandmesa-sliderule.parquet
has no familiar time index but instead a RangeIndex with additional columns (extent_id, lat, lon). What is extent_id? I think lat, lon can be dropped since they are easily extracted from the geometry objects. It would be great to keep that sorted time index by default.pandas does not automatically generate
row_groups
For this case the total file size is 20MB, so I think the row_group heuristic could be adjusted. Glancing around it seems the default row_group size is 128MB+ (https://parquet.apache.org/docs/file-format/configurations/):library versions:
Beta Was this translation helpful? Give feedback.
All reactions