Matching optional GeoParquet with default dataframe #180

scottyhq · 2023-02-15T16:57:30Z

scottyhq
Feb 15, 2023
Maintainer

The initial Geoparquet output option is great, and opens up possibilities for a lot of exciting workflows! I see there are already some relevant issues open (#159, #160, #171), but thought a meta-level discussion of the initial implementation might be useful.

I think a good goal could be parity with the existing default pandas dataframe output

As a Python user, I did a quick test of the grandmesa example notebook, returning a dataframe and then writing to disk with:
atl06_sr.to_parquet('grandmesa-frompandas.parquet', version='2.6')

Compared to directly using an output config with:

output_config =  {"path": "grandmesa-sliderule.parquet", 
                  "format": "parquet", 
                  "open_on_complete": True }

Comparing these files I notice a few things:

both files use format parquet version 2.6, which is necessary to support nanosecond timestamps (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html)
grandmesa-sliderule.parquet has no familiar time index but instead a RangeIndex with additional columns (extent_id, lat, lon). What is extent_id? I think lat, lon can be dropped since they are easily extracted from the geometry objects. It would be great to keep that sorted time index by default.
pandas does not automatically generate row_groups For this case the total file size is 20MB, so I think the row_group heuristic could be adjusted. Glancing around it seems the default row_group size is 128MB+ (https://parquet.apache.org/docs/file-format/configurations/):

import pyarrow as pa
pa.parquet.read_metadata('grandmesa-sliderule.parquet')
#<pyarrow._parquet.FileMetaData object at 0x7f1467d7c3b0>
#  created_by: parquet-cpp-arrow version 11.0.0-SNAPSHOT
#  num_columns: 19
#  num_rows: 266124
#  num_row_groups: 1079
#  format_version: 2.6
#  serialized_size: 1998085

pa.parquet.read_metadata('grandmesa-frompandas.parquet')
#  created_by: parquet-cpp-arrow version 10.0.1
#  num_columns: 17
#  num_rows: 266124
#  num_row_groups: 1
#  format_version: 2.6
#  serialized_size: 12162

pandas adds custom metadata to the parquet schema for handling the index and also embedding version information. This could be a good option for embedding sliderule server version info, here's how you access and modify the schema in python:

table = pa.parquet.read_table('grandmesa-sliderule.parquet')
s = table.schema

metadata = sliderule.get_version()
my_metadata = {b'sliderule': json.dumps(metadata).encode('utf-8')}
merged_metadata = { **my_metadata, **s.metadata}
fixed_table = table.replace_schema_metadata(merged_metadata)

print(fixed_table.schema)
# -- schema metadata --
# sliderule: '{"arcticdem": {"version": "v2.0.0", "commit": "v2.0.0-0-g67d5' + 376
# pandas: '{"index_columns": ["time"], "column_indexes": [{"name": null, "f' + 2692
# geo: '{"primary_column": "geometry", "columns": {"geometry": {"encoding":' + 1351

pa.parquet.write_table(fixed_table, 'grandmesa-sliderule.parquet', version='2.6')

library versions:

sliderule     : 2.0.0 (commit 001446eb3c83f67a08f39d6a8a64b4532e9f9d94)
IPython       : 8.10.0
pyarrow       : 10.0.1
geopandas     : 0.12.2
dask_geopandas: 0.3.0
fsspec        : 2023.1.0
pandas        : 1.5.3

jpswinski · 2023-02-15T17:15:48Z

jpswinski
Feb 15, 2023
Maintainer

As of commit 0247d95 the following has been added/updated to the parquet server-side code and is available for private clusters using the latest version:

The parquet file version is now 2.6
The schema meta data includes sliderule server version information
The lat and lon columns have been removed

Still to do:

Figure out how to create a time index in the parquet file. It is not clear to me yet how the parquet format supports an index, and how the concept of a parquet index compares to the concept of a pandas dataframe index.
Fix the number of row groups (and hopefully see a dramatic decrease in the size of the file)

To discuss:

The extent_id is a unique value generated by the server for each custom ATL03/06/08 segment produced. Given that the ICESat-2 standard data products has a well defined meaning for segment, sliderule uses the term extent to indicate a custom-length and custom-filtered segment of photons. The extent_id is analogous to the segment_id. All data returned from sliderule for ATL03/06/08 endpoints include the extent_id, but by default the Python client strips it out when it creates the GeoDataFrame. There is an option to keep the extend_id in the GeoDataFrame which is useful when performing merges on GeoDataFrames from multiple APIs (for example, you can combine results from atl06 and atl08 endpoints and created a single GeoDataFrame with both elevation and vegatation data in it). I can expose a similar option to the Python client when using the parquet output option - but the column will still always be in the parquet file, it will just be absent from the GeoDataFrame that the client creates from the parquet file. On the flip side, if you specify "keep_id" in the atl03sp, atl06p, and atl08p APIs, then the extent_id will show up in the GeoDataFrame returned from them, and then if you write out a parquet file from that data frame, you will see it in your parquet file as well.

2 replies

scottyhq Feb 15, 2023
Maintainer Author

The extent_id is a unique value generated by the server for each custom

Thanks for the detail! Is this in development docs someplace? I didn't find info here http://icesat2sliderule.org/rtd/user_guide/ICESat-2.html, would be great to add

jpswinski Feb 15, 2023
Maintainer

It is a relatively new feature and not documented yet. You looked in the right place. I will make sure it is documented in the next release.

scottyhq · 2023-02-15T18:58:03Z

scottyhq
Feb 15, 2023
Maintainer Author

@jpswinski Looks like the geometry in parquet output is also flipped (should be Lon, Lat), but is currently Lat, Lon:

(for grand mesa example above) str(gfs.geometry[0]) returns:

'POINT (39.17098579943573 -108.04677248756315)'

4 replies

scottyhq Feb 15, 2023
Maintainer Author

Could just copy the geopandas schema for now

json.loads(pa.parquet.read_table('grandmesa-frompandas.parquet').schema.metadata[b'geo'].decode('utf-8')) :

{'primary_column': 'geometry',
 'columns': {'geometry': {'encoding': 'WKB',
   'crs': {'$schema': 'https://proj.org/schemas/v0.5/projjson.schema.json',
    'type': 'GeographicCRS',
    'name': 'WGS 84',
    'datum_ensemble': {'name': 'World Geodetic System 1984 ensemble',
     'members': [{'name': 'World Geodetic System 1984 (Transit)'},
      {'name': 'World Geodetic System 1984 (G730)'},
      {'name': 'World Geodetic System 1984 (G873)'},
      {'name': 'World Geodetic System 1984 (G1150)'},
      {'name': 'World Geodetic System 1984 (G1674)'},
      {'name': 'World Geodetic System 1984 (G1762)'},
      {'name': 'World Geodetic System 1984 (G2139)'}],
     'ellipsoid': {'name': 'WGS 84',
      'semi_major_axis': 6378137,
      'inverse_flattening': 298.257223563},
     'accuracy': '2.0',
     'id': {'authority': 'EPSG', 'code': 6326}},
    'coordinate_system': {'subtype': 'ellipsoidal',
     'axis': [{'name': 'Geodetic latitude',
       'abbreviation': 'Lat',
       'direction': 'north',
       'unit': 'degree'},
      {'name': 'Geodetic longitude',
       'abbreviation': 'Lon',
       'direction': 'east',
       'unit': 'degree'}]},
    'scope': 'Horizontal component of 3D system.',
    'area': 'World.',
    'bbox': {'south_latitude': -90,
     'west_longitude': -180,
     'north_latitude': 90,
     'east_longitude': 180},
    'id': {'authority': 'EPSG', 'code': 4326}},
   'geometry_type': 'Point',
   'bbox': [-108.33519583325098,
    38.82517972247198,
    -107.7343326544406,
    39.19487541264169]}},
 'version': '0.4.0',
 'creator': {'library': 'geopandas', 'version': '0.12.2'}}

related: https://github.com/ICESat2-SlideRule/sliderule-python/issues/66

jpswinski Feb 15, 2023
Maintainer

Here is the geo schema we are currently using:

{
        "version": "1.0.0-beta.1",
        "primary_column": "geometry",
        "columns": {
            "geometry": {
                "encoding": "WKB",
                "geometry_types": ["Point"],
                "crs": {
                    "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json",
                    "type": "GeographicCRS",
                    "name": "WGS 84 longitude-latitude",
                    "datum": {
                        "type": "GeodeticReferenceFrame",
                        "name": "World Geodetic System 1984",
                        "ellipsoid": {
                            "name": "WGS 84",
                            "semi_major_axis": 6378137,
                            "inverse_flattening": 298.257223563
                        }
                    },
                    "coordinate_system": {
                        "subtype": "ellipsoidal",
                        "axis": [
                            {
                                "name": "Geodetic longitude",
                                "abbreviation": "Lon",
                                "direction": "east",
                                "unit": "degree"
                            },
                            {
                                "name": "Geodetic latitude",
                                "abbreviation": "Lat",
                                "direction": "north",
                                "unit": "degree"
                            }
                        ]
                    },
                    "id": {
                        "authority": "OGC",
                        "code": "CRS84"
                    }
                },
                "edges": "planar",
                "bbox": [-180.0, -90.0, 180.0, 90.0],
                "epoch": 2018.0
            }
        }
    }

Are you suggesting we replace this with the one above?

jpswinski Feb 15, 2023
Maintainer

For the lat/lon flip - great catch! Sorry about that. It should be fixed with 5eba331 but I'll add some tests that specifically test for this.

scottyhq Feb 15, 2023
Maintainer Author

Are you suggesting we replace this with the one above?

Yes. We can also revisit that linked issue again at our next meeting for different CRS options :)

scottyhq · 2023-02-15T20:01:48Z

scottyhq
Feb 15, 2023
Maintainer Author

Figure out how to create a time index in the parquet file. It is not clear to me yet how the parquet format supports an index, and how the concept of a parquet index compares to the concept of a pandas dataframe index.

I think it's just another column for the parquet point of view. This is why pandas adds in the custom metadata to re-create the dataframe as it appeared when saving. The key seems to be a metadata mapping to tell the software library (pandas) which column to use as the index:

# Consider time as just another column
gf = gfs[:3].drop(columns = ['extent_id', 'lat', 'lon'])
gf['time'] = gf[:3].delta_time.apply(lambda x:  pd.Timestamp('2018-01-01') + pd.Timedelta(x, unit='s'))
# gf.set_index('time', inplace=True)
gf.sort_index(inplace=True)

In the same way as adding the sliderule key and metadata to the schema, we can add a pandas key with the following metadata (see the 'index_columns':['time'] mapping in particular):

Click to expand

{'index_columns': ['time'],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'distance',
   'field_name': 'distance',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'segment_id',
   'field_name': 'segment_id',
   'pandas_type': 'uint32',
   'numpy_type': 'uint32',
   'metadata': None},
  {'name': 'rgt',
   'field_name': 'rgt',
   'pandas_type': 'uint16',
   'numpy_type': 'uint16',
   'metadata': None},
  {'name': 'rms_misfit',
   'field_name': 'rms_misfit',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'delta_time',
   'field_name': 'delta_time',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'gt',
   'field_name': 'gt',
   'pandas_type': 'uint8',
   'numpy_type': 'uint8',
   'metadata': None},
  {'name': 'dh_fit_dy',
   'field_name': 'dh_fit_dy',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'n_fit_photons',
   'field_name': 'n_fit_photons',
   'pandas_type': 'int32',
   'numpy_type': 'int32',
   'metadata': None},
  {'name': 'h_sigma',
   'field_name': 'h_sigma',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'pflags',
   'field_name': 'pflags',
   'pandas_type': 'uint16',
   'numpy_type': 'uint16',
   'metadata': None},
  {'name': 'spot',
   'field_name': 'spot',
   'pandas_type': 'uint8',
   'numpy_type': 'uint8',
   'metadata': None},
  {'name': 'h_mean',
   'field_name': 'h_mean',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'cycle',
   'field_name': 'cycle',
   'pandas_type': 'uint16',
   'numpy_type': 'uint16',
   'metadata': None},
  {'name': 'w_surface_window_final',
   'field_name': 'w_surface_window_final',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'dh_fit_dx',
   'field_name': 'dh_fit_dx',
   'pandas_type': 'float64',
   'numpy_type': 'float64',
   'metadata': None},
  {'name': 'geometry',
   'field_name': 'geometry',
   'pandas_type': 'bytes',
   'numpy_type': 'object',
   'metadata': None},
  {'name': 'time',
   'field_name': 'time',
   'pandas_type': 'datetime',
   'numpy_type': 'datetime64[ns]',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '10.0.1'},
 'pandas_version': '1.5.3'}

0 replies

jpswinski · 2023-02-23T17:46:14Z

jpswinski
Feb 23, 2023
Maintainer

@scottyhq, as of commit 15feadd the following has been added/updated to the parquet server-side code and is available for private clusters using the latest version:

pandas metadata has been added to the schema
when opened as a dataframe, the time column is used as the index
the time column is stored as native datetime64[ns] values
the size of the parquet files are smaller because the number of rows in each row_group is now much larger

For convenience, the functionality previously added was:

The parquet file version is now 2.6
The schema meta data includes sliderule server version information
The lat and lon columns have been removed

Still to do:

Sort the data by the time column in the parquet file. This is trivial for small parquet files, but I am not sure of how best to proceed for large parquet files.

Questions:

I am not seeing as good of compression as I expected - specifically, there are some columns that should compress extremely well that parquet-tools is reporting as actually growing in size. It looks to be integer columns that have a source width of 8 or 16 bits, yet appears to be stored in the parquet file as a 32 bit integer. Any thoughts on how to improve this? See the output below for one such example:

############ Column(atl08_class) ############
name: atl08_class
path: atl08_class
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=false)
converted_type (legacy): UINT_8
compression: GZIP (space_saved: -59%)

2 replies

scottyhq Feb 28, 2023
Maintainer Author

Sort the data by the time column in the parquet file.

Is it currently ordered in any way, or just workers crunching granules append to the table as they complete? I do think it would be best to sort by time even if its a performance hit and takes a while. Probably compression will be better (depending on algorithm) if similar numbers are closer together in the table

I am not seeing as good of compression as I expected

compression: GZIP (space_saved: -59%) I noticed numbers like this for the grandmesa test case, but I think you have to also look at "total_uncompressed_size". For a column of mostly the same number, the overhead of a lookup table is significant, but the total storage is tiny! For example dh_fit_dy was all 0.0 in this case:

NAME			 DTYPE	  UNCOMPRESSED_SIZE	COMP_RATIO
==================================================
extent_id                  INT64         2.407966        73.04              
distance                   DOUBLE        0.551469        20.04              
segment_id                 INT32         0.450041        21.01              
rgt                        INT32         0.000556        17.63              
rms_misfit                 DOUBLE        2.419969         4.71              
gt                         INT32         0.087919        83.12              
dh_fit_dy                  DOUBLE        9.8e-05        -34.69

I think it might be good to use the SNAPPY parquet compression by default (if nothing else because this is what pyarrow/pandas use by default. My understanding is that it is much faster to decompress and compute$ >> storage$ ?...)

jpswinski Mar 1, 2023
Maintainer

Ok, we'll dive in to sorting :-) As you guessed, the current order is just first arrival based on when the particular worker produces the output.

My initial thought for sorting is that I will write the file out to disk as is, and then if sorting is enabled, there will be a second step where just the time is read back in, paired with a row index. I will then sort the time/row index pair, and use the resulting indices to read/write the original file to a new file that then gets returned.

This will minimize the amount of data I need to maintain in memory without getting fancy. Consequently, there will still be an upper bound on the size of the file that I can sort. If we allocate just over GB to do the sort, we can sort up to about 100 million rows. If the request creates more than this upper bound of rows, then the code will return the file unsorted. In that circumstance I can send an error message back to the client; I can also embed in the metadata of the GeoParquet file some indication that the data is not sorted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SlideRule Earth

Matching optional GeoParquet with default dataframe #180

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

SlideRule Earth

Matching optional GeoParquet with default dataframe #180

scottyhq Feb 15, 2023 Maintainer

Replies: 4 comments · 8 replies

jpswinski Feb 15, 2023 Maintainer

scottyhq Feb 15, 2023 Maintainer Author

jpswinski Feb 15, 2023 Maintainer

scottyhq Feb 15, 2023 Maintainer Author

scottyhq Feb 15, 2023 Maintainer Author

jpswinski Feb 15, 2023 Maintainer

jpswinski Feb 15, 2023 Maintainer

scottyhq Feb 15, 2023 Maintainer Author

scottyhq Feb 15, 2023 Maintainer Author

jpswinski Feb 23, 2023 Maintainer

scottyhq Feb 28, 2023 Maintainer Author

jpswinski Mar 1, 2023 Maintainer

scottyhq
Feb 15, 2023
Maintainer

Replies: 4 comments 8 replies

jpswinski
Feb 15, 2023
Maintainer

scottyhq Feb 15, 2023
Maintainer Author

jpswinski Feb 15, 2023
Maintainer

scottyhq
Feb 15, 2023
Maintainer Author

scottyhq Feb 15, 2023
Maintainer Author

jpswinski Feb 15, 2023
Maintainer

jpswinski Feb 15, 2023
Maintainer

scottyhq Feb 15, 2023
Maintainer Author

scottyhq
Feb 15, 2023
Maintainer Author

jpswinski
Feb 23, 2023
Maintainer

scottyhq Feb 28, 2023
Maintainer Author

jpswinski Mar 1, 2023
Maintainer