-
-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow parquet provider #1722
Pyarrow parquet provider #1722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barbuz this is a very welcome addition! I am not (yet) very well versed in (Geo)Parquet, so some general comments/suggestions first:
- documentation: add, in particular to Feature Providers: https://docs.pygeoapi.io/en/latest/data-publishing/ogcapi-features.html#providers
- CRS support: see https://docs.pygeoapi.io/en/latest/crs.html . So the Parquet provider may itself support or let pygeoapi do (
@crs_transform
on the Provider functionsquery()
andget()
) - Q: will Provider support both Parquet and GeoParquet ? pygeoapi also supports geometry-less data (ok, I see
self.has_geometry
test) - Q: so
source
may be a local file or remote URL (with HTTP range support)? - see CRS support above: looks like result is always in WGS84.
That's it for now, don't hesitate if you have questions!
Today we also discussed this PR at the pygeoapi PSC meeting. It was mentioned that the OGR provider also may support Parquet, but like with other providers like PostgreSQL a native/direct implementation can be more efficient, so yes, we like to go ahead. One question was if the GeoPandas dependency is required. Background is that GeoPandas itself may have quite some dependencies. We try to avoid too many dependencies for now (builds, Docker image size). Like said, I am not an expert on Parquet, but the question arose: is it possible to only have |
We are implementing a geopandas generic provider in cgs-earth/pygeoapi-plugins for the exact same reason you mention @justb4. OGR provider technically means you can serve CSV with all of the OAF queryables, but it can be a barrier to use for someone new to OGR. Much easier to have the same configuration as the CSV provider with a slightly more capable backend implemented. We understood this to be out of scope in pygeoapi, hence hosting it in our public repo of pygeoapi plugins. This seems like a hybrid between the generic geopandas provider that I mentioned above and a geoparquet specific provider. @barbuz which one is this PR supposed to be? |
Thanks for your comments! I'll try to answer to everything:
Added.
geoParquet encodes the CRS of the data together with the data, so I found it more convenient to rely on that than to specify the CRS of each dataset in the pygeoapi configuration. Looking at it again I probably did it in a way that does not work well with pygeoapi's management of bbox and crs, so I have now switched to using
Yes, I wrote this mainly to work with GeoParquet, but also added support for geometry-less data (plain Parquet)
Currently a local file or AWS S3 URL. That was what I specifically needed, but it should be possible to add support for HTTP (or other protocols) if there is desire for that.
Result is now in the data's native CRS, pygeoapi takes care of transformations.
I am not sure about this, it would surely need some research as I am not too familiar with geoarrow. The current state of the code basically only relies on geopandas to build the "geo_interface" to the data (i.e. the dictionary pygeoapi expects to be returned by a query or get call), so I think it should not be too hard to replace this with something else. Could you just help me understand exactly what structure should this dictionary have? I found out that this was working but I am not sure if it follows some particular standard.
This is intended to be a GeoParquet provider. It also supports plain Parquet as an edge case (a GeoParquet file without a geometry column is basically just a Parquet file), but it was never meant to be any more generic than that, surely not a generic geopandas provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barbuz thanks for these answers and additions!
To answer on the query()
return structure: this is basically an in-memory GeoJSON FeatureCollection
, with some (optional) additions. You may look into some other providers how this is populated. For example mongo.py
feature_collection = {
'type': 'FeatureCollection',
'features': featurelist, // array of Feature dict, see https://geojson.org/
'numberMatched': matchcount, // OPTIONAL
'numberReturned': len(featurelist) // OPTIONAL
}
To add:
- Dockerfile : add deps. As Debian packages are preferred, these would be
python3-arrow
andpython3-geopandas
Still, the main concern is the growth in disk space of the deps: on MacOS: pyarrow
about 100MB and geopandas
around 105MB. This is added to the current about 306MB for all current pygeoapi deps. geoarrow
adds only a couple of MB. But like you indicated more expertise is needed, basically to convert .parquet files to our internal GeoJSON structure (see above). I looked a bit around but could not find a concrete example. I ping here @bertt and @cholmes . Maybe they can be of help for GeoArrow..
Also the Docker Image (now over 1GB) will grow.
This is a generic problem that we, devteam, want to address: how to distribute pygeoapi with only required/minimal deps as needed for a user's use case...
I'll just help by pinging the true geoarrow experts - @paleolimbot @jorisvandenbossche @kylebarron Joris also works directly on geopandas. They can sound in if there's some lightweight way to do things, but unfortunately parquet is often a pretty heavy.
It might be interesting to explore using geoarrow as the internal structure instead of GeoJSON. Obviously that'd be a big change, but something to think about, as there's potentially some really interesting benefits - you'd align with the memory structure of other geospatial tools for very efficient interoperability between libraries. |
It sounds like you need two things: to read GeoParquet and convert to GeoJSON. pyarrow isn't enough on its own, because pyarrow has no knowledge of spatial types, and you need something to understand the geospatial nature of GeoParquet and convert to GeoJSON. You could implement something by hand using pyarrow and shapely, but that wouldn't be trivial. pyarrow and geopandas are indeed very large dependencies. This is a large part of the reason I've been working on https://github.com/geoarrow/geoarrow-rs. Its Python bindings are 10MB total, including both the GeoParquet reader/writer and other general geospatial functionality, like GeoJSON conversion, and it has no dependencies of its own. But it's also not fully production ready just yet. Indeed, you will lose most of the performance benefits of GeoParquet by converting the data in memory GeoJSON, but it will at least work. Using GeoArrow internally would be a big performance improvement, but would also be a big change. |
Parquet/GeoParquet support would definitively be valuable! I think the use case here is to have Parquet/GeoParquet as a "backend" provider to be able to emit GeoJSON. Using Parquet/GeoParquet would be an interesting experiment. Given the way pygeoapi data providers work, GeoJSON in memory is a much smaller footprint given OGC API workflow (subsetting, no bulk data transfer per se) and GeoJSON are used for query results (and are simply Python dictionaries). Dependency wise, it's clear the GeoParquet dependencies would be optional and not affecting the core. Which is great/by design given our architecture. |
Do you mean that it could be fine to leave geopandas as a dependency for this provider? Replacing geopandas with geoarrow is something I could try to work on, as is adding support for HTTP access to data, properly managing the CRS internally, and a number of other things... I'm just trying to understand what would be the best use of my (sadly limited) time: what is actually needed to get this PR approved? |
Thanks @cholmes and @kylebarron! If @tomkralidis agrees, I suggest the way forward for now:
Indeed through And at same time open a new issue to investigate the use of GeoArrow i.s.o. GeoPandas, plus for possibly other extensions like internal CRS transforms, HTTP Urls, CQL support etc. I would love to access Overture data via OGC API Features, makes compelling use case e.g. in QGIS! |
Thanks @justb4; yes, we could add to Having said this, on a clean virtualenv, for example, @kylebarron do ping us when geoarrow-rs is close to production ready. |
We might need to pin geopandas or numpy if we introduce geopandas as a dependency. Geopandas <2.2.2 is only compatible with numpy<2. Just had some issues in our geopandas provider last week managing these deps. |
Yes, about my figures. The "geoarrow-pyarrow adds 222MB " is mostly IMO @barbuz think we are good to go. As last possibly can you look in maybe pinning, as @webb-ben states (meaning < Pandas < 2.2.2 I guess)? In my venv Python 3.10.12, I have |
Yes you are correct I meant pandas 2.2.2. I am able to run on my local environment. Challenge has been getting a Docker image to build. See this workflow for reference. Running
and yields this error:
|
To be clear there are multiple core geoarrow libraries for Python. geoarrow-pyarrow is developed by @paleolimbot and depends on pyarrow and his own |
The above workflow run errors refer to CKAN, not sure if related. The Docker Image can have a very different version and dep tree compared to local custom pypi installs. This has to do with the Ubuntu (Jammy) base image, Python versions, the use of UbuntuGIS (unstable), the use/preference for
For this PR we will hold off adding to Docker, also as a new issue is underway to streamline different Docker Images for varying plugin support. This will take some deep thinking and hopefully help from the community. As to progress this PR, if @barbuz indicates that the current PR version (unit tests) works with the current deps and |
I can confirm that the unit tests for this provider can be passed with the current However, I must say that there are tests for other providers that I could not get to pass neither in the main branch nor in this branch, likely because I am missing some dependencies (e.g. elasticsearch, mongo, oracle, postgresql). Should I try to get these to work before merging the PR or is this acceptable? |
I would worry only about the parquet tests. Our Once this PR is ready for review, we will enable CI which will run all tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we're good, thanks for patience @barbuz ! @tomkralidis will also do final review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various comments/change requests -- fantastic work here!
Overall, we need to decide how to document the plugin (Parquet, GeoParquet?).
As well, see comments in the actual plugin on opening/closing dataset files (may need attention, or not).
Thank you for all your comments! This provider can support both Parquet and GeoParquet. I am not sure what the best way to call it is; any GeoParquet file is also valid Parquet (so calling it a Parquet provider is correct), but as a geospatial API maybe we care more about the "Geo" part (so calling it GeoParquet could be better)? We could also call it a Pyarrow provider, as it mainly leverages that library to work with Parquet data. What do you think? Unfortunately I likely won't have much time to work on this in the next two weeks because I will be travelling, if it is not a problem I will pick it up on my return. |
OK great! Let's go with calling it a Parquet plugin, then. And in the plugin documentation mention that GeoParquet is supported also.
No problem, will look forward to your updates in a couple of weeks. |
@tomkralidis wouldn't be more clear calling it the other way around so GeoParquet? |
Thinking more, for me, GeoParquet would mean that "geo" is required, much like our GeoJSON (JSON) plugin (which requires geometry, but can be I don't have strong opinions on this one, though. If folks feel strongly enough, let's name to GeoParquet (ensuring that the documentation mentioned that plain Parquet is also supported). |
I've checked GDAL and they use |
+1 |
Parquet as a name is more future proof, as our hope is that eventually
geoparquet goes away and geo is just part of the core parquet spec.
…On Sun, Jul 28, 2024 at 11:41 AM Tom Kralidis ***@***.***> wrote:
@tomkralidis <https://github.com/tomkralidis> wouldn't be more clear
calling it the other way around so GeoParquet?
Thinking more, for me, GeoParquet would mean that "geo" is *required*,
much like our GeoJSON (JSON) plugin (which requires geometry, but can be
null). We've named the PostgreSQL in this manner as well (in addition to
elasticsearch, xarray, etc.).
I don't have strong opinions on this one, though. If folks feel strongly
enough, let's name to GeoParquet (ensuring that the documentation mentioned
that plain Parquet is also supported).
I've checked GDAL and they use Parquet as the short name and (Geo)Parquetas
the long one. So then, what about to keep the provider name Parquet in the
code as you suggested and rename it (Geo)Parquet in the documentation page?
+1
—
Reply to this email directly, view it on GitHub
<#1722 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDL2M2YZ5FPANQQFPCWLDZOUNMPAVCNFSM6AAAAABK3UWURSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJUGU3TMMZVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @cholmes ! Think this nails the name to We are really almost there for this PR. Remains:
Tip: there is a pre-commit hook: https://github.com/geopython/pygeoapi/blob/master/.pre-commit-config.yaml or when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we overlooked but the Provider class name in plugin.p
y reads: 'pygeoapi.provider.parquet.parquet.ParquetProvider'
. Think one .parquet
too many (?).
Sorry for the delay, I've just pushed some changes to address your reviews |
Overview
I have been using pygeoapi for my work and ended up developing a Features provider based on
pyarrow
andgeopandas
to manage parquet and geoparquet data. I am happily using it in my local instance, and I thought it could be useful to other people and projects as well, so here is my proposed contribution.Additional information
This is my first PR on pygeoapi so please let me know if I have missed anything. I am also happy to address any comments or suggestions for this code.
Other options I have tried to manage geoparquet data are
dask-geopandas
and plaingeopandas
, butpyarrow
turned out to have the most efficient and complete implementation (in fact it is used internally by both other libraries).Dependency policy (RFC2)
Updates to public demo
Contributions and licensing
(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)