-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for spherical geometry (S2) in Python #10
Comments
And the previous community meeting, we had a discussion about this with @benbovy. Pasting the notes from that meeting here:
|
Some updates on this: I've tried using pybind11 and xtensor-python to implement some vectorized functions working on numpy There has been some effort to support numpy |
Thanks for the update @benbovy ! I haven't been able to fully get my head around the use of pybind11 / xtensor-python parts here, but a couple thoughts came to mind:
|
@benbovy on a less technical note, if you still have a capacity and an interest in applying for the NumFOCUS SDG, the timeline for the next round is following:
If we should go ahead, I suggest organising a call in the first half of February to discuss the details. |
Thanks for your feedback @brendan-ward !
I agree, we could certainly start with 1-dimensional arrays. That said, longer term I'd be interested in exploring the possibility of reusing n-dimensional vectorized S2 Python wrappers (as well as pygeos) with Xarray, which is undergoing some heavy refactoring to allow custom indexes. Xarray + geospatial indexes would be great for processing outputs of climate / ocean / landscape evolution / earth system models.
That's my main concern. Unfortunately, there's no On the other hand, if we can make this possible without much effort, simply using pybind11 (and xtensor-python) to create the Python wrappers would allow us to develop new features / iterate very quickly and would avoid the burden of maintaining an intermediate C API + dealing with the low-level Numpy C API (+ vendoring S2). It not yet clear to me what is needed to make this approach work, though. It would be great to have thoughts from folks at QuantStack. @wolfv, have you ever thought about adding vectorized functions / ufuncs working with numpy arrays of wrapped CGAL objects (as |
@martinfleis that sounds good. I'm happy to discuss the details within the next two weeks. Before going further with a NumFOCUS application, I'd like to clarify the technical challenges of the C API vs. pybind11 approaches. TBH I'm not sure to have the full capacity of going deep into either the design of a S2 C API or heavy customization of pybind11 internals, especially within the context of a NumFOCUS small development grant. |
@paleolimbot I'd be curious about your thoughts on a shared C API against S2 that could be leveraged for both Python, Specifically, if this is something of interest to the broader community, should we try and take this up directly with S2? Long term, it seems ideal that the C API lives inside that project, as it does with GEOS, and given the approach with GEOS it seems also reasonable that it not be expected to have full coverage of the underlying C++ API, at least not initially. However, that might lead to other challenges:
Since it looks like |
From watching the FOSS4G talk of @paleolimbot mentioned above (https://www.youtube.com/watch?v=zqRhF2XM1CE), it is my understanding that the S2 Geometry library itself it quite low level, and all the functionality to work on "simple features" (as what we get from GEOS for planar features) actually lives in the R If that's correct, we shouldn't put effort in directly interfacing with S2, but rather with the C or C++ layer in R's |
Thanks all for looping me in here! @jorisvandenbossche is right that anything resembling a GEOS-like interface lives in the R package. More specifically, I copied the BigQuery Geography API (which copies the PostGIS API more or less), because I could test it as I went. I'm conflicted about a C API for S2 because it robs S2 of some of the things that make it so great (despite the fact that I opened that ticket for myself). That said, S2 C++ is complicated to use and an awesome library that nobody has access to in Python is not doing anybody any good. I have some ideas about how to organize a C API and will give it a try since it seems like there may be interest in helping if I get it started. The nice part about doing this within the S2 R package is that there's near 100% test coverage, although eventually a C API should live elsewhere. Another approach I've been thinking about is one that operates on vectors of geometries (as ABI-stable C ArrowArrays) rather than individual geometries. Operating on chunks of geometries is nice because you don't have to sacrifice performance on points to support multipolygons (et al.)...you just send more features in a chunk and the implementation can loop without a bunch of tiny allocs or virtual method calls. This depends on the somewhat experimental stuff that Joris, I, and others have been working on with how geometry can be passed around as a Or a combination of both! I'll start my own C API ticket and see what happens. I probably missed a few things from the thread...let me know if I didn't catch something important! |
This sounds very interesting! I've been wondering where some of the boundaries might be with those (mostly a separate topic), namely serialization to / from file storage vs on the compute side between a client library (e.g., something in Python or R) and the underlying geometry implementation (e.g., GEOS or S2). Is the idea that the core object that is used in the client library becomes the ArrowArray, which is then passed to a layer in C/C++ that translates from those to GEOS or S2 objects? Is that limited to geometry construction / serialization from the underlying library (GEOS / S2), and otherwise the client holds on to wrapped pointers to GEOS or S2 objects?
I was thinking about this too in terms of trying to figure out which operations in a hypothetical C API for S2 would need to work on scalar geometries vs 1D arrays of geometries and have the underlying implementation do the looping rather than the caller (passing arrays seems very appealing). The part where I got hung up was that on the Python side (in pygeos/shapely), we're leveraging looping mechanics from numpy to give us the inner loop over multi-dimensional arrays and then operating on scalar geometries. I.e., it is the client that knows how to do the looping rather than the underlying implementation. Though our most common use cases so far are likely 1D arrays of geometries (though @benbovy 's comment re: n-dimensional arrays is well taken), and I might not be thinking about this correctly at all.
Yes - interested! (may not have tons of capacity to throw at this, but interested in being involved). I was thinking of drafting a Google Doc or similar to outline some thoughts on the C API, but your thoughts on this are going to be a much better starting point. Is there someplace where we (community) can start collaborating around details about the API more specifically? Is there interest in a meeting to discuss this topic, and shared needs in the S2 / ArrowArray space for both Python and R? |
From this discussion it seems that the best path forward would be to develop a common, GEOS-like C API for S2 that can be leveraged from different "client" ecosystems such as R or Python geospatial libraries. Although you guys certainly have much more experience than I have with GEOS, Arrow and/or the Numpy C API, I'd still be happy to help somehow if I'm able to do that, as I would really like to see more S2 capabilities accessible from within Python. Some naive questions: @paleolimbot how much effort do you think it would require to refactor bits from the S2 R package into a target-language agnostic C API? Do you think that short or mid-term we could already have something (even with a few features only) so that we can start building Python wrappers on it? Perhaps a very naive question: would it be possible to separate the building of a GEOS-like C API from the building of the S2 library itself? @brendan-ward's comment re: ideally the C API would live inside the S2 project, although I don't know if there's interest in that from the S2 maintainers point of view and this would also probably further slower the iteration speed. Note: there are conda packages for the S2 library. Could we easily reuse with Numpy arrays a C API that is working with C Arrow arrays? I.e., perform broadcasting, convert to Arrow array, call the C API and then convert back to Numpy array at little overhead cost? As a user, I indeed find the pygeos/shapely approach based on Numpy ufuncs quite appealing :), even though I acknowledge that it is probably overkill for 80% of use cases.
I was also wondering about that, but I guess that S2-specific wrappers may live alongside (and more-or-less independently from) a GEOS-like "standardized" S2 C API, depending on what we want to achieve using Python, R,...? |
@benbovy (speaking for myself here) - I don't think we've mady any decisions about which direction to pursue or even recommend, we're still gathering info to weigh pros / cons. So please don't take the suggestions around a C API for S2 as trying to shut down your work on the pybind11 / xtensor-python approach! It does seem that in elaborating how we'd want a C API to work would also inform how we'd want an API via pybind11/xtensor-python to work (though a good bit of that would be inside the C++ tier rather than API exposed to python). Question: if for the short term we wanted to stick to the 1D approach to get this implemented most quickly, and manage all the looping ourselves as a simple loop over that array, would that sidestep the issues around how to handle object arrays in xtensor-python that you were bumping into? I.e., sidestep the ufunc related issues specifically. Or is the issue more at a fundamental level of trying to have an object array of S2 objects at any dimension? (sorry - not entirely following your Yet another idea, no idea if it would work. I wonder if there is middle ground of using pybind11 but with a more C style interface rather than object-oriented interface (i.e., don't use pybind to expose C++ classes with methods). So in pygeos our core geometry object is just a struct that holds a couple pointers and is defined as a type: typedef struct {
PyObject_HEAD void* ptr;
void* ptr_prepared;
} GeometryObject; (other mechanics around managing the python type wrapping around this omitted here) Which are then always passed around as So (if I follow correctly), it seems like it would be possible (in theory, not up on pybind11 yet) to do something similar: constructors return a pointer to a struct (let's call this Then I think you could expose functions like So in this purely hypothetical example, pybind11 is really just helping with the boilerplate around creating a Python C extension, but allowing you to more easily have a C++ implementation underneath rather than having to define a C API in the middle. No idea what that means for working with numpy at the C++ level though. |
No worries at all! I'm really open about which direction to pursue and I'm enjoying the discussion here! Sorry if my tone suggested otherwise. The
That is what I had in mind too (instead of exposing more specific S2 derived types to Python).
Yes this is what I wanted to check. Taking your example, I was wondering if pybind11 could even help with the boilerplate around the
However, I couldn't get it to work with the |
@benbovy I did some testing with pybind11 around returning numpy arrays of custom objects here. Let me preface this by saying I am not at all up to speed on modern C++, esp. pointer semantics, still quite unfamiliar with pybind11, and basically don't know what. I'm doing here. I think this is working for a hypothetical example of wrapping a foreign object into a class that we then return as a pointer. I only tried the 1D array case and did not use The things that seemed to make this work were using For the example I construct a custom Point object from x and y input arrays (looping ourselves), where a Point holds an instance of a foreign object, meant to simulate an S2 object, which allocates and deallocates memory (I haven't yet profiled for memory leaks); it seems like it is calling the destructors properly and should release it. Point was just meant to stand in for a wrapper around S2 objects, not a specific concrete Point type. Not sure if this solves the specific issues you were bumping into above. |
Hi all...sorry for the radio silence, I wanted to put some fingers to keyboard before I said that something was going to be easy or hard! This thread has helped me a lot, in particular helping clarify that S2 doesn't necessarily need a C API, it really just needs the simple features compatability layer to live outside the code that interfaces with R. The real crux of it is that there is no I've opened a PR to start the process of extracting out a standalone With respect to the possibility of an |
Just a reminder that if we want to apply for the SDG grant in this round, the Proposal Submission Deadline is March 4, 2022. Cc @benbovy |
Hello everyone, getting back on this thread after a few busy weeks, sorry! @brendan-ward, @paleolimbot, many thanks for sharing what you've done! @brendan-ward - Your test was very helpful. I could also make work the 1D array case with xtensor-python, using @paleolimbot - I like your approach in @martinfleis - We're close to the dealine, indeed :/ I'll try to see if I can draft a proposal early next week, using as a reference the former proposals that you mentioned during the community meeting. I'll keep you updated! We can have a call to discuss the details if needed and if you're available next Monday or Tuesday. |
@benbovy we can always postpone it to another round (June 3). You can also check our GSoC project idea on S2 and salvage some text from that (https://github.com/geopandas/geopandas/wiki/Google-Summer-of-Code-2022#s2---bringing-spherical-geometry-to-python). I am available on Monday after 5pm UTC and Tuesday at almost any time feasible for Europe. But while I am happy to help organising, I am of no use technically in this. |
@benbovy I should be available after 16:00 UTC on Monday/Tuesday if it would be helpful for me to join, but definitely do not hold it up on my account. Happy to help review proposal / ideas. |
Ok, it's done! There's a long way to go, but at least the simple features <-> s2geometry mapping is now in its own repo: https://github.com/paleolimbot/s2geography . I'm not awesome at CMake but it seems to work locally and on GitHub actions! |
That's really great @paleolimbot ! |
We had a chat with Dewey, Benoît and Brendan yesterday, and some (partial/unstructured) notes can be found at https://hackmd.io/FiNFZ6wkQCufZyyxBewFYA?both |
Hi all, I’m happy to report some good progress on this front. s2geometry
s2geography
s2shapely (Python bindings)https://github.com/benbovy/s2shapely/
What do you think about the name “s2shapely”? I kind of like it, especially if we are able to replicate most of shapely’s API. Perhaps it would make sense to eventually move the repository into the shapely Github organization? The pybind11 workarounds used here still need to prove their "robustness", but if it works well in the general case (I'm optimistic 🙂) then I don’t see any major blocker for the next steps. I’m going to work on implementing more things in s2shapely by copying shapely’s API. (*) EDIT: not real Numpy ufuncs but pybind11 vectorized functions. |
@benbovy this is great!
I understand the choice but can already imagine a situation when debugging some geopandas bug:
I would probably try to pick another name that does not cause a confusion. The similar issue may happen when you are talking about it. It will always require clarification that this is the one with s2, not a shapely you already know. |
@martinfleis I agree this may be confusing in such situation. On the other hand, if the goal is to closely replicate shapely's API using a different geometry backend, choosing a name that is too unrelated may be confusing too for newcomers. To help clarifying between the two packages maybe we could provide different class names, e.g., Some alternative names:
|
I think s2shapely is probably accurate for what you're trying to do here and I doubt it will be all that confusing; however, you could also use the name 's2geography'. |
@benbovy thanks for the overview of todo items and progress!
I would personally lean to rather use a more distinct package name, but then the same class names to make the API as similar as possible (except things like base class geography vs geometry, etc). I was thinking the same as what @paleolimbot already mentioned as well: we could just use "s2geography", to keep it close to the C++ library it is wrapping? And since it is python vs C++, there is no naming conflict. Or do you think it is confusing because it might not be clear which of the two you are talking about? (most users will just use the python bindings, and so won't run into that confusion though) |
Although there is of course some conflict. For a package manager like conda, we would then need to use |
Also, 's2geography' for the C++ library is a name I pulled out of my bum one day when my daughter was napping and could totally be changed (although maybe it's already a conda package?) |
I like the name "s2geography" for the C++ library! Some other options: arcly Those names are short, clear, at the same time close and distinct enough from shapely. They are also available on pypi / conda. |
Maybe a bit hard to pronounce, though? Any other (curved-shape-word)ly idea? |
Another option: spherically |
From the -ly options, I think I might like I would personally still consider |
I'm on the same side as @jorisvandenbossche. |
+0.001 for "s2geography" (just because it opens the Python package up to doing more than just providing a compatible shapely interface). |
I like I'd prefer |
Yes that makes sense too! Although I think it is OK if "spherely" vs. "shapely" do not share the same exact API and features. That is actually my concern with using the same name "s2geography" for both the C++ and Python libraries: it restricts more the Python package to just providing Python bindings to the C++ library. |
Fair enough! 'spherely' is nicer to say too 🙂 |
General issue to track the idea of improving the support for working with spherical geometries (instead of planar) in GeoPandas and more generally the Python geospatial ecosystem.
Some background links:
sf
that they will use S2 by default: https://r-spatial.org/r/2020/06/17/s2.html#sf-10-goodbye-flat-earth-welcome-s2-spherical-geometryThe text was updated successfully, but these errors were encountered: