-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add xvec accessor and to_crs, set_crs, geom_coords_names #11
Conversation
Codecov ReportBase: 97.72% // Head: 98.53% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #11 +/- ##
==========================================
+ Coverage 97.72% 98.53% +0.80%
==========================================
Files 1 2 +1
Lines 132 205 +73
==========================================
+ Hits 129 202 +73
Misses 3 3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Great! A signature very familiar to Xarray users is this: Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)
# e.g.
ds.xvec.coords_to_crs(geom1=crs, geom2=other_crs)
# or
ds.xvec.coords_to_crs({"geom1": crs, "geom2": other_crs}) But I also agree that setting a common CRS for all geometry coordinates may be convenient and is consistent with geopandas. What about a
I think it is fine to loop through the coordinates and chain the |
I am not sure about the |
Since I wrote https://github.com/martinfleis/xvec/pull/11#issuecomment-1326097578 I read from discussions in GeoPandas and other places that a single CRS is used for the whole Dataframe / object. Rioxarray / odc-geo also consider one global CRS per DataArray / Dataset. So I agree and changed my mind regarding https://github.com/martinfleis/xvec/pull/7#discussion_r1029284508, we might consider a single CRS for the whole Dataset / DataArray too. How to enforce that, though? Unlike GeoPandas, there's no concept of an active geometry coordinate in xvec and it probably shouldn't be one? I'm uncertain on whether we want to add a # internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names
# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords
# read (write?) property returning a single CRS
# or raising an error if conflicting CRS are found
Dataset.xvec.crs
# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)
# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.to_crs(coords=None, crs=None...) |
I tend to disagree here. GeoPandas does not have a concept of a single CRS for the whole DataFrame. You can easily have many GeoSeries in the gdf each with its own CRS and there are use cases for that I've used myself. The feeling that there is one per gdf si coming from the active geometry concept only. Treating each coords as a single GeoSeries, hence allowing it to have a different CRS makes sense to me. I am not sure how the single CRS per DataArray here would work and why we should do it. I think that with the transformation, we will have two use cases - 1) change a CRS of a single coords array, likely to match the CRS or the other or to use 3857 for easy plotting (that is the exact use case where you may want to have multiple projections in one DataArray and use one coords for geometry ops and the other for plotting); 2) harmonise all projections among all geometry coords. Having an API that supports both at the same time feels a bit unclean and can be confusing. So I would use the API you suggested above for the first case and figure out another one for the second one. Resulting CRS api would then be: # internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names
# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords
# (re)assign the CRS of all or given geometry coordinates without re-projection
Dataset.xvec.set_crs(coords=None, crs=None, ...)
# transform one of more coords to a CRS set for each
Dataset.xvec.coords_to_crs(coord_crs=None, **coord_crs_kwargs)
# (re)assign the CRS of all or given geometry coordinates with re-projection
Dataset.xvec.harmonize_crs(coords=None, crs=None...) |
Ah ok, I must have misunderstood something (so many great discussions happening in a lot a different places!) and I guess I was confused by the description (header) of GeoDataFrame.set_crs. You are re-changing my mind then 🙂. I like I'm confused by I guess one of the following would confuse me less:
|
Is see. It seems to be an artefact from the past when more geometry columns were not supported. That line should change to "Set the Coordinate Reference System (CRS) of the active geometry column". But the line below slightly specifies it:
I see the confusion on
this seems to be a reasonable and consistent solution. Removing it in favour of harmonising will not work as for the transformation you need to know the original CRS and you need a way to set it if you don't have one. |
As far as I understand it, geometry / geography variables don't have to be coordinates (though most often they will be, to be able to use indexes). As such, I wonder if it would make sense to change the api to something like this: Dataset.xvec.assign_crs(variables=None, **variable_kwargs)
Dataset.xvec.to_crs(variables=None, **variable_kwargs) This is what Edit: I now see that this is somewhat implied in the suggestions for |
I would try to avoid taking this complexity (which I believe R's There should be the ability to convert geometries associated with a dimension from one CRS to another, but I would try to avoid the complexity of catering for (and having to manage) multiple of them and have the user specify which one is active here. The geometry data associated with the dimension is typically small compared to the payload of the values in the data cube. |
theoretically, they can be also data but it is currently unclear how operations on N-D array of shapely geometries would work. Shapely currently assumes one dimension and we would likely need to flatten and then reshape the arrays to do spatial ops on top. And given we wouldn't have a GeometryIndex on top of data, we would need to find another way of storing the CRS. That said, I think it is in scope of xvec so it make sense to think forward and set the API future-proof. |
@edzer I agree there's no need to specify an active geometry coordinate for vector data cubes, we can just treat those cubes as collections with none, one or more geometry variables and iterate over all or a given subset of them.
Is it because there is usually only one (spatial) index in a Table or DataFrame? Xarray supports well co-existing indexes for different dimensions (and now even for different coordinates of the same dimension), so having an active geometry coordinate is not very relevant indeed.
@martinfleis this is only for certain features like STRTree, right? For all element-wise operations (predicates, measurement, etc.), there's no particular reason not to support them for any n-dimensional coordinate or data variable.
Two possible ways are (1) as an Xarray variable attribute and (2) by wrapping the variable data in a "duck" nd-array with a CRS attribute. We could define some general rules to extract the CRS from coordinates or data variables (https://github.com/martinfleis/xvec/issues/5#issuecomment-1316603760).
I would rather avoid methods with too many arguments. If we want to update the CRS for all (data | coordinate) variables, perhaps |
Yeah, it is a small subset but not only STRtree. Functions like So, compiling all the suggestion made above, does this API look reasonable? # internal attribute for getting all geometry coordinate names
# (set in the accessor __init__)
Dataset.xvec._geom_coord_names
# property returning all geometry coordinates (subset of Dataset.coords)
Dataset.xvec.geom_coords
# (re)assign the CRS one of more variables without re-projection (allow_override to match GeoPandas API)
# in case of coordinates this updates GeometryIndex
Dataset.xvec.set_crs(variables=None, allow_override=False, **variable_kwargs)
# transform one of more variables to a CRS set for each
# in case of coordinates this updates GeometryIndex
Dataset.xvec.to_crs(variables=None, **variable_kwargs) I've removed |
👍 (nit: I'd rather use |
This should now be ready for a review. Thanks all for the fruitful discussion! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @martinfleis! I left a couple of comments.
xvec/accessor.py
Outdated
def __init__(self, xarray_obj: Union[xr.Dataset, xr.DataArray]): | ||
"""xvec init, nothing to be done here.""" | ||
self._obj = xarray_obj | ||
self._geom_coords_names = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we return all geometry coordinates even those which don't have a GeometryIndex (yet)? This would be convenient, e.g., for setting a new GeometryIndex for each geometry coordinate.
I guess it could be possible to check for valid coordinates at a reasonable cost?
is_geom_variable(name, obj):
if isinstance(obj.xindexes.get(name), GeometryIndex):
return True
elif obj[name].dtype is np.dtype("O"):
# try first on a small subset
subset = obj[name].data[0:10]
if np.all(shapely.is_valid_input(subset)):
return np.all(shapely.is_valid_input(obj[name].data))
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly, but I also find it useful to return just those backed by the GeometryIndex
. I am using it below myself. What about having geom_coords
returning all and geom_indexes
returning the list (or a collection as you suggest below) of those backed by the index?
Co-authored-by: Benoit Bovy <[email protected]>
@benbovy I went ahead and implemented you suggestions.
|
LGTM! Thanks @martinfleis |
Closes #10 and resolves #5.
This is very much WIP. Implementing
.xvec
accessor and acoords_to_crs
method, that practically copies the code from GeoPandas. We may consider going throughGeometryArray.to_crs
instead but I'd wait until shapely 2.0 is the only supported engine. Right now, it may cause a bit of trouble and unnecessary conversions.It currently supports only a single coords. While
drop_indexes
support dropping multiple indexes at the same time,set_xindex
seems to support only a single one, so we will need a workaround. I think that it may be useful to use a singlecoords_to_crs
call to consolidate projections of all geometries instead of going one by one.To-do: