-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GeometryIndex #7
Add GeometryIndex #7
Conversation
As replacement of ShapelySTRTreeIndex.
Codecov ReportBase: 24.39% // Head: 97.72% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #7 +/- ##
===========================================
+ Coverage 24.39% 97.72% +73.33%
===========================================
Files 1 1
Lines 41 132 +91
===========================================
+ Hits 10 129 +119
+ Misses 31 3 -28
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Remove Index API that is meant to be only used via Xarray internals.
I think this is ready if someone wants to have a look (@martinfleis @keewis). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is superb!
What about calling the index GeometryIndex
instead of GeoVectorIndex
? Would that be ambiguous? We can later add something like GeographyIndex
when S2 becomes ready.
Later, we should also make a decorator fetching the docstring for isel
, create_variables
etc. from the original PandasIndex.
xvec/index.py
Outdated
*, | ||
options: Mapping[str, Any], | ||
): | ||
# TODO: try getting CRS from coordinate attrs or GeometryArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: try getting CRS from coordinate attrs or GeometryArray | |
# TODO: try getting CRS from coordinate attrs or GeometryArray or SRID |
Just so we don't forget about this option as well. It is not common but when loading geoms from PostGIS, this may be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can somehow get xarray
to extract the CRS from the GeometryArray
and store it on the object? That would probably have to be done in a hook that would be defined here... some sort of array type → hook callable
mapping where the hooks return either the full Variable
object or a data, attrs
tuple, maybe?
Also, that would require some thought into how to store the crs
. Not sure if we could copy whatever rioxarray
/ odc-geo
are doing, since those seem to have one global crs
per DataArray
/ Dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would be nice to extract the CRS from the GeometryArray, but for dimension coordinates Xarray currently converts the GeometryArray into a plain numpy array, therefore losing the CRS info. We need to fix that when refactoring IndexVariable in Xarray.
Also, that would require some thought into how to store the crs. Not sure if we could copy whatever rioxarray / odc-geo are doing, since those seem to have one global crs per DataArray / Dataset?
With a CRSIndex
(corteva/rioxarray#588) or RasterIndex
built from three coordinates (2D spatial + spatial_ref), having one global crs per DataArray / Dataset rioxarray may not be required anymore and the conventions regarding the coordinate names may even be relaxed.
While for rasters it makes sense to have a scalar coordinate holding the CRS info to avoid duplicating it, for vector data the coordinate <-> index relationship is 1:1 so we could add the CRS info directly as attribute(s) of the vector coordinate.
It may be a good idea to add CRS info (a serializable version) as a coordinate attribute, if not present. So basic CRS info (identifier) would be accessible from the DataArray / Dataset repr, and the full CRS object would be accessible from the index object. E.g.,
>>> ds.geom_coord.attrs["crs"]
'EPSG:4267'
>>> ds.xindexes["geom_coord"].crs
<pyproj.crs.CRS>
xvec/index.py
Outdated
# check for possible CRS of geometry labels | ||
# (by default assume same CRS than the index) | ||
if hasattr(label_array, "crs") and label_array.crs != self.crs: | ||
raise ValueError("conflicting CRS for input geometries") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also need a change if crs becomes optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do when label CRS is not None and index CRS is None? Raise or ignore the label CRS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In geopandas we warn that the two CRS are not matching but let the operation happen. I'd do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for being consistent with geopandas. Is it for all operations that could depend on a CRS?
(Note: it might be tricky to set the right stacklevel
for the warnings here since the index API is called in various places of Xarray's internals, but not a big deal I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When two CRS are checked against each other like in overlay or sjoin, if one gdf has a CRS and the other does not, it always warns and executes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I wonder if we shouldn't have a special case for .sel()
. I find it convenient to just pass one or more (list/array) shapely geometries as input labels without an explicit CRS. A warning may be annoying.
Co-authored-by: Martin Fleischmann <[email protected]>
Co-authored-by: Martin Fleischmann <[email protected]>
Co-authored-by: Martin Fleischmann <[email protected]>
Thanks @martinfleis for the review! I agree CRS could be optional.
Yes I agree it would make sense to rename it if later we add another index for geographic (S2) features, although alternatively we might want to reuse the same index class for both shapely/GEOS and S2 and select the right backend depending on the coordinate elements and/or the CRS? Should we pick a name that refer to both vector and georeferenced (even if optional) geometrical features? |
I'd like to have the behaviour of this the same as we will have in geopandas which is unfortunately not been discussed yet... If geopandas implements S2 under the
to understand properly, by "vector" you mean an numpy array of shapely geoms and georeferenced the same with assigned CRS? I am not sure I follow the distinction here. |
Ah I better understand your point now. I'm +1 for consistency with geopandas and for using |
Try being consistent with geopandas.
In the last commit I took and adapted some utility functions from geopandas. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For concat this is similar than in geopandas (i.e., raises if different non-empty CRS are found, otherwise maybe warns). For join and reindex_like it maybe warns (no error) but I'm not sure if this is what we really want for alignment. Shouldn't we do like for concat?.
I think that in all these cases, we should raise. I know that geopandas only warns in some cases, but all these are more similar to concat than sjoin. I even remember discussion that not even sjoin should allow that and raise but we never proposed a change there.
Apart from this and the notes in code I'd merge this and make further changes in follow-up PRs.
xvec/index.py
Outdated
|
||
# check for possible CRS of geometry labels | ||
# (by default assume same CRS than the index) | ||
label_crs = getattr(label_array, "crs", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this situation when label_array
has crs
attribute actually happen? Since we cast all inputs to the numpy.array
above, even if we have something like a GeometryArray
, we would strip it of CRS. Codecov marking l. 193 as untested seems to prove that.
I think we need to check for crs
of labels
not label_array
.
label_crs = getattr(label_array, "crs", None) | |
label_crs = getattr(labels, "crs", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it depends (note: labels
is a dict and label
is the extracted value).
- If
label
is aGeometryArray
then yes we need to checklabel.crs
- If
label
is axarray.Variable
orxarray.DataArray
wrapping aGeometryArray
(i.e., the._data
attribute), we need to checklabel_array.crs
The latter is not supported yet in Xarray, though, and we still need to figure out a more general way to extract CRS from various input types. So I suggest to remove the CRS check here and address that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me. Let's move this to an issue to keep track.
xvec/index.py
Outdated
def _repr_inline_(self, max_width: int): | ||
srs = _format_crs(self.crs, max_width=max_width) | ||
return f"{self.__class__.__name__} (crs={srs})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _repr_inline_(self, max_width: int): | |
srs = _format_crs(self.crs, max_width=max_width) | |
return f"{self.__class__.__name__} (crs={srs})" | |
def _repr_inline_(self, max_width: int): | |
if max_width is None: | |
max_width = 50 | |
srs = _format_crs(self.crs, max_width=max_width) | |
return f"{self.__class__.__name__} (crs={srs})" |
Somehow this is raising a TypeError. Cell 6 in the Quickstart notebook:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/IPython/core/formatters.py:342, in BaseFormatter.__call__(self, obj)
340 method = get_real_method(obj, self.print_method)
341 if method is not None:
--> 342 return method()
343 return None
344 else:
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/common.py:167, in AbstractArray._repr_html_(self)
165 if OPTIONS["display_style"] == "text":
166 return f"<pre>{escape(repr(self))}</pre>"
--> 167 return formatting_html.array_repr(self)
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:318, in array_repr(arr)
316 if hasattr(arr, "xindexes"):
317 indexes = _get_indexes_dict(arr.xindexes)
--> 318 sections.append(index_section(indexes))
320 sections.append(attr_section(arr.attrs))
322 return _obj_repr(arr, header_components, sections)
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:195, in _mapping_section(mapping, name, details_func, max_items_collapse, expand_option_name, enabled)
188 expanded = _get_boolean_with_default(
189 expand_option_name, n_items < max_items_collapse
190 )
191 collapsed = not expanded
193 return collapsible_section(
194 name,
--> 195 details=details_func(mapping),
196 n_items=n_items,
197 enabled=enabled,
198 collapsed=collapsed,
199 )
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:155, in summarize_indexes(indexes)
154 def summarize_indexes(indexes):
--> 155 indexes_li = "".join(
156 f"<li class='xr-var-item'>{summarize_index(v, i)}</li>"
157 for v, i in indexes.items()
158 )
159 return f"<ul class='xr-var-list'>{indexes_li}</ul>"
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:156, in <genexpr>(.0)
154 def summarize_indexes(indexes):
155 indexes_li = "".join(
--> 156 f"<li class='xr-var-item'>{summarize_index(v, i)}</li>"
157 for v, i in indexes.items()
158 )
159 return f"<ul class='xr-var-list'>{indexes_li}</ul>"
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:139, in summarize_index(coord_names, index)
136 name = "<br>".join([escape(str(n)) for n in coord_names])
138 index_id = f"index-{uuid.uuid4()}"
--> 139 preview = escape(inline_index_repr(index))
140 details = short_index_repr_html(index)
142 data_icon = _icon("icon-database")
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting.py:413, in inline_index_repr(index, max_width)
411 def inline_index_repr(index, max_width=None):
412 if hasattr(index, "_repr_inline_"):
--> 413 repr_ = index._repr_inline_(max_width=max_width)
414 else:
415 # fallback for the `pandas.Index` subclasses from
416 # `Indexes.get_pandas_indexes` / `xr_obj.indexes`
417 repr_ = repr(index)
File ~/Git/xvec/xvec/index.py:261, in GeometryIndex._repr_inline_(self, max_width)
260 def _repr_inline_(self, max_width: int):
--> 261 srs = _format_crs(self.crs, max_width=max_width)
262 return f"{self.__class__.__name__} (crs={srs})"
File ~/Git/xvec/xvec/index.py:21, in _format_crs(crs, max_width)
18 else:
19 srs = "None"
---> 21 return srs if len(srs) <= max_width else " ".join([srs[:max_width], "..."])
TypeError: '<=' not supported between instances of 'int' and 'NoneType'
This bit seems to override our default with None
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting.py:413, in inline_index_repr(index, max_width)
411 def inline_index_repr(index, max_width=None):
412 if hasattr(index, "_repr_inline_"):
--> 413 repr_ = index._repr_inline_(max_width=max_width)
We need to catch that and override in that case. Either here like in the code suggestion above or in _format_crs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, we can fix it here but I think this should rather be fixed in xarray.core.formatting.inline_index_repr
? cc @keewis.
Prevent long diffs.
Let's address later the extraction of CRS info from various input types in a more general way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go! Thanks a lot!
Just adding a few more tests before merging |
Good to go now :). Thanks again for the helpful comments and suggestions! |
This new index replaces
ShapelySTRTreeIndex
and provides the same functionality in addition to all the basic features ofxarray.indexes.PandasIndex
. The STRTree is not fully initialized until we use it.After trying both, I found that encapsulating
PandasIndex
was a bit easier and cleaner than inheriting from it, i.e., we can define an alternative__init__
signature (crs) that works well with the.from_variables()
class method (note:Index.__init__()
is rarely used directly).