Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid auto creation of indexes in concat #8872

Merged
merged 81 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
22995e9
test not creating indexes on concatenation
TomNicholas Mar 25, 2024
7142c9f
construct result dataset using Coordinates object with indexes passed…
TomNicholas Mar 25, 2024
7fb075a
remove unnecessary overwriting of indexes
TomNicholas Mar 25, 2024
285c1de
ConcatenatableArray class
TomNicholas Mar 25, 2024
cc24757
use ConcatenableArray in tests
TomNicholas Mar 28, 2024
90a2592
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Mar 28, 2024
beb665a
add regression tests
TomNicholas Mar 28, 2024
22f361d
fix by performing check
TomNicholas Mar 28, 2024
55166fc
refactor assert_valid_explicit_coords and rename dims->sizes
TomNicholas Mar 28, 2024
322b76e
Merge branch 'forbid_invalid_coordinates' into concat-avoid-index-aut…
TomNicholas Mar 28, 2024
da6692b
Revert "add regression tests"
TomNicholas Mar 28, 2024
35dfb67
Revert "fix by performing check"
TomNicholas Mar 28, 2024
fd3de2b
Revert "refactor assert_valid_explicit_coords and rename dims->sizes"
TomNicholas Mar 28, 2024
0a60172
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Mar 28, 2024
21afbb1
fix failing test
TomNicholas Mar 28, 2024
6e9ead6
possible fix for failing groupby test
TomNicholas Mar 28, 2024
2534712
Revert "possible fix for failing groupby test"
TomNicholas Mar 29, 2024
3e848eb
test expand_dims doesn't create Index
TomNicholas Apr 19, 2024
95d453c
add option to not create 1D index in expand_dims
TomNicholas Apr 19, 2024
ba5627e
refactor tests to consider data variables and coordinate variables se…
TomNicholas Apr 20, 2024
3719ba7
test expand_dims doesn't create Index
TomNicholas Apr 19, 2024
018e74b
add option to not create 1D index in expand_dims
TomNicholas Apr 19, 2024
f680505
refactor tests to consider data variables and coordinate variables se…
TomNicholas Apr 20, 2024
f10509a
fix bug causing new test to fail
TomNicholas Apr 20, 2024
8152c0a
test index auto-creation when iterable passed as new coordinate values
TomNicholas Apr 20, 2024
aa813cf
make test for iterable pass
TomNicholas Apr 20, 2024
e78de7d
added kwarg to dataarray
TomNicholas Apr 20, 2024
b1329cc
whatsnew
TomNicholas Apr 20, 2024
a9f7e0c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 20, 2024
2ce3dec
Revert "refactor tests to consider data variables and coordinate vari…
TomNicholas Apr 20, 2024
87a08b4
Revert "add option to not create 1D index in expand_dims"
TomNicholas Apr 20, 2024
e0c6db1
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 20, 2024
214ed7d
test that concat doesn't raise if create_1d_index=False
TomNicholas Apr 20, 2024
78d2798
make test pass by passing create_1d_index down through concat
TomNicholas Apr 20, 2024
fc206b0
assert that an UnexpectedDataAccess error is raised when create_1d_in…
TomNicholas Apr 20, 2024
ce797f1
eliminate possibility of xarray internals bypassing UnexpectedDataAcc…
TomNicholas Apr 20, 2024
62e750f
update tests to use private versions of assertions
TomNicholas Apr 26, 2024
f86c82f
create_1d_index->create_index
TomNicholas Apr 26, 2024
4dd8d3c
Merge branch 'main' into expand_dims_create_1d_index
TomNicholas Apr 26, 2024
d5d90fd
Update doc/whats-new.rst
TomNicholas Apr 26, 2024
e00dbab
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 26, 2024
5bb88b8
Rename create_1d_index -> create_index
TomNicholas Apr 26, 2024
1d471b1
fix ConcatenatableArray
TomNicholas Apr 26, 2024
766605d
formatting
TomNicholas Apr 26, 2024
971287f
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 26, 2024
10c0ed5
whatsnew
TomNicholas Apr 26, 2024
51eea5d
add new create_index kwarg to overloads
TomNicholas Apr 26, 2024
bde9f2b
split vars into data_vars and coord_vars in one loop
TomNicholas Apr 26, 2024
d5241ce
avoid mypy error by using new variable name
TomNicholas Apr 26, 2024
7e8f895
warn if create_index=True but no index created because dimension vari…
TomNicholas Apr 27, 2024
ed85446
add string marks in warning message
TomNicholas Apr 27, 2024
39571ba
Merge branch 'main' into expand_dims_create_1d_index
TomNicholas Apr 27, 2024
206985b
Merge branch 'expand_dims_create_1d_index' into concat-avoid-index-au…
TomNicholas Apr 27, 2024
86998e4
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 27, 2024
5894724
regression test for dtype changing in to_stacked_array
TomNicholas Apr 29, 2024
dad9433
correct doctest
TomNicholas Apr 29, 2024
b235c09
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 29, 2024
36a2223
Remove outdated comment
TomNicholas Apr 29, 2024
e17c13f
test we can skip creation of indexes during shape promotion
TomNicholas Apr 29, 2024
e8fa857
make shape promotion test pass
TomNicholas Apr 29, 2024
648d5bc
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas Apr 29, 2024
deb292c
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas Apr 29, 2024
6dd57a9
point to issue in whatsnew
TomNicholas Apr 29, 2024
b0e3612
don't create dimension coordinates just to drop them at the end
TomNicholas May 1, 2024
b2f06a0
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas May 1, 2024
ff70fc7
Remove ToDo about not using Coordinates object to pass indexes
TomNicholas May 1, 2024
2f97a5c
get rid of unlabeled_dims variable entirely
TomNicholas May 1, 2024
6d825e5
move ConcatenatableArray and similar to new file
TomNicholas May 8, 2024
b88b5a6
formatting nit
TomNicholas May 8, 2024
30c7408
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas May 8, 2024
b243150
renamed create_index -> create_index_for_new_dim in concat
TomNicholas May 8, 2024
9e9e168
renamed create_index -> create_index_for_new_dim in expand_dims
TomNicholas May 8, 2024
dca2fb9
fix incorrect arg name
TomNicholas May 8, 2024
c979672
add example to docstring
TomNicholas May 8, 2024
ac27ce0
add example of using new kwarg to docstring of expand_dims
TomNicholas May 8, 2024
d73ac48
add example of using new kwarg to docstring of concat
TomNicholas May 8, 2024
9ebbb33
Merge branch 'main' into concat-avoid-index-auto-creation
TomNicholas May 8, 2024
d1b656d
re-nit the nit
TomNicholas May 8, 2024
ac998e9
more instances of the nit
keewis May 8, 2024
0849b94
fix docstring doctest formatting nit
TomNicholas May 8, 2024
25764ca
Merge branch 'concat-avoid-index-auto-creation' of https://github.com…
TomNicholas May 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 17 additions & 8 deletions xarray/core/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from xarray.core import dtypes, utils
from xarray.core.alignment import align, reindex_variables
from xarray.core.coordinates import Coordinates
from xarray.core.duck_array_ops import lazy_array_equiv
from xarray.core.indexes import Index, PandasIndex
from xarray.core.merge import (
Expand Down Expand Up @@ -646,14 +647,26 @@ def get_indexes(name):
# preserves original variable order
result_vars[name] = result_vars.pop(name)

result = type(datasets[0])(result_vars, attrs=result_attrs)

absent_coord_names = coord_names - set(result.variables)
absent_coord_names = coord_names - set(result_vars)
if absent_coord_names:
raise ValueError(
f"Variables {absent_coord_names!r} are coordinates in some datasets but not others."
)
result = result.set_coords(coord_names)
coord_vars = {
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
name: result_var
for name, result_var in result_vars.items()
if name in coord_names
}
coords = Coordinates(coord_vars, indexes=result_indexes)

# TODO: this is just the complement of the set of coord_vars
result_data_vars = {
name: result_var
for name, result_var in result_vars.items()
if name not in coord_names
}

result = type(datasets[0])(result_data_vars, coords=coords, attrs=result_attrs)
result.encoding = result_encoding

result = result.drop_vars(unlabeled_dims, errors="ignore")
Expand All @@ -665,10 +678,6 @@ def get_indexes(name):
else:
index_vars = index.create_variables()
result[dim] = index_vars[dim]
result_indexes[dim] = index

# TODO: add indexes at Dataset creation (when it is supported)
result = result._overwrite_indexes(result_indexes)
Copy link
Member

@benbovy benbovy Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing those lines doesn't break any existing test because the line above result[dim] = index_vars[dim] actually re-creates a default PandasIndex when assigning the new dim variable. However, this unnecessarily re-creates a new index (or re-wrap an existing one) and this may not work in the future if we allow passing a custom xarray index as dim argument to concat.

It would be better to explicitly add both index and index_vars to result. Best way would be to assign them to result_indexes and coord_vars respectively before constructing the Coordinates object and then the result object, unless there are cases where result.drop_vars(unlabeled_dims) would delete the index coordinate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benbovy. I think your comment might explain the behaviour I just noticed in zarr-developers/VirtualiZarr#18 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed your comment now @benbovy


return result

Expand Down
53 changes: 53 additions & 0 deletions xarray/tests/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -978,6 +978,32 @@ def test_concat_str_dtype(self, dtype, dim) -> None:

assert np.issubdtype(actual.x2.dtype, dtype)

def test_concat_avoids_index_auto_creation(self) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to myself: The reason this test didn't catch the problem described in zarr-developers/VirtualiZarr#18 (comment) is because this test checks that concatenating datasets that start without indexes stay without indexes, whereas that problem is from concatenating datasets with indexes but having the coordinate variables be silently replaced by IndexVariable objects created from the index data.

# TODO once passing indexes={} directly to DataArray constructor is allowed then no need to create coords first
coords = Coordinates({"x": np.array([1, 2, 3])}, indexes={})
datasets = [
Dataset(
{"a": (["x", "y"], np.zeros((3, 3)))},
coords=coords,
)
for _ in range(2)
]
# should not raise on concat
combined = concat(datasets, dim="x")
assert combined["a"].shape == (6, 3)
assert combined["a"].dims == ("x", "y")

# nor have auto-created any indexes
assert combined.indexes == {}

# should not raise on stack
combined = concat(datasets, dim="z")
assert combined["a"].shape == (2, 3, 3)
assert combined["a"].dims == ("z", "x", "y")

# nor have auto-created any indexes
assert combined.indexes == {}


class TestConcatDataArray:
def test_concat(self) -> None:
Expand Down Expand Up @@ -1051,6 +1077,33 @@ def test_concat_lazy(self) -> None:
assert combined.shape == (2, 3, 3)
assert combined.dims == ("z", "x", "y")

def test_concat_avoids_index_auto_creation(self) -> None:
# TODO once passing indexes={} directly to DataArray constructor is allowed then no need to create coords first
coords = Coordinates({"x": np.array([1, 2, 3])}, indexes={})
arrays = [
DataArray(
np.zeros((3, 3)),
dims=["x", "y"],
coords=coords,
)
for _ in range(2)
]
# should not raise on concat
combined = concat(arrays, dim="x")
assert combined.shape == (6, 3)
assert combined.dims == ("x", "y")

# nor have auto-created any indexes
assert combined.indexes == {}

# should not raise on stack
combined = concat(arrays, dim="z")
assert combined.shape == (2, 3, 3)
assert combined.dims == ("z", "x", "y")

# nor have auto-created any indexes
assert combined.indexes == {}

@pytest.mark.parametrize("fill_value", [dtypes.NA, 2, 2.0])
def test_concat_fill_value(self, fill_value) -> None:
foo = DataArray([1, 2], coords=[("x", [1, 2])])
Expand Down
Loading