Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on repeated dimension names during construction #8491

Merged
merged 18 commits into from
Dec 1, 2023

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Nov 29, 2023

@github-actions github-actions bot added the topic-NamedArray Lightweight version of Variable label Nov 29, 2023
@TomNicholas
Copy link
Member Author

Interesting - some of the h5netcdf tests apparently attempt to open hdf5 files that have repeated dimension names

FAILED xarray/tests/test_backends.py::TestH5NetCDFDataRos3Driver::test_get_variable_list - ValueError: Duplicate dimension names are forbidden, but dimensions {'PRESSURE'} appear more than once in dims=('DATETIME', 'PRESSURE', 'PRESSURE')
FAILED xarray/tests/test_backends.py::TestH5NetCDFDataRos3Driver::test_get_variable_list_empty_driver_kwds - ValueError: Duplicate dimension names are forbidden, but dimensions {'PRESSURE'} appear more than once in dims=('DATETIME', 'PRESSURE', 'PRESSURE')

@andersy005
Copy link
Member

thank you for putting this together, @TomNicholas.

Interesting - some of the h5netcdf tests apparently attempt to open hdf5 files that have repeated dimension names

if my understanding is accurate, neither h5netcdf nor netCDF appears to enforce restrictions against duplicate dimensions. are we to modify the HDF5 files accordingly or xfail these tests?

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 29, 2023

I guess depends what the point of those tests was. If it's just to check that we can correctly read the structure of those particular files then we could change to just check that the exact error we now expect is raised. If we wanted to check more than that then we should change the files. Obviously xfail would be the quickest way to move forward however.

@dcherian
Copy link
Contributor

dcherian commented Nov 30, 2023

I'm not sure we should raise here since netcdf4 at least allows repeated dimension names (i bet zarr does too)

import netCDF4

ds = netCDF4.Dataset("test.nc", mode="w")
ds.createDimension("time", 24)
ds.createVariable("test", int, dimensions=("time", "time"))
ds["test"][:] = 0
ds.close()

!ncdump -h test.nc
netcdf test {
dimensions:
	time = 24 ;
variables:
	int64 test(time, time) ;
}

One of the takeaways in #2368 and #1378 (comment) was that we want to be able to open all netCDF files even if the resulting Xarray object isn't as useful as it could be.

@andersy005
Copy link
Member

I'm not sure we should raise here since netcdf4 at least allows repeated dimension names (i bet zarr does too)

this may be a distinct issue, but should we consider updating the repr to indicate the presence of duplicated dimensions? as it stands, the current repr only displays unique dimensions.

In [92]: data = np.zeros((2, 3, 3))

In [93]: dims = ['a', 'b', 'b']

In [94]: var = xr.Variable(data=data, dims=dims)

In [95]: var.shape
Out[95]: (2, 3, 3)

In [96]: var.dims
Out[96]: ('a', 'b', 'b')

In [97]: var
Out[97]: 
<xarray.Variable (a: 2, b: 3)>
array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

@dcherian
Copy link
Contributor

Updating repr sounds good to me.

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 30, 2023

I'm not sure we should raise here since netcdf4 at least allows repeated dimension names

(Apparently Zarr v3 does allow them, even whilst it recommends against them)

We have three choices:

  1. Forbid duplicate dimension names at construction time. I prefer this because it's clear, and it completely prevents getting into a world of undefined state, unclear functionality, and where currently internal implementation details can bleed out to affect what works and what doesn't.

  2. Allow duplicate dimension names at construction time, but try to forbid them whenever something we know isn't going work work is attempted. This is sort of what we half-heartedly currently do.

  3. Support duplicate dimensions, either for a subset or for all functionality. This is really hard, perhaps impossible for the full case, and isn't going to happen in this PR. It would require someone interested and committed to work on it.

I was thinking we should do (1) now then either (2) or (3) later, but @dcherian's comment
makes it sound like people would be annoyed if we did (1), and perhaps we should work towards a limited version of (2) immediately instead. In that case we should at least:

  • Raise before attempting to broadcast (this already happens)
  • Raise inside get_axis_num (added in this PR)
  • Test reductions over all axes still work (or just raise on all reductions)
  • Fix the repr (not simple, see Warn on repeated dimension names during construction #8491 (comment))
  • Raise in concat
  • Raise in stack
  • Add tests for all the above
  • Put a warning in the docs somewhere documenting this approach (warns in constructor now too)
  • Anything else?

@dcherian
Copy link
Contributor

(let's move your comment/discussion to an issue)

@TomNicholas
Copy link
Member Author

A new issue? Or one of the existing ones?

@max-sixty
Copy link
Collaborator

I would vote to add a warning to the constructor and mostly be done with it...

That list of items are all good, and the library would be better off with them. But I would guess that even raising in the right places has a long tail of work, and setting clear expectations of no support is realistic (unless someone wants to champion this, in which case great)

If we just say "we don't support this yet, but we will grudgingly load the file. Please rename the dims immediately", then that's quite clear...

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!

(needs whatsnew changing)

@TomNicholas
Copy link
Member Author

Okay, so I've changed the error in the constructor to merely a warning, and I've added an actual error that gets raised by any call to _get_axis_num.

this may be a distinct issue, but should we consider updating the repr to indicate the presence of duplicated dimensions? as it stands, the current repr only displays unique dimensions.

I looked at this and it's because the repr calls .sizes, which uses .dims as keys for a dictionary. To make the repr display duplicate dimensions we would either have to change the return type of (.sizes) or redirect the repr to use a new ._non_unique_sizes, both of which don't seem great.

Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me 👍🏽

@max-sixty
Copy link
Collaborator

FWIW one thing I'd be keen for to do generally — though maybe this isn't the place to start it — is handle warnings in the test suite when we add a new warning — i.e. filter them out where we expect them.

In this case, that would be the loading the netCDF files that have duplicate dims.

Otherwise warnings become a huge block of text without much salience. I mostly see the 350 lines of them and think "meh mostly units & cftime", but then something breaks on a new upstream release that was buried in there, or we have a supported code path that is raising warnings internally.

(I'm not sure whether it's possible to generally enforce that — maybe we could raise on any warnings coming from within xarray? Would be a non-trivial project to get us there though...)

@TomNicholas TomNicholas changed the title Forbid repeated dimension names Warn on repeated dimension names during construction Nov 30, 2023
xarray/namedarray/core.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member Author

FWIW one thing I'd be keen for to do generally — though maybe this isn't the place to start it — is handle warnings in the test suite when we add a new warning — i.e. filter them out where we expect them.

I agree @max-sixty , so I raised #8494.

In this case, that would be the loading the netCDF files that have duplicate dims.

I've added @pytest.mark.filterwarnings("ignore:Duplicate dimension names") to those tests.

@TomNicholas TomNicholas added the plan to merge Final call for comments label Nov 30, 2023
@andersy005 andersy005 enabled auto-merge (squash) November 30, 2023 23:36
@andersy005 andersy005 merged commit d46c5b6 into pydata:main Dec 1, 2023
30 checks passed
@TomNicholas TomNicholas deleted the forbid_repeated_dimension_names branch December 1, 2023 01:37
TomNicholas added a commit to TomNicholas/xarray that referenced this pull request Dec 1, 2023
dcherian added a commit that referenced this pull request Dec 6, 2023
* raise FutureWarning

* change some internal instances of ds.dims -> ds.sizes

* improve clarity of which unexpected errors were raised

* whatsnew

* return a class which warns if treated like a Mapping

* fix failing tests

* avoid some warnings in the docs

* silence warning caused by #8491

* fix another warning

* typing of .get

* fix various uses of ds.dims in tests

* fix some warnings

* add test that FutureWarnings are correctly raised

* more fixes to avoid warnings

* update tests to avoid warnings

* yet more fixes to avoid warnings

* also warn in groupby.dims

* change groupby tests to match

* update whatsnew to include groupby deprecation

* filter warning when we actually test ds.dims

* remove error I used for debugging

---------

Co-authored-by: Deepak Cherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-NamedArray Lightweight version of Variable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicate coord names in DataArray constructor
4 participants