Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD: objects can override is_dict_like() #8613

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Sevans711
Copy link

@Sevans711 Sevans711 commented Jan 16, 2024

By setting obj._xarray_dict_like_, any object obj can now tell xarray to treat it like it is (or is not) a dict. Note: objects without the _xarray_dict_like_ attribute will be unaffected; this change is fully backwards-compatible.

There are no relevant issue reports, tests added, user visible changes, or new functions / methods.

This is my first contribution to xarray, I tried my best to follow the contribution guidelines; please let me know if I am missing anything essential!

Motivation

I have been using xarray in my personal project pretty successfully (xarray is a great package, and I am very grateful for it)! However, I sometimes use dict-like objects as coordinates. xarray seems to be okay with this most of the time, but fails when it comes to things like the DataArray.sel method. I would really like to be able to use sel, since my dict-like objects are only dict-like for reasons completely unrelated to xarray.

For example, I built a PlasmaFluid class which stores information about different fluids. And, I use instances of that class as coordinates for my DataArray. E.g.:

import xarray as xr
import mypackage
f0 = mypackage.PlasmaFluid(0, 'e-', q=-1, density=1e10)     # fluid composed of electrons
f1 = mypackage.PlasmaFluid(1, 'H neutral', q=0, density=1e14)     # fluid composed of neutral hydrogen
f2 = mypackage.PlasmaFluid(2, 'p', q=1, density=1e10)     # fluid composed of protons
# fluids can have many parameters, including charge, mass, density, and more.
#   so, each PlasmaFluid object has a .keys() method which tells the parameters,
#   and a __getitem__ method to get the values. E.g.:
f0['q']     # -1
f0['density']     # 1e10
# xarray seems happy enough to allow PlasmaFluid objects to be used as coordinates, e.g.:
u_x = xr.DataArray([2, 7, 3], coords=dict(fluid=[f0, f1, f2]))
density = xr.DataArray([f0['density'], f1['density'], f2['density']], coords=dict(fluid=[f0, f1, f2]))
momentum_density_x = density * u_x   # == xr.DataArray([2e10, 7e14, 3e10], coords=dict(fluid=[f0, f1, f2]))

This code works just fine, and produces a nicely-labeled DataArray, where the fluid dimension contains the actual PlasmaFluid objects, e.g. u_x.isel(fluid=0) is f0.

However, it fails when I try to use the sel method, e.g.:

u_x.sel(fluid=f1)
# --> ValueError: cannot use a dict-like object for selection on a dimension that does not have a MultiIndex

After implementing the change in this PR, I am able to avoid this issue, and can use the sel method by updating the class I wrote to include PlasmaFluid._xarray_dict_like_ = False.

Minimal Reproducible Example

As per @max-sixty's suggestion (thank you!) here is an example minimal reproducible example you can copy paste to see the issue I'm referring to:

class PlasmaFluid():
    '''minimal working example of a dict-like class I want to use as a coordinate'''
    def __init__(self, i, name, **params):
        self.i = i
        self.name = name
        self.params = params
    
    def keys(self):
        return self.params.keys()
    
    def __getitem__(self, key):
        return self.params[key]

f0 = PlasmaFluid(0, 'e-', q=-1, m=9.10938e-31)   # fluid composed of electrons
f1 = PlasmaFluid(1, 'H', q=0, m=1.67262e-27)     # fluid composed of neutral hydrogen
f2 = PlasmaFluid(2, 'p', q=1, m=1.67356e-27)     # fluid composed of protons
arr = xr.DataArray([2, 7, 3], coords=dict(fluid=[f0, f1, f2]))

arr.sel(fluid=f2)  # <-- causes crash because is_dict_like(f2)

By setting obj._xarray_dict_like_, any object obj can now tell xarray to treat it like it is (or is not) a dict. This is useful if using custom objects for coordinates, when those objects store information in a dict-like format, but that information is not related to xarray and there's no reason for xarray to treat the objects as dicts.
Copy link

welcome bot commented Jan 16, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@max-sixty
Copy link
Collaborator

Thanks for the PR.

I don't have a strong view but one quick question — could you instead implement the methods that get checked on the custom object?

@Sevans711
Copy link
Author

Thanks for the PR.

I don't have a strong view but one quick question — could you instead implement the methods that get checked on the custom object?

Thanks for your response. I had considered that, but unfortunately the is_dict_like function before my changes only checks for the existence of attributes. The code previously looked like:

def is_dict_like(value: Any) -> TypeGuard[Mapping]:
    return hasattr(value, "keys") and hasattr(value, "__getitem__")

Ideally I would like a solution that allows me to keep the keys and __getitem__ method in my PlasmaFluid class (and other similar classes I've written), since those classes store a bunch of different parameters' values, and IMO the most natural way to do that is with a dictionary-like object.

@max-sixty
Copy link
Collaborator

Ah I see, thanks.


I wonder if we could alternatively check that doesn't have Hashable implemented. So if you implemented Hashable, then xarray would not see it as dict-like.

The downside is that something like https://pypi.org/project/frozendict/ — which is Hashable — won't work to lookup on MultiIndexes.


Ideally I think we should avoid these sorts of custom attributes — they make xarray's implementation more complicated, increase the cardinality of its interface. And often this sort of issue is emblematic of a deeper issue — should we be doing something quite different depending on the type of the argument / is there a more lower-level API even if .sel tries to be helpful by guessing?

On the lower-level API point — does the equivalent of da.loc[dict(space='IA')] work?


FYI @Sevans711 one thing that would be useful here is a reproducible example, with a minimal class that has those methods.

@headtr1ck
Copy link
Collaborator

I don't see a reason why we could not simply remove this check (I didn't look at the code, possibly requires some refactoring).

This would solve the issue at hand and for other use cases raise a more low level error (I assume).

Just a note: @Sevans711 are you sure you want to build your code like this? You are giving away lots of the power of xarray by using this approach. It would be much more intuitive (and faster!) if you would use the Plasma fluid properties directly as coordinates.

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2024

another option would be to wrap the value you're looking for in a 0D array with object dtype (so instead of u_x.sel(fluid=f1) try u_x.sel(fluid=np.array(f1, dtype=object)).

In general I agree, though, this is a pretty un-idiomatic use of xarray.

@Sevans711
Copy link
Author

Thank you all for your comments. Replying to the various points you made:

@max-sixty

  • I wouldn't want to affect frozendict's interaction with xarray, that seems not ideal, mostly because I don't understand exactly how MultiIndexes are used here and whether frozendict should work with them or not. If the is_dict_like check added any check other than hasattr for a few attrs, I would be happy; it definitely doesn't need to be the check I proposed with this PR. I am mostly looking for some way to tell xarray it's okay to use PlasmaFluid (and similar classes) in sel (and similar methods).
  • I agree maybe my solution is not the best one, I appreciate your suggestions and would be happy for a different solution to be implemented. Maybe a bigger change is a good idea, but I didn't want to poke around too much without understanding what it might break. Regarding da.loc, it produces the same error message for me; e.g. arr.loc(dict(fluid=f2)) crashes just like arr.sel(dict(fluid=f2)).
  • edited the original message to include a reproducible example, thank you for this recommendation!

@headtr1ck

  • I also don't understand why we need the check, though I did get the impression it is there to help prevent mistakes for people who are intending to use MultiIndexes, but input them incorrectly. I wouldn't want to make changes which lead to a hard-to-understand error or unexpected behavior, for those people.
  • I appreciate your concern for the way I built the code, but I am pretty sure it is how I want it to be. There have been numerous benefits, including some unexpected benefits, which signals "good object oriented programming" to me. I don't want to discuss it too much here but some examples include:
    • My code accepts multiple different kinds of data input, where each kind has its own quirks. E.g., PlasmaFluidKind1 only knows about charge and mass, while PlasmaFluidKind2 knows about charge, mass, and 12+ other parameters. By using the PlasmaFluid objects as coordinates, my arrays look the same regardless of the kind of input.
    • fluid is not the only array dimension; there are other important coordinates too, and for better readability I would prefer the arrays to not be flooded with lots of fluid-related coordinates.
    • I implemented PlasmaFluid.__str__, so the coordinates appear as easy-to-read strings such as ('e-', 'H', 'p'). xarray includes the array type (object) to help hint that they aren't actually strings.
    • When I expand the coordinates via the clickable stack-thing in Jupyter, I see the __repr__ showing PlasmaFluidKind1(...) (or PlasmaFluidKind2), so I can easily tell what kind of data this array came from.
  • I didn't realize it would be faster to use fluid properties directly as coordinates. I suspect it isn't important in my use-case (my arrays usually have a few fluids but >500x500 points along x&y coordinates which are floats), but I could check by profiling the code. Would you be willing to explain when / why it would be faster to avoid object coordinates?

@keewis

  • Thank you for letting me know this trick. Indeed, wrapping the value in a numpy array does work, and numpy is even smart enough to figure out it should use dtype=object without me explicitly specifying it. E.g., u_x.sel(fluid=np.array(f1)) works too! This is a good workaround in the meantime, while this PR is still being discussed.

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2024

Just to throw out some ideas, the data you're describing seems like a particularly well fit for awkward, a library for ragged arrays (basically, "arrays of structs") which was built with high-energy physics in mind. Over the past month, @jpivarski, one of the maintainers of awkward, has also been working on ragged, a library that describes the subset of awkward that xarray can wrap (see #7988).

@Sevans711
Copy link
Author

Just to throw out some ideas, the data you're describing seems like a particularly well fit for awkward, a library for ragged arrays (basically, "arrays of structs") which was built with high-energy physics in mind. Over the past month, @jpivarski, one of the maintainers of awkward, has also been working on ragged, a library that describes the subset of awkward that xarray can wrap (see #7988).

Thanks for pointing this out, I didn't know about this! I will continue to consider it, though I don't know if it is the best fit for me. To clarify my use-case: all my data is homogeneous, of the same type and with no holes or variable-length lists. I believe the only thing I'm doing that doesn't quite follow the usual patterns is using objects for coordinates, instead of a simple type like int/float/str. There's no reason my data couldn't work with numpy, but I'm using xarray to make it easier to track all the coordinates and dimensions :)

@jpivarski
Copy link

If all of the objects are fixed-size data structures (i.e. all instances in the array have the same set of named fields), then Awkward Array would be overkill. Looking at PlasmaFluid with parameters q and m (assuming that they're always q and m, for a given array), then what you really have is a two-tier structure to your coordinate names, right?

Such a thing could be implemented as records within records, but it could also be represented as nested columns, like using Pandas's MultiIndex on column names, like

>>> pd.DataFrame(
...     data=[[1, 1.1, "one"], [2, 2.2, "two"], [3, 3.3, "three"]],
...     columns=pd.MultiIndex.from_tuples([("a1", "x"), ("a1", "y"), ("a2", "z")])
... )
  a1          a2
   x    y      z
0  1  1.1    one
1  2  2.2    two
2  3  3.3  three

In xarray, would xarray-datatree work?

On the other hand, maybe I've misunderstood and only the i field of PlasmaFluid is relevant for indexing. In that case, maybe just make the class interpretable as an integer index?

class PlasmaFluid:
    ...
    def __index__(self):    # or maybe __int__
        return self.i

@Sevans711
Copy link
Author

Sevans711 commented Jan 17, 2024

On the other hand, maybe I've misunderstood and only the i field of PlasmaFluid is relevant for indexing. In that case, maybe just make the class interpretable as an integer index?

This is correct, only the i field is relevant for indexing. I see why it was confusing, it is due to somewhat poorly-chosen examples on my part. In general, I can have lots of other fluids too with the same mass and/or charge (e.g. PlasmaFluid(3, 'He+', q=1), PlasmaFluid(4, 'C+', q=1)). But, every fluid should be assigned a unique integer, so I can unambiguously refer to it.

It is a good idea to implement __int__ or __index__, and in fact I have already done that (in the same way you suggested), and some parts of my code utilize that to convert PlasmaFluid objects to integers. However, even when using a PlasmaFluid implementation which includes __index__ and/or __int__, I get the same error when I try to use sel.

What are your thoughts on having the is_dict_like() check return False for objects with an __index__ method? This is similar to @max-sixty's idea, but maybe less likely to cause conflicts. I am guessing that __index__-able dict-like objects are far less common than Hashable dict-like objects. I would also think that anyone using a dict-like object with __index__ shouldn't complain if that object is not treated like a dictionary when it comes to DataArray.sel, which relates to indexing.


I want to briefly mention another benefit of including the PlasmaFluid objects as coordinates, maybe it will be more convincing than my other examples:

  • Consider a function plot_with_fluid_info(array) which will make one plot per fluid in the array, and in each plot add a textbox containing PlasmaFluid.get_plot_info(). PlasmaFluid subclasses can control what goes onto plots, by overriding this method. In this way, my plotting code can be exactly the same for each kind of input, but the plots can contain relevant info specific to the kind of input I'm plotting.
  • Consider any number of other analysis routines which also take one array as input, yet want access to the corresponding PlasmaFluid objects and methods therein. By using PlasmaFluid as coordinates, I don't need to pass more info into these functions; they can read the info directly from the array.
  • This behavior would be difficult to replicate if I insist on using the properties of PlasmaFluids as coordinates instead of the PlasmaFluids themselves.

@Sevans711
Copy link
Author

Bumping, since it has been a little while. What are the steps required to get this PR accepted?

I would be happy to implement a different check for is_dict_like, such as hasattr(obj, 'keys') and hasattr(obj, '__getitem__') and not hasattr(obj, '__index__') if that sounds better.

Also happy to wait longer if you need more time to consider. Mostly I just want to avoid this PR getting buried / forgotten. I am motivated to work on getting this finished and pulled into main! Ty for any assistance.

@dcherian
Copy link
Contributor

Seems like the general request is to allow indexing of object-type arrays with objects.

Can we just skip the check if the dimension being indexed is backed by an object dtype array?

@max-sixty
Copy link
Collaborator

Can we just skip the check if the dimension being indexed is backed by an object dtype array?

This seems worth trying. I'm not sure exactly how multiindexes will work, but if tests pass it's probably fine...

@keewis
Copy link
Collaborator

keewis commented Jan 31, 2024

I'm not entirely sure, but I believe @benbovy has been working on making MultiIndexes less of a special case, and by doing so I believe we'd be removing the special casing for dict objects as well. Can you confirm that, @benbovy?

@benbovy
Copy link
Member

benbovy commented Jan 31, 2024

Yes I think we can safely remove this special case now.

Selection on both single and multi-indexes used to be handled internally through the same function that was supporting only dimension coordinate indexers. This is not the case anymore.

@Sevans711
Copy link
Author

Awesome!

If this will be getting fixed imminently and sometime in the not-too-distant future (maybe within a few months?) I am happy to wait. I suspect it would be much less work for someone who is more familiar with the codebase than me. However if you need me to attempt to remove this special case please let me know, and I'll see what I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants