Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Define __hash__ on CategoricalDtype #14027

Open
mroeschke opened this issue Aug 31, 2023 · 1 comment
Open

[FEA] Define __hash__ on CategoricalDtype #14027

mroeschke opened this issue Aug 31, 2023 · 1 comment
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@mroeschke
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Would be nice if hash(CategoricalDtype(...)) was supported

In [9]: import cudf

In [10]: import pandas

In [11]: hash(cudf.CategoricalDtype(list("abc")))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 hash(cudf.CategoricalDtype(list("abc")))

TypeError: unhashable type: 'CategoricalDtype'

In [12]: hash(pandas.CategoricalDtype(list("abc")))
Out[12]: 1532899084736511412

Describe the solution you'd like
Define __hash__ on CategoricalDtype

Describe alternatives you've considered
Custom hash based on order and categories

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@mroeschke mroeschke added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 31, 2023
@wence-
Copy link
Contributor

wence- commented Sep 1, 2023

Hmmm, this opens a can of worms:

import cudf
dt = cudf.CategoricalDtype(["a", "b", "c"])
hash(dt) # TypeError
hash(dt.categories) # works, dt.categories is a `StringIndex`, so that's immutable, so hashing is ok.
hash(dt.categories._column) # works?! This seems bad, since a column is _not_ immutable
hash(cudf.StringIndex(["a", "b", "c"])) == hash(cudf.StringIndex(["a", "b", "c"])) # False, uh-oh

What is going on?

Recall that if you define no dunder ops on a class, then you get __eq__ and __hash__ from object, which uses id for equality and a hash of id for __hash__.

If you define a class that sets __eq__ but not __hash__ then the interpreter automatically sets __hash__ = None, to indicate that the object is unhashable. Since it can't automatically construct a hash function that satisfies the invariant x == y => hash(x) == hash(y) and x is y. See https://docs.python.org/3/reference/datamodel.html#object.__hash__

Columns and Indexs have __eq__ defined, so why can we hash them? Turns out the __eq__ method is set programmatically through a mixin class, through __init_subclass__, and it appears that this is sufficiently dynamic that the interpreter doesn't spot the __eq__ method and set __hash__ = None.

CategoricalDtype explicitly defines __eq__, hence we get __hash__ = None.

To hash this we need to compute a deterministic hash of the categories, and mix with the ordered flag (as you note). This requires, ideally, a libcudf implementation of a hash of a column of values: right now all of the hashes that are implemented produce a per-row hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

2 participants