[FEA] Define `hash` on CategoricalDtype #14027

mroeschke · 2023-08-31T21:45:05Z

Is your feature request related to a problem? Please describe.
Would be nice if hash(CategoricalDtype(...)) was supported

In [9]: import cudf

In [10]: import pandas

In [11]: hash(cudf.CategoricalDtype(list("abc")))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 hash(cudf.CategoricalDtype(list("abc")))

TypeError: unhashable type: 'CategoricalDtype'

In [12]: hash(pandas.CategoricalDtype(list("abc")))
Out[12]: 1532899084736511412

Describe the solution you'd like
Define __hash__ on CategoricalDtype

Describe alternatives you've considered
Custom hash based on order and categories

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

The text was updated successfully, but these errors were encountered:

wence- · 2023-09-01T11:00:44Z

Hmmm, this opens a can of worms:

import cudf
dt = cudf.CategoricalDtype(["a", "b", "c"])
hash(dt) # TypeError
hash(dt.categories) # works, dt.categories is a `StringIndex`, so that's immutable, so hashing is ok.
hash(dt.categories._column) # works?! This seems bad, since a column is _not_ immutable
hash(cudf.StringIndex(["a", "b", "c"])) == hash(cudf.StringIndex(["a", "b", "c"])) # False, uh-oh

What is going on?

Recall that if you define no dunder ops on a class, then you get __eq__ and __hash__ from object, which uses id for equality and a hash of id for __hash__.

If you define a class that sets __eq__ but not __hash__ then the interpreter automatically sets __hash__ = None, to indicate that the object is unhashable. Since it can't automatically construct a hash function that satisfies the invariant x == y => hash(x) == hash(y) and x is y. See https://docs.python.org/3/reference/datamodel.html#object.__hash__

Columns and Indexs have __eq__ defined, so why can we hash them? Turns out the __eq__ method is set programmatically through a mixin class, through __init_subclass__, and it appears that this is sufficiently dynamic that the interpreter doesn't spot the __eq__ method and set __hash__ = None.

CategoricalDtype explicitly defines __eq__, hence we get __hash__ = None.

To hash this we need to compute a deterministic hash of the categories, and mix with the ordered flag (as you note). This requires, ideally, a libcudf implementation of a hash of a column of values: right now all of the hashes that are implemented produce a per-row hash.

mroeschke added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 31, 2023

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Define `hash` on CategoricalDtype #14027

[FEA] Define `hash` on CategoricalDtype #14027

mroeschke commented Aug 31, 2023

wence- commented Sep 1, 2023

[FEA] Define __hash__ on CategoricalDtype #14027

[FEA] Define __hash__ on CategoricalDtype #14027

Comments

mroeschke commented Aug 31, 2023

wence- commented Sep 1, 2023

[FEA] Define `hash` on CategoricalDtype #14027

[FEA] Define `hash` on CategoricalDtype #14027