Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use libcudf Dictionary type for CategoricalColumn in Python #8573

Open
beckernick opened this issue Jun 21, 2021 · 3 comments
Open

[FEA] Use libcudf Dictionary type for CategoricalColumn in Python #8573

beckernick opened this issue Jun 21, 2021 · 3 comments
Assignees
Labels
feature request New feature or request Performance Performance related issue Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

cuDF Python would like to back the CategoricalColumn with the Dictionary type. Work has been initiated toward this goal in #8567

@beckernick beckernick added feature request New feature or request Python Affects Python cuDF API. Cython labels Jun 21, 2021
@isVoid isVoid self-assigned this Jun 21, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@wence-
Copy link
Contributor

wence- commented Sep 21, 2023

This desire came up again recently in relation to #14138, where it is noted that we implement a lot of "heavyweight" algorithms as a sequence of calls in Python, rather than pushing down into libcudf.

@isVoid's implementation work in #8567 stalled due to some differences in the way libcudf and pandas (and hence cudf) choose to model dictionary-encoded columns.

In libcudf, the keys of the dictionary are required to be sorted, and the encoding looks up the value by indexing into the keys array. This restricts dictionary encoding to keys that admit a total order, and (I think) doesn't have a hook for a user-provided comparator.

In pandas, categoricals (dictionary encoded columns) come in two flavours

  1. ordered
  2. unordered

The latter do not require that the keys admit a total order (or indeed a partial one), and can be applied even in the case where the key type does have a "natural" ordering, e.g.:

n [5]: col = pd.Categorical([1, 2, 3], ordered=False)

In [6]: col.min() # => TypeError

Ordered categoricals either use the natural ordering induced by the key type (this matches libcudf), or allow for a user-defined ordering. This enables the user to impose a total order on naturally unordered key types (for example floats), and/or provide one that is different from the natural order:

col = pd.Categorical([3, 2, 1], ordered=True)
col.min() # => 1

col = pd.Categorical([3, 2, 1], categories=[3, 1, 2], ordered=True)
col.min() # => 3

AIUI, it was interfacing these differences that caused too many hacks/workarounds on the python side.

In light of this, we should consider if the libcudf side would need some extensions to support cudf's use case of dictionary encoding. Or if there is a smart way of managing things in a translation layer that doesn't require huge amounts of special-casing.

@wence- wence- added the Performance Performance related issue label Sep 21, 2023
@vyasr
Copy link
Contributor

vyasr commented Sep 22, 2023

Another reason Michael's work stalled is that due to the fact that it's not directly mapping to a libcudf type categorical data in cudf is special-cased all over the place and therefore requires a large amount of work to track. We were hoping that it would be simpler to work on that after we had refactored cudf internals to a place where the categorical logic was better isolated to just the categorical column, or at least more contained in some other way. I'm not opposed to revisiting the work now, but just an FYI that I'd hope this would become substantially easier after we restructure cudf internals around pylibcudf over the next couple of releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

4 participants