-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix categorical-accessor support and testing in dask-cudf #15591
Changes from 4 commits
cb046bb
7a45bde
057fdcb
fde1651
c0d972f
c2bc812
0f01712
b4e7c66
d901e20
f38d62e
e229267
ea0616b
6f0ee4c
5432d20
164cc2d
5913c8d
e41ad69
88e8383
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -111,7 +111,8 @@ def test_categorical_accessor_initialization2(data): | |
dsr.cat | ||
|
||
|
||
@xfail_dask_expr("TODO: Unexplained dask-expr failure") | ||
# TODO: Remove this once we are pinned to dask>=2024.5.0 | ||
@xfail_dask_expr("Requires: https://github.com/dask/dask/pull/11059") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wonder if the That way, in addition to leaving this TODO we could also do something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was thinking the same thing. I was actually going to submit a dedicated PR to revise the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay - Thanks again for the suggestion. The |
||
@pytest.mark.parametrize("data", [data_cat_1()]) | ||
def test_categorical_basic(data): | ||
cat = data.copy() | ||
|
@@ -203,7 +204,6 @@ def test_categorical_compare_unordered(data): | |
dsr < dsr | ||
|
||
|
||
@xfail_dask_expr("TODO: Unexplained dask-expr failure") | ||
@pytest.mark.parametrize("data", [data_cat_3()]) | ||
def test_categorical_compare_ordered(data): | ||
cat1 = data[0].copy() | ||
|
@@ -274,7 +274,6 @@ def test_categorical_categories(): | |
) | ||
|
||
|
||
@xfail_dask_expr("TODO: Unexplained dask-expr failure") | ||
def test_categorical_as_known(): | ||
df = dask_cudf.from_cudf(DataFrame({"col_1": [0, 1, 2, 3]}), npartitions=2) | ||
df["col_1"] = df["col_1"].astype("category") | ||
|
@@ -283,7 +282,19 @@ def test_categorical_as_known(): | |
pdf = dd.from_pandas(pd.DataFrame({"col_1": [0, 1, 2, 3]}), npartitions=2) | ||
pdf["col_1"] = pdf["col_1"].astype("category") | ||
expected = pdf["col_1"].cat.as_known() | ||
dd.assert_eq(expected, actual) | ||
|
||
# Note: Categories may be ordered differently in | ||
# cudf and pandas. Therefore, we need to compare | ||
# the global set of categories (before and after | ||
# calling `compute`), then we need to check that | ||
# the initial order of rows was preserved. | ||
assert set(expected.cat.categories) == set( | ||
actual.cat.categories.values_host | ||
) | ||
assert set(expected.compute().cat.categories) == set( | ||
actual.compute().cat.categories.values_host | ||
) | ||
dd.assert_eq(expected, actual.astype(expected.dtype)) | ||
|
||
|
||
def test_str_slice(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure why the
Frame.__dask_tokenize__
definition (which looks very similar) isn't being used?Whatever the reason may be,
test_categorical_compare_ordered
fails without this fix, because differentSeries
objects end up being tokenized to the same value, and the corresponding expressions are cached between tests. (general message: Unique/proper tokenization is very important when query-planning is active)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like before this was using
IndexedFrame.__dask_tokenize__
where the data was being hashed asself.hash_values().values_host
as opposed toself.to_pandas()
(might be difference here for categorical)Might be a better fix if
IndexedFrame.__dask_tokenize__
usesto_pandas()
instead ofself.hash_values
? Additionally if we want to use this fix as-is I think we would need to also incorporate theself.index
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Thanks for pointing that out @mroeschke !
This is actually a problem we ran into before and fixed in
Frame.__dask_tokenize__
. It turns out thatnormalize_token(self._dtypes)
doesn't work very well. The more reliable thing to do is actually usestr(self._dtypes)
. With that said, dtypes with many categories may not be completely/well represented bystr(self._dtypes)
. Therefore, I just added an extra line to explicitly normalize the actualcategories
for each categorical dtype.I think you are right that this is probably the safest and most robust thing to do. However, I am still hesitant to remove the
hash_values
code path. Right now, we avoid moving more than two columns (the hashed values, and the index) to host memory when a cudf object is tokenized. The overhead difference may not be dramatic, but it would be nice to avoid moving the whole thing to pandas.