-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove cudf._lib.groupby in favor of inlining pylibcudf #17582
base: branch-25.02
Are you sure you want to change the base?
Conversation
if ( | ||
is_string_dtype(col) | ||
and agg not in _STRING_AGGS | ||
and ( | ||
str_agg in {"cumsum", "cummin", "cummax"} | ||
or not ( | ||
any( | ||
a in str_agg | ||
for a in { | ||
"count", | ||
"max", | ||
"min", | ||
"first", | ||
"last", | ||
"nunique", | ||
"unique", | ||
"nth", | ||
} | ||
) | ||
or (agg is list) | ||
) | ||
) | ||
): | ||
raise TypeError( | ||
f"function is not supported for this dtype: {agg}" | ||
) | ||
elif ( | ||
_is_categorical_dtype(col) | ||
and agg not in _CATEGORICAL_AGGS | ||
and ( | ||
str_agg in {"cumsum", "cummin", "cummax"} | ||
or not ( | ||
any( | ||
a in str_agg | ||
for a in {"count", "max", "min", "unique"} | ||
) | ||
) | ||
) | ||
): | ||
raise TypeError( | ||
f"{col.dtype} type does not support {agg} operations" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be okay with moving this logic out into separate functions that determine if an aggregation is not supported for string and categorical types? They can be nested in _aggregate
. Eg.
def _is_unsupported_for_string(col, str_agg):
cumulative_agg = str_agg in {"cumsum", "cummin", "cummax"}
basic_agg = any(a in str_agg for a in {
"count", "max", "min", "first", "last", "nunique", "unique", "nth"
})
return (
is_string_dtype(col)
and str_agg not in _STRING_AGGS
and (cumulative_agg or (not basic_agg and str_agg != "list"))
)
def _is_unsupported_for_categorical(col, str_agg):
cumulative_agg = atr_agg in {"cumsum", "cummin", "cummax"}
basic_agg = any(a in str_agg for a in {"count", "max", "min", "unique"})
return (
_is_categorical_dtype(col)
and agg_str not in _CATEGORICAL_AGGS
and (cumulative_agg or not basic_agg)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a singledispatch
function on the dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Was able adapt this to use singledispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, minor suggestions. I also like the singledispatch
function you added.
|
||
def _is_all_scan_aggregate(all_aggs: list[list[str]]) -> bool: | ||
""" | ||
Returns true if all are scan aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
Returns true if all are scan aggregations. | |
Returns True if all are scan aggregations. |
if self._dropna | ||
else plc.types.NullPolicy.INCLUDE, | ||
) | ||
# Do we need this because we just check _spill_locks in test_spillable_df_groupby? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rephrase this as a note?
# Do we need this because we just check _spill_locks in test_spillable_df_groupby? | |
# Note: We return a SimpleNamespace with an additional _spill_locks | |
# attribute solely to verify that the spill_lock is set correctly. |
or agg_obj.kind in valid_aggregations | ||
): | ||
included_aggregations_i.append((agg, agg_obj.kind)) | ||
col_aggregations.append(agg_obj.c_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this a plc_obj
? And change this line too https://github.com/rapidsai/cudf/blob/branch-25.02/python/cudf/cudf/core/_internals/aggregation.py#L32
col_aggregations.append(agg_obj.c_obj) | |
col_aggregations.append(agg_obj.plc_obj) |
Description
Contributes to #17317
Checklist