Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

Open
wants to merge 17 commits into
base: branch-25.02
Choose a base branch
from

Conversation

mroeschke
Copy link
Contributor

Description

Contributes to #17317

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 12, 2024
@mroeschke mroeschke self-assigned this Dec 12, 2024
@mroeschke mroeschke requested a review from a team as a code owner December 12, 2024 00:01
@github-actions github-actions bot added the CMake CMake build issue label Dec 12, 2024
Comment on lines 763 to 804
if (
is_string_dtype(col)
and agg not in _STRING_AGGS
and (
str_agg in {"cumsum", "cummin", "cummax"}
or not (
any(
a in str_agg
for a in {
"count",
"max",
"min",
"first",
"last",
"nunique",
"unique",
"nth",
}
)
or (agg is list)
)
)
):
raise TypeError(
f"function is not supported for this dtype: {agg}"
)
elif (
_is_categorical_dtype(col)
and agg not in _CATEGORICAL_AGGS
and (
str_agg in {"cumsum", "cummin", "cummax"}
or not (
any(
a in str_agg
for a in {"count", "max", "min", "unique"}
)
)
)
):
raise TypeError(
f"{col.dtype} type does not support {agg} operations"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be okay with moving this logic out into separate functions that determine if an aggregation is not supported for string and categorical types? They can be nested in _aggregate. Eg.

def _is_unsupported_for_string(col, str_agg):
    cumulative_agg = str_agg in {"cumsum", "cummin", "cummax"}
    basic_agg = any(a in str_agg for a in {
        "count", "max", "min", "first", "last", "nunique", "unique", "nth"
    })
    return (
        is_string_dtype(col)
        and str_agg not in _STRING_AGGS
        and (cumulative_agg or (not basic_agg and str_agg != "list"))
    )

def _is_unsupported_for_categorical(col, str_agg):
    cumulative_agg = atr_agg in {"cumsum", "cummin", "cummax"}
    basic_agg = any(a in str_agg for a in {"count", "max", "min", "unique"})
    return (
        _is_categorical_dtype(col)
        and agg_str not in _CATEGORICAL_AGGS
        and (cumulative_agg or not basic_agg)
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a singledispatch function on the dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Was able adapt this to use singledispatch

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, minor suggestions. I also like the singledispatch function you added.


def _is_all_scan_aggregate(all_aggs: list[list[str]]) -> bool:
"""
Returns true if all are scan aggregations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
Returns true if all are scan aggregations.
Returns True if all are scan aggregations.

if self._dropna
else plc.types.NullPolicy.INCLUDE,
)
# Do we need this because we just check _spill_locks in test_spillable_df_groupby?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rephrase this as a note?

Suggested change
# Do we need this because we just check _spill_locks in test_spillable_df_groupby?
# Note: We return a SimpleNamespace with an additional _spill_locks
# attribute solely to verify that the spill_lock is set correctly.

or agg_obj.kind in valid_aggregations
):
included_aggregations_i.append((agg, agg_obj.kind))
col_aggregations.append(agg_obj.c_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this a plc_obj? And change this line too https://github.com/rapidsai/cudf/blob/branch-25.02/python/cudf/cudf/core/_internals/aggregation.py#L32

Suggested change
col_aggregations.append(agg_obj.c_obj)
col_aggregations.append(agg_obj.plc_obj)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants