Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

mroeschke · 2024-12-12T00:01:18Z

Description

Contributes to #17317

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…roupby

Matt711 · 2024-12-18T02:26:47Z

python/cudf/cudf/core/groupby/groupby.py

+                if (
+                    is_string_dtype(col)
+                    and agg not in _STRING_AGGS
+                    and (
+                        str_agg in {"cumsum", "cummin", "cummax"}
+                        or not (
+                            any(
+                                a in str_agg
+                                for a in {
+                                    "count",
+                                    "max",
+                                    "min",
+                                    "first",
+                                    "last",
+                                    "nunique",
+                                    "unique",
+                                    "nth",
+                                }
+                            )
+                            or (agg is list)
+                        )
+                    )
+                ):
+                    raise TypeError(
+                        f"function is not supported for this dtype: {agg}"
+                    )
+                elif (
+                    _is_categorical_dtype(col)
+                    and agg not in _CATEGORICAL_AGGS
+                    and (
+                        str_agg in {"cumsum", "cummin", "cummax"}
+                        or not (
+                            any(
+                                a in str_agg
+                                for a in {"count", "max", "min", "unique"}
+                            )
+                        )
+                    )
+                ):
+                    raise TypeError(
+                        f"{col.dtype} type does not support {agg} operations"
+                    )


Would you be okay with moving this logic out into separate functions that determine if an aggregation is not supported for string and categorical types? They can be nested in _aggregate. Eg.

def _is_unsupported_for_string(col, str_agg): cumulative_agg = str_agg in {"cumsum", "cummin", "cummax"} basic_agg = any(a in str_agg for a in { "count", "max", "min", "first", "last", "nunique", "unique", "nth" }) return ( is_string_dtype(col) and str_agg not in _STRING_AGGS and (cumulative_agg or (not basic_agg and str_agg != "list")) ) def _is_unsupported_for_categorical(col, str_agg): cumulative_agg = atr_agg in {"cumsum", "cummin", "cummax"} basic_agg = any(a in str_agg for a in {"count", "max", "min", "unique"}) return ( _is_categorical_dtype(col) and agg_str not in _CATEGORICAL_AGGS and (cumulative_agg or not basic_agg) )

Maybe use a singledispatch function on the dtype.

Sure thing. Was able adapt this to use singledispatch

…roupby

Matt711

Thanks, minor suggestions. I also like the singledispatch function you added.

Matt711 · 2024-12-19T02:32:07Z

python/cudf/cudf/core/groupby/groupby.py

+
+def _is_all_scan_aggregate(all_aggs: list[list[str]]) -> bool:
+    """
+    Returns true if all are scan aggregations.


nitpick

Suggested change

Returns true if all are scan aggregations.

Returns True if all are scan aggregations.

Matt711 · 2024-12-19T02:32:12Z

python/cudf/cudf/core/groupby/groupby.py

+                if self._dropna
+                else plc.types.NullPolicy.INCLUDE,
+            )
+            # Do we need this because we just check _spill_locks in test_spillable_df_groupby?


Should we rephrase this as a note?

Suggested change

# Do we need this because we just check _spill_locks in test_spillable_df_groupby?

# Note: We return a SimpleNamespace with an additional _spill_locks

# attribute solely to verify that the spill_lock is set correctly.

Matt711 · 2024-12-19T02:32:22Z

python/cudf/cudf/core/groupby/groupby.py

+                    or agg_obj.kind in valid_aggregations
+                ):
+                    included_aggregations_i.append((agg, agg_obj.kind))
+                    col_aggregations.append(agg_obj.c_obj)


Can we call this a plc_obj? And change this line too https://github.com/rapidsai/cudf/blob/branch-25.02/python/cudf/cudf/core/_internals/aggregation.py#L32

Suggested change

col_aggregations.append(agg_obj.c_obj)

col_aggregations.append(agg_obj.plc_obj)

mroeschke added 8 commits December 10, 2024 18:59

Move groupby to core._internals

c0adb97

Use plc instead of pylibcudf

8d0ceb8

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

b3aa2e4

…roupby

migrate groups

34e7540

Migrate aggregate

293fca8

migrate shift

bd597c9

Migrate replace_nulls

9f47e75

inline plc groupby object

eac75c3

mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 12, 2024

mroeschke self-assigned this Dec 12, 2024

mroeschke requested a review from a team as a code owner December 12, 2024 00:01

mroeschke requested review from wence- and galipremsagar December 12, 2024 00:01

github-actions bot added the CMake CMake build issue label Dec 12, 2024

mroeschke added 7 commits December 12, 2024 09:10

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

ccd5c12

…roupby

Keep _spill_lock for testing

ccc7af1

Undo breakpoint in groupby

77a7ce1

retype as types.SimpleNamespace

8672628

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

240a3a4

…roupby

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

b6d23ff

…roupby

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

3df6310

…roupby

Matt711 reviewed Dec 18, 2024

View reviewed changes

mroeschke added 2 commits December 18, 2024 15:21

Merge remote-tracking branch 'upstream/branch-25.02' into cudf/_lib/g…

3649b10

…roupby

Use singledispatch function

04c9896

mroeschke requested review from Matt711 and vyasr December 18, 2024 23:50

mroeschke mentioned this pull request Dec 19, 2024

Remove cudf._lib.utils in favor of python APIs #17625

Draft

3 tasks

Matt711 approved these changes Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

mroeschke commented Dec 12, 2024

Matt711 Dec 18, 2024

vyasr Dec 18, 2024

mroeschke Dec 18, 2024

Matt711 left a comment

Matt711 Dec 19, 2024

Matt711 Dec 19, 2024

Matt711 Dec 19, 2024

	Returns true if all are scan aggregations.
	Returns True if all are scan aggregations.

	# Do we need this because we just check _spill_locks in test_spillable_df_groupby?
	# Note: We return a SimpleNamespace with an additional _spill_locks
	# attribute solely to verify that the spill_lock is set correctly.

	col_aggregations.append(agg_obj.c_obj)
	col_aggregations.append(agg_obj.plc_obj)

Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

Are you sure you want to change the base?

Remove cudf._lib.groupby in favor of inlining pylibcudf #17582

Conversation

mroeschke commented Dec 12, 2024

Description

Checklist

Matt711 Dec 18, 2024

Choose a reason for hiding this comment

vyasr Dec 18, 2024

Choose a reason for hiding this comment

mroeschke Dec 18, 2024

Choose a reason for hiding this comment

Matt711 left a comment

Choose a reason for hiding this comment

Matt711 Dec 19, 2024

Choose a reason for hiding this comment

Matt711 Dec 19, 2024

Choose a reason for hiding this comment

Matt711 Dec 19, 2024

Choose a reason for hiding this comment