-
Notifications
You must be signed in to change notification settings - Fork 912
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix order-preservation in pandas-compat unsorted groupby (#16942)
When sort=False is requested in groupby aggregations and pandas compatibility mode is enabled, we are on the hook to produce the grouped aggregation result in an order which matches the input key order. We previously nearly did this, but the reordering relied on the (incorrect) assumption that when joining two tables with a left join, the resulting gather map for the left table is the identity. This is not the case. To fix this, we must permute the right (result) table gather map by the ordering that makes the left map the identity (AKA, sort by key with the left map as keys) before permuting the result. While here, replace the (bounds-checking) IndexedFrame.take call with usage of the internal (non-bounds-checking) _gather method. This avoids a redundant reduction over the indices, since by construction they are in bounds. - Closes #16908 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16942
- Loading branch information
Showing
2 changed files
with
51 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
29 changes: 29 additions & 0 deletions
29
python/cudf/cudf/tests/groupby/test_ordering_pandas_compat.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
import numpy as np | ||
import pytest | ||
|
||
import cudf | ||
from cudf.testing import assert_eq | ||
|
||
|
||
@pytest.fixture(params=[False, True], ids=["without_nulls", "with_nulls"]) | ||
def with_nulls(request): | ||
return request.param | ||
|
||
|
||
@pytest.mark.parametrize("nrows", [30, 300, 300_000]) | ||
@pytest.mark.parametrize("nkeys", [1, 2, 4]) | ||
def test_groupby_maintain_order_random(nrows, nkeys, with_nulls): | ||
key_names = [f"key{key}" for key in range(nkeys)] | ||
key_values = [np.random.randint(100, size=nrows) for _ in key_names] | ||
value = np.random.randint(-100, 100, size=nrows) | ||
df = cudf.DataFrame(dict(zip(key_names, key_values), value=value)) | ||
if with_nulls: | ||
for key in key_names: | ||
df.loc[df[key] == 1, key] = None | ||
with cudf.option_context("mode.pandas_compatible", True): | ||
got = df.groupby(key_names, sort=False).agg({"value": "sum"}) | ||
expect = ( | ||
df.to_pandas().groupby(key_names, sort=False).agg({"value": "sum"}) | ||
) | ||
assert_eq(expect, got, check_index_type=not with_nulls) |