Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix order-preservation in pandas-compat unsorted groupby #16942

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Sep 27, 2024

Description

When sort=False is requested in groupby aggregations and pandas compatibility mode is enabled, we are on the hook to produce the grouped aggregation result in an order which matches the input key order. We previously nearly did this, but the reordering relied on the (incorrect) assumption that when joining two tables with a left join, the resulting gather map for the left table is the identity. This is not the case.

To fix this, we must permute the right (result) table gather map by the ordering that makes the left map the identity (AKA, sort by key with the left map as keys) before permuting the result.

While here, replace the (bounds-checking) IndexedFrame.take call with usage of the internal (non-bounds-checking) _gather method. This avoids a redundant reduction over the indices, since by construction they are in bounds.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

When sort=False is requested in groupby aggregations and pandas
compatibility mode is enabled, we are on the hook to produce the
grouped aggregation result in an order which matches the input key
order. We previously nearly did this, but the reordering relied on
the (incorrect) assumption that when joining two tables with a left
join, the resulting gather map for the left table is the identity.
This is not the case.

To fix this, we must permute the right (result) table gather map by
the ordering that makes the left map the identity (AKA, sort by key
with the left map as keys) before permuting the result.

While here, replace the (bounds-checking) IndexedFrame.take call with
usage of the internal (non-bounds-checking) _gather method. This
avoids a redundant reduction over the indices, since by construction
they are in bounds.

- Closes rapidsai#16908
@wence- wence- requested a review from a team as a code owner September 27, 2024 09:37
@wence- wence- requested review from mroeschke and isVoid September 27, 2024 09:37
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 27, 2024
@wence- wence- added bug Something isn't working non-breaking Non-breaking change cudf.pandas Issues specific to cudf.pandas cuDF (Python) labels Sep 27, 2024
@wence-
Copy link
Contributor Author

wence- commented Sep 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 9b2f892 into rapidsai:branch-24.12 Sep 30, 2024
106 checks passed
@wence- wence- deleted the wence/fix/16908 branch September 30, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False
2 participants