Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support std and var with ddof !=1 in pandas-like group by #1645

Merged
merged 6 commits into from
Dec 22, 2024

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Comment on lines 270 to 272
# Invert the dict to have root_name: output_name
# TODO(FBruzzesi): Account for duplicates
columns={v: k for k, v in output_to_root_name_mapping.items()},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ€” in case someone does .agg(nw.col('a').std(ddof=1).alias('b'), nw.col('a').std(ddof=1).alias('c'))? hmm yeah i guess it's possible

btw, since cuDF have no introduced cudf.NamedAgg, we could use that if it makes all this logic a bit easier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct, similarly to simple aggs.
To avoid an explosion in the number of lists to keep track of, maybe there is a better data structure. I will think about it a bit, yet aside that specific case, this PR should be ready

[
grouped[std_root_names]
.std(ddof=ddof)
.set_axis(std_output_names, axis="columns", copy=False)
Copy link
Member Author

@FBruzzesi FBruzzesi Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli how bad is it to use set_axis to rename the columns? What's the alternative? Our rename does a mapping, yet here we could do:

grouped[["b", "b"]].std(ddof=2).set_axis(["c", "d"], axis="columns")

For once we are exploiting some pandas weirdness 😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine let's just use copy=false

@FBruzzesi FBruzzesi marked this pull request as ready for review December 22, 2024 15:24
@FBruzzesi FBruzzesi changed the title WIP, feat: support std and var with ddof !=1 in pandas-like group by feat: support std and var with ddof !=1 in pandas-like group by Dec 22, 2024
@MarcoGorelli
Copy link
Member

oh nice

_______ test_group_by_depth_1_std_var[pandas_pyarrow_constructor-var-2] ________
[XPASS(strict)] 
_______ test_group_by_depth_1_std_var[pandas_pyarrow_constructor-var-0] ________
[XPASS(strict)] 

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seriously awesome work here @FBruzzesi !

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Dec 22, 2024
@MarcoGorelli MarcoGorelli merged commit 9f09ea0 into main Dec 22, 2024
24 checks passed
@FBruzzesi FBruzzesi deleted the feat/group-by-specific-paths branch December 22, 2024 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants