Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make column_empty mask buffer creation consistent with libcudf #16715

Merged

Conversation

mroeschke
Copy link
Contributor

Description

Based on offline discussions, this PR makes column_empty consistent with libcudf where

  • A size 0 "empty" column should not have a mask buffer
  • A size > 0 "empty" (i.e all null) column should have a mask buffer

Additionally removes column_empty_like which can be subsumed by column_empty (I didn't find any active usage of this method across RAPIDS https://github.com/search?q=org%3Arapidsai%20column_empty_like&type=code)

column_empty will have an unused masked argument, but since there is usage of this method across RAPIDS I'll need to adjust them before removing that keyword here (https://github.com/search?q=org%3Arapidsai%20column_empty&type=code)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 30, 2024
@mroeschke mroeschke requested a review from a team as a code owner August 30, 2024 22:32
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
@@ -1486,7 +1454,12 @@ def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool:
def column_empty(
row_count: int, dtype: Dtype = "object", masked: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the masked parameter no longer used?

I also want to raise a concern: the nullability of a column is conveyed by the presence/absence of a mask. While a column with zero rows has no data, the nullability is moreso a part of its dtype which is independent of the row count. We have to track nullability carefully for things like round-tripping Parquet data (where nullability is part of the schema). I would check carefully if this changes any I/O behaviors for zero-row files -- we may not have adequate test coverage today. libcudf/cudf are notoriously weak/imprecise in this area and I want to get stricter, not less strict, over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the masked parameter no longer used?

Correct.

I would check carefully if this changes any I/O behaviors for zero-row files -- we may not have adequate test coverage today.

I added a small, zero row roundtrip parquet unit test in 34fbc8a to validate this was unaffected. Agreed that it would be good to make libcudf/cudf stricter w.r.t. mask presence, and at least this PR makes it more consistent for zero row vs all null columns generated in cudf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the masked parameter now that it is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to remove this parameter in a follow up since there are some usages of masked in other RAPIDS libraries. I'll address those first (after this PR) before opening up a PR to remove it here

@mroeschke mroeschke changed the base branch from branch-24.10 to branch-24.12 September 25, 2024 20:25
@vyasr vyasr requested a review from bdice September 26, 2024 22:39
@vyasr
Copy link
Contributor

vyasr commented Sep 26, 2024

@bdice could you take another look here?

@mroeschke mroeschke marked this pull request as draft October 10, 2024 19:23
@mroeschke
Copy link
Contributor Author

mroeschke commented Oct 10, 2024

Just putting this in draft to not merge while I'm out (might, but not intended to, require some related fixed in other RAPIDS projects), but this PR is fully ready to review.

@mroeschke mroeschke marked this pull request as ready for review October 30, 2024 20:10
@mroeschke
Copy link
Contributor Author

Opened back up now that I'm back from PTO.

Could you take another look when you have the time @bdice?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked a potentially breaking question about removing the masked parameter. Also this feels like a potentially disruptive change. This should probably target 25.02, to avoid breaking 24.12 just before freeze.

@@ -1486,7 +1454,12 @@ def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool:
def column_empty(
row_count: int, dtype: Dtype = "object", masked: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the masked parameter now that it is unused?

@@ -154,7 +154,10 @@ def jit_groupby_apply(offsets, grouped_values, function, *args):
offsets = cp.asarray(offsets)
ngroups = len(offsets) - 1

output = cudf.core.column.column_empty(ngroups, dtype=return_type)
# numba doesn't support masks
output = cudf.core.column.column_empty(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we're losing performance by allocating a null mask with all nulls, and then dropping it. Is there a better way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I added a for_numba argument in this function to avoid creating a mask in these cases

@vyasr vyasr changed the base branch from branch-24.12 to branch-25.02 November 20, 2024 16:38
@vyasr
Copy link
Contributor

vyasr commented Nov 20, 2024

Thanks for taking a look Bradley! Moving to 25.02

@vyasr
Copy link
Contributor

vyasr commented Dec 2, 2024

@mroeschke sorry for not getting this merged earlier, unfortunately there are conflicts now. Could you please resolve?

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 541e7e8 into rapidsai:branch-25.02 Dec 3, 2024
105 checks passed
@mroeschke mroeschke deleted the ref/column_empty/consistent branch December 3, 2024 22:58
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Dec 4, 2024
rapidsai/cudf#16715 makes size 0 cudf columns have no mask buffer which I think aligns with the intention in `empty_geometry_column`. Additionally, it makes the `masked` keyword unnecessary and unused so removing it here.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Michael Wang (https://github.com/isVoid)

URL: #1496
rapids-bot bot pushed a commit that referenced this pull request Dec 13, 2024
Follow up to #16715.

Now that the usages of the `masked` keyword in RAPIDS have been address (rapidsai/cuspatial#1496 is the only one I could find), I think we can remove this keyword all together in this method

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #17530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants