Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-partition Scan support to cuDF-Polars #17494

Merged
merged 10 commits into from
Dec 19, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 3, 2024

Description

Adds multi-partition Scan support following the same design as #17441

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Dec 3, 2024
Copy link

copy-pr-bot bot commented Dec 3, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024
@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024
@rjzamora rjzamora added cudf.polars Issues specific to cudf.polars and removed cudf.polars Issues specific to cudf.polars labels Dec 3, 2024
@rjzamora
Copy link
Member Author

rjzamora commented Dec 4, 2024

/ok to test

@rjzamora rjzamora marked this pull request as ready for review December 4, 2024 15:40
@rjzamora rjzamora requested a review from a team as a code owner December 4, 2024 15:40
@rjzamora rjzamora requested review from vyasr and mroeschke December 4, 2024 15:40
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 4, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is quite complicated, could you please add some documentation/comments on what is going on?

python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
Comment on lines +218 to +221
config_options["parquet_options"] = config_options.get(
"parquet_options", {}
).copy()
config_options["parquet_options"]["chunked"] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we require py 3.10 now, I think this simpler as:

Suggested change
config_options["parquet_options"] = config_options.get(
"parquet_options", {}
).copy()
config_options["parquet_options"]["chunked"] = False
config_options["parquet_options"] |= {"chunked": False}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works if the "parquet_options" key is missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, sorry

file_size: float = 0
# TODO: Use system info to set default blocksize
parallel_options = ir.config_options.get("executor_options", {})
blocksize: int = parallel_options.get("parquet_blocksize", 1024**3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1GiB a good size, or should we pick something larger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1GiB a good size, or should we pick something larger?

My experience tells me that 1GiB is a good default, but that most users with datacenter-class GPUs will usually want to go bigger. In Dask cuDF we use pynvml to query the "real" device size. The details of this can get sticky, so I'd rather revisit this kind of improvement after we start benchmarking.

@rjzamora
Copy link
Member Author

@wence- Any sense for how far away we are on this one?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions, but I think this looks good now, thanks for the documentation/refactoring.

python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
Comment on lines +58 to +59
class ScanPartitionFlavor(IntEnum):
"""Flavor of Scan partitioning."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ScanPartitionFlavor(IntEnum):
"""Flavor of Scan partitioning."""
class ScanPartitionFlavour(IntEnum):
"""Flavour of Scan partitioning."""

😉

python/cudf_polars/cudf_polars/experimental/io.py Outdated Show resolved Hide resolved
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 19, 2024
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 253b0d8 into rapidsai:branch-25.02 Dec 19, 2024
132 checks passed
@rjzamora rjzamora deleted the cudf-polars-multi-scan branch December 19, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants