Add multi-partition `Scan` support to cuDF-Polars #17494

rjzamora · 2024-12-03T20:14:37Z

Description

Adds multi-partition Scan support following the same design as #17441

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-03T20:14:42Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rjzamora · 2024-12-04T00:35:44Z

/ok to test

…-multi-scan

wence-

The logic here is quite complicated, could you please add some documentation/comments on what is going on?

python/cudf_polars/cudf_polars/experimental/io.py

wence- · 2024-12-12T16:55:21Z

python/cudf_polars/cudf_polars/experimental/io.py

+            config_options["parquet_options"] = config_options.get(
+                "parquet_options", {}
+            ).copy()
+            config_options["parquet_options"]["chunked"] = False


Since we require py 3.10 now, I think this simpler as:

Suggested change

config_options["parquet_options"] = config_options.get(

"parquet_options", {}

).copy()

config_options["parquet_options"]["chunked"] = False

config_options["parquet_options"] |= {"chunked": False}

I don't think this works if the "parquet_options" key is missing?

Oh yeah, sorry

wence- · 2024-12-12T16:58:34Z

python/cudf_polars/cudf_polars/experimental/io.py

+        file_size: float = 0
+        # TODO: Use system info to set default blocksize
+        parallel_options = ir.config_options.get("executor_options", {})
+        blocksize: int = parallel_options.get("parquet_blocksize", 1024**3)


Is 1GiB a good size, or should we pick something larger?

Is 1GiB a good size, or should we pick something larger?

My experience tells me that 1GiB is a good default, but that most users with datacenter-class GPUs will usually want to go bigger. In Dask cuDF we use pynvml to query the "real" device size. The details of this can get sticky, so I'd rather revisit this kind of improvement after we start benchmarking.

…-multi-scan

rjzamora · 2024-12-18T18:50:59Z

@wence- Any sense for how far away we are on this one?

wence-

A few small suggestions, but I think this looks good now, thanks for the documentation/refactoring.

python/cudf_polars/cudf_polars/experimental/io.py

wence- · 2024-12-19T10:31:03Z

python/cudf_polars/cudf_polars/experimental/io.py

+class ScanPartitionFlavor(IntEnum):
+    """Flavor of Scan partitioning."""


Suggested change

class ScanPartitionFlavor(IntEnum):

"""Flavor of Scan partitioning."""

class ScanPartitionFlavour(IntEnum):

"""Flavour of Scan partitioning."""

😉

python/cudf_polars/cudf_polars/experimental/io.py

…-multi-scan

rjzamora · 2024-12-19T18:59:49Z

/merge

add multi-partition scan support

1236053

rjzamora self-assigned this Dec 3, 2024

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024

Merge branch 'branch-25.02' into cudf-polars-multi-scan

6ee3114

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Dec 3, 2024

rjzamora added cudf.polars Issues specific to cudf.polars and removed cudf.polars Issues specific to cudf.polars labels Dec 3, 2024

Merge branch 'branch-25.02' into cudf-polars-multi-scan

67bb928

rjzamora added 2 commits December 4, 2024 07:28

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

c0f319d

…-multi-scan

update coverage

ddb5f71

rjzamora marked this pull request as ready for review December 4, 2024 15:40

rjzamora requested a review from a team as a code owner December 4, 2024 15:40

rjzamora requested review from vyasr and mroeschke December 4, 2024 15:40

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 4, 2024

wence- requested changes Dec 12, 2024

View reviewed changes

rjzamora added 3 commits December 12, 2024 10:38

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

e62099d

…-multi-scan

use ScanPartitionPlan to clarify the logic a bit (maybe)

ecbc104

Merge branch 'branch-25.02' into cudf-polars-multi-scan

1dae35a

wence- approved these changes Dec 19, 2024

View reviewed changes

rjzamora added 2 commits December 19, 2024 09:25

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

6bf6afe

…-multi-scan

address review comments

ecc8443

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 19, 2024

rapids-bot bot merged commit 253b0d8 into rapidsai:branch-25.02 Dec 19, 2024
132 checks passed

rjzamora deleted the cudf-polars-multi-scan branch December 19, 2024 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-partition `Scan` support to cuDF-Polars #17494

Add multi-partition `Scan` support to cuDF-Polars #17494

rjzamora commented Dec 3, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 3, 2024

rjzamora commented Dec 4, 2024

wence- left a comment •

edited

Loading

wence- Dec 12, 2024

rjzamora Dec 12, 2024

wence- Dec 19, 2024

wence- Dec 12, 2024

rjzamora Dec 12, 2024

rjzamora commented Dec 18, 2024

wence- left a comment

wence- Dec 19, 2024

rjzamora commented Dec 19, 2024

		class ScanPartitionFlavor(IntEnum):
		"""Flavor of Scan partitioning."""

Add multi-partition Scan support to cuDF-Polars #17494

Add multi-partition Scan support to cuDF-Polars #17494

Conversation

rjzamora commented Dec 3, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 3, 2024

rjzamora commented Dec 4, 2024

wence- left a comment • edited Loading

Choose a reason for hiding this comment

wence- Dec 12, 2024

Choose a reason for hiding this comment

rjzamora Dec 12, 2024

Choose a reason for hiding this comment

wence- Dec 19, 2024

Choose a reason for hiding this comment

wence- Dec 12, 2024

Choose a reason for hiding this comment

rjzamora Dec 12, 2024

Choose a reason for hiding this comment

rjzamora commented Dec 18, 2024

wence- left a comment

Choose a reason for hiding this comment

wence- Dec 19, 2024

Choose a reason for hiding this comment

rjzamora commented Dec 19, 2024

Add multi-partition `Scan` support to cuDF-Polars #17494

Add multi-partition `Scan` support to cuDF-Polars #17494

rjzamora commented Dec 3, 2024 •

edited

Loading

wence- left a comment •

edited

Loading