-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-partition Scan
support to cuDF-Polars
#17494
Add multi-partition Scan
support to cuDF-Polars
#17494
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is quite complicated, could you please add some documentation/comments on what is going on?
config_options["parquet_options"] = config_options.get( | ||
"parquet_options", {} | ||
).copy() | ||
config_options["parquet_options"]["chunked"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we require py 3.10 now, I think this simpler as:
config_options["parquet_options"] = config_options.get( | |
"parquet_options", {} | |
).copy() | |
config_options["parquet_options"]["chunked"] = False | |
config_options["parquet_options"] |= {"chunked": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works if the "parquet_options"
key is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, sorry
file_size: float = 0 | ||
# TODO: Use system info to set default blocksize | ||
parallel_options = ir.config_options.get("executor_options", {}) | ||
blocksize: int = parallel_options.get("parquet_blocksize", 1024**3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1GiB a good size, or should we pick something larger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 1GiB a good size, or should we pick something larger?
My experience tells me that 1GiB is a good default, but that most users with datacenter-class GPUs will usually want to go bigger. In Dask cuDF we use pynvml to query the "real" device size. The details of this can get sticky, so I'd rather revisit this kind of improvement after we start benchmarking.
@wence- Any sense for how far away we are on this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions, but I think this looks good now, thanks for the documentation/refactoring.
class ScanPartitionFlavor(IntEnum): | ||
"""Flavor of Scan partitioning.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ScanPartitionFlavor(IntEnum): | |
"""Flavor of Scan partitioning.""" | |
class ScanPartitionFlavour(IntEnum): | |
"""Flavour of Scan partitioning.""" |
😉
/merge |
Description
Adds multi-partition
Scan
support following the same design as #17441Checklist