-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the Parquet writer's default_row_group_size_bytes
from 128MB to inf
#16750
Change the Parquet writer's default_row_group_size_bytes
from 128MB to inf
#16750
Conversation
Signed-off-by: Muhammad Haseeb <[email protected]>
Signed-off-by: Muhammad Haseeb <[email protected]>
Signed-off-by: Muhammad Haseeb <[email protected]>
Signed-off-by: Muhammad Haseeb <[email protected]>
Signed-off-by: Muhammad Haseeb <[email protected]>
Benchmarks with
|
With this change we are going to include 1 million rows in a row group no matter how large that ends up being? Have you done any testing with read performance on distributed setups? Especially with predicate push down enabled? How about with different amounts of columns being read? How about with very long nested types, like a LIST with thousands of entries in it, or JSON strings that can be multiple KiB per string? I am concerned that the performance testing is not real world enough. I can see a number of advantages and disadvantages with this change, but the testing is only showing the advantages. For example: if I want to do predicate push down it only happens at a row group level. That means if I have larger row groups I now potentially need to read/decode more data because my filtering is not as fine grained. But it can reduce the size of the files in total, so it could end up being a win overall. It is hard to tell, which is why it would be nice to see more more realistic tests. I'm just not 100% sure of the what those tests should look like. @abellina do you think that an NDS run with these changes when transcoding the data would be enough? Is anyone on the dask side concerned about this and have tests that they want to run? |
No. I think it is a good data point, but the data we have is simple ( I am also concerned about changing the default in cuDF. Why are these hard coded configs and not dynamically chosen based on the data or some statistics of the columns being written? (this is maybe a naive question) |
python/cudf/cudf/_lib/parquet.pyx
Outdated
row_group_size_bytes: int, default 18446744073709551615 | ||
Maximum size of each stripe of the output. | ||
By default, 134217728 (128MB) will be used. | ||
By default, an infinite value equal to uint64 max (~18446744074GB) will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use uint64 max here? Nobody reading this needs to know the exact value
python/cudf/cudf/utils/ioutils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new value, does row_group_size_bytes_val_in_mb and its use still make sense?
Also, should we change the Python default to None? In C++ it would be a breaking change, but Python does not have this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
makes the most sense on the Python side. Let's wait for Spark to let us know if they want a specific value in bytes or if infinity (None) is good to go.
Signed-off-by: Muhammad Haseeb <[email protected]>
Removed Java changes as PR #16805 takes care of that. |
default_row_group_size_bytes
from 128MB to unlimited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving Python changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving dask-cudf changes
default_row_group_size_bytes
from 128MB to unlimiteddefault_row_group_size_bytes
from 128MB to 1M rows
default_row_group_size_bytes
from 128MB to 1M rowsdefault_row_group_size_bytes
from 128MB to inf
For vis, PR #16805 resets the default limit for Java/Spark back to 128MB. Dask are good with 1M default for now and may revert or change in the future if needed. |
…#16751) Related to #16750 This PR adds a benchmark to study read throughput of Parquet reader for wide tables. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Vukasin Milovanovic (https://github.com/vuule) URL: #16751
/merge |
Description
Closes #16733.
This PR changes the default value of Parquet writer's default max row group size from 128MB to 1Million rows. This allows avoiding thin row group strips when writing wide (> 512 cols) tables resulting in a significantly improved read throughput for wide tables (especially when low cardinality) with virtually no impact to narrow-tables read performance.
Benchmarked using: #16751
Results
Hardware
Read Throughput
Checklist