-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet writer does not encode null count = 0
correctly
#6256
Comments
I'm a bit curious here, since like in page-index ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1088 ) and page header v2 ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L633 ), the null-count is better to have. Should it also |
I think the parquet-rs writer always has the null count when writing statistics so it is always possible to write I think we should change the writer to write The difference is when reading a parquet file from some other writer that may not have written a value for the The current parquet-rs code will assume a missing null count field means "0" which might not be correct Previously to #6216 I will try and document this as clearly as I can |
I've checked parquet-java and parquet-c++, all of these seems to write null-count with 0, but I don't know the case of legacy data. Maybe @wgtmac knows? |
I agree it is pretty unlikely to find one of these files in practice |
seems parquet-java distinguishes None and 0. Maybe I can draft a pr in parquet-format to enforcing this or get more info about this...🤔 But otherplace it's checked: https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L128 |
I am not sure how important it is to enforce that this statistics field is set Note that all the parquet files written by the rust writer so far will have had their null counts set to |
From my aspects, 0 and None is different, and parquet-c++ / parquet-java always writes the null_count if has any value. Perhaps there're some old file write it to 0, and make filter pruning becomes buggy, lol Anyway, we all agree that
I'll draft a discuss in mailling list about this tomorrow, a bit late in utc-8 and I have to sleep :-) |
Thank you . Pleasant dreams 💤 |
Yes, this makes sense to me
Yes, this is what I think the rust writer should do too (I am making a PR to make it so) |
I made a PR with a proposed fix: #6257 which turned out to be larger than I expected (though it is mostly tests) |
Is there a semantic difference between
arrow-rs/parquet/src/column/writer/mod.rs Lines 994 to 1004 in 0f7116b
This stuck out to me because #6257 adds the checked conversion for |
I am not sure to be honest 🤔 |
null count = 0
correctly
null count = 0
correctly null count = 0
correctly
@Michael-J-Ward I think no 🤔 Besides the DataPage V2 seems not used by most community. I can also mention this is maillist. |
I've update it here: https://lists.apache.org/thread/oqd9lt7p0jtf217hg0667xhk0zvwbgvt |
Describe the bug
TLDR is that the Rust parquet writer does not write out
null_counts
correctly in row group statistics. However, the reader has the same mistake so systems that both write and read with Rust are unaffectedWhile reviewing #6216 from @Michael-J-Ward (🙏 ) I am pretty sure the arrow-rs parquet writer does not do the correct thing with respect to null statistics
Specifically, when there are no nulls in the data, the writer does not emit a value for null_count in the thrift metadata (it writes the equivalent of
None
) -- it should instead write the equivalent ofSome(0)
This will not cause issues for people using parquet-rs to read and write data as the reader also (incorrectly) reports
Some(0)
when the thrift metadata hasNone
To Reproduce
TBD
Expected behavior
Some(0)
None
Additional context
The text was updated successfully, but these errors were encountered: