Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write null counts in parquet files when they are present #6257

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 15, 2024

Which issue does this PR close?

Closes #6256

Note this PR contains some of the code that was originally in @Michael-J-Ward 's PR #6256

Rationale for this change

See #6256.

Current behavior:

  • parquet-rs writer always has the null count when writing statistics, but writes None to thrift when the null count is zero
  • parquet-rs reader treats a missing null count (None) as Some(0) (aka that it is known there are no nulls)
  • parquet-rs will write negative numbers if the null count or distinct count is greater than what fits in i64 (e.g. u64::MAX) -- this is likely a theoretical concern only

This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do

What changes are included in this PR?

  • Update parquet reader/writer to follow the spec
  • Add error checking for values that are too large to fit into i64
  • documented that older versions of parquet-rs wrote None.
  • added tests

Are there any user-facing changes?

Yes

Changes

  • parquet-rs writer always writes Some(..) to thrift
  • parquet-rs reader returns None (aka that it is unknown if there are nulls) if there are no null counts in the thrift
  • parquet-rs writer writes None if the null count / distinct count is too large to fit in i64
  • documented that older versions of parquet-rs wrote none.
  • Changed the StatisticsConverter code to read statistics consistently with older versions of parquet-rs (treat missing null counts as known zero) and added a flag to alter the behavior

This change means the generated parquet files are slightly larger (as now they encode Some(0) for null counts) but the behavior is more correct and consistent.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2024
@alamb alamb force-pushed the alamb/parquet_null_counts branch from 6070619 to 2d70413 Compare August 15, 2024 18:20
@alamb alamb force-pushed the alamb/parquet_null_counts branch from 2d70413 to e00e160 Compare August 15, 2024 18:31
@@ -189,7 +189,7 @@ fn test_primitive() {
pages: (0..8)
.map(|_| Page {
rows: 250,
page_header_size: 36,
page_header_size: 38,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page headers (and files) are larger now because Some(0) takes more space than None


#[test]
fn test_count_encoding() {
statistics_count_test(None, None);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails like this without the changes in this PR:

assertion `left == right` failed
  left: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: Some(0), ...
 right: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: None, ...

@@ -1195,6 +1197,23 @@ impl<'a> StatisticsConverter<'a> {
self.arrow_field
}

/// Set the statistics converter to treat missing null counts as missing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default reading null counts will work with files written with older versions of parquet-rs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For breaking / major release, is it acceptable to include an upgrade instruction of "add this config to maintain to old behavior"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely

It is important to note that the default behavior in this PR is the old behavior (in other words there should be changes needed in downstream consumers of this code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

If "add this config to maintain old behavior" is acceptable for a breaking release, then I would expect the default to be the new behavior.

IOW, I'd expect what you said on the parquet mailing list

Applications that use parquet-rs to read parquet_files and interpret the
null_count will need to be changed after the upgrade to explicitly continue
the old behavior of "treat no null_count as 0" which is also documented
now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

Yes

then I would expect the default to be the new behavior.

My thinking was

Since there are two different apis:

Statistics::null_count would now return Option<..> so users of the library will ned to update their code anyways and thus can choose at that time which behavior they wanted

StatisticsConverter's API didn't change and thus it keeps the previous behavior. This is what I would persoanlly want for a system -- no change for reading parquet files that were written with previous versions of the rust writer.

// Number of nulls recorded, when it is not available, we just mark it as 0.
// TODO this should be `None` if there is no information about NULLS.
// see https://github.com/apache/arrow-rs/pull/6216/files
let null_count = stats.null_count.unwrap_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the removal of this unwrap_or is what changes the semantics while reading

let null_count = stats
.null_count_opt()
.map(|value| value as i64)
.filter(|&x| x > 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this filter is what fixes the statistics while writing

Maybe it was intended to be x >= 0 originally 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intrigued, so went to do some code archaeology, but that filter was only introduced a few days ago in #6216, by you! 😄

For 6 years before it was:

null_count: if stats.has_nulls() {
    Some(stats.null_count() as 
} else {
    None
},

Your commit:
7d4e650

Prior code:
https://github.com/apache/arrow-rs/blame/25bfccca58ff219d9f59ba9f4d75550493238a4f/parquet/src/file/statistics.rs#L228-L242

@alamb alamb marked this pull request as ready for review August 15, 2024 18:40
@alamb
Copy link
Contributor Author

alamb commented Aug 20, 2024

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

@etseidl
Copy link
Contributor

etseidl commented Aug 20, 2024

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

Sorry, I was away when this discussion started (interesting things always happen when I'm on vacation 😉). I think this PR is heading the right direction. Writing Some(0) when the null count is known is the right behavior IMO. On the read side I've always treated missing null counts as unknown rather than 0, so the changes here are welcome. I think it's fine for the read side to continue with the old behavior for some time.

@alamb
Copy link
Contributor Author

alamb commented Aug 28, 2024

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

@etseidl
Copy link
Contributor

etseidl commented Aug 28, 2024

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

I've looked this PR over several times and haven't found any issues with it. The only thing giving me pause is this apache/parquet-format#449 (comment)

It seems java and cpp ignore the definition levels and write Some(0) regardless, so I'm fine with merging this and worrying about micro optimizations down the road (in step with the rest of the parquet community).

@alamb
Copy link
Contributor Author

alamb commented Aug 31, 2024

Thank you @etseidl -- I think

  1. given the potential for this PR to cause unintended consequences
  2. we haven't acutally had any bug reports related to this issue,
  3. the 53.0.0 release is imminent and has several people waiting on features

I am not going to merge this PR until after we have released 53.0.0

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look and found no nits. I'd say ship it.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @alamb

@alamb alamb added next-major-release the PR has API changes and it waiting on the next major version api-change Changes to the arrow API and removed next-major-release the PR has API changes and it waiting on the next major version api-change Changes to the arrow API labels Sep 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 22, 2024

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this

Maybe I am overly concerned.

@etseidl
Copy link
Contributor

etseidl commented Sep 22, 2024

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this

Maybe I am overly concerned.

Then how about splitting this up? First change the read behavior so None isn't treated as 0, keep the missing_null_counts_as_zero setting, but default to false. If anyone is broken by that behavior they can workaround by setting it true. After that's been around a release, then change the behavior on write.

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Sep 24, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2024

I go back and forth on this PR -- it isn't technically an API change in the sense of a breaking API change, but also it changes the content of written parquet files. Arguable the content better matches the parquet spec, I am just really worried about unintended consequences of doing this
Maybe I am overly concerned.

Then how about splitting this up? First change the read behavior so None isn't treated as 0, keep the missing_null_counts_as_zero setting, but default to false. If anyone is broken by that behavior they can workaround by setting it true. After that's been around a release, then change the behavior on write.

That is a (very) good idea -- I will plan to do that when I can find some time

@etseidl
Copy link
Contributor

etseidl commented Sep 24, 2024

That is a (very) good idea -- I will plan to do that when I can find some time

@alamb Let me know if you'd like some help with this. I have spare cycles right now.

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2024

That is a (very) good idea -- I will plan to do that when I can find some time

@alamb Let me know if you'd like some help with this. I have spare cycles right now.

that would be super helpful @etseidl -- I do not have many spare cycles now (as you have probably guessed). All your help is most appreciated

@etseidl
Copy link
Contributor

etseidl commented Oct 1, 2024

To clear up my own thinking on this, I made a table of what happens with a round trip of the statistics.

null_count read and write behaviors
|-------|-------------------|-------------------|-------------------|-------------------|
| start |       current     |    change write   |    change read    |    change both    |
| value |  write  |   read  |  write  |   read  |  write  |   read  |  write  |   read  |
|-------|---------|---------|---------|---------|---------|---------|---------|---------|
| None  |  None   | Some(0) |  None   | Some(0) |  None   |  None   |  None   |  None   |
| 0     |  None   | Some(0) | Some(0) | Some(0) |  None   |  None   | Some(0) | Some(0) |
| n > 0 | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) | Some(n) |
|-------|---------|---------|---------|---------|---------|---------|---------|---------|

write means what ends up in parquet file
read is return value of null_count_opt()

Currently we write None for None or 0, and Some(n) otherwise, and on read these will become either Some(0) or Some(n). If we just change the current write behavior, the written Parquet will be consistent with other writers (parquet-java, parquet-cpp), and we get the same result on a round trip. This should not break old parquet-rs readers, and will allow written files to be more in line with the spec (albeit a few bytes larger). Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side. Does this sound right to you @alamb?

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2024

Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

I think it also would break reading of OLD files (aka files written with the older software / older versions of arrow-rs would now be interpreted as "unknown null count" rather than "0 null count") which I worry about a lot.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side.

Yes, I suppose writing the null counts properly seems like a good idea. I am (likely overly) paranoid about the read side changes

@etseidl
Copy link
Contributor

etseidl commented Oct 1, 2024

Changing both read and write will return None when the original value was None, which is correct, but could potentially break old readers.

I think it also would break reading of OLD files (aka files written with the older software / older versions of arrow-rs would now be interpreted as "unknown null count" rather than "0 null count") which I worry about a lot.

Yes, that's the "change read" column from my table, I suppose. So we agree it's the read side where the danger lies.

So in theory we can write the null counts properly now, but should wait until 54.0.0 to make the last change on the read side.

Yes, I suppose writing the null counts properly seems like a good idea. I am (likely overly) paranoid about the read side changes

As I am often told, I am not paranoid enough, so maybe we balance out 😆

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2024

@etseidl
Copy link
Contributor

etseidl commented Nov 25, 2024

Gentle bump...do we want to merge this now?

@alamb
Copy link
Contributor Author

alamb commented Nov 26, 2024

I am still quite worried about the subtle semantic implications of this change -- see for example the discussion on

This is the kind of change that could easily lead to lots of debugging / head scratching I think, and even worse would be hard to catch with tests as it would only affect files written in the past

So I am torn, to be honest

@etseidl
Copy link
Contributor

etseidl commented Nov 26, 2024

Not my call, but I would just throw out there that with this change, there is still a way to get the old behavior if desired. But without this change, there is no way for a user to figure out if null_count is truly 0 or is not present. Food for thought.

@tustvold
Copy link
Contributor

This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do

FWIW the current behaviour is a bug IMO and so my vote would be to proceed with this change. We've postponed it to a breaking release, and are calling it out as a major change in the changelog, so I think we've done all that we can reasonably be expected to.

Comment on lines +438 to +439
/// To preserve the prior behavior and read null counts properly from older files
/// you should default to zero:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should make it clearer that this behaviour is actually incorrect, it will claim a null count of 0, when it actually isn't known

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit if we're doing this

/// Note this API returns Some(0) even if the null count was not present
/// in the statistics.
/// See <https://github.com/apache/arrow-rs/pull/6216/files>
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: Versions of this library prior to `53.0.0` returned 0 if the null count was
/// Note: Versions of this library prior to `54.0.0` returned 0 if the null count was

@alamb
Copy link
Contributor Author

alamb commented Dec 10, 2024

After more thought, I would like to hold off merging this into arrow 54.0.0 because:

  1. It isn't super urgent as I understand: It has been wrong for many years, and I don't know anyone who is actually affected by this
  2. It has the potential for causing subtle downstream performance issues.

Maybe it is just that I am burned out on the fallout from the DataFusion 43 release (where there were a bunch of regressions and other challenges) but I don't want to introduce some other potentially subtle issue for a while longer

(I agree this may just me being weak)

@etseidl
Copy link
Contributor

etseidl commented Dec 10, 2024

Fair enough...I'll check back in February or March 😉

@tustvold tustvold assigned tustvold and unassigned tustvold Dec 15, 2024
@tustvold tustvold marked this pull request as draft December 15, 2024 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet writer does not encode null count = 0 correctly
5 participants