-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45073: Update encryption test files to correct invalid repetition levels #65
Conversation
Have you changed the compression type? It seems that all files have at least 80% size reduction. |
No, they were using Snappy before and so I changed the writing code to keep using Snappy. I'm not sure why there's a size reduction. I'll see if I can figure out why. |
It looks like the size difference is caused by this change which stopped writing column metadata at the end of chunks: apache/arrow#43428 If I revert that commit then the file sizes match exactly. |
Thanks for the investigation! That makes sense. |
BTW, these files are used by parquet-java: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java. We need to make sure that these tests won't fail after migrating to the new files. |
I tested updating parquet-java to reference my fork of parquet-testing with the new commit SHA and it looks like it worked fine: https://github.com/adamreeve/parquet-java/actions/runs/12407983545 Relevant lines from the log:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind change a "corrupt" file to the bad_file?
Also cc @ggershinsky
Just to confirm I've understood what you mean, do you mean add a file to the |
You're right, lets move forward |
OK thanks. I'm happy to add a |
I think so, we can do it separately |
See apache/arrow#45073. This fixes the encryption test files generated by Arrow C++ so that the repetition levels are correct. I tried to avoid changing other properties of these files like the compression and encodings used.
After applying the change to fix the repetition level (apache/arrow#45074), I generated the files by running the parquet-encryption-test tests with the following changes: