Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45185: Add bad_data file with invalid repetition levels #67

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

Follow up to #65. For apache/arrow#45185

This adds a file generated with the same bad logic previously used for generating encryption test files but without any encryption. It also contains only the problematic int64 list column rather than all the test data columns, and I've disabled compression so that this can be used for tests without needing Snappy enabled.

@raulcd
Copy link
Member

raulcd commented Jan 7, 2025

CC @wgtmac

@wgtmac
Copy link
Member

wgtmac commented Jan 8, 2025

File path:  bad_data/ARROW-GH-45185.parquet
Created by: parquet-cpp-arrow version 19.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
  repeated int64 int64_field;
}

Row group 0:  count: 50  19.10 B records  start: 4  total(compressed): 955 B total(uncompressed):955 B
--------------------------------------------------------------------------------
             type      encodings count     avg size   nulls   min / max
int64_field  INT64     _ _ R     100       9.55 B     0       "0" / "99000000000000"


Column: int64_field
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  100     8.00 B     800 B
  0-1    data  _ R  100     1.18 B     118 B

      "columnIndexReference" : {
        "offset" : 959,
        "length" : 31
      },
      "offsetIndexReference" : {
        "offset" : 990,
        "length" : 12
      },

The file size is 1.2K. Could we reduce it as much as possible? For example:

  • leverage compression like zstd
  • disable dictionary encoding
  • disable page index
  • reduce row count

BTW, repeated int64 int64_field is a special case of unannotated list type which we should avoid: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L607-L624. Should we replace it with LIST-annotated type? cc @pitrou @mapleFU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants