Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48898][SQL] Set nullability correctly in the Variant schema #49151

Closed

Conversation

cashmand
Copy link
Contributor

What changes were proposed in this pull request?

The variantShreddingSchema method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the variantShreddingSchema to mark that struct as non-nullable.

Why are the changes needed?

If we use variantShreddingSchema to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level.

Does this PR introduce any user-facing change?

No, this code is not used yet.

How was this patch tested?

Added a test to do some minimal validation of the variantShreddingSchema function.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

Thank you, @cashmand .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you re-trigger the failed test pipeline, @cashmand ?

@dongjoon-hyun
Copy link
Member

Could you re-trigger once more, @cashmand ?

@cashmand
Copy link
Contributor Author

Hi @dongjoon-hyun, it looks like the tests are passing now.

@cashmand
Copy link
Contributor Author

Actually @dongjoon-hyun, I just realized that ArrayType has the same nullability issue. Let me update the PR to also set the nullability correctly on ArrayType.

@dongjoon-hyun
Copy link
Member

Thank you!

@cashmand cashmand requested a review from chenhao-db December 12, 2024 19:29
@cashmand
Copy link
Contributor Author

@dongjoon-hyun, the PR should be good to merge now. Thanks for your patience!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in fbc061d Dec 13, 2024
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Dec 17, 2024
### What changes were proposed in this pull request?

The variantShreddingSchema method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the variantShreddingSchema to mark that struct as non-nullable.

### Why are the changes needed?

If we use variantShreddingSchema to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level.

### Does this PR introduce _any_ user-facing change?

No, this code is not used yet.

### How was this patch tested?

Added a test to do some minimal validation of the variantShreddingSchema function.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49151 from cashmand/SPARK-48898-nullability-again.

Authored-by: cashmand <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants