Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable from_json / JsonToStructs #9423

Merged
merged 20 commits into from
Nov 6, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Oct 12, 2023

Depends on rapidsai/cudf#14309

This PR re-enables from_json / JsonToStructs. It was previously disabled (see #8558) because it can cause incorrect results in some cases, but we have similar issues when reading directly from JSON files (see xfailed cases in #9304).

Given that this expression does work correctly in many cases where the inputs are valid JSON, and that it is disabled by default, I would like to discuss adding this back in.

@andygrove andygrove self-assigned this Oct 12, 2023
Signed-off-by: Andy Grove <[email protected]>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me personally I would disable JSON reading and CSV reading for that matter by default, instead of compounding the problem.

@andygrove andygrove marked this pull request as draft October 12, 2023 21:07
@andygrove andygrove changed the title Re-enable from_json / JsonToStructs WIP: Re-enable from_json / JsonToStructs Oct 12, 2023
@andygrove
Copy link
Contributor Author

I am seeing two errors with the current code:

  • Some valid JSON lines are being replaced with null when recoverWithNulls is enabled
  • Also seeing the following exception in some cases
- E                   Caused by: java.lang.IllegalStateException: The input data didn't parse correctly and we read a different number of rows than was expected. Expected 512, but got 511
E                       at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$6(GpuJsonToStructs.scala:179)

@@ -3556,17 +3556,19 @@ object GpuOverrides extends Logging {
expr[JsonToStructs](
"Returns a struct value with the given `jsonStr` and `schema`",
ExprChecks.projectOnly(
TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP,
"MAP only supports keys and values that are of STRING type"),
TypeSig.STRUCT.nested(TypeSig.all) +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really support all under a Struct. We don't support maps under a struct and we don't support UDTs under a struct. Could we be more explicit about what we do and do not support here?

@andygrove andygrove changed the title WIP: Re-enable from_json / JsonToStructs Re-enable from_json / JsonToStructs Nov 1, 2023
@andygrove andygrove marked this pull request as ready for review November 1, 2023 14:20
@andygrove
Copy link
Contributor Author

I am seeing two errors with the current code:

* Some valid JSON lines are being replaced with null when recoverWithNulls is enabled

* Also seeing the following exception in some cases
- E                   Caused by: java.lang.IllegalStateException: The input data didn't parse correctly and we read a different number of rows than was expected. Expected 512, but got 511
E                       at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$6(GpuJsonToStructs.scala:179)

These issues are now resolved by recent bug fixes in cuDF

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit d3ef70c into NVIDIA:branch-23.12 Nov 6, 2023
36 checks passed
@andygrove andygrove deleted the enable-from-json branch November 6, 2023 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants