-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for reading CSV and JSON floating-point values #4637
Conversation
Signed-off-by: Andy Grove <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great step. Ultimately I would like to use this same trick to support other types, like decimal and date/time. But I am not convinced that we want to make it common everywhere because JSON has some odd configs/requirements with how it parses some numbers, which will be different from CSV and from casting. But these are corner cases and it might not matter that much anyways.
On a side note @GaryShen2008 we need to make sure that we take this into account when we are gathering requirements for JSON parsing in CUDF. We already know that they are not going to want to provide all of the parsing options that Spark supports, so we probably want to instead work on supporting the parsing ourselves as much as we can.
Signed-off-by: Andy Grove <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry we should also update the compatibility doc so it explains that the parsing code has the same limitations as casting.
https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/docs/compatibility.md#csv-floating-point
But reading it again, I think we want to do with for almost all types. Even integers to get the proper overflow checks.
Also can you add in a test for ANSI mode with invalid float values? I want to be sure that we are doing the same thing. Either both throw or not.
Signed-off-by: Andy Grove <[email protected]>
…w tests. Also update compatibility guide. Signed-off-by: Andy Grove <[email protected]>
-Inf | ||
INF | ||
-INF | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nan and Inf values are already covered in nan_and_inf.csv
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala
Show resolved
Hide resolved
build |
There are test failures in |
build |
tests/src/test/scala/com/nvidia/spark/rapids/HashAggregatesSuite.scala
Outdated
Show resolved
Hide resolved
build |
Signed-off-by: Andy Grove [email protected]
Closes #124 and partially addresses #125, #126, and #1986 (only for floating-point types)
This PR affects both CSV and JSON readers and reads floating-point columns as strings and then casts them to a floating-point type. This means that we can now support special values such as NaN and Inf more consistently with Spark.
If this approach is acceptable then I will follow up with PRs to do this for other data types.
Note that this isn't perfect and there are still some follow-on issues (some existing and some new):
positiveInf
,negativeInf
, andnanValue
#4644NaN
andInfinity
values fully compatible with Spark #4646Out of these, issue 4647 bothers me the most.
Status:
allowNonNumericNumbers