Improve support for reading CSV and JSON floating-point values #4637

andygrove · 2022-01-26T23:47:33Z

Signed-off-by: Andy Grove [email protected]

Closes #124 and partially addresses #125, #126, and #1986 (only for floating-point types)

This PR affects both CSV and JSON readers and reads floating-point columns as strings and then casts them to a floating-point type. This means that we can now support special values such as NaN and Inf more consistently with Spark.

If this approach is acceptable then I will follow up with PRs to do this for other data types.

Note that this isn't perfect and there are still some follow-on issues (some existing and some new):

Out of these, issue 4647 bothers me the most.

Status:

Implement basic functionality
Enable related tests that were previously XFAIL
Fix resource leaks
Add new tests for better coverage of edge cases
Add support (and tests) for JSON option allowNonNumericNumbers
Add tests with ansi mode enabled for JSON and CSV with invalid inputs
Update compatibility guide

Signed-off-by: Andy Grove <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala

revans2

This is a really great step. Ultimately I would like to use this same trick to support other types, like decimal and date/time. But I am not convinced that we want to make it common everywhere because JSON has some odd configs/requirements with how it parses some numbers, which will be different from CSV and from casting. But these are corner cases and it might not matter that much anyways.

On a side note @GaryShen2008 we need to make sure that we take this into account when we are gathering requirements for JSON parsing in CUDF. We already know that they are not going to want to provide all of the parsing options that Spark supports, so we probably want to instead work on supporting the parsing ourselves as much as we can.

Signed-off-by: Andy Grove <[email protected]>

revans2

Sorry we should also update the compatibility doc so it explains that the parsing code has the same limitations as casting.

https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/docs/compatibility.md#csv-floating-point

But reading it again, I think we want to do with for almost all types. Even integers to get the proper overflow checks.

Also can you add in a test for ANSI mode with invalid float values? I want to be sure that we are doing the same thing. Either both throw or not.

Signed-off-by: Andy Grove <[email protected]>

docs/compatibility.md

integration_tests/src/main/python/csv_test.py

…w tests. Also update compatibility guide. Signed-off-by: Andy Grove <[email protected]>

andygrove · 2022-01-31T16:51:41Z

integration_tests/src/test/resources/simple_float_values.csv

-Inf
-INF
-INF
-


Nan and Inf values are already covered in nan_and_inf.csv

andygrove · 2022-01-31T16:57:47Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala

andygrove · 2022-01-31T18:55:01Z

build

andygrove · 2022-02-01T15:05:25Z

There are test failures in HashAggregatesSuite when running against Spark 3.1.1 - I am investigating.

andygrove · 2022-02-01T15:38:32Z

build

tests/src/test/scala/com/nvidia/spark/rapids/HashAggregatesSuite.scala

andygrove · 2022-02-01T19:53:59Z

build

andygrove added 2 commits January 26, 2022 16:34

Improve support for reading CSV and JSON floating-point values

e83d9ca

Signed-off-by: Andy Grove <[email protected]>

scalastyle and add comments

81b1155

andygrove added this to the Jan 10 - Jan 28 milestone Jan 26, 2022

andygrove self-assigned this Jan 26, 2022

nartal1 mentioned this pull request Jan 27, 2022

[FEA]JSON reader: support "allowNonNumericNumbers" #4615

Closed

jlowe reviewed Jan 27, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Outdated Show resolved Hide resolved

revans2 reviewed Jan 27, 2022

View reviewed changes

Fix resource leak

aa45a26

Signed-off-by: Andy Grove <[email protected]>

revans2 reviewed Jan 27, 2022

View reviewed changes

Add file reader tests for JSON

45abe08

This was referenced Jan 27, 2022

[FEA] Make JSON parsing of NaN and Infinity values fully compatible with Spark #4646

Closed

[FEA] JSON reader should parse quoted floating point values as null #4647

Closed

andygrove added 3 commits January 27, 2022 12:28

more tests

6ebaf4a

docs update

6f35a33

ansi mode tests and bug fix

01bd647

Signed-off-by: Andy Grove <[email protected]>

revans2 reviewed Jan 28, 2022

View reviewed changes

docs/compatibility.md Outdated Show resolved Hide resolved

integration_tests/src/main/python/csv_test.py Show resolved Hide resolved

andygrove added 3 commits January 28, 2022 08:45

add invalid float values so that ansi is covered correctly by the tests

8d1f9c2

add invalid floats to csv_test

334a2fc

Remove Nan/Inf values from simple_float_values.csv and enable overflo…

240ac19

…w tests. Also update compatibility guide. Signed-off-by: Andy Grove <[email protected]>

andygrove changed the title ~~WIP: Improve support for reading CSV and JSON floating-point values~~ Improve support for reading CSV and JSON floating-point values Jan 28, 2022

andygrove marked this pull request as ready for review January 28, 2022 22:37

sameerz modified the milestones: Jan 10 - Jan 28, Jan 31 - Feb 11 Jan 30, 2022

andygrove commented Jan 31, 2022

View reviewed changes

revans2 previously approved these changes Jan 31, 2022

View reviewed changes

jlowe reviewed Jan 31, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala Show resolved Hide resolved

move withResource to enclose for loop

001d549

andygrove dismissed revans2’s stale review via 001d549 January 31, 2022 17:47

jlowe previously approved these changes Jan 31, 2022

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Jan 31, 2022

fix test regression

085a5cb

andygrove dismissed jlowe’s stale review via 085a5cb February 1, 2022 15:38

jlowe reviewed Feb 1, 2022

View reviewed changes

tests/src/test/scala/com/nvidia/spark/rapids/HashAggregatesSuite.scala Outdated Show resolved Hide resolved

change maxFloatDiff to 1e-6

5d70ced

jlowe approved these changes Feb 1, 2022

View reviewed changes

revans2 approved these changes Feb 2, 2022

View reviewed changes

andygrove merged commit 4ce9fe7 into NVIDIA:branch-22.04 Feb 2, 2022

andygrove deleted the csv-json-float-read branch February 2, 2022 16:53

andygrove mentioned this pull request Feb 11, 2022

[FEA] JSON reader parses types compatible with Spark #4609

Open

9 tasks

revans2 mentioned this pull request Mar 24, 2022

[BUG] Different CSV parsing behavior between 22.04 and 22.02 #5035

Closed

andygrove mentioned this pull request May 31, 2022

[BUG] Wrong results when comparing double reading from CSV #5682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for reading CSV and JSON floating-point values #4637

Improve support for reading CSV and JSON floating-point values #4637

andygrove commented Jan 26, 2022 •

edited

Loading

revans2 left a comment

revans2 left a comment

andygrove Jan 31, 2022

andygrove commented Jan 31, 2022

andygrove commented Jan 31, 2022

andygrove commented Feb 1, 2022

andygrove commented Feb 1, 2022

andygrove commented Feb 1, 2022

Improve support for reading CSV and JSON floating-point values #4637

Improve support for reading CSV and JSON floating-point values #4637

Conversation

andygrove commented Jan 26, 2022 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

andygrove Jan 31, 2022

Choose a reason for hiding this comment

andygrove commented Jan 31, 2022

andygrove commented Jan 31, 2022

andygrove commented Feb 1, 2022

andygrove commented Feb 1, 2022

andygrove commented Feb 1, 2022

andygrove commented Jan 26, 2022 •

edited

Loading