Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

andygrove · 2023-12-06T17:19:43Z

Closes #9905
Closes #9667
Part of #9750

Motivation

The motivation for this PR was twofold:

We had very limited testing of GpuJsonScan when reading dates. We did not have extensive tests that used randomly generated inputs, and we were not testing with different values for dateFormat. We also did not have tests that did not specify dateFormat and that is a different code path in some Spark versions.
The tests and implementation for parsing JSON strings had diverged between GpuJsonScan and GpuJsonToStructs

Changes in this PR

Add tests for reading randomly generated JSON strings as dates using a variety of date formats, as well as not specifying a date format. This caused a number of test failures, which are now resolved by the other changes in this PR.
Move some code for casting string to date into the shim layer because the behavior differs between Spark versions:
- 311 shim falls back to CPU in GpuJsonToStructs if a dateFormat is specified that is not yyyy-MM-dd because we cannot easily support the behavior in 311 where out-of-range days and months are supported by Spark. GpuJsonScan however does support custom dateFormat.
- 320 shim has different code paths for GpuJsonScan for the case where no dateFormat is specified, and for when a dateFormat is specified, even if it is the default. Supports single digit months and days. GpuJsonToStructs just performs a regular cast from string to date and does not respect dateFormat.
- 340 shim has different code paths for both GpuJsonScan and GpuJsonToStructs for the case where no dateFormat is specified, and for when a dateFormat is specified, even if it is the default. Does not support single digit date or month.

Note that we still need to do similar work for timestampFormat, so I filed #10044 to track that.

Test Status

Follow-on issues

Signed-off-by: Andy Grove <[email protected]>

…e-format

andygrove · 2023-12-13T19:08:57Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTextBasedPartitionReader.scala

@@ -25,8 +25,7 @@ import ai.rapids.cudf.{CaptureGroups, ColumnVector, DType, HostColumnVector, Hos
 import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
 import com.nvidia.spark.rapids.DateUtils.{toStrf, TimestampFormatConversionException}
 import com.nvidia.spark.rapids.jni.CastStrings
-import com.nvidia.spark.rapids.shims.GpuTypeShims
-import java.util


I'm surprised that scala style didn't catch this misplaced import

I think we should stop importing Java classes, let alone pretend it is a Scala package object. IMO, having to explicitly reference the Java class on initialization makes the code more clear.

andygrove · 2023-12-13T19:12:38Z

integration_tests/src/main/python/json_test.py

+@pytest.mark.parametrize('date_format', ['', 'yyyy-MM-dd'] if is_before_spark_320 else json_supported_date_formats)
+@pytest.mark.parametrize('ansi_enabled', [True, False])
+@pytest.mark.parametrize('allow_numeric_leading_zeros', [True, False])
+def test_json_read_generated_dates(spark_tmp_table_factory, spark_tmp_path, date_gen_pattern, schema, date_format, \


This is the main new test that uncovered many issues

revans2

In general this looks really good. My main ask would be to make sure that if there are special requirements for parsing dates/timestamps that are specific to JSON that they are captured in #10032 so that we can make sure that we satisfy everyone with a final solution.

revans2 · 2024-01-05T15:21:03Z

integration_tests/src/main/python/json_test.py

-    full_format = date_format + ts_part
+@pytest.mark.parametrize('timestamp_format', json_supported_timestamp_formats)
+@pytest.mark.parametrize('v1_enabled_list', ["", "json"])
+@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653')


Do we actually fall back to the CPU in these cases? Typically the new pattern for these types of tests it that if a timezone is not supported that we add the operators to the allow list so it can fall back and we still verify that we got the right result.

revans2 · 2024-01-05T15:21:43Z

integration_tests/src/main/python/json_test.py

 @pytest.mark.parametrize("timestamp_type", ["TIMESTAMP_LTZ", "TIMESTAMP_NTZ"])
-def test_json_ts_formats_round_trip_ntz_v1(spark_tmp_path, date_format, ts_part, timestamp_type):
-    json_ts_formats_round_trip_ntz(spark_tmp_path, date_format, ts_part, timestamp_type, 'json', 'FileSourceScanExec')
+@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653')


Same here and for all of the tests that have issues with non utc time zones.

jlowe · 2024-01-05T15:13:18Z

integration_tests/src/main/python/json_test.py

+
+    def create_test_data(spark):
+        write = gen_df(spark, gen).write
+        if len(date_format) > 0:


Nit: Could write if date_format: here, and that would allow the use of None instead of '' in the definitions which matches more closely what the intent is, i.e.: to provide no format.

Comment applies to other uses of this code pattern below.

integration_tests/src/main/python/json_test.py

jlowe · 2024-01-05T15:17:34Z

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/GpuJsonToStructsShim.scala

2024 copyrights

jlowe · 2024-01-05T15:29:26Z

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/GpuJsonToStructsShim.scala

+  def tagDateFormatSupport(meta: RapidsMeta[_, _, _], dateFormat: Option[String]): Unit = {
+  }


Do we really support all possible date formats? Or is any specified date format ignored on Spark 3.2+?

I added a clarifying comment here:

// dateFormat is ignored by JsonToStructs in Spark 3.2.x and 3.3.x because it just // performs a regular cast from string to date

andygrove · 2024-01-06T00:09:33Z

build

jlowe · 2024-01-08T15:10:01Z

integration_tests/src/main/python/json_test.py

+
+    def create_test_data(spark):
+        write = gen_df(spark, gen).write
+        if len(timestamp_format) > 0:


Suggested change

if len(timestamp_format) > 0:

if timestamp_format:

jlowe · 2024-01-08T15:10:13Z

integration_tests/src/main/python/json_test.py

-            .option('timestampFormat', full_format) \
-            .json(data_path)
+        read = spark.read.schema(schema)
+        if len(timestamp_format) > 0:


Suggested change

if len(timestamp_format) > 0:

if timestamp_format:

integration_tests/src/main/python/json_test.py

Co-authored-by: Jason Lowe <[email protected]>

This reverts commit 4b183a4.

andygrove · 2024-01-08T18:02:13Z

build

andygrove · 2024-01-08T18:43:09Z

build

…e-format

andygrove · 2024-01-08T21:24:32Z

build

andygrove · 2024-01-09T01:55:26Z

@revans2 @jlowe I have addressed feedback, and the tests are now passing.

…sistent with GpuStructsToJson [databricks] (NVIDIA#9975)" This reverts commit 47047a9.

…ests consistent with GpuStructsToJson [databricks] (NVIDIA#9975)"" This reverts commit 90a7dca.

andygrove added 2 commits December 6, 2023 10:15

upmerge

b393caa

Revert change to csv_test

fa09f44

Signed-off-by: Andy Grove <[email protected]>

andygrove force-pushed the json-scan-date-format branch from eefedec to fa09f44 Compare December 6, 2023 17:39

andygrove self-assigned this Dec 6, 2023

andygrove added the bug Something isn't working label Dec 6, 2023

andygrove mentioned this pull request Dec 7, 2023

[FEA] Improve CSV integration tests for parsing dates and timestamps #9990

Open

andygrove added 10 commits December 7, 2023 19:47

upmerge

0caced0

Merge remote-tracking branch 'nvidia/branch-24.02' into json-scan-dat…

e661b47

…e-format

scalastyle

06e8db6

introduce shim

50c8273

remove unreachable code

9ec6bd6

fix some failures with 311

a531a6b

save progress

d4f9147

fix more failures with 340

b6f2ff1

fix test failures with 341

fd50335

tests pass with 341

01dd4bf

andygrove changed the title ~~[WIP] Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson~~ [WIP] Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] Dec 12, 2023

andygrove added 5 commits December 12, 2023 15:38

Add 330 shim and fix failures in test_basic_json_read

70011f8

save progress on 330 shim

c65b29f

tests pass with 330

1facac3

320 shim

c555b09

test all date formats with from_json

816f009

andygrove mentioned this pull request Dec 13, 2023

[FEA] Improve timestampFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson #10044

Open

andygrove added 4 commits December 13, 2023 12:00

remove redundant and confusing use of failOnInvalid parameter

d82ae6b

Revert unrelated change

d7f87dc

Remove comment

958d4e1

Remove blank line

92e68ba

andygrove commented Dec 13, 2023

View reviewed changes

Remove blank line

7a51cfe

andygrove commented Dec 13, 2023

View reviewed changes

add 334 shim

51ddd71

revans2 reviewed Jan 5, 2024

View reviewed changes

jlowe reviewed Jan 5, 2024

View reviewed changes

andygrove added 5 commits January 5, 2024 14:58

add clarifying comment

77ad4ab

use None instead of empty string in tests

3b30a20

fix copyright years

51a8da3

remove xfail from tests

bef3ef1

jlowe reviewed Jan 8, 2024

View reviewed changes

andygrove and others added 7 commits January 8, 2024 10:27

fix regression

51c13c2

Update integration_tests/src/main/python/json_test.py

8648b0c

Co-authored-by: Jason Lowe <[email protected]>

Update integration_tests/src/main/python/json_test.py

8b0b9c5

Co-authored-by: Jason Lowe <[email protected]>

fix regression

4b183a4

Revert "fix regression"

a753e88

This reverts commit 4b183a4.

update more tests to use None instead of empty string

ea80821

allow fallback for non-utc in test_json_read_generated_dates

9e9acc4

update more tests to use None instead of empty string

cf22537

andygrove mentioned this pull request Jan 8, 2024

[FEA] custom kernel for date/timestamp formatting/parsing #10032

Open

Merge remote-tracking branch 'nvidia/branch-24.02' into json-scan-dat…

b5779a0

…e-format

revans2 approved these changes Jan 9, 2024

View reviewed changes

andygrove merged commit 47047a9 into NVIDIA:branch-24.02 Jan 9, 2024
40 checks passed

andygrove deleted the json-scan-date-format branch January 9, 2024 17:32

res-life mentioned this pull request Jan 10, 2024

[BUG] json_test.py::test_from_json_struct_timestamp failed on: Part of the plan is not columnar #10174

Closed

andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 10, 2024

:Revert "Improve dateFormat support in GpuJsonScan and make tests con…

90a7dca

…sistent with GpuStructsToJson [databricks] (NVIDIA#9975)" This reverts commit 47047a9.

andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 10, 2024

Revert ":Revert "Improve dateFormat support in GpuJsonScan and make t…

f243f91

…ests consistent with GpuStructsToJson [databricks] (NVIDIA#9975)"" This reverts commit 90a7dca.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

andygrove commented Dec 6, 2023 •

edited

Loading

andygrove Dec 13, 2023

gerashegalov Jan 9, 2024

andygrove Dec 13, 2023

revans2 left a comment

revans2 Jan 5, 2024

revans2 Jan 5, 2024

jlowe Jan 5, 2024

jlowe Jan 5, 2024

jlowe Jan 5, 2024

andygrove Jan 5, 2024

andygrove commented Jan 6, 2024

jlowe Jan 8, 2024

jlowe Jan 8, 2024

andygrove commented Jan 8, 2024

andygrove commented Jan 8, 2024

andygrove commented Jan 8, 2024

andygrove commented Jan 9, 2024

		def tagDateFormatSupport(meta: RapidsMeta[_, _, _], dateFormat: Option[String]): Unit = {
		}

Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

Conversation

andygrove commented Dec 6, 2023 • edited Loading

Motivation

Changes in this PR

Test Status

Follow-on issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 8, 2024

andygrove commented Jan 8, 2024

andygrove commented Jan 8, 2024

andygrove commented Jan 9, 2024

andygrove commented Dec 6, 2023 •

edited

Loading