Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975

Merged
merged 50 commits into from
Jan 9, 2024

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Dec 6, 2023

Closes #9905
Closes #9667
Part of #9750

Motivation

The motivation for this PR was twofold:

  • We had very limited testing of GpuJsonScan when reading dates. We did not have extensive tests that used randomly generated inputs, and we were not testing with different values for dateFormat. We also did not have tests that did not specify dateFormat and that is a different code path in some Spark versions.
  • The tests and implementation for parsing JSON strings had diverged between GpuJsonScan and GpuJsonToStructs

Changes in this PR

  • Add tests for reading randomly generated JSON strings as dates using a variety of date formats, as well as not specifying a date format. This caused a number of test failures, which are now resolved by the other changes in this PR.
  • Move some code for casting string to date into the shim layer because the behavior differs between Spark versions:
    • 311 shim falls back to CPU in GpuJsonToStructs if a dateFormat is specified that is not yyyy-MM-dd because we cannot easily support the behavior in 311 where out-of-range days and months are supported by Spark. GpuJsonScan however does support custom dateFormat.
    • 320 shim has different code paths for GpuJsonScan for the case where no dateFormat is specified, and for when a dateFormat is specified, even if it is the default. Supports single digit months and days. GpuJsonToStructs just performs a regular cast from string to date and does not respect dateFormat.
    • 340 shim has different code paths for both GpuJsonScan and GpuJsonToStructs for the case where no dateFormat is specified, and for when a dateFormat is specified, even if it is the default. Does not support single digit date or month.

Note that we still need to do similar work for timestampFormat, so I filed #10044 to track that.

Test Status

  • 3.1.x
  • 3.2.x
  • 3.3.x
  • 3.4.x
  • 3.5.x

Follow-on issues

@andygrove andygrove force-pushed the json-scan-date-format branch from eefedec to fa09f44 Compare December 6, 2023 17:39
@andygrove andygrove self-assigned this Dec 6, 2023
@andygrove andygrove added the bug Something isn't working label Dec 6, 2023
@andygrove andygrove changed the title [WIP] Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [WIP] Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] Dec 12, 2023
@@ -25,8 +25,7 @@ import ai.rapids.cudf.{CaptureGroups, ColumnVector, DType, HostColumnVector, Hos
import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
import com.nvidia.spark.rapids.DateUtils.{toStrf, TimestampFormatConversionException}
import com.nvidia.spark.rapids.jni.CastStrings
import com.nvidia.spark.rapids.shims.GpuTypeShims
import java.util
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that scala style didn't catch this misplaced import

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stop importing Java classes, let alone pretend it is a Scala package object. IMO, having to explicitly reference the Java class on initialization makes the code more clear.

@pytest.mark.parametrize('date_format', ['', 'yyyy-MM-dd'] if is_before_spark_320 else json_supported_date_formats)
@pytest.mark.parametrize('ansi_enabled', [True, False])
@pytest.mark.parametrize('allow_numeric_leading_zeros', [True, False])
def test_json_read_generated_dates(spark_tmp_table_factory, spark_tmp_path, date_gen_pattern, schema, date_format, \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main new test that uncovered many issues

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks really good. My main ask would be to make sure that if there are special requirements for parsing dates/timestamps that are specific to JSON that they are captured in #10032 so that we can make sure that we satisfy everyone with a final solution.

full_format = date_format + ts_part
@pytest.mark.parametrize('timestamp_format', json_supported_timestamp_formats)
@pytest.mark.parametrize('v1_enabled_list', ["", "json"])
@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually fall back to the CPU in these cases? Typically the new pattern for these types of tests it that if a timezone is not supported that we add the operators to the allow list so it can fall back and we still verify that we got the right result.

@pytest.mark.parametrize("timestamp_type", ["TIMESTAMP_LTZ", "TIMESTAMP_NTZ"])
def test_json_ts_formats_round_trip_ntz_v1(spark_tmp_path, date_format, ts_part, timestamp_type):
json_ts_formats_round_trip_ntz(spark_tmp_path, date_format, ts_part, timestamp_type, 'json', 'FileSourceScanExec')
@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and for all of the tests that have issues with non utc time zones.


def create_test_data(spark):
write = gen_df(spark, gen).write
if len(date_format) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could write if date_format: here, and that would allow the use of None instead of '' in the definitions which matches more closely what the intent is, i.e.: to provide no format.

Comment applies to other uses of this code pattern below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024 copyrights

Comment on lines 44 to 45
def tagDateFormatSupport(meta: RapidsMeta[_, _, _], dateFormat: Option[String]): Unit = {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really support all possible date formats? Or is any specified date format ignored on Spark 3.2+?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a clarifying comment here:

    // dateFormat is ignored by JsonToStructs in Spark 3.2.x and 3.3.x because it just
    // performs a regular cast from string to date

@andygrove
Copy link
Contributor Author

build


def create_test_data(spark):
write = gen_df(spark, gen).write
if len(timestamp_format) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(timestamp_format) > 0:
if timestamp_format:

.option('timestampFormat', full_format) \
.json(data_path)
read = spark.read.schema(schema)
if len(timestamp_format) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(timestamp_format) > 0:
if timestamp_format:

integration_tests/src/main/python/json_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/json_test.py Outdated Show resolved Hide resolved
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@revans2 @jlowe I have addressed feedback, and the tests are now passing.

@andygrove andygrove merged commit 47047a9 into NVIDIA:branch-24.02 Jan 9, 2024
40 checks passed
@andygrove andygrove deleted the json-scan-date-format branch January 9, 2024 17:32
andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 10, 2024
…sistent with GpuStructsToJson [databricks] (NVIDIA#9975)"

This reverts commit 47047a9.
andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 10, 2024
…ests consistent with GpuStructsToJson [databricks] (NVIDIA#9975)""

This reverts commit 90a7dca.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants