-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975
Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson [databricks] #9975
Conversation
Signed-off-by: Andy Grove <[email protected]>
eefedec
to
fa09f44
Compare
@@ -25,8 +25,7 @@ import ai.rapids.cudf.{CaptureGroups, ColumnVector, DType, HostColumnVector, Hos | |||
import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource} | |||
import com.nvidia.spark.rapids.DateUtils.{toStrf, TimestampFormatConversionException} | |||
import com.nvidia.spark.rapids.jni.CastStrings | |||
import com.nvidia.spark.rapids.shims.GpuTypeShims | |||
import java.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that scala style didn't catch this misplaced import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stop importing Java classes, let alone pretend it is a Scala package object. IMO, having to explicitly reference the Java class on initialization makes the code more clear.
@pytest.mark.parametrize('date_format', ['', 'yyyy-MM-dd'] if is_before_spark_320 else json_supported_date_formats) | ||
@pytest.mark.parametrize('ansi_enabled', [True, False]) | ||
@pytest.mark.parametrize('allow_numeric_leading_zeros', [True, False]) | ||
def test_json_read_generated_dates(spark_tmp_table_factory, spark_tmp_path, date_gen_pattern, schema, date_format, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main new test that uncovered many issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks really good. My main ask would be to make sure that if there are special requirements for parsing dates/timestamps that are specific to JSON that they are captured in #10032 so that we can make sure that we satisfy everyone with a final solution.
full_format = date_format + ts_part | ||
@pytest.mark.parametrize('timestamp_format', json_supported_timestamp_formats) | ||
@pytest.mark.parametrize('v1_enabled_list', ["", "json"]) | ||
@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually fall back to the CPU in these cases? Typically the new pattern for these types of tests it that if a timezone is not supported that we add the operators to the allow list so it can fall back and we still verify that we got the right result.
@pytest.mark.parametrize("timestamp_type", ["TIMESTAMP_LTZ", "TIMESTAMP_NTZ"]) | ||
def test_json_ts_formats_round_trip_ntz_v1(spark_tmp_path, date_format, ts_part, timestamp_type): | ||
json_ts_formats_round_trip_ntz(spark_tmp_path, date_format, ts_part, timestamp_type, 'json', 'FileSourceScanExec') | ||
@pytest.mark.xfail(condition = is_not_utc(), reason = 'xfail non-UTC time zone tests because of https://github.com/NVIDIA/spark-rapids/issues/9653') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and for all of the tests that have issues with non utc time zones.
|
||
def create_test_data(spark): | ||
write = gen_df(spark, gen).write | ||
if len(date_format) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could write if date_format:
here, and that would allow the use of None instead of '' in the definitions which matches more closely what the intent is, i.e.: to provide no format.
Comment applies to other uses of this code pattern below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024 copyrights
def tagDateFormatSupport(meta: RapidsMeta[_, _, _], dateFormat: Option[String]): Unit = { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really support all possible date formats? Or is any specified date format ignored on Spark 3.2+?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a clarifying comment here:
// dateFormat is ignored by JsonToStructs in Spark 3.2.x and 3.3.x because it just
// performs a regular cast from string to date
build |
|
||
def create_test_data(spark): | ||
write = gen_df(spark, gen).write | ||
if len(timestamp_format) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(timestamp_format) > 0: | |
if timestamp_format: |
.option('timestampFormat', full_format) \ | ||
.json(data_path) | ||
read = spark.read.schema(schema) | ||
if len(timestamp_format) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(timestamp_format) > 0: | |
if timestamp_format: |
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
This reverts commit 4b183a4.
build |
build |
build |
…sistent with GpuStructsToJson [databricks] (NVIDIA#9975)" This reverts commit 47047a9.
…ests consistent with GpuStructsToJson [databricks] (NVIDIA#9975)"" This reverts commit 90a7dca.
Closes #9905
Closes #9667
Part of #9750
Motivation
The motivation for this PR was twofold:
GpuJsonScan
when reading dates. We did not have extensive tests that used randomly generated inputs, and we were not testing with different values fordateFormat
. We also did not have tests that did not specifydateFormat
and that is a different code path in some Spark versions.GpuJsonScan
andGpuJsonToStructs
Changes in this PR
GpuJsonToStructs
if adateFormat
is specified that is notyyyy-MM-dd
because we cannot easily support the behavior in 311 where out-of-range days and months are supported by Spark.GpuJsonScan
however does support customdateFormat
.GpuJsonScan
for the case where nodateFormat
is specified, and for when adateFormat
is specified, even if it is the default. Supports single digit months and days.GpuJsonToStructs
just performs a regular cast from string to date and does not respectdateFormat
.GpuJsonScan
andGpuJsonToStructs
for the case where nodateFormat
is specified, and for when adateFormat
is specified, even if it is the default. Does not support single digit date or month.Note that we still need to do similar work for
timestampFormat
, so I filed #10044 to track that.Test Status
Follow-on issues