-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support date_format via Gpu for non-UTC time zone [databricks] #9721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core changes to dateTimeExpressions.scala looks good, but I don't feel good checking in code that changes how we tag TimeZoneAware/TimestampTypes until we have tests in place that verify that we are still falling back in all the right places.
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala
Outdated
Show resolved
Hide resolved
build |
build |
Signed-off-by: Chong Gao <[email protected]>
4fbfbb5
to
e3941c2
Compare
Please review the last commit. |
build |
|
||
# get from Java: | ||
# ZoneId.getAvailableZoneIds | ||
# id.getRules.isFixedOffset || id.getRules.getTransitionRules.isEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can we use Python Java gateway (e.g., Py4j
) to obtain the list instead of hardcoding it here? Besides the maintenance, just want to ensure we're consistent with Java on this behavior id.getRules.isFixedOffset || id.getRules.getTransitionRules.isEmpty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's invoked in the pytest marks, Python Java gateway is not available when Spark session is not started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now is using the following:
@pytest.mark.skipif(is_supported_time_zone() ......)
The is_supported_time_zone works in the pytest marks.
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark : unary_op_df(spark, data_gen).selectExpr("date_format(a, '{}')".format(date_format))) | ||
|
||
@pytest.mark.parametrize('date_format', supported_date_formats, ids=idfn) | ||
@pytest.mark.parametrize('data_gen', [timestamp_gen], ids=idfn) | ||
@pytest.mark.xfail(is_dst_time_zone(), reason="only support non-DST time zone, refer to https://github.com/NVIDIA/spark-rapids/issues/6839") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid xfail
here?
If is_dst_time_zone = false
, test result equals without fallback.
If is_dst_time_zone = true
, test result equals with CPU operator fallback. This one can be combined with test_date_format_for_date
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using xfail is to record the xfail/xpass cases.
I'm OK to use skipif, but we may forget it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use fallback check instead? I am nervous on xfail
which hides errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
build |
Test passed locally:
|
build |
@allow_non_gpu(*non_utc_allow) | ||
def test_date_format(data_gen, date_format): | ||
@pytest.mark.parametrize('data_gen', [date_gen], ids=idfn) | ||
@allow_non_gpu('ProjectExec') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check whether it's supported timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date_format(date) will introduce cast(date as timestamp) which is not supported in non-UTC now.
After we support cast(date as timestamp), then we will update this case.
assert_gpu_fallback_collect( | ||
lambda spark : unary_op_df(spark, data_gen).selectExpr("date_format(a, '{}')".format(date_format)), | ||
'ProjectExec', | ||
conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can merge https://github.com/NVIDIA/spark-rapids/pull/9721/files#diff-ec41a643cf16a9946f6df2498c6d853812f526eb0a3f8d3299b4a911346e1703R491 here directly without creating conf variable.
@@ -1681,8 +1681,9 @@ object GpuOverrides extends Logging { | |||
.withPsNote(TypeEnum.STRING, "A limited number of formats are supported"), | |||
TypeSig.STRING)), | |||
(a, conf, p, r) => new UnixTimeExprMeta[DateFormatClass](a, conf, p, r) { | |||
override def isTimeZoneSupported = true | |||
override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: code style. not sure whether a line break need here.
@@ -1681,8 +1681,9 @@ object GpuOverrides extends Logging { | |||
.withPsNote(TypeEnum.STRING, "A limited number of formats are supported"), | |||
TypeSig.STRING)), | |||
(a, conf, p, r) => new UnixTimeExprMeta[DateFormatClass](a, conf, p, r) { | |||
override def isTimeZoneSupported = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the fact that DateFormatClass is not considered a TimeZoneAwareExpression, but requires support for non-UTC timezones. It's the last check for timezone requirement in Expressions. By default, this flag is false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not a good practice to set some value as false
by default then override true
like this. Can we compute that value based on the input in the object constructor?
build |
pre-merge failed due to #10050 |
build |
PR includes:
test_date_format
to test date and time separately. date_format(date) will introduce cast(date, timestamp) which is not supported in non-UTC now. Only test date_format(timestamp)Signed-off-by: Chong Gao [email protected]