-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TimeAdd for non-UTC time zone #10068
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
def test_timeadd(data_gen): | ||
days, seconds = data_gen | ||
assert_gpu_and_cpu_are_equal_collect( | ||
# We are starting at year 0005 to make sure we don't go before year 0001 | ||
# and beyond year 10000 while doing TimeAdd | ||
lambda spark: unary_op_df(spark, TimestampGen(start=datetime(5, 1, 1, tzinfo=timezone.utc), end=datetime(15, 1, 1, tzinfo=timezone.utc)), seed=1) | ||
.selectExpr("a + (interval {} days {} seconds)".format(days, seconds))) | ||
.selectExpr("a + (interval {} days {} seconds)".format(days, seconds)), | ||
conf = {'spark.rapids.sql.nonUTC.enabled': True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this configuration any more.
@@ -473,7 +473,7 @@ object GpuScalar extends Logging { | |||
* | |||
* This class is introduced because many expressions require both the cudf Scalar and its | |||
* corresponding Scala value to complete their computations. e.g. 'GpuStringSplit', | |||
* 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeMath' ... | |||
* 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeAdd' ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Why name changed? It seems different over different Spark version. We can comment both in GpuTimeAdd/GpuTimeMath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GpuTimeMath
was an abstract class being implemented by GpuTimeAdd
and GpuDateAddInterval
. I removed it because the reusability of the two classes is actually poor.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
GpuTimeZoneDB.fromTimestampToUtcTimestamp(utcRes, zoneId) | ||
} | ||
} | ||
GpuColumnVector.from(res, dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful, seems res
leaked.
Behavior seems to mismatch with cpu when timestamp is long overflow. Will check and fix it, converting to draft |
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Perf test results on 50000000 timestamps from big data gen:
|
We can use this class to test perf: Add a new case in this suite. |
Signed-off-by: Haoyang Li <[email protected]>
Perf test results with
Also added mean and speed up stats for it. |
@@ -133,10 +164,56 @@ case class GpuTimeAdd(start: Expression, | |||
} | |||
} | |||
|
|||
// A tricky way to check overflow. The result is overflow when positive + positive = negative | |||
// or negative + negative = positive, so we can check the sign of the result is the same as | |||
// the sign of the operands. | |||
private def timestampAddDuration(cv: ColumnView, duration: BinaryOperable): ColumnVector = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dulicated.
Could we extract this function into a file like: datetimeExpressionsUtils.scala
. It seems that it applys for all Spark versions, so do not put this funciton into a shim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already has overflow check utility: AddOverflowChecks.basicOpOverflowCheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
@@ -60,6 +67,15 @@ def test_timeadd_daytime_column(): | |||
assert_gpu_and_cpu_are_equal_collect( | |||
lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND")) | |||
|
|||
@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0') | |||
@allow_non_gpu(*non_supported_tz_allow) | |||
def test_timeadd_daytime_column_long_overflow(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to ensure the random df will 100% overflow?
Maybe specify some constant variables to ensure overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By not making it actually random.
DayTimeIntervalGen
Has both a min_value
and a max_value
. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.
def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second", |
You could also use SetValuesGen with only values in it that would overflow.
class SetValuesGen(DataGen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to SetValuesGen.
sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
@@ -60,6 +67,15 @@ def test_timeadd_daytime_column(): | |||
assert_gpu_and_cpu_are_equal_collect( | |||
lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND")) | |||
|
|||
@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0') | |||
@allow_non_gpu(*non_supported_tz_allow) | |||
def test_timeadd_daytime_column_long_overflow(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By not making it actually random.
DayTimeIntervalGen
Has both a min_value
and a max_value
. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.
def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second", |
You could also use SetValuesGen with only values in it that would overflow.
class SetValuesGen(DataGen): |
sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
Current code will randomly fail tests. If the results are inside overlap or gap, they may not match with cpu. I think this is because
We can't access offset values of data in plugin side, so this behavior needs a kernel to match. It won't be difficult, and we can also add the long overflow check to the kernel. For now, the TimeAdd in this PR can be off by default, or note the difference in the compatibility doc, or just hang there to wait for kernel if we will do it soon. Two cases that will always fail: @pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_daytime_column_normal():
gen_list = [
# timestamp column max year is 1000
('t', TimestampGen(start=datetime(1900, 12, 31, 15, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc))),
# max days is 8000 year, so added result will not be out of range
('d', DayTimeIntervalGen(min_value=timedelta(seconds=0), max_value=timedelta(seconds=0)))]
assert_gpu_and_cpu_are_equal_collect(
lambda spark: gen_df(spark, gen_list, length=2048).selectExpr("t", "d", "t + d"))
@pytest.mark.parametrize('data_gen', [(0, 1)], ids=idfn)
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_special(data_gen):
days, seconds = data_gen
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, TimestampGen(start=datetime(1900, 12, 31, 15, 55, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc)), length=100)
.selectExpr("a + (interval {} days {} seconds)".format(days, seconds))) |
Signed-off-by: Haoyang Li <[email protected]>
@revans2 Sorry I forgot to revert my test code when investigating failed cases. This PR have something wrong now, please check my comment above |
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
New perf test results:
|
Signed-off-by: Haoyang Li <[email protected]>
Depends on NVIDIA/spark-rapids-jni#1700 |
import com.nvidia.spark.rapids.jni.GpuTimeZoneDB | ||
|
||
object datetimeExpressionsUtils { | ||
def timestampAddDuration(cv: ColumnVector, duration: BinaryOperable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some comments/checks here about the types expected. They can be assertions that get turned off in production if we want. I just want it to be very clear what is and is not supported for the types here. As soon as we start to try and do things like bitCastTo
I get a little nervous that we might get errors showing up over time if we are not clear/defensive now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
override def left: Expression = start | ||
override def right: Expression = interval | ||
|
||
override def toString: String = s"$left - $right" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an add why do we show it as left - right
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Close for not planned recently |
Closes #10067
This PR supports TimeAdd for non-UTC timezone. Removed
GpuTimeMath
because it is only overridden by GpuTimeAdd ( only in 311 shim) and GpuDateAddInterval (columnarEval
is overridden). Also cleaned up some integration tests.Perf test results to be updated.