Support TimeAdd for non-UTC time zone #10068

thirtiseven · 2023-12-18T08:44:49Z

This PR supports TimeAdd for non-UTC timezone. Removed GpuTimeMath because it is only overridden by GpuTimeAdd ( only in 311 shim) and GpuDateAddInterval (columnarEval is overridden). Also cleaned up some integration tests.

Perf test results to be updated.

Signed-off-by: Haoyang Li <[email protected]>

winningsix · 2023-12-19T01:10:29Z

integration_tests/src/main/python/date_time_test.py

 def test_timeadd(data_gen):
    days, seconds = data_gen
    assert_gpu_and_cpu_are_equal_collect(
        # We are starting at year 0005 to make sure we don't go before year 0001
        # and beyond year 10000 while doing TimeAdd
        lambda spark: unary_op_df(spark, TimestampGen(start=datetime(5, 1, 1, tzinfo=timezone.utc), end=datetime(15, 1, 1, tzinfo=timezone.utc)), seed=1)
-            .selectExpr("a + (interval {} days {} seconds)".format(days, seconds)))
+            .selectExpr("a + (interval {} days {} seconds)".format(days, seconds)),
+            conf = {'spark.rapids.sql.nonUTC.enabled': True})


We don't need this configuration any more.

winningsix · 2023-12-19T01:13:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala

@@ -473,7 +473,7 @@ object GpuScalar extends Logging {
 *
 * This class is introduced because many expressions require both the cudf Scalar and its
 * corresponding Scala value to complete their computations. e.g. 'GpuStringSplit',
- * 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeMath' ...
+ * 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeAdd' ...


Q: Why name changed? It seems different over different Spark version. We can comment both in GpuTimeAdd/GpuTimeMath.

GpuTimeMath was an abstract class being implemented by GpuTimeAdd and GpuDateAddInterval. I removed it because the reusability of the two classes is actually poor.

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2023-12-20T06:45:08Z

sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

+                  GpuTimeZoneDB.fromTimestampToUtcTimestamp(utcRes, zoneId)
+                }
+              }
+              GpuColumnVector.from(res, dataType)


~~Be careful, seems res leaked.~~

thirtiseven · 2023-12-25T03:11:39Z

Behavior seems to mismatch with cpu when timestamp is long overflow. Will check and fix it, converting to draft

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-29T03:15:39Z

Perf test results on 50000000 timestamps from big data gen:

spark.time(df.selectExpr("COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a1", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a2", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a3", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a4", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a5", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a6", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a7", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a8", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a9", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a10", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a11", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a12", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a13").show())

GPU Time (ms)	CPU Time (ms)	Speed up
370.2	8,124.6	21.95x

res-life · 2023-12-29T07:00:08Z

We can use this class to test perf:

https://github.com/NVIDIA/spark-rapids/blob/branch-24.02/tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala

Add a new case in this suite.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-12-29T08:43:07Z

Perf test results with TimeZonePerfSuite:

test,type,zone,used MS
time_add,Cpu,Asia/Shanghai,9557
time_add,Gpu,Asia/Shanghai,189
time_add,Cpu,Asia/Shanghai,9397
time_add,Gpu,Asia/Shanghai,181
time_add,Cpu,Asia/Shanghai,9303
time_add,Gpu,Asia/Shanghai,186
time_add,Cpu,Asia/Shanghai,9376
time_add,Gpu,Asia/Shanghai,167
time_add,Cpu,Asia/Shanghai,9478
time_add,Gpu,Asia/Shanghai,154
time_add, Asia/Shanghai: mean cpu time: 9422.20 ms, mean gpu time: 175.40 ms, speedup: 53.72 x

Also added mean and speed up stats for it.

res-life · 2023-12-29T09:56:05Z

sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

@@ -133,10 +164,56 @@ case class GpuTimeAdd(start: Expression,
    }
  }

+  // A tricky way to check overflow. The result is overflow when positive + positive = negative
+  // or negative + negative = positive, so we can check the sign of the result is the same as
+  // the sign of the operands.
  private def timestampAddDuration(cv: ColumnView, duration: BinaryOperable): ColumnVector = {


Dulicated.
Could we extract this function into a file like: datetimeExpressionsUtils.scala. It seems that it applys for all Spark versions, so do not put this funciton into a shim.

Already has overflow check utility: AddOverflowChecks.basicOpOverflowCheck

Done, thanks.

res-life · 2023-12-29T10:05:29Z

integration_tests/src/main/python/date_time_test.py

@@ -60,6 +67,15 @@ def test_timeadd_daytime_column():
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND"))

+@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
+@allow_non_gpu(*non_supported_tz_allow)
+def test_timeadd_daytime_column_long_overflow():


How to ensure the random df will 100% overflow?
Maybe specify some constant variables to ensure overflow.

By not making it actually random.

DayTimeIntervalGen Has both a min_value and a max_value. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.

spark-rapids/integration_tests/src/main/python/data_gen.py

Line 743 in 384b2a9

def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second",

You could also use SetValuesGen with only values in it that would overflow.

spark-rapids/integration_tests/src/main/python/data_gen.py

Line 358 in 384b2a9

class SetValuesGen(DataGen):

Updated to SetValuesGen.

sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

integration_tests/src/main/python/date_time_test.py

revans2 · 2023-12-29T14:52:32Z

integration_tests/src/main/python/date_time_test.py

@@ -60,6 +67,15 @@ def test_timeadd_daytime_column():
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND"))

+@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
+@allow_non_gpu(*non_supported_tz_allow)
+def test_timeadd_daytime_column_long_overflow():


By not making it actually random.

DayTimeIntervalGen Has both a min_value and a max_value. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.

spark-rapids/integration_tests/src/main/python/data_gen.py

Line 743 in 384b2a9

def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second",

You could also use SetValuesGen with only values in it that would overflow.

spark-rapids/integration_tests/src/main/python/data_gen.py

Line 358 in 384b2a9

class SetValuesGen(DataGen):

sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-01-03T11:59:13Z

Current code will randomly fail tests. If the results are inside overlap or gap, they may not match with cpu.

I think this is because TimeAdd uses a different way to handle overlap or gap in Spark.

ZonedDateTime in Java (doc) will try to keep the offset of the additive number if the results are in overlap or gap, while this PR will ignore the offset of the additive number and simply convert the timestamp results in utc back to the given timezone.

We can't access offset values of data in plugin side, so this behavior needs a kernel to match. It won't be difficult, and we can also add the long overflow check to the kernel.

For now, the TimeAdd in this PR can be off by default, or note the difference in the compatibility doc, or just hang there to wait for kernel if we will do it soon.

Two cases that will always fail:

@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_daytime_column_normal():
    gen_list = [
        # timestamp column max year is 1000
        ('t', TimestampGen(start=datetime(1900, 12, 31, 15, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc))),
        # max days is 8000 year, so added result will not be out of range
        ('d', DayTimeIntervalGen(min_value=timedelta(seconds=0), max_value=timedelta(seconds=0)))]
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: gen_df(spark, gen_list, length=2048).selectExpr("t", "d", "t + d"))

@pytest.mark.parametrize('data_gen', [(0, 1)], ids=idfn)
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_special(data_gen):
    days, seconds = data_gen
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, TimestampGen(start=datetime(1900, 12, 31, 15, 55, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc)), length=100)
            .selectExpr("a + (interval {} days {} seconds)".format(days, seconds)))

integration_tests/src/main/python/date_time_test.py

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-01-03T14:34:42Z

@revans2 Sorry I forgot to revert my test code when investigating failed cases. This PR have something wrong now, please check my comment above

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-01-16T08:23:58Z

New perf test results:

test,type,zone,used MS
time_add,Cpu,Asia/Shanghai,8164
time_add,Gpu,Asia/Shanghai,189
time_add,Cpu,Asia/Shanghai,8094
time_add,Gpu,Asia/Shanghai,186
time_add,Cpu,Asia/Shanghai,8217
time_add,Gpu,Asia/Shanghai,190
time_add,Cpu,Asia/Shanghai,8181
time_add,Gpu,Asia/Shanghai,172
time_add,Cpu,Asia/Shanghai,8132
time_add,Gpu,Asia/Shanghai,172
time_add, Asia/Shanghai: mean cpu time: 8157.60 ms, mean gpu time: 181.80 ms, speedup: 44.87 x
time_add,Cpu,Japan,8205
time_add,Gpu,Japan,165
time_add,Cpu,Japan,8129
time_add,Gpu,Japan,168
time_add,Cpu,Japan,8248
time_add,Gpu,Japan,179
time_add,Cpu,Japan,8182
time_add,Gpu,Japan,160
time_add,Cpu,Japan,8040
time_add,Gpu,Japan,159
time_add, Japan: mean cpu time: 8160.80 ms, mean gpu time: 166.20 ms, speedup: 49.10 x

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-01-16T08:30:11Z

Depends on NVIDIA/spark-rapids-jni#1700

revans2 · 2024-01-16T15:56:09Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/datetimeExpressionsUtils.scala

+import com.nvidia.spark.rapids.jni.GpuTimeZoneDB
+
+object datetimeExpressionsUtils {
+  def timestampAddDuration(cv: ColumnVector, duration: BinaryOperable, 


Could we add some comments/checks here about the types expected. They can be assertions that get turned off in production if we want. I just want it to be very clear what is and is not supported for the types here. As soon as we start to try and do things like bitCastTo I get a little nervous that we might get errors showing up over time if we are not clear/defensive now.

revans2 · 2024-01-16T15:57:29Z

sql-plugin/src/main/spark311/scala/org/apache/spark/sql/rapids/shims/datetimeExpressions.scala

+  override def left: Expression = start
+  override def right: Expression = interval
+
+  override def toString: String = s"$left - $right"


If this is an add why do we show it as left - right?

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2024-04-08T01:34:59Z

Close for not planned recently

thirtiseven added 3 commits December 14, 2023 13:29

wip

212a0b1

Suport TimeAdd for non-UTC time zone

4f52067

Signed-off-by: Haoyang Li <[email protected]>

clean up tests

c7dc304

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Dec 18, 2023

clean up

89c9305

Signed-off-by: Haoyang Li <[email protected]>

sameerz added the feature request New feature or request label Dec 18, 2023

sperlingxx self-requested a review December 19, 2023 03:15

winningsix reviewed Dec 19, 2023

View reviewed changes

thirtiseven added 3 commits December 20, 2023 09:44

remove config

4c9485f

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'branch-24.02' into timeAdd

64a0232

clean up

f4e85d7

Signed-off-by: Haoyang Li <[email protected]>

res-life reviewed Dec 20, 2023

View reviewed changes

thirtiseven marked this pull request as draft December 25, 2023 03:11

thirtiseven and others added 4 commits December 28, 2023 12:45

Merge branch 'branch-24.02' into timeAdd

7f4237c

Add long overflow check for 320+

2de0e6e

Signed-off-by: Haoyang Li <[email protected]>

Add long overflow check for lower versions too

3d5f2b9

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'NVIDIA:branch-24.02' into timeAdd

4574ef1

thirtiseven marked this pull request as ready for review December 29, 2023 02:51

thirtiseven requested a review from res-life December 29, 2023 03:15

Add perf test

2014495

Signed-off-by: Haoyang Li <[email protected]>

res-life reviewed Dec 29, 2023

View reviewed changes

revans2 reviewed Dec 29, 2023

View reviewed changes

wip

c6102f3

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as draft January 3, 2024 12:11

revans2 reviewed Jan 3, 2024

View reviewed changes

revert test code

66f3dd6

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven added 5 commits January 4, 2024 09:56

clean up

0801287

Signed-off-by: Haoyang Li <[email protected]>

wip

26152f5

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'branch-24.02' into timeAddKernel

cf10350

wip

a4334d9

Signed-off-by: Haoyang Li <[email protected]>

Use timeadd kernel

be5f813

Signed-off-by: Haoyang Li <[email protected]>

clean up

06c6c47

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as ready for review January 16, 2024 11:02

revans2 reviewed Jan 16, 2024

View reviewed changes

thirtiseven added 6 commits January 18, 2024 22:23

fix 311 build and address comments

ec7c4b0

Signed-off-by: Haoyang Li <[email protected]>

wip

511f8ee

Signed-off-by: Haoyang Li <[email protected]>

match Calendar types

a3ab495

Signed-off-by: Haoyang Li <[email protected]>

fix build

0fb8ecb

Signed-off-by: Haoyang Li <[email protected]>

clean up

a9deb35

Signed-off-by: Haoyang Li <[email protected]>

Merge branch 'branch-24.02' into timeAdd

1f3410f

thirtiseven changed the base branch from branch-24.02 to branch-24.04 January 27, 2024 05:15

thirtiseven marked this pull request as draft February 20, 2024 06:59

thirtiseven closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TimeAdd for non-UTC time zone #10068

Support TimeAdd for non-UTC time zone #10068

thirtiseven commented Dec 18, 2023

winningsix Dec 19, 2023

winningsix Dec 19, 2023

thirtiseven Dec 19, 2023

res-life Dec 20, 2023 •

edited

Loading

thirtiseven commented Dec 25, 2023

thirtiseven commented Dec 29, 2023

res-life commented Dec 29, 2023

thirtiseven commented Dec 29, 2023

res-life Dec 29, 2023

res-life Dec 29, 2023

thirtiseven Jan 2, 2024

res-life Dec 29, 2023 •

edited

Loading

revans2 Dec 29, 2023

thirtiseven Jan 2, 2024

revans2 Dec 29, 2023

thirtiseven commented Jan 3, 2024 •

edited

Loading

thirtiseven commented Jan 3, 2024

thirtiseven commented Jan 16, 2024

thirtiseven commented Jan 16, 2024

revans2 Jan 16, 2024

thirtiseven Jan 25, 2024

revans2 Jan 16, 2024

thirtiseven Jan 25, 2024

thirtiseven commented Apr 8, 2024

Support TimeAdd for non-UTC time zone #10068

Support TimeAdd for non-UTC time zone #10068

Conversation

thirtiseven commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

thirtiseven commented Dec 25, 2023

thirtiseven commented Dec 29, 2023

res-life commented Dec 29, 2023

thirtiseven commented Dec 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Dec 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven commented Jan 3, 2024 • edited Loading

thirtiseven commented Jan 3, 2024

thirtiseven commented Jan 16, 2024

thirtiseven commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thirtiseven commented Apr 8, 2024

res-life Dec 20, 2023 •

edited

Loading

res-life Dec 29, 2023 •

edited

Loading

thirtiseven commented Jan 3, 2024 •

edited

Loading