[WIP] Move timezone check to each operator [databricks] #9482

res-life · 2023-10-19T11:00:46Z

closes #6832

Changes:

Removed generic check
Checked all the operators in datetimeExpressions.scala, and add check to operators:

GpuMinute
GpuSecond
GpuHour
DateAddInterval GpuDateAddInterval
GpuDateFormatClass
GpuToTimestamp abstract
GpuToTimestampImproved abstract
GpuUnixTimestamp
GpuToUnixTimestamp
GpuToUnixTimestampImproved
GpuGetTimestamp
GpuFromUnixTime
GpuFromUTCTimestamp Note: this is not a TimeZoneAwareExpression, but it takes a timezone parameter.
GpuSequence
Cast

Add test cases to verify operators timezone check

Signed-off-by: Chong Gao [email protected]

Signed-off-by: Chong Gao <[email protected]>

revans2 · 2023-10-19T15:03:46Z

It is more than just those operators. In spark time zone aware code is controlled by TimeZoneAwareExpression

Notably most of these are in datetimeExpressions.scala, but there are a number outside of it (This came from Spark 3.5.0 using IntelliJ to find all of the implementations of it).

Cast
CsvToStructs
CurrentBatchTimestamp
CurrentDate
DateAddInterval
DateFormatClass
FromUnixTime
GetTimeField
GetTimestamp
Hour
JsonToStructs
LocalTimestamp
MakeTimestamp
Minute
MonthsBetween
ParseToDate
ParseToTimestamp
Second
SecondWithFraction
Sequence
StructsToCsv
StructsToJson
SubtractTimestamps
TimeAdd
TimestampAdd
TimestampAddYMInterval
TimestampDiff
TimestampFormatterHelper
ToPrettyString
ToTimestamp
ToUnixTimestamp
TruncTimestamp
UnixTime
UnixTimestamp

Many of these we don't have GPU equivalents for, and in some cases we do have GPU versions. For all of the others we need to make sure that they are covered and if they don't support timestamps or don't need to worry about time zone for some reason we need to document it and preferably have a follow on issue filed to come back and implement the timestamp functionality if it is missing.

revans2

There are also some execs that use timezones that we might need to be careful of. Like CSV and JSON parsing.

We should make sure that we test all of our code with a different time zone, not just the handful of tests that we expect to fall back. Can we update the tests that we expect to not support a different time zone to skip themselves if the timezone is not what they expect? and for all of them to have a test that skips itself if the timezone is UTC, but verifies that we fallback to the CPU in those cases.

That should help us know exactly what we need to update to support timezones.

revans2 · 2023-10-19T15:08:03Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -1733,6 +1762,11 @@ object GpuOverrides extends Logging {
        TypeSig.TIMESTAMP, TypeSig.TIMESTAMP),
      (second, conf, p, r) => new UnaryExprMeta[Second](second, conf, p, r) {

+       override def tagExprForGpu(): Unit = {


Could we try and have a TimeZoneAwareExprMeta, or something similar that makes it super simple to do this? We might even be able to back it into ExprMeta itself, just by checking if the class that this wraps is also TimeZoneAware.

I'm guessing the best approach is to put it directly in ExprMeta since otherwise we would have to mixin the TimeZoneAwareExprMeta for the different functions. I'm guessing that functions requiring timezone will span the gamut of Unary/Binary/Ternary/Quaternary/Agg/etc.

Maybe wrap the check in a method and override it whenever a function starts supporting alternate timezones.

Looks like GpuCast will be a first exception to this idea: #6835

I'm testing all the existing test cases with adding non-UTC time zone config to identify all the failed cases:

https://github.com/NVIDIA/spark-rapids/blob/v23.10.0/integration_tests/src/main/python/spark_session.py#L68-L74

def _set_all_confs(conf): _spark.conf.set("spark.sql.session.timeZone": "+08:00")

Then I'll update the failed cases.

winningsix · 2023-10-20T01:07:20Z

integration_tests/src/main/python/date_time_test.py

@@ -11,10 +11,10 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-


nit: one line break after license header.

winningsix · 2023-10-20T01:09:09Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

@@ -840,7 +839,7 @@ object TypeChecks {
    areTimestampsSupported(ZoneId.systemDefault()) &&
      areTimestampsSupported(SQLConf.get.sessionLocalTimeZone)
  }
-
+  


nit: extra space.

winningsix · 2023-10-20T01:27:08Z

integration_tests/src/main/python/date_time_test.py

+    return spark.createDataFrame(SparkContext.getOrCreate().parallelize(data),schema)
+
+# used by timezone test cases, specify all the sqls that will be impacted by non-utc timezone
+time_zone_sql_conf_pairs = [


nit: There're some functions related to timezone (not supported yet), mentioned in Spark built-in function website. We can add some comments mentioning here.

convert_timezone -- SELECT convert_timezone('Europe/Brussels', 'America/Los_Angeles', timestamp_ntz'2021-12-06 00:00:00'); current_timezone() make_timestamp() make_timestamp_ltz()

For current_timezone, it just returns the session timezone, we can ignore it for this PR.
Spark config "spark.sql.session.timeZone" can set this value.

For MakeTimestamp and ConvertTimezone, it's recorded in this follow on issue: #9570

winningsix · 2023-10-20T01:45:27Z

pom.xml

@@ -1045,7 +1045,7 @@
                            <arg>-Yno-adapted-args</arg>
                            <arg>-Ywarn-unused:imports,locals,patvars,privates</arg>
                            <arg>-Xlint:missing-interpolator</arg>
-                            <arg>-Xfatal-warnings</arg>
+                            <!-- <arg>-Xfatal-warnings</arg> -->


Nit: Revert this back when we try to commit it.

winningsix · 2023-10-20T02:32:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

@@ -374,13 +374,12 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
        case Some(value) => ZoneId.of(value)
        case None => throw new RuntimeException(s"Driver time zone cannot be determined.")
      }
-      if (TypeChecks.areTimestampsSupported(driverTimezone)) {


may off-topic. Considering the configuration spark.sql.session.timeZone, should both driver and executor respect it? Then do we still need the check on driver and executor's timezone mismatch?

Considering the configuration spark.sql.session.timeZone, should both driver and executor respect it?

Here driverTimezone is from driver ZoneId.systemDefault(), not from spark.sql.session.timeZone, refer to: PR
Spark itself does not have this kind of check.

But for our spark-rapids, we check executor and driver have the same JVM time zone.

Then do we still need the check on driver and executor's timezone mismatch?

I think yes, becasue we want to avoid the issue

winningsix · 2023-10-20T02:42:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

-      case TimestampType =>
-        TypeChecks.areTimestampsSupported(ZoneId.systemDefault()) &&
-        TypeChecks.areTimestampsSupported(SQLConf.get.sessionLocalTimeZone)
+      case TimestampType => true


Do we need to consider the timezone check for scan and writer parts? AFAIK, when scanning data from Parquet, spark.sql.session.timeZone is supposed to be respect.

If applies, we should add some python tests as well.

This check is used by InternalColumnarRddConverter and HostToGpuCoalesceIterator.
Coalesce can handle non UTC timestamp. But not sure InternalColumnarRddConverter, seems it's also OK.

I think we will need to check these. For me, anything that does not have a test that shows it works fully in at least one other time zone must fall back to the CPU if it sees a timestamp that is not UTC.

Parquet for example has the rebase mode for older timestamps that requires knowing the timezone to do properly.

res-life · 2023-10-27T09:48:05Z

build

res-life · 2023-10-27T10:45:09Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

@@ -363,8 +363,7 @@ final class TypeSig private(
      case FloatType => check.contains(TypeEnum.FLOAT)
      case DoubleType => check.contains(TypeEnum.DOUBLE)
      case DateType => check.contains(TypeEnum.DATE)
-      case TimestampType if check.contains(TypeEnum.TIMESTAMP) =>
-          TypeChecks.areTimestampsSupported()


Originally invoked by shuffle meta, FileFormatChecks, tag AST and other.

shuffle meta, it's safe to remove this check, because shuffle definitely support non utc timezone.

FileFormatChecks: Spark always write Parqeut with UTC timestamp, it's safe; For ORC, Spark map ORC type timestamp with local time zone to Spark type TIMESTAMP_NTZ (with no time zone). Now spark-rapids does not support TIMESTAMP_NTZ currently, so it's safe to remove the check. Refer to link

tag AST: Not sure if remove this UTC check is OK, need to investigate.

Took a quick look at cudf. For AST, I noticed timezone info is not respected yet.

https://github.com/rapidsai/cudf/blob/330d389b26a05676d9f079503a3d96b571762337/java/src/main/java/ai/rapids/cudf/ast/Literal.java#L144-L165

res-life · 2023-10-27T10:54:45Z

Still WIP, need to check more exprs.

res-life · 2023-10-30T07:57:42Z

Filed a sub issue: #9570

revans2 · 2023-10-30T13:52:22Z

integration_tests/src/main/python/date_time_test.py

+@pytest.mark.parametrize('sql, extra_conf', time_zone_sql_conf_pairs)
+def test_timezone_for_operators_with_non_utc(sql, extra_conf):
+    # timezone is non-utc, should fallback to CPU
+    timezone_conf = {"spark.sql.session.timeZone": "+08:00", 


Should we make the time zone string a param to the test? Just because I would like to test a few more time zones than just +08:00

revans2 · 2023-10-30T13:54:49Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

-      case TimestampType =>
-        TypeChecks.areTimestampsSupported(ZoneId.systemDefault()) &&
-        TypeChecks.areTimestampsSupported(SQLConf.get.sessionLocalTimeZone)
+      case TimestampType => true


I think we will need to check these. For me, anything that does not have a test that shows it works fully in at least one other time zone must fall back to the CPU if it sees a timestamp that is not UTC.

Parquet for example has the rebase mode for older timestamps that requires knowing the timezone to do properly.

res-life · 2023-10-31T14:03:42Z

Added a pytest mark disable_timezone_test which means skip non UTC time zone tests.
By default, all the cases without disable_timezone_test will using non UTC timezone.
For the time zone aware test cases, added the disable_timezone_test.

res-life · 2023-10-31T14:03:57Z

build

res-life · 2023-10-31T14:10:40Z

I think we will need to check these. For me, anything that does not have a test that shows it works fully in at least one other time zone must fall back to the CPU if it sees a timestamp that is not UTC.
Parquet for example has the rebase mode for older timestamps that requires knowing the timezone to do properly.

Here is the checking of UTC for rebase mode for older timestamps.

    SparkShimImpl.parquetRebaseWrite(sqlConf) match {
      case "EXCEPTION" | "CORRECTED" => // Good
      case "LEGACY" =>
        if (!TypeChecks.areTimestampsSupported()) {
          meta.willNotWorkOnGpu("Only UTC timezone is supported in LEGACY rebase mode. " +
            s"Current timezone settings: (JVM : ${ZoneId.systemDefault()}, " +
            s"session: ${SQLConf.get.sessionLocalTimeZone}). " +
            " Set both of the timezones to UTC to enable LEGACY rebase support.")
        }

I think it's safe to remove the following in GpuOverrides.scala

      case TimestampType =>
        TypeChecks.areTimestampsSupported(ZoneId.systemDefault()) &&
        TypeChecks.areTimestampsSupported(SQLConf.get.sessionLocalTimeZone)

revans2

Looks good. I would like to see the following either in this PR or in a follow on PR.

how the time zone testing is inserted into the CI runs so we can continue to verify that it works as expected.
if there is a way for us to test more time zones than just UTC and Asia/Shanghai. Perhaps we can randomly select one from a list of them based off of a random seed that can be passed into the tests. But this should probably be a follow on issue.
For all of the tests that are skipped because of the time zone issue an automated test that verifies that we fell back to the CPU for them instead of producing the wrong answer.

Number 3 I see as a blocker for us to ship a product with this patch in it. We cannot ship something with data corruption in it. I am fine with checking this in for now so long as we have manually verified that the tests failed because we fell back to the CPU. But I want automated tests ASAP.

revans2 · 2023-10-31T14:02:29Z

integration_tests/src/main/python/collection_ops_test.py

@@ -354,7 +359,9 @@ def get_sequence_data(gen, len):
        SparkContext.getOrCreate().parallelize(get_sequence_data(data_gen, length)),
        mixed_schema)

+@disable_timezone_test


nit: Why two of these?

res-life · 2023-11-01T02:15:31Z

build

res-life · 2023-11-01T03:57:20Z

how the time zone testing is inserted into the CI runs so we can continue to verify that it works as expected.

Talked to Peixin, CI pass a Env variable contains the timezone , then I'll update the cases use this Env variable to set spark.sql.session.timeZone. For timezone aware expressions, it always use spark.sql.session.timeZone first.
Refer to link

object ResolveTimeZone extends Rule[LogicalPlan] {
  private val transformTimeZoneExprs: PartialFunction[Expression, Expression] = {
    case e: TimeZoneAwareExpression if e.timeZoneId.isEmpty =>
      e.withTimeZone(conf.sessionLocalTimeZone)

if there is a way for us to test more time zones than just UTC and Asia/Shanghai. Perhaps we can randomly select one from a list of them based off of a random seed that can be passed into the tests. But this should probably be a follow on issue.

Talked to Peixin, we'd better not use random timezone, random timezone will cause confusion.
We can enable non UTC test on some specific Spark version, like Spark31x, Spark 35x, not all the Spark versions considering the GPU resource limit and time limit.
Like adding the following scenario:

Asia/Shanghai on Spark31x, Spark 35x
PST on Spark31x, Spark 35x

res-life · 2023-11-01T07:54:55Z

For all of the tests that are skipped because of the time zone issue an automated test that verifies that we fell back to the CPU for them instead of producing the wrong answer.
Number 3 I see as a blocker for us to ship a product with this patch in it. We cannot ship something with data corruption in it. I am fine with checking this in for now so long as we have manually verified that the tests failed because we fell back to the CPU. But I want automated tests ASAP.

About automate the tests to verify fall back to CPU.
For example:

@disable_timezone_test
@pytest.mark.parametrize('start_gen,stop_gen', sequence_normal_no_step_integral_gens, ids=idfn)
def test_sequence_without_step(start_gen, stop_gen):
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: two_col_df(spark, start_gen, stop_gen).selectExpr(
            "sequence(a, b)",
            "sequence(a, 20)",
            "sequence(20, b)"))

It's hard to automately generate a case like:

@pytest.mark.parametrize('start_gen,stop_gen', sequence_normal_no_step_integral_gens, ids=idfn)
def test_sequence_without_step(start_gen, stop_gen):
    assert_gpu_fallback_collect(
        lambda spark: two_col_df(spark, start_gen, stop_gen).selectExpr(
            "sequence(a, b)",
            "sequence(a, 20)",
            "sequence(20, b)"),
      cpu_fallback_class_name = "Sequence" # We should analyze the code and get the operator is "sequence"
)

And after we finished the timezone feature, this case should be removed.
I think we can currently manually verify them.

res-life · 2023-11-01T14:09:52Z

Some tests failed on Databricks:

=========================== short test summary info ============================
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/hive_write_test.py::test_hive_copy_ints_to_long - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/hive_write_test.py::test_hive_copy_longs_to_float - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(StringType,[None])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(StringType,[None, ''])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(not_null)(StringType,[''])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(ArrayType(StringType,true),[None])] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(not_null)(ArrayType(StringType,true),[[]])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(not_null)(ArrayType(StringType,true),[['', '']])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(ArrayType(StringType,true),[None, [], [None], [''], [None, '']])] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.482Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(MapType(StringType,StringType,true),[{}, None, {'A': ''}, {'B': None}])] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(MapType(StringType,StringType,true),[None])] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_round_trip_corner[SetValues(not_null)(MapType(StringType,StringType,true),[{}])][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_compress_write_round_trip[none] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_compress_write_round_trip[uncompressed][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_compress_write_round_trip[snappy] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_compress_write_round_trip[zstd][INJECT_OOM] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_ts_write_twice_fails_exception - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_map_nullable - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_concurrent_writer[INJECT_OOM, IGNORE_ORDER] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_dynamic_partitioned_parquet_write[INJECT_OOM, IGNORE_ORDER({'local': True})] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_write_list_struct_single_element[INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(SortExec,ShuffleExchangeExec)] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] FAILED ../../src/main/python/parquet_write_test.py::test_parquet_write_column_name_with_dots[INJECT_OOM, IGNORE_ORDER] - pyspark.sql.utils.IllegalArgumentException: Part of the plan is not columna...
12:46:55  [2023-11-01T04:46:11.483Z] = 22 failed, 17231 passed, 3676 skipped, 846 xfailed, 286 xpassed, 112 warnings in 7429.48s (2:03:49) =

More info:

23/11/01 14:08:59 WARN GpuOverrides: 
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because Only UTC timezone is supported in LEGACY rebase mode. Current timezone settings: (JVM : UTC, session: Asia/Shanghai).  Set both of the timezones to UTC to enable LEGACY rebase support.
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> _c0#63 could run on GPU

res-life · 2023-11-01T14:37:25Z

If I set rebase mode to "CORRECTED", then test_write_round_trip_corner case passed on Databricks.

+                 'spark.sql.parquet.datetimeRebaseModeInWrite': 'CORRECTED',
+                 'spark.sql.parquet.datetimeRebaseModeInRead': 'CORRECTED',
+                 'spark.sql.parquet.int96RebaseModeInWrite' : 'CORRECTED',
+                 'spark.sql.parquet.int96RebaseModeInRead' : 'CORRECTED'

Note sure why Databricks has different behavior.
I saw this code, link:

                 # set the int96 rebase mode values because its LEGACY in databricks which will preclude this op from running on GPU
                 'spark.sql.legacy.parquet.int96RebaseModeInWrite' : 'CORRECTED',
                 'spark.sql.legacy.parquet.int96RebaseModeInRead' : 'CORRECTED'}

Do not know why some test code should set extra config to make test cases pass.
@ttnghia , can you recall this?

ttnghia · 2023-11-01T14:53:31Z

This is the reason:

  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because Only UTC timezone
 is supported in LEGACY rebase mode.
Current timezone settings: (JVM : UTC, session: Asia/Shanghai).

So you need to set your Spark session timezone to UTC too.

The CORRECTED option does not require UTC timezone, while LEGACY would need it. So if you run parquet test with LEGACY date time rebase mode, you must be in UTC.

On Databricks, the default mode is LEGACY thus you need to explicitly set it to CORRECTED if you are not doing testing for parquet datetime rebase.

Signed-off-by: Firestarman <[email protected]>

…tests Signed-off-by: Haoyang Li <[email protected]>

Signed-off-by: Haoyang Li <[email protected]>

winningsix · 2023-11-09T09:16:12Z

integration_tests/src/main/python/json_test.py

@@ -204,25 +205,66 @@ def test_json_ts_formats_round_trip(spark_tmp_path, date_format, ts_part, v1_ena
                    .json(data_path),
            conf=updated_conf)

-@allow_non_gpu('FileSourceScanExec', 'ProjectExec')
+@allow_non_gpu('FileSourceScanExec', 'BatchScanExec')
+@pytest.mark.skipif(is_utc(), reason="TODO sub-issue in https://github.com/NVIDIA/spark-rapids/issues/9653 to support non-UTC")


should be xfail?

We are testing fallback logic for non-UTC TZ. And in future will remove this.
For utc TZ, just skip is OK.

res-life · 2023-11-15T01:44:26Z

Changes about checker:

bottom check, add config nonUtc.enabled. If disable non UTC TZ support, check UTC TZ as originally does.
Expression check: if it's time zone aware expr, check UTC TZ; Cast: check conversion between string,date,timestamp.
Exec check: skip the UTC check for all Execs. Because Expression will check, Exec no need to check.
For the fieformats: remove check for Parquet, remain for other file types.

Other changes:

Remove the skip test cases logic in data_gen.py if it's not UTC.
For databricks, set the rebase mode becasue DB's default rebase mode is different with regular Sparks.
For fallback cases:

   e.g.:
   xfail(is_non_utc, reason = 'Will update in future for non utc, so here use xfail to remind us')
   test_op()
       assert_equals
  
   skipif(is_utc, reason = 'skip it, becasue will remove this case in future. This is used to test fall back when it's non UTC')
   test_op_for_non_utc()
        assert_fall_back

res-life · 2023-11-15T01:46:35Z

After this commit, more failed cases occur:

FAILED ../../src/main/python/arithmetic_ops_test.py::test_greatest[Timestamp][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_item[Short-Array(Timestamp)]
FAILED ../../src/main/python/array_test.py::test_array_item[Long-Array(Timestamp)][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_item[Byte-Array(Timestamp)]
FAILED ../../src/main/python/array_test.py::test_array_item[Integer-Array(Timestamp)][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_item_lit_ordinal[Array(Timestamp)]
FAILED ../../src/main/python/array_test.py::test_orderby_array_unique[Array(Timestamp)]
FAILED ../../src/main/python/array_test.py::test_array_element_at[Array(Timestamp)][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_transform[Array(Timestamp)][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_repeat_with_count_column[Map(Timestamp(not_null),Timestamp)]
FAILED ../../src/main/python/array_test.py::test_array_repeat_with_count_column[Timestamp][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_repeat_with_count_scalar[Array(Timestamp)][INJECT_OOM]
FAILED ../../src/main/python/array_test.py::test_array_repeat_with_count_scalar[Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])][INJECT_OOM]

res-life · 2023-11-15T08:33:56Z

My paln:

Merge this Add time zone config to set non-UTC [databricks] #9652 to enable non-UTC CI. In this PR we remove the skipping tests when time zone is non-UTC, refer to the changes in data_gen.py
Run non-UTC CI, XFail all the filed cases. I think the failed cases will be the cases that contain timestamp gen. In this step we should remove the global check when Spark-Rapids starts: link

      if (TypeChecks.areTimestampsSupported(driverTimezone)) {
        val executorTimezone = ZoneId.systemDefault()
        if (executorTimezone.normalized() != driverTimezone.normalized()) {
          throw new RuntimeException(s" Driver and executor timezone mismatch. " +
              s"Driver timezone is $driverTimezone and executor timezone is " +
              s"$executorTimezone. Set executor timezone to $driverTimezone.")
        }
      }

I'll update this PR to xfail all the failed cases.
3. Update checker(Expr, Exec) to add an option to skip UTC check.
4. Implement a non-UTC supported GPU expr, like GpuDateFormatClass, use the option in 3 to enable non-UTC. Add case to test GPU operator. In this step, we will also update some common expr/exec like AttributeReference, ShuffleExec, ProjectExec, FileterExec ......
5. Parallelly update the xfail case. When we support non-UTC for a new operator, remove the xfail. For other xfail cases, we should investigate or write fallback cases.

And please review these first:
#9652 This is for step 1
#9721 This is for step 3,4

revans2

There is kind of too much going on here. We are changing the DB tests to be consistent with the non-DB tests, but I don't have confidence that we did it right. We are splitting up the timezone checks in the plugin, and we are changing all of the tests to support testing in different time zones. Can we split this up into at least 3 separate PRs. Doing it all at once is too much for me to really follow.

revans2 · 2023-11-15T14:10:06Z

integration_tests/src/main/python/cast_test.py

@@ -158,6 +229,22 @@ def test_cast_string_ts_valid_format(data_gen):
            conf = {'spark.rapids.sql.hasExtendedYearValues': 'false',
                'spark.rapids.sql.castStringToTimestamp.enabled': 'true'})

+@allow_non_gpu('ProjectExec')
+@pytest.mark.skipif(is_utc(), reason="TODO sub-issue in https://github.com/NVIDIA/spark-rapids/issues/9653 to support non-UTC tz for Cast from StringType to TimeStampType")
+@pytest.mark.parametrize('data_gen', [StringGen('[0-9]{1,4}-[0-9]{1,2}-[0-9]{1,2}'),


For the fallback tests we don't need to do all of the combinations. We just need one to be sure that we are falling back.

revans2 · 2023-11-15T14:12:45Z

integration_tests/src/main/python/cast_test.py

+@allow_non_gpu('ProjectExec')
+@pytest.mark.skipif(is_before_spark_320(), reason="Spark versions(< 320) not support Ansi mode when casting string to date")
+@pytest.mark.skipif(is_utc(), reason="TODO sub-issue in https://github.com/NVIDIA/spark-rapids/issues/9653 to support non-UTC tz for Cast from StringType to DateType")
+def test_cast_string_date_valid_ansi_for_non_utc():


nit: I'm not sure that we need ANSI fallback tests for cast. The fallback tests can be far less than the validate we do the right thing tests. We just need to cover the operator/types when we expect a fallback to happen.

revans2 · 2023-11-15T14:14:02Z

integration_tests/src/main/python/cast_test.py

@@ -294,14 +381,37 @@ def _assert_cast_to_string_equal (data_gen, conf):
        conf
    )

+# split all_array_gens_for_cast_to_string
+# remove below split and merge tests: "TODO sub-issue in https://github.com/NVIDIA/spark-rapids/issues/9653 to support non-UTC tz for Cast from Date/Timestamp to String"
+gens_for_non_utc_strs = [


nit: Again here I think we really only need Array(date) and Array(Timestamp). The rest are nice, but not required.

revans2 · 2023-11-15T14:21:40Z

integration_tests/src/main/python/data_gen.py

+# currently does not support. On Databricks, the default datetime rebase mode is LEGACY,
+# it's different from regular Spark. Some of the cases will fall if timezone is non-UTC on DB.
+# The following configs is for DB and ensure the rebase mode is not LEGACY on DB.
+writer_confs_for_DB = {


I still don't want this in here. I am fine if we xfail the DB write tests and point to why when the timezone is not UTC. But I don't want to change what we are testing unless we go through each test and verify that we are not losing coverage. This PR is already big enough. If you want to do this change it should be split off into another PR.

winningsix · 2023-11-15T02:41:22Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

@@ -1728,11 +1726,11 @@ object GpuOverrides extends Logging {
          GpuMinute(expr)
      }),
    expr[Second](
+


nit: not needed.

winningsix · 2023-11-15T03:10:55Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

@@ -363,8 +363,7 @@ final class TypeSig private(
      case FloatType => check.contains(TypeEnum.FLOAT)
      case DoubleType => check.contains(TypeEnum.DOUBLE)
      case DateType => check.contains(TypeEnum.DATE)
-      case TimestampType if check.contains(TypeEnum.TIMESTAMP) =>
-          TypeChecks.areTimestampsSupported()


Took a quick look at cudf. For AST, I noticed timezone info is not respected yet.

https://github.com/rapidsai/cudf/blob/330d389b26a05676d9f079503a3d96b571762337/java/src/main/java/ai/rapids/cudf/ast/Literal.java#L144-L165

winningsix · 2023-11-15T03:11:12Z

sql-plugin/src/main/scala/org/apache/spark/sql/catalyst/expressions/rapids/Timestamp.scala

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.


Not touched?

NVnavkumar · 2023-11-16T00:40:00Z

My paln:
1. Merge this [Add time zone config to set non-UTC [databricks] #9652](https://github.com/NVIDIA/spark-rapids/pull/9652) to enable non-UTC CI. In this PR we remove the skipping tests when time zone is non-UTC, refer to the changes in `data_gen.py`

2. Run non-UTC CI,  XFail all the filed cases. I think the failed cases will be the cases that contain timestamp gen. In this step we should remove the global check when Spark-Rapids starts: [link](https://github.com/NVIDIA/spark-rapids/blob/v23.10.0/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala#L377)
      if (TypeChecks.areTimestampsSupported(driverTimezone)) {
        val executorTimezone = ZoneId.systemDefault()
        if (executorTimezone.normalized() != driverTimezone.normalized()) {
          throw new RuntimeException(s" Driver and executor timezone mismatch. " +
              s"Driver timezone is $driverTimezone and executor timezone is " +
              s"$executorTimezone. Set executor timezone to $driverTimezone.")
        }
      }
I'll update this PR to xfail all the failed cases. 3. Update checker(Expr, Exec) to add an option to skip UTC check. 4. Implement a non-UTC supported GPU expr, like GpuDateFormatClass, use the option in 3 to enable non-UTC. Add case to test GPU operator. In this step, we will also update some common expr/exec like AttributeReference, ShuffleExec, ProjectExec, FileterExec ...... 5. Parallelly update the xfail case. When we support non-UTC for a new operator, remove the xfail. For other xfail cases, we should investigate or write fallback cases.

And please review these first: #9652 This is for step 1 #9721 This is for step 3,4

For 1 and 2 here, I'm filed #9737 to track this separately. I think this will be important. The issue is that we need to really ensure that there is no data corruption when merging many of the changes here.

res-life · 2023-11-22T01:53:10Z

Replace this with #9719

Chong Gao added 2 commits October 19, 2023 15:06

Add test cases for timezone awarded operators

d8e77b2

Signed-off-by: Chong Gao <[email protected]>

Move timezone check to each operator

3f781a4

revans2 reviewed Oct 19, 2023

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Oct 19, 2023

winningsix reviewed Oct 20, 2023

View reviewed changes

Chong Gao added 4 commits October 27, 2023 13:30

Merge branch 23.12

d5a6d7a

Update

b3fa3ee

debug

c31b2e3

debug

a7c8996

res-life commented Oct 27, 2023

View reviewed changes

res-life mentioned this pull request Oct 30, 2023

[FEA] Implement GPU versions for all the sub operators of TimeZoneAwareExpression #9570

Closed

revans2 reviewed Oct 30, 2023

View reviewed changes

Add timezone test mark

2878c5c

revans2 reviewed Oct 31, 2023

View reviewed changes

Chong Gao added 2 commits November 1, 2023 09:12

Minor update

705f8b5

Fix failed cmp case on Spark311; Restore a python import; minor changes

882b751

res-life changed the title ~~[WIP] Move timezone check to each operator~~ [WIP] Move timezone check to each operator [databricks] Nov 1, 2023

Chong Gao and others added 18 commits November 9, 2023 13:29

Add fallback cases for cmp_test.py

ca23932

Add fallback tests for json_test.py

ee60bea

Signed-off-by: Firestarman <[email protected]>

add non_utc fallback for parquet_write qa_select and window_function …

d403c59

…tests Signed-off-by: Haoyang Li <[email protected]>

Add fallback tests for conditionals_test.py

dd5ad0b

Add fallback cases for collection_ops_test.py

058e13e

add fallback tests for date_time_test

fc3a678

Signed-off-by: Haoyang Li <[email protected]>

clean up spark_session.py

938c649

Signed-off-by: Haoyang Li <[email protected]>

Add fallback tests for explain_test and csv_test

befa39d

Update test case

cf2c621

update test case

c298d5f

Add default value

09e772c

Remove useless is_tz_utc

f43a8f9

Fix fallback cases

5882cc3

Add bottom check for time zone; Fix ORC check

7a53dc2

By default, ExecCheck do not check UTC time zone

7bd9ef8

For common expr like AttributeReference, just skip the UTC checking

9817c4e

For common expr like AttributeReference, just skip the UTC checking

f8505b7

For common expr like AttributeReference, just skip the UTC checking

fa1c84d

winningsix reviewed Nov 14, 2023

View reviewed changes

Update test cases

fbbbd5b

winningsix mentioned this pull request Nov 15, 2023

Support fine grained timezone checker instead of type based [databricks] #9719

Merged

3 tasks

revans2 reviewed Nov 15, 2023

View reviewed changes

winningsix reviewed Nov 15, 2023

View reviewed changes

res-life mentioned this pull request Nov 20, 2023

xfail all the impacted cases when using non-UTC time zone [databricks] #9773

Merged

res-life closed this Nov 22, 2023

[WIP] Move timezone check to each operator [databricks] #9482

[WIP] Move timezone check to each operator [databricks] #9482

Conversation

res-life commented Oct 19, 2023 • edited Loading

revans2 commented Oct 19, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Oct 27, 2023

res-life Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Oct 27, 2023

res-life commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Oct 31, 2023

res-life commented Oct 31, 2023

res-life commented Oct 31, 2023

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Nov 1, 2023

res-life commented Nov 1, 2023

res-life commented Nov 1, 2023 • edited Loading

res-life commented Nov 1, 2023 • edited Loading

res-life commented Nov 1, 2023 • edited Loading

ttnghia commented Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Nov 15, 2023 • edited Loading

res-life commented Nov 15, 2023

res-life commented Nov 15, 2023 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NVnavkumar commented Nov 16, 2023

res-life commented Nov 22, 2023 • edited Loading

res-life commented Oct 19, 2023 •

edited

Loading

res-life Oct 27, 2023 •

edited

Loading

res-life Oct 27, 2023 •

edited

Loading

res-life commented Nov 1, 2023 •

edited

Loading

res-life commented Nov 1, 2023 •

edited

Loading

res-life commented Nov 1, 2023 •

edited

Loading

ttnghia commented Nov 1, 2023 •

edited

Loading

res-life commented Nov 15, 2023 •

edited

Loading

res-life commented Nov 15, 2023 •

edited

Loading

res-life commented Nov 22, 2023 •

edited

Loading