Custom kernel for converting timestamps b/n UTC and non-UTC non-DST timezones #1553

NVnavkumar · 2023-11-14T02:09:36Z

This is a custom kernel implementation of transition timezones to and from UTC. This caches the timezone transition database from Java for use on the GPU to be compatible with the Spark implementation.

This passes the test suite used for the CPU POC (TimeZoneSuite.scala) in the spark-rapids codebase to be compatible with Apache Spark (See NVIDIA/spark-rapids#9739 for updates to that test suite)

Signed-off-by: Navin Kumar <[email protected]>

…nc with real timezone DB Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

…exactly, switch to upper bound. Update tests for edge case. Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-16T08:02:54Z

build

revans2 · 2023-11-16T15:26:17Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+   * parts of the database. I prefer the former solution at least until we see a performance hit
+   * where we are waiting on the database to finish loading.
+   */
+  public static void cacheDatabase() {


Can we have a way to pass in a HostMemoryAllocator to this so we can do retry if needed in the future?

Note that this is fine to do as a follow on PR. We just need it for host memory limits at some point.

Filed #1570

revans2 · 2023-11-16T15:30:27Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+            new HostColumnVector.BasicType(false, DType.INT32));
+        HostColumnVector.DataType resultType =
+            new HostColumnVector.ListType(false, childType);
+        HostColumnVector fixedTransitions = HostColumnVector.fromLists(resultType,


Do we want/need a way to make this so we don't warn about leaking this? We are looking at making leaks fail unit tests. In Spark a lot of times there are races when trying to shut down things, especially if there is a failure.

This also is fine to do as a follow on issue.

Filed #1571

revans2 · 2023-11-16T15:37:02Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+          try {
+            zoneId = ZoneId.of(tzId).normalized(); // we use the normalized form to dedupe
+          } catch (ZoneRulesException e) {
+            continue;


Can we have a comment about when this would happen? It feels odd to just eat it and skip the timezone.

It actually would never happen in this case. This try/catch might have been added by the IDE, but it's not necessary here. This is an exception that occurs when you pass in an invalid Timezone Id to this method that can't be found in the IANA database.

Should add, that the source of this is Java itself (TimeZone.getAvailableIds()), in which the data is coming from the same place.

Never mind, it seems that this data is somewhat inconsistent (probably because there is some ambiguity to be resolved, ie the 3-letter abbreviations which are available but deprecated). Maybe we should file a followup issue to handle that case?

This all depends on how many time zones we are going to be able to support, and if we end up supporting them dynamically or not. A comment and a follow on issue should be fine. I am happy to have them skipped for now, but eventually we will need a full fix.

revans2 · 2023-11-16T15:58:13Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+
+  // TODO: Deprecate this API when we support all timezones 
+  // (See https://github.com/NVIDIA/spark-rapids/issues/6840)
+  public static boolean isSupportedTimeZone(ZoneId desiredTimeZone) {


It is going to take a lot to really get rid of this. We are likely going to have to have some special case processing for ZoneOffsets. From what I can tell ZoneId.of uses ZoneOffset.of if the string looks like a ZoneOffset. There also appears to be some parsing going on for UT, GMT, and UTC offsets that I don't fully understand yet. We might need a follow on issue to look at how to dynamically look at offsets for UTC.

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

revans2 · 2023-11-16T16:21:38Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+public class GpuTimeZoneDB {
+
+  private CompletableFuture<Map<String, Integer>> zoneIdToTableFuture;
+  private CompletableFuture<HostColumnVector> fixedTransitionsFuture;


So what exactly is the data type stored here? It looks to be a LIST<STRUCT<startSeconds: int64, endSecond: Int64, offsetSeconds: Int64>>?

src/test/java/com/nvidia/spark/rapids/jni/TimeZoneTest.java

src/main/cpp/src/timezones.cu

Signed-off-by: Navin Kumar <[email protected]>

…ase. Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

…e, because the transition would still happen on that exact time. Signed-off-by: Navin Kumar <[email protected]>

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-16T22:38:13Z

build

revans2

Looking good.

ttnghia

Please format C++ code using clang-format v16. The style file should be ./src/main/cpp/.clang-format.

ttnghia · 2023-11-16T22:55:57Z

src/main/cpp/src/timezones.cu

+ * 
+ * @tparam typestamp_type type of the input and output timestamp
+ * @param timestamp input timestamp
+ * @param transitions the transitions 


Missing two more @param.

ttnghia · 2023-11-16T22:57:25Z

src/main/cpp/src/GpuTimeZoneDBJni.cpp

+        auto input = reinterpret_cast<cudf::column_view const*>(input_handle);
+        auto transitions = reinterpret_cast<cudf::table_view const*>(transitions_handle);
+        auto index = static_cast<cudf::size_type>(tz_index);


It is recommended to use auto const.

src/main/cpp/src/timezones.hpp

src/main/cpp/tests/timezones.cpp

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

ttnghia · 2023-11-21T18:38:19Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+  }
+
+
+  public static void shutdown() {


Also should this be synchronized?

It's synchronized in the close method.

ttnghia · 2023-11-21T18:40:35Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+    Table transitions = instance.getTransitions();
+    ColumnVector result = new ColumnVector(convertTimestampColumnToUTC(input.getNativeView(),
+        transitions.getNativeView(), tzIndex));
+    transitions.close();


I feel this is very expensive since we upload data to GPU and create a new transition table every time we call this function. Can we cache the transition table inside instance?

Yeah. I plan to update this in a future PR since right now we just need to see if we are computing the right thing on the GPU. I think it's a still an open question as to how to cache it and what makes sense. I will file a follow up issue on optimizing this.

Also, right now the functionality will be hidden behind a configuration flag, so there is time to optimize before fully exposing.

Filed this #1588

…o gpu-timezone-non-repeating-transition

Signed-off-by: Navin Kumar <[email protected]>

ttnghia · 2023-11-21T22:29:36Z

src/main/cpp/src/timezones.cu

+#include <cudf/column/column.hpp>
+#include <cudf/column/column_device_view.cuh>
+#include <cudf/column/column_factories.hpp>
+#include <cudf/detail/null_mask.hpp>
+#include <cudf/lists/list_device_view.cuh>
+#include <cudf/lists/lists_column_device_view.cuh>
+#include <cudf/table/table.hpp>
+#include <cudf/types.hpp>
+#include <rmm/cuda_stream_view.hpp>
+#include <rmm/exec_policy.hpp>
+#include <thrust/binary_search.h>
+
+#include "timezones.hpp"


The headers should be grouped by a "near to far" order: local headers first, then cudf_test, then cudf/, then thrust, then rmm, finally C++ built-in. This applies for all C++ files.

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-21T22:53:39Z

build

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-21T22:59:08Z

build

src/main/cpp/src/timezones.cu

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-21T23:23:13Z

build

Signed-off-by: Navin Kumar <[email protected]>

ttnghia · 2023-11-21T23:46:42Z

src/main/cpp/src/timezones.hpp

+#include <rmm/cuda_stream_view.hpp>
+
+#include <cstddef>
+


Suggested change

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar · 2023-11-21T23:49:16Z

build

NVnavkumar added 13 commits November 9, 2023 12:21

Semi-working kernel for timestamp timezone conversion

616e368

Signed-off-by: Navin Kumar <[email protected]>

Updated gtest with transition list

ac5aa18

Refactor tests to use transitions as fixture

8c53cce

Add more items in the column to test each transition

c87f7fc

Updated unit gtests for timezone kernel

558b882

Implementation of GpuTimeZoneDB with matching interface with CPU POC.

e33bb3a

Add minimal convert from UTC test

ca8502a

Signed-off-by: Navin Kumar <[email protected]>

Fix wrong offset bug in creating transition DB and update tests to sy…

3a22b6d

…nc with real timezone DB Signed-off-by: Navin Kumar <[email protected]>

Cleanup and sync test with CPP version.

10476bc

Signed-off-by: Navin Kumar <[email protected]>

Merge branch 'branch-23.12' into gpu-timezone-non-repeating-transition

2f0f32a

Fix bug that happens when we pass a timestamp on the transition time …

3094fe7

…exactly, switch to upper bound. Update tests for edge case. Signed-off-by: Navin Kumar <[email protected]>

Update timezone handling for convert to UTC and update tests

5b7f09e

Signed-off-by: Navin Kumar <[email protected]>

Internalize the daemon thread running to cache the timezone db

d78159a

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar mentioned this pull request Nov 16, 2023

Update timezone test framework to support both GPU and CPU POC NVIDIA/spark-rapids#9739

Closed

revans2 reviewed Nov 16, 2023

View reviewed changes

NVnavkumar added 6 commits November 16, 2023 10:37

Fix null pointer exception by creating the instance automatically

8b016b7

Signed-off-by: Navin Kumar <[email protected]>

Fix the visibility of these methods.

058c5cd

Signed-off-by: Navin Kumar <[email protected]>

Add comment to note the type of the column vector stored in the datab…

9d71bd4

…ase. Signed-off-by: Navin Kumar <[email protected]>

Remove the TIMESTAMP_DAYS code here.

21f7364

Signed-off-by: Navin Kumar <[email protected]>

Update this. I think the subtracting one second now doesn't make sens…

2c13b6d

…e, because the transition would still happen on that exact time. Signed-off-by: Navin Kumar <[email protected]>

Update tests to handle around the instant of transition.

7afbf1c

Signed-off-by: Navin Kumar <[email protected]>

NVnavkumar marked this pull request as ready for review November 16, 2023 22:38

NVnavkumar changed the title ~~[WIP] Custom kernel for converting timestamps b/n UTC and non-UTC non-DST timezones~~ Custom kernel for converting timestamps b/n UTC and non-UTC non-DST timezones Nov 16, 2023

revans2 reviewed Nov 16, 2023

View reviewed changes

NVnavkumar mentioned this pull request Nov 16, 2023

[FEA] Add retry to GPU timezone database caching operation #1570

Open

ttnghia reviewed Nov 16, 2023

View reviewed changes