-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom kernel for converting timestamps b/n UTC and non-UTC non-DST timezones #1553
Custom kernel for converting timestamps b/n UTC and non-UTC non-DST timezones #1553
Conversation
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…nc with real timezone DB Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…exactly, switch to upper bound. Update tests for edge case. Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
* parts of the database. I prefer the former solution at least until we see a performance hit | ||
* where we are waiting on the database to finish loading. | ||
*/ | ||
public static void cacheDatabase() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a way to pass in a HostMemoryAllocator to this so we can do retry if needed in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is fine to do as a follow on PR. We just need it for host memory limits at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #1570
new HostColumnVector.BasicType(false, DType.INT32)); | ||
HostColumnVector.DataType resultType = | ||
new HostColumnVector.ListType(false, childType); | ||
HostColumnVector fixedTransitions = HostColumnVector.fromLists(resultType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want/need a way to make this so we don't warn about leaking this? We are looking at making leaks fail unit tests. In Spark a lot of times there are races when trying to shut down things, especially if there is a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also is fine to do as a follow on issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #1571
try { | ||
zoneId = ZoneId.of(tzId).normalized(); // we use the normalized form to dedupe | ||
} catch (ZoneRulesException e) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a comment about when this would happen? It feels odd to just eat it and skip the timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually would never happen in this case. This try/catch might have been added by the IDE, but it's not necessary here. This is an exception that occurs when you pass in an invalid Timezone Id to this method that can't be found in the IANA database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add, that the source of this is Java itself (TimeZone.getAvailableIds()
), in which the data is coming from the same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, it seems that this data is somewhat inconsistent (probably because there is some ambiguity to be resolved, ie the 3-letter abbreviations which are available but deprecated). Maybe we should file a followup issue to handle that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all depends on how many time zones we are going to be able to support, and if we end up supporting them dynamically or not. A comment and a follow on issue should be fine. I am happy to have them skipped for now, but eventually we will need a full fix.
|
||
// TODO: Deprecate this API when we support all timezones | ||
// (See https://github.com/NVIDIA/spark-rapids/issues/6840) | ||
public static boolean isSupportedTimeZone(ZoneId desiredTimeZone) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to take a lot to really get rid of this. We are likely going to have to have some special case processing for ZoneOffsets. From what I can tell ZoneId.of
uses ZoneOffset.of
if the string looks like a ZoneOffset. There also appears to be some parsing going on for UT, GMT, and UTC offsets that I don't fully understand yet. We might need a follow on issue to look at how to dynamically look at offsets for UTC.
public class GpuTimeZoneDB { | ||
|
||
private CompletableFuture<Map<String, Integer>> zoneIdToTableFuture; | ||
private CompletableFuture<HostColumnVector> fixedTransitionsFuture; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what exactly is the data type stored here? It looks to be a LIST<STRUCT<startSeconds: int64, endSecond: Int64, offsetSeconds: Int64>>
?
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…ase. Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
…e, because the transition would still happen on that exact time. Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please format C++ code using clang-format v16. The style file should be ./src/main/cpp/.clang-format
.
src/main/cpp/src/timezones.cu
Outdated
* | ||
* @tparam typestamp_type type of the input and output timestamp | ||
* @param timestamp input timestamp | ||
* @param transitions the transitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing two more @param
.
auto input = reinterpret_cast<cudf::column_view const*>(input_handle); | ||
auto transitions = reinterpret_cast<cudf::table_view const*>(transitions_handle); | ||
auto index = static_cast<cudf::size_type>(tz_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to use auto const
.
} | ||
|
||
|
||
public static void shutdown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should this be synchronized
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's synchronized in the close method.
Table transitions = instance.getTransitions(); | ||
ColumnVector result = new ColumnVector(convertTimestampColumnToUTC(input.getNativeView(), | ||
transitions.getNativeView(), tzIndex)); | ||
transitions.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is very expensive since we upload data to GPU and create a new transition table every time we call this function. Can we cache the transition table inside instance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I plan to update this in a future PR since right now we just need to see if we are computing the right thing on the GPU. I think it's a still an open question as to how to cache it and what makes sense. I will file a follow up issue on optimizing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, right now the functionality will be hidden behind a configuration flag, so there is time to optimize before fully exposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed this #1588
…o gpu-timezone-non-repeating-transition
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
Signed-off-by: Navin Kumar <[email protected]>
src/main/cpp/src/timezones.cu
Outdated
#include <cudf/column/column.hpp> | ||
#include <cudf/column/column_device_view.cuh> | ||
#include <cudf/column/column_factories.hpp> | ||
#include <cudf/detail/null_mask.hpp> | ||
#include <cudf/lists/list_device_view.cuh> | ||
#include <cudf/lists/lists_column_device_view.cuh> | ||
#include <cudf/table/table.hpp> | ||
#include <cudf/types.hpp> | ||
#include <rmm/cuda_stream_view.hpp> | ||
#include <rmm/exec_policy.hpp> | ||
#include <thrust/binary_search.h> | ||
|
||
#include "timezones.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The headers should be grouped by a "near to far" order: local headers first, then cudf_test
, then cudf/
, then thrust
, then rmm
, finally C++ built-in. This applies for all C++ files.
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
build |
Signed-off-by: Navin Kumar <[email protected]>
#include <rmm/cuda_stream_view.hpp> | ||
|
||
#include <cstddef> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Navin Kumar <[email protected]>
build |
For NVIDIA/spark-rapids#6831.
This is a custom kernel implementation of transition timezones to and from UTC. This caches the timezone transition database from Java for use on the GPU to be compatible with the Spark implementation.
This passes the test suite used for the CPU POC (
TimeZoneSuite.scala
) in the spark-rapids codebase to be compatible with Apache Spark (See NVIDIA/spark-rapids#9739 for updates to that test suite)