-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings time zones) #6840
Comments
After doing some research into the JDK's API, here are some interesting findings:
So we have 2 approaches we can try here:
For now, I'm inclined to try out (2) first since I think (1) feels more like too much additional cost for what looks like premature optimization. However, we could keep in mind to implement both approaches (or a combination) since it could be something that is tuned to the data involved. For example if we are working with a lot of future timestamps, (1) could be better since that algorithm is a lot cleaner than doing a lot of processing and extraction from each timestamp to determine if it falls under a specific rule. Also, if we can partially tune an implementation of (1), maybe we can cover more than enough with less overhead. |
Should also add, at minimum for approach (2), we should ensure the transitions for the current year are already computed since that logic is potentially more complicated given the overlap between fixed transitions and choosing rules for a daylight savings cycle that hasn't yet completed. It might be much easier to start the following year with fresh rule choice at that point. |
I did some initial work looking at approach 1 when I filed the issue. The size of the database is not likely to be too horrible on a per time-zone basis. It would probably be about 11 MiB per time zone because it would need to go out to the year 294247 or so to fully support everything that Spark can also support.
There are about 206 time zones that have transition rules
Which means we would need about 2.25 GiB to store all of the rules for those timezones. That is small enough That I don't think we would need to worry about the Column size limits in Spark, but it is large enough that I don't think we want to try and keep the entire thing cached in either CPU or GPU memory. So option 2 is the ideal solution if we can make it work. If we need to we could try and play some games where we load timezones on demand as they are needed, as we probably need only one or two timezone ever. We could also look for the maximum timestamp and load up to that point, as it should be very very uncommon to need all of the data. But that would open up a number of corner cases which I would not be super happy about. So if we can make the dynamic version work, then it would be awesome. Also the dynamic version would set us up to be able to support things like fixed offset rules that don't actually have a name. |
Not planing work on this for release 24.04 |
Is your feature request related to a problem? Please describe.
This is a follow on to #6831. In order to support all time zones we will need a way to be able to deal with repeating rules. These are the instances of
ZoneOffsetTransitionRule
. The idea is to that we would cache these on the CPU similar to what we do forZoneOffsetTransition
. However, we would need a lot more columns, because we would need to include the majority of the information included in theZoneOffsetTransitionRule
. Then we would need to recreate the same kind of logic that exists inZoneRules.getOffset
. But we have to be very careful. JDK code is under the a GPL license that makes it so you cannot copy it. Instead we are going to have to try and do a clean room implementation or we might be able to look at what joda time does, which is an Apache licensed library that also supports time zone conversions.https://github.com/JodaOrg/joda-time
Specifically https://github.com/JodaOrg/joda-time/blob/main/src/main/java/org/joda/time/tz/DateTimeZoneBuilder.java and https://github.com/JodaOrg/global-tz
But I am not sure everything lines up exactly the same so it could take a bit of work.
After this we need to update the API that restricts which time zones we support so it includes all of them and deprecate it. Then we need to update all of the tests that have been written to check that we do not fall back and test the new time zones.
The text was updated successfully, but these errors were encountered: