Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings time zones) #6840

Open
revans2 opened this issue Oct 18, 2022 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 18, 2022

Is your feature request related to a problem? Please describe.
This is a follow on to #6831. In order to support all time zones we will need a way to be able to deal with repeating rules. These are the instances of ZoneOffsetTransitionRule. The idea is to that we would cache these on the CPU similar to what we do for ZoneOffsetTransition. However, we would need a lot more columns, because we would need to include the majority of the information included in the ZoneOffsetTransitionRule. Then we would need to recreate the same kind of logic that exists in ZoneRules.getOffset. But we have to be very careful. JDK code is under the a GPL license that makes it so you cannot copy it. Instead we are going to have to try and do a clean room implementation or we might be able to look at what joda time does, which is an Apache licensed library that also supports time zone conversions.

https://github.com/JodaOrg/joda-time

Specifically https://github.com/JodaOrg/joda-time/blob/main/src/main/java/org/joda/time/tz/DateTimeZoneBuilder.java and https://github.com/JodaOrg/global-tz

But I am not sure everything lines up exactly the same so it could take a bit of work.

After this we need to update the API that restricts which time zones we support so it includes all of them and deprecate it. Then we need to update all of the tests that have been written to check that we do not fall back and test the new time zones.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 18, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022
@NVnavkumar NVnavkumar self-assigned this Dec 1, 2023
@NVnavkumar NVnavkumar changed the title [FEA] Support timestamp transitions to and from UTC for repeating rules [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings) Dec 8, 2023
@NVnavkumar
Copy link
Collaborator

After doing some research into the JDK's API, here are some interesting findings:

  1. Java stores the transitions for the timezones with daylight savings after they have already happened. So for past dates, the transition list has all the previous daylight savings transitions in the transition list as fixed transitions (which makes sense). This means that the transition list will actually grow each year for these timezones. Also, the implication is that for dates in the past, we would adopt the same approach we take with non-repeating rules.

  2. Transition rules exist for dates in the future. The idea is for each current and future year, you can compute the ZoneOffsetTransition for that future year.

So we have 2 approaches we can try here:

  1. We can use ZoneOffsetTransitionRule.createTransitition(...) to create transitions for a set of future years and append these to fixed transition list for that timezone. This would incur a longer initialization time and more memory required to store the extra transitions, but computing the conversion to and from UTC would simply use the existing kernel as if the transitions were fixed and non-repeating. This is definitely the simplest to implement and is parallel to the implementation that cuDF currently uses in parsing non-UTC timestamps from ORC files.

  2. We would store the rules in another column vector, and then implement an addition to the kernel that would use rules if we would fall out of the existing transition list (ie use a transition that doesn't exist yet). We would have to process the timestamp to figure out which rule would apply and then apply a created transition from that rule. The logic is more complicated in this case so it might not perform as well as the simpler fixed transition kernel since we would need some additional conditional logic in this case. But initialization time would only increase a nominal amount to store the rules, and the storage cost would only increase on the yearly basis that Java uses and for the additional columns to store rules.

For now, I'm inclined to try out (2) first since I think (1) feels more like too much additional cost for what looks like premature optimization. However, we could keep in mind to implement both approaches (or a combination) since it could be something that is tuned to the data involved. For example if we are working with a lot of future timestamps, (1) could be better since that algorithm is a lot cleaner than doing a lot of processing and extraction from each timestamp to determine if it falls under a specific rule. Also, if we can partially tune an implementation of (1), maybe we can cover more than enough with less overhead.

@NVnavkumar
Copy link
Collaborator

Should also add, at minimum for approach (2), we should ensure the transitions for the current year are already computed since that logic is potentially more complicated given the overlap between fixed transitions and choosing rules for a daylight savings cycle that hasn't yet completed. It might be much easier to start the following year with fresh rule choice at that point.

@NVnavkumar NVnavkumar changed the title [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings) [FEA] Support timestamp transitions to and from UTC for repeating rules (daylight savings time zones) Dec 13, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Dec 27, 2023

I did some initial work looking at approach 1 when I filed the issue. The size of the database is not likely to be too horrible on a per time-zone basis. It would probably be about 11 MiB per time zone because it would need to go out to the year 294247 or so to fully support everything that Spark can also support.

294247 years * 2 transitions per year * (8 + 8 + 4 bytes) => 11.22 MiB

There are about 206 time zones that have transition rules

scala> java.time.ZoneId.getAvailableZoneIds.asScala.map(f => java.time.ZoneId.of(f).normalized).filter(f => !f.getRules().getTransitionRules.isEmpty()).size
res18: Int = 206

Which means we would need about 2.25 GiB to store all of the rules for those timezones. That is small enough That I don't think we would need to worry about the Column size limits in Spark, but it is large enough that I don't think we want to try and keep the entire thing cached in either CPU or GPU memory. So option 2 is the ideal solution if we can make it work. If we need to we could try and play some games where we load timezones on demand as they are needed, as we probably need only one or two timezone ever. We could also look for the maximum timestamp and load up to that point, as it should be very very uncommon to need all of the data. But that would open up a number of corner cases which I would not be super happy about. So if we can make the dynamic version work, then it would be awesome.

Also the dynamic version would set us up to be able to support things like fixed offset rules that don't actually have a name.

@res-life
Copy link
Collaborator

Not planing work on this for release 24.04
@sameerz can we move it to next release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants