-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize huglin and bedd for dask with flox #1495
Conversation
@fjetter, turns out we already had the tools needed for an optimized version of |
@aulemahal Dang you work fast. Since we've opened this Pandora's Box, I'm wondering about this warning:
This is likely raised because of some datasets with latitudes reaching into the polar circle will have "perpetual days" according to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well played!
end=end_date, | ||
op="sum", | ||
freq=freq, | ||
day_length = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last place I can see aggregate_between_dates
being used is in effective_growing_degree_days
. Would it be safe to assume that that index also has the same dask-related issues?
Should we pull that generic index altogether, or would it be possible to rewrite it? Is it useful to for end-users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the same unoptimized behaviour will happen for this index. As it is the only one now using aggregate_between_dates
, one option we have is to rewrite the index a single custom function. We might be able to achieve better performance then.
However, I think this idea of time-variable time bounds is still relevant. So I see two ways forward:
- A
select_time
that accepts variable and multidimensional bounds. We want to compute bounds for each period and then go back to the daily input and mask the dates outside those bounds. The mask may have spatial dimensions too. - An optimized
resample().map()
. By this I mean, that passing a generic function to the resampler would not fragment the chunks to 1 per period.
Solution 2 is clearly outside the scope of xclim. This is work for developpers of flox
and xarray
, and I don't think it will be easy, if even possible.
Solution 1 will optimize the case where bounds must be dynamic, but the resampling function is simple (mean, sum, etc). I think this is the case for effective_growing_degree_days
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that solution 2 sounds a bit above our pay grade. How much effort are we talking if you were to implement solution 1?
I'm wondering if there are any other indicators that could benefit from dynamic start and end dates. Maybe there are some statistically-informed indexes that could be fascinating down the line? I don't think we'd lose much from keeping it around and optimizing it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effort for 1 might be large. Dask and xarray have changed a lot since we wrote aggregate_between_dates
so it might be easier, but for the moment, I don't think ``effective_growing_degree_days ` (as the single use case) is worth it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Just wanted to be sure. This PR is good to go as is!
I think you are right about the warning. This warning should be safe to ignore. |
This is definitely very possible using flox directly. 80% of a groupby problem is just constructing the group labels. You should be able to do this somewhat easily with pandas. Here's how xarray does it for the resampling problem: https://github.com/pydata/xarray/blob/1b0012a44aa45c67858489bc815928e1712dbd00/xarray/core/groupby.py#L536-L541 These codes are passed to |
Pull Request Checklist:
huglin_index
causes heavy dataset fragmentation #1494number
) and pull request (:pull:number
) has been addedWhat kind of change does this PR introduce?
Changes the use of
aggregate_between_dates
to aselect_time(...).resample().sum()
. One of the "strength" ofaggregate_between_dates
is the possibility to have those days change from year to year. Inhuglin_index
andbiologically_effective_degree_days
, we are using fixed start and end dates, so I was able to perform the same operation with a simpler function.This change makes xarray aware of our resampling-sum computation and thus performance optimizations implemented by
flox
are made accessible!In order to preserve the signature and functionality, I had to make
select_time
a bit more flexible so that we can choose if the bounds are inclusive or not. (aggregate_between_dates
has a non-inclusive end-bound, whichselect_time
is inclusine by default).The results are different, but with a relative error around 1e-7, my guess is that this comes from the optimization itself (
dask
) and should be considered noise.I think calendar support is preserved. I have found issues with using leap years, non-uniform calendars, etc. However, this change now allows passing
02-30
as a start or end date if the input uses360_day
. Haha, pretty sure that's almost useless, but it's there anyway.The biggest difference is that the
xclim.indices
versions do not mask incomplete periods anymore. Withaggregate_between_dates
, if theend_date
did not exist in the period, the output would be NaN. This is not the case withselect_time
. The indicator is unchanged since the missing check will mask the incomplete period.Does this PR introduce a breaking change?
I don't think the last element in a breaking change because:
Other information:
This brings down the usage of
aggregate_between_dates
to a single function :effective_growing_degree_days
. As the dates are there a function of the year, we can't useselect_time
. To be optimized in the future!