Optimize huglin and bedd for dask with flox #1495

aulemahal · 2023-10-06T15:10:02Z

Pull Request Checklist:

This PR addresses an already opened issue (for bug fixes / features)
- This PR fixes huglin_index causes heavy dataset fragmentation #1494
Tests for the changes have been added (for bug fixes / features)
- (If applicable) Documentation has been added / updated (for bug fixes / features)
CHANGES.rst has been updated (with summary of main changes)
- Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

Changes the use of aggregate_between_dates to a select_time(...).resample().sum(). One of the "strength" of aggregate_between_dates is the possibility to have those days change from year to year. In huglin_index and biologically_effective_degree_days, we are using fixed start and end dates, so I was able to perform the same operation with a simpler function.

This change makes xarray aware of our resampling-sum computation and thus performance optimizations implemented by flox are made accessible!

In order to preserve the signature and functionality, I had to make select_time a bit more flexible so that we can choose if the bounds are inclusive or not. (aggregate_between_dates has a non-inclusive end-bound, which select_time is inclusine by default).

The results are different, but with a relative error around 1e-7, my guess is that this comes from the optimization itself (dask) and should be considered noise.

I think calendar support is preserved. I have found issues with using leap years, non-uniform calendars, etc. However, this change now allows passing 02-30 as a start or end date if the input uses 360_day. Haha, pretty sure that's almost useless, but it's there anyway.

The biggest difference is that the xclim.indices versions do not mask incomplete periods anymore. With aggregate_between_dates, if the end_date did not exist in the period, the output would be NaN. This is not the case with select_time. The indicator is unchanged since the missing check will mask the incomplete period.

Does this PR introduce a breaking change?

I don't think the last element in a breaking change because:

Indicators should always be used (especially if missing data are an issue)
I prefer that the indice doesn't perform "checks" that the indicator should be doing

Other information:

This brings down the usage of aggregate_between_dates to a single function : effective_growing_degree_days. As the dates are there a function of the year, we can't use select_time. To be optimized in the future!

aulemahal · 2023-10-06T15:11:11Z

@fjetter, turns out we already had the tools needed for an optimized version of huglin_index!

Zeitsperre · 2023-10-06T15:31:24Z

@aulemahal Dang you work fast.

Since we've opened this Pandora's Box, I'm wondering about this warning:

tests/test_indices.py::TestAgroclimaticIndices::test_huglin_index[jones-10-01-1729.12]
tests/test_indices.py::TestAgroclimaticIndices::test_huglin_index[jones-11-01-2219.51]
  /home/runner/work/xclim/xclim/.tox/py39/lib/python3.9/site-packages/xarray/core/computation.py:808: RuntimeWarning: invalid value encountered in arccos
    result_data = func(*input_data)

This is likely raised because of some datasets with latitudes reaching into the polar circle will have "perpetual days" according to the day_lengths calculator. Would it be fine to fix this with a np.errstate(invalid="ignore") context when calculating the day_length_hours?

Zeitsperre

Well played!

Zeitsperre · 2023-10-06T15:41:25Z

xclim/indices/_agro.py

-            end=end_date,
-            op="sum",
-            freq=freq,
+        day_length = (


The last place I can see aggregate_between_dates being used is in effective_growing_degree_days. Would it be safe to assume that that index also has the same dask-related issues?

Should we pull that generic index altogether, or would it be possible to rewrite it? Is it useful to for end-users?

Indeed, the same unoptimized behaviour will happen for this index. As it is the only one now using aggregate_between_dates, one option we have is to rewrite the index a single custom function. We might be able to achieve better performance then.

However, I think this idea of time-variable time bounds is still relevant. So I see two ways forward:

A select_time that accepts variable and multidimensional bounds. We want to compute bounds for each period and then go back to the daily input and mask the dates outside those bounds. The mask may have spatial dimensions too.

An optimized resample().map(). By this I mean, that passing a generic function to the resampler would not fragment the chunks to 1 per period.

Solution 2 is clearly outside the scope of xclim. This is work for developpers of flox and xarray, and I don't think it will be easy, if even possible.

Solution 1 will optimize the case where bounds must be dynamic, but the resampling function is simple (mean, sum, etc). I think this is the case for effective_growing_degree_days.

Agreed that solution 2 sounds a bit above our pay grade. How much effort are we talking if you were to implement solution 1?

I'm wondering if there are any other indicators that could benefit from dynamic start and end dates. Maybe there are some statistically-informed indexes that could be fascinating down the line? I don't think we'd lose much from keeping it around and optimizing it a bit.

Effort for 1 might be large. Dask and xarray have changed a lot since we wrote aggregate_between_dates so it might be easier, but for the moment, I don't think ``effective_growing_degree_days ` (as the single use case) is worth it...

Sounds good. Just wanted to be sure. This PR is good to go as is!

aulemahal · 2023-10-06T15:56:42Z

I think you are right about the warning. This warning should be safe to ignore.

dcherian · 2023-10-07T02:09:45Z

One of the "strength" of aggregate_between_dates is the possibility to have those days change from year to year. I

This is definitely very possible using flox directly. 80% of a groupby problem is just constructing the group labels. You should be able to do this somewhat easily with pandas.

Here's how xarray does it for the resampling problem: https://github.com/pydata/xarray/blob/1b0012a44aa45c67858489bc815928e1712dbd00/xarray/core/groupby.py#L536-L541

These codes are passed to flox.xarray.xarray_reduce

Optimize huglin and bedd for dask with flox

272c78c

aulemahal requested a review from Zeitsperre October 6, 2023 15:10

github-actions bot added the indicators Climate indices and indicators label Oct 6, 2023

Merge branch 'master' into opt-huglin

2665a6f

Zeitsperre approved these changes Oct 6, 2023

View reviewed changes

github-actions bot added the approved Approved for additional tests label Oct 6, 2023

ignore invalid values in arccos calculation

19dc5a2

Zeitsperre assigned aulemahal Oct 6, 2023

Zeitsperre added this to the v0.46.0 milestone Oct 6, 2023

Merge branch 'master' into opt-huglin

50cec23

Zeitsperre merged commit 0e12a27 into master Oct 6, 2023
9 checks passed

Zeitsperre deleted the opt-huglin branch October 6, 2023 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize huglin and bedd for dask with flox #1495

Optimize huglin and bedd for dask with flox #1495

aulemahal commented Oct 6, 2023

aulemahal commented Oct 6, 2023

Zeitsperre commented Oct 6, 2023 •

edited

Loading

Zeitsperre left a comment

Zeitsperre Oct 6, 2023

aulemahal Oct 6, 2023

Zeitsperre Oct 6, 2023

aulemahal Oct 6, 2023

Zeitsperre Oct 6, 2023

aulemahal commented Oct 6, 2023

dcherian commented Oct 7, 2023

Optimize huglin and bedd for dask with flox #1495

Optimize huglin and bedd for dask with flox #1495

Conversation

aulemahal commented Oct 6, 2023

Pull Request Checklist:

What kind of change does this PR introduce?

Does this PR introduce a breaking change?

Other information:

aulemahal commented Oct 6, 2023

Zeitsperre commented Oct 6, 2023 • edited Loading

Zeitsperre left a comment

Choose a reason for hiding this comment

Zeitsperre Oct 6, 2023

Choose a reason for hiding this comment

aulemahal Oct 6, 2023

Choose a reason for hiding this comment

Zeitsperre Oct 6, 2023

Choose a reason for hiding this comment

aulemahal Oct 6, 2023

Choose a reason for hiding this comment

Zeitsperre Oct 6, 2023

Choose a reason for hiding this comment

aulemahal commented Oct 6, 2023

dcherian commented Oct 7, 2023

Zeitsperre commented Oct 6, 2023 •

edited

Loading