Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize huglin and bedd for dask with flox #1495

Merged
merged 4 commits into from
Oct 6, 2023
Merged

Optimize huglin and bedd for dask with flox #1495

merged 4 commits into from
Oct 6, 2023

Conversation

aulemahal
Copy link
Collaborator

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • CHANGES.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

Changes the use of aggregate_between_dates to a select_time(...).resample().sum(). One of the "strength" of aggregate_between_dates is the possibility to have those days change from year to year. In huglin_index and biologically_effective_degree_days, we are using fixed start and end dates, so I was able to perform the same operation with a simpler function.

This change makes xarray aware of our resampling-sum computation and thus performance optimizations implemented by flox are made accessible!

In order to preserve the signature and functionality, I had to make select_time a bit more flexible so that we can choose if the bounds are inclusive or not. (aggregate_between_dates has a non-inclusive end-bound, which select_time is inclusine by default).

The results are different, but with a relative error around 1e-7, my guess is that this comes from the optimization itself (dask) and should be considered noise.

I think calendar support is preserved. I have found issues with using leap years, non-uniform calendars, etc. However, this change now allows passing 02-30 as a start or end date if the input uses 360_day. Haha, pretty sure that's almost useless, but it's there anyway.

The biggest difference is that the xclim.indices versions do not mask incomplete periods anymore. With aggregate_between_dates, if the end_date did not exist in the period, the output would be NaN. This is not the case with select_time. The indicator is unchanged since the missing check will mask the incomplete period.

Does this PR introduce a breaking change?

I don't think the last element in a breaking change because:

  • Indicators should always be used (especially if missing data are an issue)
  • I prefer that the indice doesn't perform "checks" that the indicator should be doing

Other information:

This brings down the usage of aggregate_between_dates to a single function : effective_growing_degree_days. As the dates are there a function of the year, we can't use select_time. To be optimized in the future!

@aulemahal aulemahal requested a review from Zeitsperre October 6, 2023 15:10
@github-actions github-actions bot added the indicators Climate indices and indicators label Oct 6, 2023
@aulemahal
Copy link
Collaborator Author

@fjetter, turns out we already had the tools needed for an optimized version of huglin_index!

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented Oct 6, 2023

@aulemahal Dang you work fast.

Since we've opened this Pandora's Box, I'm wondering about this warning:

tests/test_indices.py::TestAgroclimaticIndices::test_huglin_index[jones-10-01-1729.12]
tests/test_indices.py::TestAgroclimaticIndices::test_huglin_index[jones-11-01-2219.51]
  /home/runner/work/xclim/xclim/.tox/py39/lib/python3.9/site-packages/xarray/core/computation.py:808: RuntimeWarning: invalid value encountered in arccos
    result_data = func(*input_data)

This is likely raised because of some datasets with latitudes reaching into the polar circle will have "perpetual days" according to the day_lengths calculator. Would it be fine to fix this with a np.errstate(invalid="ignore") context when calculating the day_length_hours?

Copy link
Collaborator

@Zeitsperre Zeitsperre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well played!

end=end_date,
op="sum",
freq=freq,
day_length = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last place I can see aggregate_between_dates being used is in effective_growing_degree_days. Would it be safe to assume that that index also has the same dask-related issues?

Should we pull that generic index altogether, or would it be possible to rewrite it? Is it useful to for end-users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the same unoptimized behaviour will happen for this index. As it is the only one now using aggregate_between_dates, one option we have is to rewrite the index a single custom function. We might be able to achieve better performance then.

However, I think this idea of time-variable time bounds is still relevant. So I see two ways forward:

  1. A select_time that accepts variable and multidimensional bounds. We want to compute bounds for each period and then go back to the daily input and mask the dates outside those bounds. The mask may have spatial dimensions too.
  2. An optimized resample().map(). By this I mean, that passing a generic function to the resampler would not fragment the chunks to 1 per period.

Solution 2 is clearly outside the scope of xclim. This is work for developpers of flox and xarray, and I don't think it will be easy, if even possible.

Solution 1 will optimize the case where bounds must be dynamic, but the resampling function is simple (mean, sum, etc). I think this is the case for effective_growing_degree_days.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that solution 2 sounds a bit above our pay grade. How much effort are we talking if you were to implement solution 1?

I'm wondering if there are any other indicators that could benefit from dynamic start and end dates. Maybe there are some statistically-informed indexes that could be fascinating down the line? I don't think we'd lose much from keeping it around and optimizing it a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effort for 1 might be large. Dask and xarray have changed a lot since we wrote aggregate_between_dates so it might be easier, but for the moment, I don't think ``effective_growing_degree_days ` (as the single use case) is worth it...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just wanted to be sure. This PR is good to go as is!

@github-actions github-actions bot added the approved Approved for additional tests label Oct 6, 2023
@aulemahal
Copy link
Collaborator Author

I think you are right about the warning. This warning should be safe to ignore.

@Zeitsperre Zeitsperre added this to the v0.46.0 milestone Oct 6, 2023
@Zeitsperre Zeitsperre merged commit 0e12a27 into master Oct 6, 2023
9 checks passed
@Zeitsperre Zeitsperre deleted the opt-huglin branch October 6, 2023 20:34
@dcherian
Copy link

dcherian commented Oct 7, 2023

One of the "strength" of aggregate_between_dates is the possibility to have those days change from year to year. I

This is definitely very possible using flox directly. 80% of a groupby problem is just constructing the group labels. You should be able to do this somewhat easily with pandas.

Here's how xarray does it for the resampling problem: https://github.com/pydata/xarray/blob/1b0012a44aa45c67858489bc815928e1712dbd00/xarray/core/groupby.py#L536-L541

These codes are passed to flox.xarray.xarray_reduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved for additional tests indicators Climate indices and indicators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

huglin_index causes heavy dataset fragmentation
3 participants