Skip to content

Commit

Permalink
Optimize huglin and bedd for dask with flox (#1495)
Browse files Browse the repository at this point in the history
### What kind of change does this PR introduce?

Changes the use of `aggregate_between_dates` to a
`select_time(...).resample().sum()`. One of the "strength" of
`aggregate_between_dates` is the possibility to have those days change
from year to year. In `huglin_index` and
`biologically_effective_degree_days`, we are using fixed start and end
dates, so I was able to perform the same operation with a simpler
function.

This change makes xarray aware of our resampling-sum computation and
thus, performance optimizations implemented by `flox` are made
accessible!

In order to preserve the signature and functionality, I had to make
`select_time` a bit more flexible so that we can choose if the bounds
are inclusive or not. (`aggregate_between_dates` has a non-inclusive
end-bound, which `select_time` is inclusive by default).

The results are different, but with a relative error around 1e-7, my
guess is that this comes from the optimization itself (`dask`) and
should be considered noise.

I think calendar support is preserved. I have found issues with using
leap years, non-uniform calendars, etc. However, this change now allows
passing `02-30` as a start or end date if the input uses `360_day`.
Haha, pretty sure that's almost useless, but it's there anyway.

The biggest difference is that the `xclim.indices` versions do not mask
incomplete periods anymore. With `aggregate_between_dates`, if the
`end_date` did not exist in the period, the output would be NaN. This is
not the case with `select_time`. The indicator is unchanged since the
missing check will mask the incomplete period.

### Does this PR introduce a breaking change?
I don't think the last element in a breaking change because:
- Indicators should always be used (especially if missing data are an
issue)
- I prefer that the indice doesn't perform "checks" that the indicator
should be doing

### Other information:
This brings down the usage of `aggregate_between_dates` to a single
function : `effective_growing_degree_days`. As the dates are there a
function of the year, we can't use `select_time`. To be optimized in the
future!
  • Loading branch information
Zeitsperre authored Oct 6, 2023
2 parents 5517f5e + 50cec23 commit 0e12a27
Show file tree
Hide file tree
Showing 5 changed files with 54 additions and 27 deletions.
1 change: 1 addition & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Bug fixes
* Fixed an error in the `pytest` configuration that prevented copying of testing data to thread-safe caches of workers under certain conditions (this should always occur). (:pull:`1473`).
* Coincidentally, this also fixes an error that caused `pytest` to error-out when invoked without an active internet connection. Running `pytest` without network access is now supported (requires cached testing data). (:issue:`1468`).
* Calling a ``sdba.map_blocks``-wrapped function with data chunked along the reduced dimensions will raise an error. This forbids chunking the trained dataset along the distribution dimensions, for example. (:issue:`1481`, :pull:`1482`).
* Optimization of indicators ``huglin_index`` and ``biologically_effective_degree_days`` when used with dask and flox. As a side effect, the indice functions (i.e. under ``xc.indices``) no longer mask incomplete periods. The indicators' output is unchanged under the default "check_missing" setting (:issue:`1494`, :pull:`1495`).

Breaking changes
^^^^^^^^^^^^^^^^
Expand Down
8 changes: 4 additions & 4 deletions tests/test_indices.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,18 +320,18 @@ def test_bedd(self, method, end_date, deg_days, max_deg_days):
bedd[1], bedd[0]
) # Leap-year has slightly higher values
np.testing.assert_allclose(
bedd, np.array([deg_days, deg_days, deg_days, np.NaN]), rtol=6e-4
bedd[:3], np.array([deg_days, deg_days, deg_days]), rtol=6e-4
)
np.testing.assert_allclose(
bedd_hot, [max_deg_days, max_deg_days, max_deg_days, np.NaN], rtol=0.15
bedd_hot[:3], [max_deg_days, max_deg_days, max_deg_days], rtol=0.15
)

else:
np.testing.assert_allclose(
bedd, np.array([deg_days, deg_days, deg_days, np.NaN])
bedd[:3], np.array([deg_days, deg_days, deg_days])
)
np.testing.assert_array_equal(
bedd_hot, [max_deg_days, max_deg_days, max_deg_days, np.NaN]
bedd_hot[:3], [max_deg_days, max_deg_days, max_deg_days]
)
if method == "gladstones":
np.testing.assert_array_less(bedd, bedd_high_lat)
Expand Down
25 changes: 19 additions & 6 deletions xclim/core/calendar.py
Original file line number Diff line number Diff line change
Expand Up @@ -1465,6 +1465,7 @@ def select_time(
month: int | Sequence[int] = None,
doy_bounds: tuple[int, int] = None,
date_bounds: tuple[str, str] = None,
include_bounds: bool | tuple[bool, bool] = True,
) -> xr.DataArray | xr.Dataset:
"""Select entries according to a time period.
Expand All @@ -1488,11 +1489,13 @@ def select_time(
The bounds as (start, end) of the period of interest expressed in day-of-year,
integers going from 1 (January 1st) to 365 or 366 (December 31st). If calendar
awareness is needed, consider using ``date_bounds`` instead.
Bounds are inclusive.
date_bounds: 2-tuple of strings
The bounds as (start, end) of the period of interest expressed as dates in the
month-day (%m-%d) format.
Bounds are inclusive.
include_bounds: bool or 2-tuple of booleans
Whether the bounds of `doy_bounds` or `date_bounds` should be inclusive or not.
Either one value for both or a tuple.
Default is True, meaning bounds are inclusive.
Returns
-------
Expand Down Expand Up @@ -1529,10 +1532,19 @@ def select_time(
if N == 0:
return da

def get_doys(start, end):
def get_doys(start, end, inclusive):
if start <= end:
return np.arange(start, end + 1)
return np.concatenate((np.arange(start, 367), np.arange(0, end + 1)))
doys = np.arange(start, end + 1)
else:
doys = np.concatenate((np.arange(start, 367), np.arange(0, end + 1)))
if not inclusive[0]:
doys = doys[1:]
if not inclusive[1]:
doys = doys[:-1]
return doys

if isinstance(include_bounds, bool):
include_bounds = (include_bounds, include_bounds)

if season is not None:
if isinstance(season, str):
Expand All @@ -1545,7 +1557,7 @@ def get_doys(start, end):
mask = da.time.dt.month.isin(month)

elif doy_bounds is not None:
mask = da.time.dt.dayofyear.isin(get_doys(*doy_bounds))
mask = da.time.dt.dayofyear.isin(get_doys(*doy_bounds, include_bounds))

elif date_bounds is not None:
# This one is a bit trickier.
Expand All @@ -1564,6 +1576,7 @@ def get_doys(start, end):
doys = get_doys(
to_cftime_datetime("2000-" + start, calendar).dayofyr,
to_cftime_datetime("2000-" + end, calendar).dayofyr,
include_bounds,
)
mask = time.time.dt.dayofyear.isin(doys)
# Needed if we converted calendar, this puts back the correct coord
Expand Down
40 changes: 26 additions & 14 deletions xclim/indices/_agro.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,12 +263,14 @@ def huglin_index(
)
k_aggregated = 1
elif method.lower() == "jones":
day_length = aggregate_between_dates(
day_lengths(dates=tas.time, lat=lat, method="simple"),
start=start_date,
end=end_date,
op="sum",
freq=freq,
day_length = (
select_time(
day_lengths(dates=tas.time, lat=lat, method="simple"),
date_bounds=(start_date, end_date),
include_bounds=(True, False),
)
.resample(time=freq)
.sum()
)
k = 1
k_aggregated = 2.8311e-4 * day_length + 0.30834
Expand All @@ -277,10 +279,13 @@ def huglin_index(

hi = (((tas + tasmax) / 2) - thresh).clip(min=0) * k
hi = (
aggregate_between_dates(hi, start=start_date, end=end_date, freq=freq)
select_time(
hi, date_bounds=(start_date, end_date), include_bounds=(True, False)
)
.resample(time=freq)
.sum()
* k_aggregated
)

hi.attrs["units"] = ""
return hi

Expand Down Expand Up @@ -418,11 +423,14 @@ def biologically_effective_degree_days(
)
k_aggregated = 1
else:
day_length = aggregate_between_dates(
day_lengths(dates=tasmin.time, lat=lat, method="simple"),
start=start_date,
end=end_date,
freq=freq,
day_length = (
select_time(
day_lengths(dates=tasmin.time, lat=lat, method="simple"),
date_bounds=(start_date, end_date),
include_bounds=(True, False),
)
.resample(time=freq)
.sum()
)
k = 1
k_huglin = 2.8311e-4 * day_length + 0.30834
Expand All @@ -439,7 +447,11 @@ def biologically_effective_degree_days(
)

bedd = (
aggregate_between_dates(bedd, start=start_date, end=end_date, freq=freq)
select_time(
bedd, date_bounds=(start_date, end_date), include_bounds=(True, False)
)
.resample(time=freq)
.sum()
* k_aggregated
)

Expand Down
7 changes: 4 additions & 3 deletions xclim/indices/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -429,9 +429,10 @@ def day_lengths(
lat = convert_units_to(lat, "rad")
# arccos gives the hour-angle at sunset, multiply by 24 / 2π to get hours.
# The day length is twice that.
day_length_hours = (
(24 / np.pi) * np.arccos(-np.tan(lat) * np.tan(declination))
).assign_attrs(units="h")
with np.errstate(invalid="ignore"):
day_length_hours = (
(24 / np.pi) * np.arccos(-np.tan(lat) * np.tan(declination))
).assign_attrs(units="h")

return day_length_hours

Expand Down

0 comments on commit 0e12a27

Please sign in to comment.