From 377284d9cc8d8ab749c6076baad4f42f6f724443 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Fri, 27 Oct 2023 16:06:58 -0400
Subject: [PATCH 1/7] Add IPCC robustness to change_significance with example

---
 docs/notebooks/ensembles.ipynb | 182 +++++++++++++++++++++++++++++++++
 docs/references.bib            |  13 +++
 xclim/ensembles/_robustness.py |  21 +++-
 xclim/indices/generic.py       |  28 +++++
 4 files changed, 240 insertions(+), 4 deletions(-)

diff --git a/docs/notebooks/ensembles.ipynb b/docs/notebooks/ensembles.ipynb
index 0a48b600c..80222ac4a 100644
--- a/docs/notebooks/ensembles.ipynb
+++ b/docs/notebooks/ensembles.ipynb
@@ -275,6 +275,188 @@
     "ax.legend()\n",
     "plt.show()"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Change significance and model agreement\n",
+    "\n",
+    "Aside from arithmetic reductions of the ensemble like shown above, xclim also implements helper functions for characterizing the significance of simulated climatic change and whether the different members agree or not over the sign of the change. See [xc.ensembles.change_significance](../apidoc/xclim.ensembles.html#xclim.ensembles._robustness.change_significance). Here we will use that function to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on the approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas (AR6, WG1)](https://doi.org/10.1017/9781009157896.021). Approach A concerns observational trend and is out-of-scope for this the ensembles module.\n",
+    "\n",
+    "First we generate some fake annual mean temperature data. `ref` is the data on the reference period and `fut` is a future projection. In our example, there will be 5 different members in the ensemble."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "from matplotlib.patches import Rectangle\n",
+    "\n",
+    "xr.set_options(keep_attrs=True)\n",
+    "\n",
+    "# Reference period\n",
+    "ref = xr.DataArray(\n",
+    "    20 * np.random.random_sample((5, 30, 10, 10)) + 275,\n",
+    "    dims=(\"realization\", \"time\", \"lat\", \"lon\"),\n",
+    "    coords={\n",
+    "        \"time\": xr.date_range(\"1990\", periods=30, freq=\"YS\"),\n",
+    "        \"lat\": np.arange(40, 50),\n",
+    "        \"lon\": np.arange(-70, -60),\n",
+    "    },\n",
+    "    attrs={\"units\": \"K\"},\n",
+    ")\n",
+    "\n",
+    "# Future\n",
+    "fut = xr.DataArray(\n",
+    "    20 * np.random.random_sample((5, 30, 10, 10)) + 275,\n",
+    "    dims=(\"realization\", \"time\", \"lat\", \"lon\"),\n",
+    "    coords={\n",
+    "        \"time\": xr.date_range(\"2070\", periods=30, freq=\"YS\"),\n",
+    "        \"lat\": np.arange(40, 50),\n",
+    "        \"lon\": np.arange(-70, -60),\n",
+    "    },\n",
+    "    attrs={\"units\": \"K\"},\n",
+    ")\n",
+    "# Add change.\n",
+    "fut = fut + xr.concat(\n",
+    "    [\n",
+    "        xr.DataArray(np.linspace(15, north_delta, num=10), dims=(\"lat\",))\n",
+    "        for north_delta in [15, 10, 0, -7, -10]\n",
+    "    ],\n",
+    "    \"realization\",\n",
+    ")\n",
+    "\n",
+    "deltas = (fut.mean(\"time\") - ref.mean(\"time\")).assign_attrs(\n",
+    "    long_name=\"Temperature change\"\n",
+    ")\n",
+    "mean_delta = deltas.mean(\"realization\")\n",
+    "deltas.plot(col=\"realization\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Approach B: The simple method.\n",
+    "\n",
+    "Zones where there is no agreement on the sign of the change are to be hatched. We call the function without any change significance test."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chg_frac, pos_frac = ensembles.change_significance(fut, ref, test=None)\n",
+    "agreement = (pos_frac > 0.8) | (pos_frac < 0.2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, `chg_frac` is the fraction of models showing significant change. Since `test = None`, this array is all ones, significance was not tested. `pos_frac` is the fraction of members showing positive change. Usually, when less than 80% of the members agree on the sign of the change, the region is hatched."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(6, 4))\n",
+    "mean_delta.plot.contourf(ax=ax)\n",
+    "agreement.plot.contourf(\n",
+    "    ax=ax,\n",
+    "    levels=[-0.5, 0.5, 1.5],\n",
+    "    cmap=\"none\",\n",
+    "    hatches=[\"///\", None],\n",
+    "    add_colorbar=False,\n",
+    ")\n",
+    "ax.legend(\n",
+    "    handles=[\n",
+    "        Rectangle((0, 0), 2, 2, fill=False, hatch=\"///\", label=\"Low model agreement\")\n",
+    "    ],\n",
+    "    bbox_to_anchor=(0.0, 1.1),\n",
+    "    loc=\"upper left\",\n",
+    ");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Approach C: The advanced alternative.\n",
+    "\n",
+    "Here, we define three categories : robust change, conflicting change and no significant change or robust change. Change is said to be significant if the delta exceed an internal variability. The IPCC document suggests two different ways for computing this variability threshold:\n",
+    "\n",
+    "- $\\gamma = \\sqrt{2}\\cdot1.645\\cdot\\sigma_{20yr}$ where $\\sigma_{20yr}$ is the standard deviation of 20-year means, computed from non-overlapping periods in the pre-industrical control runs, after detrending with a quadratic fit.\n",
+    "- $\\gamma = \\sqrt{\\frac{2}{20}}\\cdot1.645\\cdot\\sigma_{1yr}$ where $\\sigma_{1yr}$ is the interanual standard deviation computed over the linearly detrended reference period.\n",
+    "\n",
+    "The first method involves data that is not always easily available. When such data is at hand, the xclim function could be used with a custom computation of the $\\gamma$ (`gamma`) threshold:\n",
+    "\n",
+    "```python\n",
+    "chg_frac, pos_frac = ensembles.change_significance(fut, ref, test='threshold', abs_thresh=gamma)\n",
+    "```\n",
+    "\n",
+    "The second method is more accessible and xclim implements it directly. However, `pos_frac` output of xclim is the fraction of the members showing _significant_ positive change, while the IPCC defines the agreement separately from the change significance. So we run the function twice."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chg_frac, _ = ensembles.change_significance(fut, ref, test=\"ipcc-advanced\")\n",
+    "_, pos_frac = ensembles.change_significance(fut, ref, test=None)\n",
+    "\n",
+    "# Fraction of agreement, i.e.: max of positive or negative fraction\n",
+    "agree_frac = xr.concat((chg_frac * pos_frac, chg_frac * (1 - pos_frac)), \"sign\").max(\n",
+    "    \"sign\"\n",
+    ")\n",
+    "\n",
+    "robust = xr.where((chg_frac >= 0.66) & (agree_frac >= 0.8), 1, 0)\n",
+    "nochange_notrobust = xr.where((chg_frac < 0.66), 2, 0)\n",
+    "conflicting = xr.where((chg_frac >= 0.66) & (agree_frac < 0.8), 3, 0)\n",
+    "\n",
+    "mask = robust + nochange_notrobust + conflicting"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(6, 4))\n",
+    "mean_delta.plot.contourf(ax=ax)\n",
+    "mask.plot.contourf(\n",
+    "    ax=ax,\n",
+    "    levels=[0.5, 1.5, 2.5, 3.5],\n",
+    "    cmap=\"none\",\n",
+    "    hatches=[None, \"\\\\\\\\\\\\\", \"xxx\"],\n",
+    "    add_colorbar=False,\n",
+    ")\n",
+    "ax.legend(\n",
+    "    handles=[\n",
+    "        Rectangle((0, 0), 2, 2, fill=False, hatch=h, label=lbl)\n",
+    "        for h, lbl in (\n",
+    "            [\"\\\\\\\\\\\\\", \"No significant change\"],\n",
+    "            [\"xxx\", \"Conflicting signals\"],\n",
+    "        )\n",
+    "    ],\n",
+    "    bbox_to_anchor=(0.0, 1.1),\n",
+    "    loc=\"upper left\",\n",
+    "    ncols=2,\n",
+    ");"
+   ]
   }
  ],
  "metadata": {
diff --git a/docs/references.bib b/docs/references.bib
index 1ccf861a5..4d9afb8ce 100644
--- a/docs/references.bib
+++ b/docs/references.bib
@@ -2073,3 +2073,16 @@ @article{tobin_2018
 title = {Vulnerabilities and resilience of European power generation to 1.5°C, 2°C and 3°C warming},
 journal = {Environmental Research Letters}
 }
+
+
+@inbook{
+ipccatlas_ar6wg1,
+place={Cambridge},
+title={Atlas},
+DOI={10.1017/9781009157896.021},
+booktitle={Climate Change 2021 – The Physical Science Basis: Working Group I Contribution to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change},
+publisher={Cambridge University Press},
+author={Intergovernmental Panel on Climate Change (IPCC)},
+year={2023},
+pages={1927–2058}
+}
diff --git a/xclim/ensembles/_robustness.py b/xclim/ensembles/_robustness.py
index ea52180b2..8ae4307c0 100644
--- a/xclim/ensembles/_robustness.py
+++ b/xclim/ensembles/_robustness.py
@@ -17,6 +17,7 @@
 from packaging.version import Version
 
 from xclim.core.formatting import update_history
+from xclim.indices.generic import detrend
 
 
 def change_significance(  # noqa: C901
@@ -49,7 +50,7 @@ def change_significance(  # noqa: C901
         a distribution across the future period.
         `fut` and `ref` must be of the same type (Dataset or DataArray). If they are
         Dataset, they must have the same variables (name and coords).
-    test : {'ttest', 'welch-ttest', 'mannwhitney-utest', 'brownforsythe-test', 'threshold', None}
+    test : {'ttest', 'welch-ttest', 'mannwhitney-utest', 'brownforsythe-test', 'ipcc-advanced', 'threshold', None}
         Name of the statistical test used to determine if there was significant change. See notes.
     weights : xr.DataArray
         Weights to apply along the 'realization' dimension. This array cannot contain missing values.
@@ -111,13 +112,19 @@ def change_significance(  # noqa: C901
         Brown-Forsythe test assuming skewed, non-normal distributions. Same significance criterion as 'ttest'.
       'threshold' :
         Change is considered significative if the absolute delta exceeds a given threshold (absolute or relative).
+      'ipcc-advanced' :
+        An approximation of the "advanced approach" used in the IPCC Atlas and described in :cite:t:`ipccatlas_ar6wg1`.
+        Change is considered significant if the delta exceeds :math:`1.645\sqrt{\frac{2}{10}}\sigma_{ref'}$ where $\sigma_{ref'}` is
+        the interannual standard deviation of the linearly detrended reference (`ref`). See notebook :ref:`notebooks/ensembles:Ensembles`
+        for more details.
       None :
-        Significant change is not tested and, thus, members showing no change are
-        included in the `sign_frac` output.
+        Significant change is not tested. Members showing any non-zero positive change are
+        included in the `pos_frac` output.
 
     References
     ----------
     :cite:cts:`tebaldi_mapping_2011`
+    :cite:cts:`ipccatlas_ar6wg1`
 
     Example
     -------
@@ -167,6 +174,7 @@ def change_significance(  # noqa: C901
         "mannwhitney-utest": ["p_change"],
         "brownforsythe-test": ["p_change"],
         "threshold": ["abs_thresh", "rel_thresh"],
+        "ipcc-advanced": [],
     }
 
     # Get delta, either from fut or from fut - ref
@@ -323,7 +331,12 @@ def mwu_wrapper(f, r):  # This specific test can't manage an all-NaN slice
             changed = abs(delta / ref.mean("time")) > kwargs["rel_thresh"]
         else:
             raise ValueError("Invalid argument combination for test='threshold'.")
-
+    elif test == "ipcc-advanced":
+        # Detrend ref
+        refy = ref.resample(time="YS").mean()
+        ref_detrended = detrend(refy, dim="time", deg=1)
+        gamma = np.sqrt(2 / 20) * 1.645 * ref_detrended.std("time")
+        changed = abs(delta) > gamma
     elif test is not None:
         raise ValueError(
             f"Statistical test {test} must be one of {', '.join(test_params.keys())}."
diff --git a/xclim/indices/generic.py b/xclim/indices/generic.py
index 674dc2185..f552eb5b3 100644
--- a/xclim/indices/generic.py
+++ b/xclim/indices/generic.py
@@ -40,6 +40,7 @@
     "count_occurrences",
     "cumulative_difference",
     "default_freq",
+    "detrend",
     "diurnal_temperature_range",
     "domain_count",
     "doymax",
@@ -1004,3 +1005,30 @@ def _get_zone(da):
         )
 
     return zones
+
+
+def detrend(ds, dim="time", deg=1):
+    """Detrend data along a given dimension computing a polynomial trend of a given order.
+
+    Parameters
+    ----------
+    ds : xr.Dataset or xr.DataArray
+      The data to detrend. If a Dataset, detrending is done on all data variables.
+    dim : str
+      Dimension along which to compute the trend.
+    deg : int
+      Degree of the polynomial to fit.
+
+    Returns
+    -------
+    detrended : xr.Dataset or xr.DataArray
+      Same as `ds`, but with its trend removed (subtracted).
+    """
+    if isinstance(ds, xr.Dataset):
+        return ds.map(detrend, keep_attrs=False, dim=dim, deg=deg)
+    # is a DataArray
+    # detrend along a single dimension
+    coeff = ds.polyfit(dim=dim, deg=deg)
+    trend = xr.polyval(ds[dim], coeff.polyfit_coefficients)
+    with xr.set_options(keep_attrs=True):
+        return ds - trend

From e026cb468b8a22e0767d58bd14992d582d79c7e8 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Fri, 27 Oct 2023 16:09:44 -0400
Subject: [PATCH 2/7] upd chgs

---
 CHANGES.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/CHANGES.rst b/CHANGES.rst
index 30de3b8c8..ff7841f93 100644
--- a/CHANGES.rst
+++ b/CHANGES.rst
@@ -2,6 +2,14 @@
 Changelog
 =========
 
+v0.47.0 (unreleased)
+--------------------
+Contributors to this version: Pascal Bourgault (:user:`aulemahal`).
+
+New features and enhancements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* New ``ipcc-advanced`` test added to ``xclim.ensembles.change_significance``.
+
 v0.46.0 (2023-10-24)
 --------------------
 Contributors to this version: David Huard (:user:`huard`), Trevor James Smith (:user:`Zeitsperre`), Pascal Bourgault (:user:`aulemahal`), Éric Dupuis (:user:`coxipi`).

From a42a65d84edc2ccbac76abb99e6dfbc333a4fa62 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Mon, 30 Oct 2023 11:05:11 -0400
Subject: [PATCH 3/7] Try rst in link instead

---
 docs/notebooks/ensembles.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/notebooks/ensembles.ipynb b/docs/notebooks/ensembles.ipynb
index 80222ac4a..83d15d3f1 100644
--- a/docs/notebooks/ensembles.ipynb
+++ b/docs/notebooks/ensembles.ipynb
@@ -282,7 +282,7 @@
    "source": [
     "### Change significance and model agreement\n",
     "\n",
-    "Aside from arithmetic reductions of the ensemble like shown above, xclim also implements helper functions for characterizing the significance of simulated climatic change and whether the different members agree or not over the sign of the change. See [xc.ensembles.change_significance](../apidoc/xclim.ensembles.html#xclim.ensembles._robustness.change_significance). Here we will use that function to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on the approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas (AR6, WG1)](https://doi.org/10.1017/9781009157896.021). Approach A concerns observational trend and is out-of-scope for this the ensembles module.\n",
+    "Aside from arithmetic reductions of the ensemble like shown above, xclim also implements helper functions for characterizing the significance of simulated climatic change and whether the different members agree or not over the sign of the change. See [xc.ensembles.change_significance](../apidoc/xclim.ensembles.rstl#xclim.ensembles._robustness.change_significance). Here we will use that function to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on the approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas (AR6, WG1)](https://doi.org/10.1017/9781009157896.021). Approach A concerns observational trend and is out-of-scope for this the ensembles module.\n",
     "\n",
     "First we generate some fake annual mean temperature data. `ref` is the data on the reference period and `fut` is a future projection. In our example, there will be 5 different members in the ensemble."
    ]

From 880c963fc86c83f9adf5de760a8f148c3467e45f Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Wed, 1 Nov 2023 14:53:30 -0400
Subject: [PATCH 4/7] Apply suggestions from code review

Co-authored-by: David Huard <huard.david@ouranos.ca>
---
 docs/notebooks/ensembles.ipynb | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/notebooks/ensembles.ipynb b/docs/notebooks/ensembles.ipynb
index 83d15d3f1..a4e1ef24b 100644
--- a/docs/notebooks/ensembles.ipynb
+++ b/docs/notebooks/ensembles.ipynb
@@ -282,7 +282,7 @@
    "source": [
     "### Change significance and model agreement\n",
     "\n",
-    "Aside from arithmetic reductions of the ensemble like shown above, xclim also implements helper functions for characterizing the significance of simulated climatic change and whether the different members agree or not over the sign of the change. See [xc.ensembles.change_significance](../apidoc/xclim.ensembles.rstl#xclim.ensembles._robustness.change_significance). Here we will use that function to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on the approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas (AR6, WG1)](https://doi.org/10.1017/9781009157896.021). Approach A concerns observational trend and is out-of-scope for this the ensembles module.\n",
+    "Aside from arithmetic reductions of the ensemble as shown above, xclim also implements helper functions characterizing the significance of simulated climatic change, and whether the different members agree or not over the sign of the change. Here we will use function [xc.ensembles.change_significance](../apidoc/xclim.ensembles.rstl#xclim.ensembles._robustness.change_significance) to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on two model ensemble approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas chapter (AR6, WG1)](https://doi.org/10.1017/9781009157896.021).\n",
     "\n",
     "First we generate some fake annual mean temperature data. `ref` is the data on the reference period and `fut` is a future projection. In our example, there will be 5 different members in the ensemble."
    ]
@@ -394,12 +394,12 @@
    "source": [
     "#### Approach C: The advanced alternative.\n",
     "\n",
-    "Here, we define three categories : robust change, conflicting change and no significant change or robust change. Change is said to be significant if the delta exceed an internal variability. The IPCC document suggests two different ways for computing this variability threshold:\n",
+    "Here, we define three categories : robust change, conflicting change, and no significant change or robust change. Change is said to be significant if the signal emerges from internal variability. The IPCC AR6 suggests two different ways for computing this variability threshold:\n",
     "\n",
-    "- $\\gamma = \\sqrt{2}\\cdot1.645\\cdot\\sigma_{20yr}$ where $\\sigma_{20yr}$ is the standard deviation of 20-year means, computed from non-overlapping periods in the pre-industrical control runs, after detrending with a quadratic fit.\n",
+    "- $\\gamma = \\sqrt{2}\\cdot1.645\\cdot\\sigma_{20yr}$ where $\\sigma_{20yr}$ is the standard deviation of 20-year means, computed from non-overlapping periods in the pre-industrial control runs, after detrending with a quadratic fit.\n",
     "- $\\gamma = \\sqrt{\\frac{2}{20}}\\cdot1.645\\cdot\\sigma_{1yr}$ where $\\sigma_{1yr}$ is the interanual standard deviation computed over the linearly detrended reference period.\n",
     "\n",
-    "The first method involves data that is not always easily available. When such data is at hand, the xclim function could be used with a custom computation of the $\\gamma$ (`gamma`) threshold:\n",
+    "The first method involves data from pre-industrial control simulations that are not always easily available. When such data is at hand, the xclim function could be used with a custom computation of the $\\gamma$ (`gamma`) threshold:\n",
     "\n",
     "```python\n",
     "chg_frac, pos_frac = ensembles.change_significance(fut, ref, test='threshold', abs_thresh=gamma)\n",

From 683b37629bfcc297f0c71756483e621ca51414d1 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Thu, 2 Nov 2023 17:10:45 -0400
Subject: [PATCH 5/7] Complete rewrite of the robustness tools

---
 CHANGES.rst                    |   2 +-
 docs/notebooks/ensembles.ipynb | 124 +++---
 xclim/core/formatting.py       |   2 +-
 xclim/ensembles/__init__.py    |   7 +-
 xclim/ensembles/_robustness.py | 762 ++++++++++++++++++++-------------
 5 files changed, 537 insertions(+), 360 deletions(-)

diff --git a/CHANGES.rst b/CHANGES.rst
index d61b621b0..7c728e595 100644
--- a/CHANGES.rst
+++ b/CHANGES.rst
@@ -8,7 +8,7 @@ Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Pascal B
 
 New features and enhancements
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-* New ``ipcc-advanced`` test added to ``xclim.ensembles.change_significance``.
+* New functions ``xclim.ensembles.robustness_fractions`` and ``xclim.ensembles.robustness_categories``. The former will replace ``xclim.ensembles.change_significance`` which is now deprecated and will be removed in xclim 0.49.
 
 Bug fixes
 ^^^^^^^^^
diff --git a/docs/notebooks/ensembles.ipynb b/docs/notebooks/ensembles.ipynb
index a4e1ef24b..2dffc73d0 100644
--- a/docs/notebooks/ensembles.ipynb
+++ b/docs/notebooks/ensembles.ipynb
@@ -164,6 +164,7 @@
     "# Set display to HTML style (for fancy output)\n",
     "xr.set_options(display_style=\"html\", display_width=50)\n",
     "\n",
+    "import matplotlib as mpl\n",
     "import matplotlib.pyplot as plt\n",
     "\n",
     "%matplotlib inline\n",
@@ -282,9 +283,14 @@
    "source": [
     "### Change significance and model agreement\n",
     "\n",
-    "Aside from arithmetic reductions of the ensemble as shown above, xclim also implements helper functions characterizing the significance of simulated climatic change, and whether the different members agree or not over the sign of the change. Here we will use function [xc.ensembles.change_significance](../apidoc/xclim.ensembles.rstl#xclim.ensembles._robustness.change_significance) to generate masks replicating the ones in the [IPCC Atlas](https://interactive-atlas.ipcc.ch/), based on two model ensemble approaches outlined in Cross-Chapter Box 1 of the [IPCC Atlas chapter (AR6, WG1)](https://doi.org/10.1017/9781009157896.021).\n",
+    "When communicating climate change through plots of projected change, it is often useful to add information on the statistical significance of the values. A common way to represent this information without overloading the figures is through hatching patterns surimposed on the primary data. Two aspects are usually shown : \n",
     "\n",
-    "First we generate some fake annual mean temperature data. `ref` is the data on the reference period and `fut` is a future projection. In our example, there will be 5 different members in the ensemble."
+    "- change significance : whether most of the ensemble members project a climate change signal statistically significant in comparison to their internal variability.\n",
+    "- model agreement : whether the different ensemble members agree on the sign of the change.\n",
+    "\n",
+    "We can then divide the plotted points into categories each with its own hatching pattern, usually leaving the robust data (models agree and all show significant change) without hatching. \n",
+    "\n",
+    "Xclim provides some tools to help in generating these hatching masks. First is [xc.ensembles.robustness_fractions](../apidoc/xclim.ensembles.rst#xclim.ensembles._robustness.robustness_fractions) that can characterize the change significance and sign agreement accross ensemble members. To demonstrate its usage, we'll first generate some fake annual mean temperature data. Here, `ref` is the data on the reference period and `fut` is a future projection. There are be 5 different members in the ensemble. We tweaked the generation so that all models agree on significant change in the \"south\" while agreement and signifiance of change decreases as we go north and east."
    ]
   },
   {
@@ -342,9 +348,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Approach B: The simple method.\n",
+    "Change significance can be determined in a lot of different ways. Xclim provides some simple and some more complicated statistical test in `robustness_fractions`. In this example, we'll follow the suggestions found in the Cross-Chapter Box 1 of the [IPCC Atlas chapter (AR6, WG1)](https://doi.org/10.1017/9781009157896.021). Specifically, we are following Approach C, using the alternative for when pre-industrial control data is not available.\n",
     "\n",
-    "Zones where there is no agreement on the sign of the change are to be hatched. We call the function without any change significance test."
+    "We first compute the different fractions for each robustness aspect."
    ]
   },
   {
@@ -353,15 +359,23 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "chg_frac, pos_frac = ensembles.change_significance(fut, ref, test=None)\n",
-    "agreement = (pos_frac > 0.8) | (pos_frac < 0.2)"
+    "fractions = ensembles.robustness_fractions(fut, ref, test=\"ipcc-ar6-c\")\n",
+    "fractions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, `chg_frac` is the fraction of models showing significant change. Since `test = None`, this array is all ones, significance was not tested. `pos_frac` is the fraction of members showing positive change. Usually, when less than 80% of the members agree on the sign of the change, the region is hatched."
+    "In this output we have:\n",
+    "\n",
+    "- `changed` : The fraction of members showing significant change.\n",
+    "- `positive` : The fraction of members showing positive change, no matter if it is significant or not.\n",
+    "- `changed_positive` : The fraction of members showing significant AND positive change.\n",
+    "- `agree` : The fraction of members agreeing on the sign of change. This is the maximum between `positive` and `1 - positive`.\n",
+    "- `valid` : The fraction of \"valid\" members. A member is valid is there are no NaNs along the time axes of `fut` and  `ref`. In our case, it is 1 everywhere.\n",
+    "\n",
+    "For example, here's the plot of the fraction of members showing significant change."
    ]
   },
   {
@@ -370,42 +384,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fig, ax = plt.subplots(figsize=(6, 4))\n",
-    "mean_delta.plot.contourf(ax=ax)\n",
-    "agreement.plot.contourf(\n",
-    "    ax=ax,\n",
-    "    levels=[-0.5, 0.5, 1.5],\n",
-    "    cmap=\"none\",\n",
-    "    hatches=[\"///\", None],\n",
-    "    add_colorbar=False,\n",
-    ")\n",
-    "ax.legend(\n",
-    "    handles=[\n",
-    "        Rectangle((0, 0), 2, 2, fill=False, hatch=\"///\", label=\"Low model agreement\")\n",
-    "    ],\n",
-    "    bbox_to_anchor=(0.0, 1.1),\n",
-    "    loc=\"upper left\",\n",
-    ");"
+    "fractions.changed.plot(figsize=(6, 4))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Approach C: The advanced alternative.\n",
-    "\n",
-    "Here, we define three categories : robust change, conflicting change, and no significant change or robust change. Change is said to be significant if the signal emerges from internal variability. The IPCC AR6 suggests two different ways for computing this variability threshold:\n",
-    "\n",
-    "- $\\gamma = \\sqrt{2}\\cdot1.645\\cdot\\sigma_{20yr}$ where $\\sigma_{20yr}$ is the standard deviation of 20-year means, computed from non-overlapping periods in the pre-industrial control runs, after detrending with a quadratic fit.\n",
-    "- $\\gamma = \\sqrt{\\frac{2}{20}}\\cdot1.645\\cdot\\sigma_{1yr}$ where $\\sigma_{1yr}$ is the interanual standard deviation computed over the linearly detrended reference period.\n",
-    "\n",
-    "The first method involves data from pre-industrial control simulations that are not always easily available. When such data is at hand, the xclim function could be used with a custom computation of the $\\gamma$ (`gamma`) threshold:\n",
-    "\n",
-    "```python\n",
-    "chg_frac, pos_frac = ensembles.change_significance(fut, ref, test='threshold', abs_thresh=gamma)\n",
-    "```\n",
-    "\n",
-    "The second method is more accessible and xclim implements it directly. However, `pos_frac` output of xclim is the fraction of the members showing _significant_ positive change, while the IPCC defines the agreement separately from the change significance. So we run the function twice."
+    "Xclim provides all this so that one can construct their own robustness maps the way they want. Often, hatching overlays are based on categories defined by some thresholds on the significant change and agreement fractions. The [`xclim.ensembles.robustness_categories`](../apidoc/xclim.ensembles.rst#xclim.ensembles._robustness.robustness_categories) function helps for that common case and defaults to the categories and thresholds used by the IPCC in its Atlas."
    ]
   },
   {
@@ -414,19 +400,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "chg_frac, _ = ensembles.change_significance(fut, ref, test=\"ipcc-advanced\")\n",
-    "_, pos_frac = ensembles.change_significance(fut, ref, test=None)\n",
-    "\n",
-    "# Fraction of agreement, i.e.: max of positive or negative fraction\n",
-    "agree_frac = xr.concat((chg_frac * pos_frac, chg_frac * (1 - pos_frac)), \"sign\").max(\n",
-    "    \"sign\"\n",
-    ")\n",
-    "\n",
-    "robust = xr.where((chg_frac >= 0.66) & (agree_frac >= 0.8), 1, 0)\n",
-    "nochange_notrobust = xr.where((chg_frac < 0.66), 2, 0)\n",
-    "conflicting = xr.where((chg_frac >= 0.66) & (agree_frac < 0.8), 3, 0)\n",
-    "\n",
-    "mask = robust + nochange_notrobust + conflicting"
+    "robustness = ensembles.robustness_categories(fractions)\n",
+    "robustness"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The output is a categorical map following the \"flag variables\" CF conventions. Parameters needed for plotting are found in the attributes. "
    ]
   },
   {
@@ -435,22 +417,40 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "robustness.plot(figsize=(6, 4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Matplotlib doesn't provide an easy way of plotting categorial data with a proper legend, so our real plotting script is a bit more complicated, but xclim's output makes it easier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cmap = mpl.colors.ListedColormap([\"none\"])  # So we can deactivate pcolor's colormapping\n",
+    "\n",
     "fig, ax = plt.subplots(figsize=(6, 4))\n",
-    "mean_delta.plot.contourf(ax=ax)\n",
-    "mask.plot.contourf(\n",
-    "    ax=ax,\n",
-    "    levels=[0.5, 1.5, 2.5, 3.5],\n",
-    "    cmap=\"none\",\n",
-    "    hatches=[None, \"\\\\\\\\\\\\\", \"xxx\"],\n",
-    "    add_colorbar=False,\n",
-    ")\n",
+    "mean_delta.plot(ax=ax)\n",
+    "# For each flag value plot the corresponding hatch.\n",
+    "for val, ha in zip(robustness.flag_values, [None, \"\\\\\\\\\\\\\", \"xxx\"]):\n",
+    "    ax.pcolor(\n",
+    "        robustness.lon,\n",
+    "        robustness.lat,\n",
+    "        robustness.where(robustness == val),\n",
+    "        hatch=ha,\n",
+    "        cmap=cmap,\n",
+    "    )\n",
+    "\n",
     "ax.legend(\n",
     "    handles=[\n",
     "        Rectangle((0, 0), 2, 2, fill=False, hatch=h, label=lbl)\n",
-    "        for h, lbl in (\n",
-    "            [\"\\\\\\\\\\\\\", \"No significant change\"],\n",
-    "            [\"xxx\", \"Conflicting signals\"],\n",
-    "        )\n",
+    "        for h, lbl in zip([\"\\\\\\\\\\\\\", \"xxx\"], robustness.flag_descriptions[1:])\n",
     "    ],\n",
     "    bbox_to_anchor=(0.0, 1.1),\n",
     "    loc=\"upper left\",\n",
diff --git a/xclim/core/formatting.py b/xclim/core/formatting.py
index 121072a5d..af0d359ef 100644
--- a/xclim/core/formatting.py
+++ b/xclim/core/formatting.py
@@ -434,7 +434,7 @@ def _call_and_add_history(*args, **kwargs):
         attr = update_history(
             gen_call_string(func.__name__, **bound_args.arguments),
             *da_list,
-            new_name=out.name,
+            new_name=out.name if isinstance(out, xr.DataArray) else None,
             **da_dict,
         )
         out.attrs["history"] = attr
diff --git a/xclim/ensembles/__init__.py b/xclim/ensembles/__init__.py
index 57c3e8d0f..27deb868b 100644
--- a/xclim/ensembles/__init__.py
+++ b/xclim/ensembles/__init__.py
@@ -17,4 +17,9 @@
     make_criteria,
     plot_rsqprofile,
 )
-from ._robustness import change_significance, robustness_coefficient
+from ._robustness import (
+    change_significance,
+    robustness_categories,
+    robustness_coefficient,
+    robustness_fractions,
+)
diff --git a/xclim/ensembles/_robustness.py b/xclim/ensembles/_robustness.py
index 8ae4307c0..64d982ee6 100644
--- a/xclim/ensembles/_robustness.py
+++ b/xclim/ensembles/_robustness.py
@@ -9,6 +9,7 @@
 from __future__ import annotations
 
 import warnings
+from inspect import Parameter, signature
 
 import numpy as np
 import scipy
@@ -16,110 +17,117 @@
 import xarray as xr
 from packaging.version import Version
 
-from xclim.core.formatting import update_history
-from xclim.indices.generic import detrend
+from xclim.core.formatting import gen_call_string, update_xclim_history
+from xclim.indices.generic import compare, detrend
 
+__all__ = [
+    "change_significance",
+    "robustness_categories",
+    "robustness_coefficient",
+    "robustness_fractions",
+]
 
-def change_significance(  # noqa: C901
-    fut: xr.DataArray | xr.Dataset,
-    ref: xr.DataArray | xr.Dataset = None,
-    test: str | None = "ttest",
+
+SIGNIFICANCE_TESTS = {}
+"""Registry of change significance tests.
+
+New tests must be decorated with :py:func:`significance_test` and fullfill the following requirements:
+
+- Function name should begin by "_", registered test name is the function name without its first character and with _ replaced by -.
+- Function must accept 2 positional arguments : fut and ref (see :py:func:`change_significance` for definitions)
+- Function may accept other keyword-only arguments.
+- Function must return 2 values :
+    + `changed` : 1D boolean array along `realization`. True for realization with significant change.
+    + `pvals` : 1D float array along `realization`. P-values of the statistical test. Should be `None` for test where is doesn't apply.
+"""
+
+
+def significance_test(func):
+    """Register a significance test for use in :py:func:`change_significance`.
+
+    See :py:data:`SIGNIFICANCE_TESTS`.
+    """
+    SIGNIFICANCE_TESTS[func.__name__[1:].replace("_", "-")] = func
+    return func
+
+
+# This function's docstring is modified to inlude the registered test names and docs.
+# See end of this file.
+@update_xclim_history
+def robustness_fractions(  # noqa: C901
+    fut: xr.DataArray,
+    ref: xr.DataArray | None = None,
+    test: str | None = None,
     weights: xr.DataArray = None,
-    p_vals: bool = False,
     **kwargs,
-) -> (
-    tuple[xr.DataArray | xr.Dataset, xr.DataArray | xr.Dataset]
-    | tuple[
-        xr.DataArray | xr.Dataset,
-        xr.DataArray | xr.Dataset,
-        xr.DataArray | xr.Dataset | None,
-    ]
-):
+) -> xr.Dataset:
     r"""Robustness statistics qualifying how members of an ensemble agree on the existence of change and on its sign.
 
     Parameters
     ----------
-    fut : xr.DataArray or xr.Dataset
+    fut : xr.DataArray
         Future period values along 'realization' and 'time' (..., nr, nt1)
         or if `ref` is None, Delta values along `realization` (..., nr).
-    ref : Union[xr.DataArray, xr.Dataset], optional
-        Reference period values along realization' and 'time'  (..., nt2, nr).
+    ref : xr.DataArray, optional
+        Reference period values along realization' and 'time'  (..., nr, nt2).
         The size of the 'time' axis does not need to match the one of `fut`.
-        But their 'realization' axes must be identical.
+        But their 'realization' axes must be identical and the other coordinates should be the same.
         If `None` (default), values of `fut` are assumed to be deltas instead of
         a distribution across the future period.
-        `fut` and `ref` must be of the same type (Dataset or DataArray). If they are
-        Dataset, they must have the same variables (name and coords).
-    test : {'ttest', 'welch-ttest', 'mannwhitney-utest', 'brownforsythe-test', 'ipcc-advanced', 'threshold', None}
+    test : {tests_list}, optional
         Name of the statistical test used to determine if there was significant change. See notes.
     weights : xr.DataArray
         Weights to apply along the 'realization' dimension. This array cannot contain missing values.
-        Only tests "threshold" and "None" are currently supported with weighted arrays.
-    p_vals : bool
-        If True, return the estimated p-values.
     \*\*kwargs
-        Other arguments specific to the statistical test.
-
-        For 'ttest', 'welch-ttest', 'mannwhitney-utest' and 'brownforsythe-test':
-            p_change : float (default : 0.05)
-                p-value threshold for rejecting the hypothesis of no significant change.
-        For 'threshold': (Only one of those must be given.)
-            abs_thresh : float (no default)
-                Threshold for the (absolute) change to be considered significative.
-            rel_thresh : float (no default, in [0, 1])
-                Threshold for the relative change (in reference to ref) to be significative.
-                Only valid if `ref` is given.
+        Other arguments specific to the statistical test. See notes.
 
     Returns
     -------
-    change_frac :  xr.DataArray or xr.Dataset
-        The fraction of members that show significant change [0, 1].
-        Passing `test=None` yields change_frac = 1 everywhere. Same type as `fut`.
-    pos_frac : xr.DataArray or xr.Dataset
-        The fraction of members showing significant change that show a positive change ]0, 1].
-        Null values are returned where no members show significant change.
-    pvals [Optional] : xr.DataArray or xr.Dataset or None
-        The p-values estimated by the significance tests. Only returned if `p_vals` is True. None
-        if `test` is one of 'ttest', 'welch-ttest', 'mannwhitney-utest' or 'brownforsythe-test'.
-
-        The table below shows the coefficient needed to retrieve the number of members
-        that have the indicated characteristics, by multiplying it to the total
-        number of members (`fut.realization.size`).
-
-        +-----------------+------------------------------+------------------------+
-        |                 | Significant change           | Non-significant change |
-        +-----------------+------------------------------+------------------------+
-        | Any direction   | change_frac                  | 1 - change_frac        |
-        +-----------------+------------------------------+------------------------+
-        | Positive change | pos_frac * change_frac       | N.A.                   |
-        +-----------------+------------------------------+                        |
-        | Negative change | (1 - pos_frac) * change_frac |                        |
-        +-----------------+------------------------------+------------------------+
+    xr.Dataset
+        Same coordinates as `fut` and  `ref`, but no `time` and no `realization`.
+
+        Variables:
+
+        changed :
+                The weighted fraction of valid members showing significant change.
+                Passing `test=None` yields change_frac = 1 everywhere. Same type as `fut`.
+        positive :
+                The weighted fraction of valid members showing positive change, no matter if it is significant or not.
+        changed_positive :
+                The weighted fraction of valid members showing significant and positive change (]0, 1]).
+        agree :
+                The weighted fraction of valid members agreeing on the sign of change. It is the maximum between positive and 1 - positive.
+        valid :
+                The weighted fraction of valid members. A member is valid is there are no NaNs along the time axes of `fut` and  `ref`.
+        pvals :
+                The p-values estimated by the significance tests. Only returned if the test uses `pvals`. Has the  `realization` dimension.
+
+    The table below shows the coefficient needed to retrieve the number of members
+    that have the indicated characteristics, by multiplying it by the total
+    number of members (`fut.realization.size`) and by `valid_frac`, assuming uniform weights.
+    For compactness, we rename the outputs cf, cpf and pf.
+
+    +-----------------+--------------------+------------------------+------------+
+    |                 | Significant change | Non-significant change | Any change |
+    +-----------------+--------------------+------------------------+------------+
+    | Any direction   | cf                 | 1 - cf                 | 1          |
+    +-----------------+--------------------+------------------------+------------+
+    | Positive change | cpf                | pf - cpf               | pf         |
+    +-----------------+--------------------+------------------------+------------+
+    | Negative change | (cf - cpf)         | 1 - pf - (cf -cpf)     | 1 - pf     |
+    +-----------------+--------------------+------------------------+------------+
 
     Notes
     -----
     Available statistical tests are :
 
-      'ttest' :
-        Single sample T-test. Same test as used by :cite:t:`tebaldi_mapping_2011`.
-        The future values are compared against the reference mean (over 'time').
-        Change is qualified as 'significant' when the test's p-value is below the user-provided `p_change` value.
-      'welch-ttest' :
-        Two-sided T-test, without assuming equal population variance. Same significance criterion as 'ttest'.
-      'mannwhitney-utest' :
-        Two-sided Mann-Whiney U-test. Same significance criterion as 'ttest'.
-      'brownforsythe-test' :
-        Brown-Forsythe test assuming skewed, non-normal distributions. Same significance criterion as 'ttest'.
-      'threshold' :
-        Change is considered significative if the absolute delta exceeds a given threshold (absolute or relative).
-      'ipcc-advanced' :
-        An approximation of the "advanced approach" used in the IPCC Atlas and described in :cite:t:`ipccatlas_ar6wg1`.
-        Change is considered significant if the delta exceeds :math:`1.645\sqrt{\frac{2}{10}}\sigma_{ref'}$ where $\sigma_{ref'}` is
-        the interannual standard deviation of the linearly detrended reference (`ref`). See notebook :ref:`notebooks/ensembles:Ensembles`
-        for more details.
-      None :
-        Significant change is not tested. Members showing any non-zero positive change are
-        included in the `pos_frac` output.
+    {tests_doc}
+        threshold :
+                Change is considered significant when it exceeds an absolute or relative threshold.
+                Accepts one argument, either "abs_thresh" or "rel_thresh".
+        None :
+                Significant change is not tested. Members showing any positive change are
+                included in the `pos_frac` output.
 
     References
     ----------
@@ -136,15 +144,9 @@ def change_significance(  # noqa: C901
     >>> tgmean = xclim.atmos.tg_mean(tas=ens.tas, freq="YS")
     >>> fut = tgmean.sel(time=slice("2020", "2050"))
     >>> ref = tgmean.sel(time=slice("1990", "2020"))
-    >>> chng_f, pos_f = ensembles.change_significance(fut, ref, test="ttest")
-
-    If the deltas were already computed beforehand, the 'threshold' test can still
-    be used, here with a 2 K threshold.
-
-    >>> delta = fut.mean("time") - ref.mean("time")
-    >>> chng_f, pos_f = ensembles.change_significance(
-    ...     delta, test="threshold", abs_thresh=2
-    ... )
+    >>> fractions = ensembles.robustness_fractions(fut, ref, test="ttest")
+    # Binary map, True where at least 80% of members agree on the sign of change.
+    >>> agreement_map = (fractions.pos_frac > 0.8) | ((1 - fractions.pos_frac) > 0.8)
     """
     # Realization dimension name
     realization = "realization"
@@ -167,240 +169,245 @@ def change_significance(  # noqa: C901
             coords={"realization": fut[realization]},
         )
 
-    # Significance tests parameter names
-    test_params = {
-        "ttest": ["p_change"],
-        "welch-ttest": ["p_change"],
-        "mannwhitney-utest": ["p_change"],
-        "brownforsythe-test": ["p_change"],
-        "threshold": ["abs_thresh", "rel_thresh"],
-        "ipcc-advanced": [],
-    }
-
-    # Get delta, either from fut or from fut - ref
-    changed = None
     if ref is None:
         delta = fut
-        n_valid_real = w.where(delta.notnull()).sum(realization)
-        if test not in ["threshold", None]:
+        valid = delta.notnull()
+        if test not in [None, "threshold"]:
             raise ValueError(
-                "When deltas are given (ref=None), 'test' must be one of ['threshold', None]"
+                "When deltas are given (ref=None), 'test' must be None or 'threshold'."
             )
     else:
         delta = fut.mean("time") - ref.mean("time")
-        n_valid_real = w.where(fut.notnull().all("time")).sum(realization)
-
-    pvals = None
-    if test == "ttest":
-        if weights is not None:
-            raise NotImplementedError(
-                "'ttest' is not currently supported for weighted arrays."
-            )
-        p_change = kwargs.setdefault("p_change", 0.05)
-
-        if Version(scipy.__version__) < Version("1.9.0"):
-            warnings.warn(
-                "`xclim` will be dropping support for `scipy<1.9.0` in a future release. "
-                "Please consider updating your environment dependencies accordingly",
-                FutureWarning,
-                stacklevel=3,
-            )
-
-            def _ttest_func(f, r):
-                if np.isnan(f).all() or np.isnan(r).all():
-                    return np.NaN
-
-                return spstats.ttest_1samp(f, r, axis=-1, nan_policy="omit")[1]
-
-        else:
-
-            def _ttest_func(f, r):
-                # scipy>=1.9: popmean.axis[-1] must equal 1 for both fut and ref
-                if np.isnan(f).all() or np.isnan(r).all():
-                    return np.NaN
-
-                return spstats.ttest_1samp(
-                    f, r[..., np.newaxis], axis=-1, nan_policy="omit"
-                )[1]
-
-        # Test hypothesis of no significant change
-        pvals = xr.apply_ufunc(
-            _ttest_func,
-            fut,
-            ref.mean("time"),
-            input_core_dims=[["time"], []],
-            output_core_dims=[[]],
-            vectorize=True,
-            dask="parallelized",
-            output_dtypes=[float],
-        )
-        # When p < p_change, the hypothesis of no significant change is rejected.
-        changed = pvals < p_change
-
-    elif test == "welch-ttest":
-        if weights is not None:
-            raise NotImplementedError(
-                "'welch-ttest' is not currently supported for weighted arrays."
-            )
-        p_change = kwargs.setdefault("p_change", 0.05)
-
-        # Test hypothesis of no significant change
-        # equal_var=False -> Welch's T-test
-        def wtt_wrapper(f, r):  # This specific test can't manage an all-NaN slice
-            if np.isnan(f).all() or np.isnan(r).all():
-                return np.NaN
-            return spstats.ttest_ind(f, r, axis=-1, equal_var=False, nan_policy="omit")[
-                1
-            ]
-
-        pvals = xr.apply_ufunc(
-            wtt_wrapper,
-            fut,
-            ref,
-            input_core_dims=[["time"], ["time"]],
-            output_core_dims=[[]],
-            exclude_dims={"time"},
-            vectorize=True,
-            dask="parallelized",
-            output_dtypes=[float],
-        )
-
-        # When p < p_change, the hypothesis of no significant change is rejected.
-        changed = pvals < p_change
-    elif test == "mannwhitney-utest":
-        if weights is not None:
-            raise NotImplementedError(
-                "'mannwhitney-utest' is not currently supported for weighted arrays."
-            )
-        if Version(scipy.__version__) < Version("1.8.0"):
-            raise ImportError(
-                "The Mann-Whitney test requires `scipy>=1.8.0`. "
-                "`xclim` will be dropping support for `scipy<1.9.0` in a future release. "
-                "Please consider updating your environment dependencies accordingly"
-            )
-
-        p_change = kwargs.setdefault("p_change", 0.05)
+        valid = fut.notnull().all("time") & ref.notnull().all("time")
 
-        # Test hypothesis of no significant change
-        # -> Mann-Whitney U-test
-
-        def mwu_wrapper(f, r):  # This specific test can't manage an all-NaN slice
-            if np.isnan(f).all() or np.isnan(r).all():
-                return np.NaN
-            return spstats.mannwhitneyu(f, r, axis=-1, nan_policy="omit")[1]
-
-        pvals = xr.apply_ufunc(
-            mwu_wrapper,
-            fut,
-            ref,
-            input_core_dims=[["time"], ["time"]],
-            output_core_dims=[[]],
-            exclude_dims={"time"},
-            vectorize=True,
-            dask="parallelized",
-            output_dtypes=[float],
-        )
-        # When p < p_change, the hypothesis of no significant change is rejected.
-        changed = pvals < p_change
-    elif test == "brownforsythe-test":
-        if weights is not None:
-            raise NotImplementedError(
-                "'brownforsythe-test' is not currently supported for weighted arrays."
-            )
-
-        p_change = kwargs.setdefault("p_change", 0.05)
-        # Test hypothesis of no significant change
-        # -> Brown-Forsythe test
-        pvals = xr.apply_ufunc(
-            lambda f, r: spstats.levene(f, r, center="median")[1],
-            fut,
-            ref,
-            input_core_dims=[["time"], ["time"]],
-            output_core_dims=[[]],
-            exclude_dims={"time"},
-            vectorize=True,
-            dask="parallelized",
-            output_dtypes=[float],
-        )
-        # When p < p_change, the hypothesis of no significant change is rejected.
-        changed = pvals < p_change
+    if test is None:
+        test_params = {}
+        changed = xr.ones_like(delta).astype(bool)
+        pvals = None
     elif test == "threshold":
-        if "abs_thresh" in kwargs and "rel_thresh" not in kwargs:
-            changed = abs(delta) > kwargs["abs_thresh"]
-        elif "rel_thresh" in kwargs and "abs_thresh" not in kwargs and ref is not None:
-            changed = abs(delta / ref.mean("time")) > kwargs["rel_thresh"]
+        abs_thresh = kwargs.get("abs_thresh")
+        rel_thresh = kwargs.get("rel_thresh")
+        if abs_thresh is not None and rel_thresh is None:
+            changed = abs(delta) > abs_thresh
+            test_params = {"abs_thresh": abs_thresh}
+        elif rel_thresh is not None and abs_thresh is None:
+            changed = abs(delta / ref.mean("time")) > rel_thresh
+            test_params = {"rel_thresh": rel_thresh}
         else:
-            raise ValueError("Invalid argument combination for test='threshold'.")
-    elif test == "ipcc-advanced":
-        # Detrend ref
-        refy = ref.resample(time="YS").mean()
-        ref_detrended = detrend(refy, dim="time", deg=1)
-        gamma = np.sqrt(2 / 20) * 1.645 * ref_detrended.std("time")
-        changed = abs(delta) > gamma
-    elif test is not None:
+            raise ValueError(
+                "One and only one of abs_thresh or rel_thresh must be given if test='threshold'."
+            )
+        pvals = None
+    elif test in SIGNIFICANCE_TESTS:
+        test_func = SIGNIFICANCE_TESTS[test]
+        test_params = {
+            n: kwargs.get(n, p.default)
+            for n, p in signature(test_func).parameters.items()
+            if p.kind == Parameter.KEYWORD_ONLY
+        }
+
+        changed, pvals = test_func(fut, ref, **test_params)
+    else:
         raise ValueError(
-            f"Statistical test {test} must be one of {', '.join(test_params.keys())}."
+            f"Statistical test {test} must be one of {', '.join(SIGNIFICANCE_TESTS.keys())}."
         )
 
-    # Compute `change_frac`: ratio of realizations with significant changes.
-    if test is not None:
-        delta_chng = delta.where(changed)
-        change_frac = changed.weighted(w).sum(realization) / n_valid_real
-    else:
-        delta_chng = delta
-        change_frac = xr.ones_like(delta.isel({realization: 0}))
-
-    # Test that models agree on the sign of the change
-    # This returns NaN (cause 0 / 0) where no model show significant change.
-    pos_frac = (delta_chng > 0).weighted(w).sum(realization) / (
-        change_frac * n_valid_real
-    )
+    valid_frac = valid.weighted(w).sum(realization) / fut[realization].size
+    n_valid = valid.weighted(w).sum(realization)
+    change_frac = changed.where(valid).weighted(w).sum(realization) / n_valid
+    pos_frac = (delta > 0).where(valid).weighted(w).sum(realization) / n_valid
+    change_pos_frac = ((delta > 0) & changed).where(valid).weighted(w).sum(
+        realization
+    ) / n_valid
+    agree_frac = xr.concat((pos_frac, 1 - pos_frac), "sign").max("sign")
 
     # Metadata
-    kwargs_str = ", ".join(
-        [f"{k}: {v}" for k, v in kwargs.items() if k in test_params[test]]
-    )
+    kwargs_str = gen_call_string("", **test_params)[1:-1]
     test_str = (
-        f"Significant change was tested with test {test} with parameters {kwargs_str}."
+        f"Significant change was tested with test {test} and parameters {kwargs_str}."
+    )
+
+    out = xr.Dataset(
+        {
+            "changed": change_frac.assign_attrs(
+                description="Fraction of members showing significant change. "
+                + test_str,
+                units="",
+                test=str(test),
+            ),
+            "positive": pos_frac.assign_attrs(
+                description="Fraction of valid members showing positive change.",
+                units="",
+            ),
+            "changed_positive": change_pos_frac.assign_attrs(
+                description="Fraction of valid members showing significant and positive change. "
+                + test_str,
+                units="",
+                test=str(test),
+            ),
+            "valid": valid_frac.assign_attrs(
+                description="Fraction of valid members (No missing values along time).",
+                units="",
+            ),
+            "agree": agree_frac.assign_attrs(
+                description="Fraction of valid members agreeing on the sign of change. Maximum between pos_frac and 1 - pos_frac.",
+                units="",
+            ),
+        },
+        attrs={"description": "Significant change and sign of change fractions."},
     )
-    das = {"fut": fut} if ref is None else {"fut": fut, "ref": ref}
 
     if pvals is not None:
         pvals.attrs.update(
             description="P-values from change significance test. " + test_str,
             units="",
-            test=str(test),
-            history=update_history(
-                f"pvals from change_significance(fut=fut, ref=ref, test={test}, {kwargs_str})",
-                **das,
-            ),
         )
-    pos_frac.attrs.update(
-        description="Fraction of members showing significant change that agree on a positive change. "
-        + test_str,
-        units="",
-        test=str(test),
-        history=update_history(
-            f"pos_frac from change_significance(fut=fut, ref=ref, test={test}, {kwargs_str})",
-            **das,
-        ),
-    )
-    change_frac.attrs.update(
-        description="Fraction of members showing significant change. " + test_str,
-        units="",
-        test=str(test),
-        history=update_history(
-            f"change_frac from change_significance(fut=fut, ref=ref, test={test}, {kwargs_str})",
-            **das,
+        out = out.assign(pvals=pvals)
+
+    return out
+
+
+def change_significance(  # noqa: C901
+    fut: xr.DataArray | xr.Dataset,
+    ref: xr.DataArray | xr.Dataset = None,
+    test: str | None = "ttest",
+    weights: xr.DataArray = None,
+    p_vals: bool = False,
+    **kwargs,
+) -> (
+    tuple[xr.DataArray | xr.Dataset, xr.DataArray | xr.Dataset]
+    | tuple[
+        xr.DataArray | xr.Dataset,
+        xr.DataArray | xr.Dataset,
+        xr.DataArray | xr.Dataset | None,
+    ]
+):
+    """Backwards-compatible implementaton of :py:func:`robustness_fractions`."""
+    warnings.warn(
+        (
+            "Function change_significance is deprecated as of xclim 0.47 and will be removed in 0.49. "
+            "Please use robustness_fractions instead."
         ),
+        FutureWarning,
     )
 
-    # Returns either two (2) or three (3) variables. This should be adjusted.
+    if isinstance(fut, xr.Dataset):
+        outs = {
+            v: robustness_fractions(
+                fut[v],
+                ref[v] if isinstance(ref, xr.Dataset) else ref,
+                test=test,
+                weights=weights[v] if isinstance(weights, xr.Dataset) else weights,
+                **kwargs,
+            )
+            for v in fut.data_vars.keys()
+        }
+        change_frac = xr.merge([fracs.changed.rename(v) for v, fracs in outs.items()])
+        pos_frac = xr.merge(
+            [
+                (fracs.changed_positive / fracs.changed).rename(v)
+                for v, fracs in outs.items()
+            ]
+        )
+        if p_vals:
+            if "pvals" in list(outs.values())[0]:
+                pvals = xr.merge([fracs.pvals.rename(v) for v, fracs in outs.items()])
+            else:
+                pvals = None
+            return change_frac, pos_frac, pvals
+        return change_frac, pos_frac
+
+    fracs = robustness_fractions(fut, ref, test=test, weights=weights, **kwargs)
+    # different def.
+    # Old "pos_frac" is fraction of change_frac that is positive
+    # New change_pos_frac is fraction of all that is both postive and significant
+    pos_frac = fracs.changed_positive / fracs.changed
+
     if p_vals:
-        return change_frac, pos_frac, pvals
-    return change_frac, pos_frac
+        return fracs.changed, pos_frac, fracs.pvals if "pvals" in fracs else None
+    return fracs.changed, pos_frac
+
+
+def robustness_categories(
+    changed_or_fractions: xr.Dataset | xr.DataArray,
+    agree: xr.DataArray = None,
+    *,
+    categories: list[str] = [
+        "Robust signal",
+        "No change or no signal",
+        "Conflicting signal",
+    ],
+    ops: list[tuple[str, str]] = [(">=", ">="), ("<", None), (">=", "<")],
+    thresholds: list[tuple[float, float]] = [(0.66, 0.8), (0.66, None), (0.66, 0.8)],
+) -> xr.DataArray:
+    """Create a categorical robustness map for mapping hatching patterns.
+
+    Each robustness category is defined by a double threshold, one on the fraction of members showing significant change (`change_frac`)
+    and one on the fraction of member agreeing on the sign of change (`agree_frac`). When the two thresholds are fullfilled, the point
+    is assigned to the given category.
+    The default values for the comparisons are the ones suggested by the IPCC for its "Advanced approach" described
+    in the Cross-Chapter Box 1 of the Atlas of the AR6 WGI report (:cite:t:`ipccatlas_ar6wg1`).
+
+    Parameters
+    ----------
+    changed_or_fractions : xr.Dataset or xr.DataArray
+      Either the fraction of members showing significant change as an array or
+      directly the output of :py:func:`robustness_fractions`.
+    agree : xr.DataArray, optional
+      The fraction of members agreeing on the sign of change. Only needed if the first argument is
+      the `changed` array.
+    categories : list or strings
+      The label of each robustness categories. They are stored in the semi-colon separated flag_descriptions
+      attribute as well as in a compressed form in the flag_meanings attribute.
+      If a point is mapped to two categories, priority is given to the first one in this list.
+    ops : list of tuples of 2 strings
+      For each category, the comparison operators for `change_frac` and `agree_frac`.
+      None or an empty string means the variable is not needed for this category.
+    thresholds : list of tuples of 2 floats
+      For each categroy, the threshold to be used with the corresponding operator. All should be between 0 and 1.
+
+    Returns
+    -------
+    xr.DataArray
+        Categorical (int) array following the flag variables CF conventions.
+        99 is used as a fill value for points that do not fall in any category.
+    """
+    if isinstance(changed_or_fractions, xr.Dataset):
+        # Output of robustness fractions
+        changed = changed_or_fractions.changed
+        agree = changed_or_fractions.agree
+    else:
+        changed = changed_or_fractions
+
+    # Initial map is all 99, same shape as change_frac
+    robustness = changed * 0 + 99
+
+    # We go in reverse gear so that the first categories have precedence in the case of multiple matches.
+    for i, ((chg_op, agr_op), (chg_thresh, agr_thresh)) in reversed(
+        list(enumerate(zip(ops, thresholds), 1))
+    ):
+        if not agr_op:
+            cond = compare(changed, chg_op, chg_thresh)
+        elif not chg_op:
+            cond = compare(agree, agr_op, agr_thresh)
+        else:
+            cond = compare(changed, chg_op, chg_thresh) & compare(
+                agree, agr_op, agr_thresh
+            )
+        robustness = xr.where(cond, i, robustness)
+
+    robustness = robustness.astype(np.uint8).assign_attrs(
+        flag_values=list(range(1, len(categories) + 1)),
+        _FillValue=99,
+        flag_descriptions=categories,
+        flag_meanings=" ".join(
+            map(lambda cat: cat.casefold().replace(" ", "_"), categories)
+        ),
+    )
+    return robustness
 
 
+@update_xclim_history
 def robustness_coefficient(
     fut: xr.DataArray | xr.Dataset, ref: xr.DataArray | xr.Dataset
 ) -> xr.DataArray | xr.Dataset:
@@ -482,6 +489,171 @@ def diff_cdf_sq_area_int(x1, x2):
         description="Ensemble robustness coefficient as defined by Knutti and Sedláček (2013).",
         reference="Knutti, R. and Sedláček, J. (2013) Robustness and uncertainties in the new CMIP5 climate model projections. Nat. Clim. Change.",
         units="",
-        history=update_history("knutti_sedlacek(fut, ref)", ref=ref, fut=fut),
     )
     return R
+
+
+@significance_test
+def _ttest(fut, ref, *, p_change=0.05):
+    """Single sample T-test. Same test as used by :cite:t:`tebaldi_mapping_2011`.
+    The future values are compared against the reference mean (over 'time').
+    Accepts argument p_change (float, default : 0.05) the p-value threshold for rejecting the hypothesis of no significant change.
+    """
+    if Version(scipy.__version__) < Version("1.9.0"):
+        warnings.warn(
+            "`xclim` will be dropping support for `scipy<1.9.0` in a future release. "
+            "Please consider updating your environment dependencies accordingly",
+            FutureWarning,
+            stacklevel=4,
+        )
+
+        def _ttest_func(f, r):
+            if np.isnan(f).all() or np.isnan(r).all():
+                return np.NaN
+
+            return spstats.ttest_1samp(f, r, axis=-1, nan_policy="omit")[1]
+
+    else:
+
+        def _ttest_func(f, r):
+            # scipy>=1.9: popmean.axis[-1] must equal 1 for both fut and ref
+            if np.isnan(f).all() or np.isnan(r).all():
+                return np.NaN
+
+            return spstats.ttest_1samp(
+                f, r[..., np.newaxis], axis=-1, nan_policy="omit"
+            )[1]
+
+    # Test hypothesis of no significant change
+    pvals = xr.apply_ufunc(
+        _ttest_func,
+        fut,
+        ref.mean("time"),
+        input_core_dims=[["time"], []],
+        output_core_dims=[[]],
+        vectorize=True,
+        dask="parallelized",
+        output_dtypes=[float],
+    )
+    # When p < p_change, the hypothesis of no significant change is rejected.
+    changed = pvals < p_change
+    return changed, pvals
+
+
+@significance_test
+def _welch_ttest(fut, ref, *, p_change=0.05):
+    """Two-sided T-test, without assuming equal population variance. Same significance criterion and argument as 'ttest'."""
+
+    # Test hypothesis of no significant change
+    # equal_var=False -> Welch's T-test
+    def wtt_wrapper(f, r):  # This specific test can't manage an all-NaN slice
+        if np.isnan(f).all() or np.isnan(r).all():
+            return np.NaN
+        return spstats.ttest_ind(f, r, axis=-1, equal_var=False, nan_policy="omit")[1]
+
+    pvals = xr.apply_ufunc(
+        wtt_wrapper,
+        fut,
+        ref,
+        input_core_dims=[["time"], ["time"]],
+        output_core_dims=[[]],
+        exclude_dims={"time"},
+        vectorize=True,
+        dask="parallelized",
+        output_dtypes=[float],
+    )
+
+    # When p < p_change, the hypothesis of no significant change is rejected.
+    changed = pvals < p_change
+    return changed, pvals
+
+
+@significance_test
+def _mannwhitney_utest(ref, fut, *, p_change=0.05):
+    """Two-sided Mann-Whiney U-test. Same significance criterion and argument as 'ttest'."""
+    if Version(scipy.__version__) < Version("1.8.0"):
+        raise ImportError(
+            "The Mann-Whitney test requires `scipy>=1.8.0`. "
+            "`xclim` will be dropping support for `scipy<1.9.0` in a future release. "
+            "Please consider updating your environment dependencies accordingly"
+        )
+
+    def mwu_wrapper(f, r):  # This specific test can't manage an all-NaN slice
+        if np.isnan(f).all() or np.isnan(r).all():
+            return np.NaN
+        return spstats.mannwhitneyu(f, r, axis=-1, nan_policy="omit")[1]
+
+    pvals = xr.apply_ufunc(
+        mwu_wrapper,
+        fut,
+        ref,
+        input_core_dims=[["time"], ["time"]],
+        output_core_dims=[[]],
+        exclude_dims={"time"},
+        vectorize=True,
+        dask="parallelized",
+        output_dtypes=[float],
+    )
+    # When p < p_change, the hypothesis of no significant change is rejected.
+    changed = pvals < p_change
+    return changed, pvals
+
+
+@significance_test
+def _brownforsythe_test(fut, ref, *, p_change=0.05):
+    """Brown-Forsythe test assuming skewed, non-normal distributions. Same significance criterion and argument as 'ttest'."""
+    pvals = xr.apply_ufunc(
+        lambda f, r: spstats.levene(f, r, center="median")[1],
+        fut,
+        ref,
+        input_core_dims=[["time"], ["time"]],
+        output_core_dims=[[]],
+        exclude_dims={"time"},
+        vectorize=True,
+        dask="parallelized",
+        output_dtypes=[float],
+    )
+    # When p < p_change, the hypothesis of no significant change is rejected.
+    changed = pvals < p_change
+    return changed, pvals
+
+
+@significance_test
+def _ipcc_ar6_c(fut, ref, *, ref_pi=None):
+    r"""The advanced approach used in the IPCC Atlas chapter (:cite:t:`ipccatlas_ar6wg1`).
+    Change is considered significant if the delta exceeds a threshold related to the internal variability.
+    If pre-industrial data is given in argument `ref_pi`, the threshold is defined as
+    :math:`\sqrt{2}*1.645*\sigma_{20yr}`, where :math:`\sigma_{20yr}` is the standard deviation of 20-year
+    means computed from non-overlapping periods after detrending with a quadratic fit.
+    Otherwise, when such pre-industrial control data is not available, the threshold is defined in relation to
+    the historical data (`ref`) as :math:`\sqrt{\frac{2}{20}}*1.645*\sigma_{1yr}, where :math:`\sigma_{1yr}`
+    is the interannual standard deviation measured after linearly detrending the data.
+    See notebook :ref:`notebooks/ensembles:Ensembles` for more details.
+    """
+    # Ensure annual
+    refy = ref.resample(time="YS").mean()
+    if ref_pi is None:
+        ref_detrended = detrend(refy, dim="time", deg=1)
+        gamma = np.sqrt(2 / 20) * 1.645 * ref_detrended.std("time")
+    else:
+        ref_detrended = detrend(refy, dim="time", deg=2)
+        gamma = (
+            np.sqrt(2) * 1.645 * ref_detrended.resample(time="20YS").mean().std("time")
+        )
+
+    delta = fut.mean("time") - ref.mean("time")
+    changed = abs(delta) > gamma
+    return changed, None
+
+
+# Add doc of each significance test to the `change_significance` output.
+def _gen_test_entry(namefunc):
+    name, func = namefunc
+    doc = func.__doc__.replace("\n    ", "\n\t\t").rstrip()
+    return f"\t{name}:\n\t\t{doc}"
+
+
+change_significance.__doc__ = change_significance.__doc__.format(
+    tests_list="{" + ", ".join(list(SIGNIFICANCE_TESTS.keys()) + ["threshold"]) + "}",
+    tests_doc="\n".join(map(_gen_test_entry, SIGNIFICANCE_TESTS.items())),
+)

From 6277febc6ec0861f1a021a753e48047a116b4e55 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Fri, 3 Nov 2023 11:53:35 -0400
Subject: [PATCH 6/7] change tests for robustnessfrac - add for robustcat

---
 CHANGES.rst             |   2 +-
 tests/test_ensembles.py | 145 +++++++++++++++++-----------------------
 2 files changed, 64 insertions(+), 83 deletions(-)

diff --git a/CHANGES.rst b/CHANGES.rst
index 7c728e595..937556bb4 100644
--- a/CHANGES.rst
+++ b/CHANGES.rst
@@ -8,7 +8,7 @@ Contributors to this version: Juliette Lavoie (:user:`juliettelavoie`), Pascal B
 
 New features and enhancements
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-* New functions ``xclim.ensembles.robustness_fractions`` and ``xclim.ensembles.robustness_categories``. The former will replace ``xclim.ensembles.change_significance`` which is now deprecated and will be removed in xclim 0.49.
+* New functions ``xclim.ensembles.robustness_fractions`` and ``xclim.ensembles.robustness_categories``. The former will replace ``xclim.ensembles.change_significance`` which is now deprecated and will be removed in xclim 0.49 (:pull:`1514`).
 
 Bug fixes
 ^^^^^^^^^
diff --git a/tests/test_ensembles.py b/tests/test_ensembles.py
index 7cee2aaf5..9321139c4 100644
--- a/tests/test_ensembles.py
+++ b/tests/test_ensembles.py
@@ -542,8 +542,8 @@ def test_make_criteria(self, tas_series):
 
 
 # ## Tests for Robustness ##
-@pytest.fixture(params=[True, False])
-def robust_data(request, random):
+@pytest.fixture
+def robust_data(random):
     norm = get_dist("norm")
     ref = np.tile(
         np.array(
@@ -592,9 +592,6 @@ def robust_data(request, random):
     ref["time"] = xr.cftime_range("2000-01-01", periods=40, freq="YS")
     fut = xr.DataArray(fut, dims=("lon", "realization", "time"), name="tas")
     fut["time"] = xr.cftime_range("2040-01-01", periods=40, freq="YS")
-    if request.param:
-        ref = ref.chunk({"lon": 1}).to_dataset()
-        fut = fut.chunk({"lon": 1}).to_dataset()
     return ref, fut
 
 
@@ -604,7 +601,7 @@ def robust_data(request, random):
         (
             "ttest",
             [0.75, 1, 1, 1],
-            [2 / 3, 0.5, 1, 1],
+            [0.5, 0.5, 1, 1],
             [
                 [False, True, True, True],
                 [True, True, True, True],
@@ -616,7 +613,7 @@ def robust_data(request, random):
         (
             "welch-ttest",
             [0.25, 1, 1, 1],
-            [1, 0.5, 1, 1],
+            [0.25, 0.5, 1, 1],
             [
                 [False, False, False, True],
                 [True, True, True, True],
@@ -628,7 +625,7 @@ def robust_data(request, random):
         (
             "mannwhitney-utest",
             [0.5, 1, 1, 1],
-            [0.5, 0.5, 1, 1],
+            [0.25, 0.5, 1, 1],
             [
                 [False, False, True, True],
                 [True, True, True, True],
@@ -640,7 +637,7 @@ def robust_data(request, random):
         (
             "brownforsythe-test",
             [0.25, 0.25, 0.25, 0],
-            [1, 0.0, 1, np.nan],
+            [0.25, 0.0, 0.25, 0],
             [
                 [False, True, False, False],
                 [True, False, False, False],
@@ -649,28 +646,25 @@ def robust_data(request, random):
             ],
             {},
         ),
+        (
+            "ipcc-ar6-c",
+            [0.25, 1, 1, 1],
+            [0.25, 0.5, 1, 1],
+            None,
+            {},
+        ),
         (
             "threshold",
             [0.25, 1, 1, 1],
-            [1, 0.5, 1, 1],
-            [
-                [False, False, False, True],
-                [True, True, True, True],
-                [True, True, True, True],
-                [False, False, True, True],
-            ],
+            [0.25, 0.5, 1, 1],
+            None,
             {"rel_thresh": 0.002},
         ),
         (
             "threshold",
             [0, 0, 0.5, 0],
-            [np.nan, np.nan, 1, np.nan],
-            [
-                [False, False, False, False],
-                [False, False, False, False],
-                [False, False, True, True],
-                [False, False, False, False],
-            ],
+            [0, 0, 0.5, 0],
+            None,
             {"abs_thresh": 2},
         ),
         (
@@ -682,87 +676,61 @@ def robust_data(request, random):
         ),
     ],
 )
-def test_change_significance(
+def test_robustness_fractions(
     robust_data, test, exp_chng_frac, exp_pos_frac, exp_changed, kws
 ):
     ref, fut = robust_data
 
     if test == "ttest" and Version(__scipy_version__) < Version("1.9.0"):
         with pytest.warns(FutureWarning):
-            chng_frac, pos_frac = ensembles.change_significance(
-                fut, ref, test=test, **kws
-            )
+            fracs = ensembles.robustness_fractions(fut, ref, test=test, **kws)
     else:
-        chng_frac, pos_frac = ensembles.change_significance(fut, ref, test=test, **kws)
+        fracs = ensembles.robustness_fractions(fut, ref, test=test, **kws)
 
-    assert chng_frac.attrs["test"] == str(test)
-    if isinstance(ref, xr.Dataset):
-        chng_frac = chng_frac.tas
-        pos_frac = pos_frac.tas
+    assert fracs.changed.attrs["test"] == str(test)
 
-    np.testing.assert_array_almost_equal(chng_frac, exp_chng_frac)
-    np.testing.assert_array_almost_equal(pos_frac, exp_pos_frac)
+    np.testing.assert_array_almost_equal(fracs.positive, [0.5, 0.5, 1, 1])
+    np.testing.assert_array_almost_equal(fracs.agree, [0.5, 0.5, 1, 1])
+    np.testing.assert_array_almost_equal(fracs.valid, [1, 1, 1, 0.5])
+    np.testing.assert_array_almost_equal(fracs.changed, exp_chng_frac)
+    np.testing.assert_array_almost_equal(fracs.changed_positive, exp_pos_frac)
 
-    # With p-values
-    chng, sign, pvals = ensembles.change_significance(
-        fut, ref, test=test, p_vals=True, **kws
-    )
-    if pvals is not None:
-        if isinstance(ref, xr.Dataset):
-            pvals = pvals.tas
-        pvals.load()  # Otherwise it acts weirdly for welch-ttest
+    if "pvals" in fracs:
         # 0.05 is the default p_change parameter
-        changed = pvals < 0.05
+        changed = fracs.pvals < 0.05
         np.testing.assert_array_almost_equal(changed, exp_changed)
 
 
-def test_change_significance_weighted(robust_data):
+def test_robustness_fractions_weighted(robust_data):
     ref, fut = robust_data
     weights = xr.DataArray([1, 0.1, 3.5, 5], coords={"realization": ref.realization})
-    chng_frac, pos_frac = ensembles.change_significance(
-        fut, ref, test=None, weights=weights
-    )
-    assert chng_frac.attrs["test"] == "None"
-    if isinstance(ref, xr.Dataset):
-        chng_frac = chng_frac.tas
-        pos_frac = pos_frac.tas
-
-    np.testing.assert_array_equal(chng_frac, [1, 1, 1, 1])
-    np.testing.assert_array_almost_equal(pos_frac, [0.53125, 0.88541667, 1.0, 1.0])
+    fracs = ensembles.robustness_fractions(fut, ref, test=None, weights=weights)
+    assert fracs.changed.attrs["test"] == "None"
 
-
-def test_change_significance_delta(robust_data):
-    ref, fut = robust_data
-    delta = fut.mean("time") - ref.mean("time")
-    chng_frac, pos_frac = ensembles.change_significance(
-        delta, test="threshold", abs_thresh=2
+    np.testing.assert_array_equal(fracs.changed, [1, 1, 1, 1])
+    np.testing.assert_array_almost_equal(
+        fracs.changed_positive, [0.53125, 0.88541667, 1.0, 1.0]
     )
 
-    if isinstance(ref, xr.Dataset):
-        chng_frac = chng_frac.tas
-        pos_frac = pos_frac.tas
 
-    exp_chng_frac = [0, 0, 0.5, 0]
-    exp_pos_frac = [np.nan, np.nan, 1, np.nan]
-    np.testing.assert_array_equal(chng_frac, exp_chng_frac)
-    np.testing.assert_array_equal(pos_frac, exp_pos_frac)
+def test_robustness_fractions_delta(robust_data):
+    delta = xr.DataArray([-2, 1, -2, -1], dims=("realization",))
+    fracs = ensembles.robustness_fractions(delta, test="threshold", abs_thresh=1.5)
+    np.testing.assert_array_equal(fracs.changed, [0.5])
+    np.testing.assert_array_equal(fracs.changed_positive, [0.0])
+    np.testing.assert_array_equal(fracs.positive, [0.25])
+    np.testing.assert_array_equal(fracs.agree, [0.75])
 
-    weights = xr.DataArray([1, 0.1, 3.5, 5], coords={"realization": delta.realization})
-    chng_frac, pos_frac = ensembles.change_significance(
-        delta, test="threshold", abs_thresh=2, weights=weights
+    weights = xr.DataArray([4, 3, 2, 1], dims=("realization",))
+    fracs = ensembles.robustness_fractions(
+        delta, test="threshold", abs_thresh=1.5, weights=weights
     )
-    if isinstance(ref, xr.Dataset):
-        chng_frac = chng_frac.tas
-        pos_frac = pos_frac.tas
-
-    exp_chng_frac = [0, 0, 0.88541667, 0]
-    exp_pos_frac = [np.nan, np.nan, 1, np.nan]
-
-    np.testing.assert_array_almost_equal(chng_frac, exp_chng_frac)
-    np.testing.assert_array_equal(pos_frac, exp_pos_frac)
+    np.testing.assert_array_equal(fracs.changed, [0.6])
+    np.testing.assert_array_equal(fracs.positive, [0.3])
+    np.testing.assert_array_equal(fracs.agree, [0.7])
 
 
-def test_change_significance_empty():
+def test_robustness_fractions_empty():
     """Test that NaN is returned if input arrays are full of NaNs."""
     r = np.full((20, 10), np.nan)
     f = np.full((20, 10), np.nan)
@@ -770,8 +738,21 @@ def test_change_significance_empty():
     ref = xr.DataArray(r, dims=("realization", "time"), name="tas")
     fut = xr.DataArray(f, dims=("realization", "time"), name="tas")
 
-    c, f = ensembles.change_significance(fut, ref, test="ttest")
-    assert np.all(np.isnan(c))
+    f = ensembles.robustness_fractions(fut, ref, test="ttest")
+    assert np.all(np.isnan(f.changed))
+
+
+def test_robustness_categories():
+    changed = xr.DataArray([0.5, 0.8, 1, 1])
+    agree = xr.DataArray([1, 0.5, 0.5, 1])
+
+    categories = ensembles.robustness_categories(changed, agree)
+    np.testing.assert_array_equal(categories, [2, 3, 3, 1])
+    assert categories.flag_values == [1, 2, 3]
+    assert (
+        categories.flag_meanings
+        == "robust_signal no_change_or_no_signal conflicting_signal"
+    )
 
 
 def test_robustness_coefficient():

From 1ab658c2406efa1f48b7f3717dc543da86fb6f55 Mon Sep 17 00:00:00 2001
From: Pascal Bourgault <bourgault.pascal@ouranos.ca>
Date: Mon, 6 Nov 2023 16:39:50 -0500
Subject: [PATCH 7/7] fixes after review

---
 docs/notebooks/ensembles.ipynb | 8 ++++----
 xclim/ensembles/_robustness.py | 2 --
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/docs/notebooks/ensembles.ipynb b/docs/notebooks/ensembles.ipynb
index 2dffc73d0..665b60d19 100644
--- a/docs/notebooks/ensembles.ipynb
+++ b/docs/notebooks/ensembles.ipynb
@@ -283,14 +283,14 @@
    "source": [
     "### Change significance and model agreement\n",
     "\n",
-    "When communicating climate change through plots of projected change, it is often useful to add information on the statistical significance of the values. A common way to represent this information without overloading the figures is through hatching patterns surimposed on the primary data. Two aspects are usually shown : \n",
+    "When communicating climate change through plots of projected change, it is often useful to add information on the statistical significance of the values. A common way to represent this information without overloading the figures is through hatching patterns superimposed on the primary data. Two aspects are usually shown : \n",
     "\n",
-    "- change significance : whether most of the ensemble members project a climate change signal statistically significant in comparison to their internal variability.\n",
+    "- change significance : whether most of the ensemble members project a statistically significant climate change signal, in comparison to their internal variability.\n",
     "- model agreement : whether the different ensemble members agree on the sign of the change.\n",
     "\n",
-    "We can then divide the plotted points into categories each with its own hatching pattern, usually leaving the robust data (models agree and all show significant change) without hatching. \n",
+    "We can then divide the plotted points into categories each with its own hatching pattern, usually leaving the robust data (models agree and enough show a significant change) without hatching. \n",
     "\n",
-    "Xclim provides some tools to help in generating these hatching masks. First is [xc.ensembles.robustness_fractions](../apidoc/xclim.ensembles.rst#xclim.ensembles._robustness.robustness_fractions) that can characterize the change significance and sign agreement accross ensemble members. To demonstrate its usage, we'll first generate some fake annual mean temperature data. Here, `ref` is the data on the reference period and `fut` is a future projection. There are be 5 different members in the ensemble. We tweaked the generation so that all models agree on significant change in the \"south\" while agreement and signifiance of change decreases as we go north and east."
+    "Xclim provides some tools to help in generating these hatching masks. First is [xc.ensembles.robustness_fractions](../apidoc/xclim.ensembles.rst#xclim.ensembles._robustness.robustness_fractions) that can characterize the change significance and sign agreement accross ensemble members. To demonstrate its usage, we'll first generate some fake annual mean temperature data. Here, `ref` is the data on the reference period and `fut` is a future projection. There are 5 different members in the ensemble. We tweaked the generation so that all models agree on significant change in the \"south\" while agreement and signifiance of change decreases as we go north and east."
    ]
   },
   {
diff --git a/xclim/ensembles/_robustness.py b/xclim/ensembles/_robustness.py
index 64d982ee6..9b59a347e 100644
--- a/xclim/ensembles/_robustness.py
+++ b/xclim/ensembles/_robustness.py
@@ -145,8 +145,6 @@ def robustness_fractions(  # noqa: C901
     >>> fut = tgmean.sel(time=slice("2020", "2050"))
     >>> ref = tgmean.sel(time=slice("1990", "2020"))
     >>> fractions = ensembles.robustness_fractions(fut, ref, test="ttest")
-    # Binary map, True where at least 80% of members agree on the sign of change.
-    >>> agreement_map = (fractions.pos_frac > 0.8) | ((1 - fractions.pos_frac) > 0.8)
     """
     # Realization dimension name
     realization = "realization"