Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spei, proper use of loc parameter in scipy dists #1653

Closed
wants to merge 18 commits into from
Closed

Spei, proper use of loc parameter in scipy dists #1653

wants to merge 18 commits into from

Conversation

coxipi
Copy link
Contributor

@coxipi coxipi commented Feb 19, 2024

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • CHANGES.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

  • Spei used to rely on an offset to ensure that distribution of values are positive when using distributions such as gamma, fisk. Now, instead, we properly use the loc parameter in those distributions that generalize the distributions and allow negative values. For instance, the gamma disitribution is a 2-parameter function, but in scipy it's generalized to 3 parameters, with a loc parameter. This extends the support of gamma from $x \in [0,\infty]$ to $x \in [\text{loc}, \infty]$.

The code was already compatible with this idea, just had to change _fit_start so it can compute estimates of the parameters in the cases with negatives values.

Does this PR introduce a breaking change?

Yes, removing the offset from SPEI.

Other information:

@github-actions github-actions bot added the indicators Climate indices and indicators label Feb 19, 2024
CHANGES.rst Outdated Show resolved Hide resolved
@coxipi
Copy link
Contributor Author

coxipi commented Feb 22, 2024

I'm pretty much done here I think, but I would like to test a bit more situations where we have negative values. I need to find realistic cases, not too extreme I suppose, although I don't see why the method would not work with very negative values. I have to see!

Shifting the values like:

pr = pr - f*pr.mean()

for $f \in [0,1]$, I find that the SPI values change more for gamma. For method ML, the error is somewhat constant, and for method APP, results get worse and worse as the shift is increased:

Mean[(spi_shifted - spi)/spi]
gamma-ML ~7%
fisk-ML ~0.1%
gamma-APP from 0.001% to 69%
fisk-APP from 0.001% to 10%

I say error because I assume that the SPI should not change if we apply a constant change to all precipitations

@coxipi
Copy link
Contributor Author

coxipi commented Mar 15, 2024

So I tried to obtain a proper estimation of loc parameter in the case of the gamma distribution, resorting to approximations as it's already the case when we have the two-parameter distribution.

It becomes a mess. Even with a significantly worse approximation, the expression is quite messy. It can be done, but maybe there is a simple way I'm not seeing.

loc is estimated in some of the distributions in _fit_start (e.g. GEV) so I will see if there's something obvious I'm not seeing. Otherwise, _fit_start will probably still work well to do what its name implies, start a fit, but otherwise should not be used as an approximate method (unless the intent IS to use a two parameter distribution). So, either improve _fit_start or limit its application as a method for SPI/SPEI.

@coxipi
Copy link
Contributor Author

coxipi commented Mar 22, 2024

I will implement a very rudimentary estimation of the loc parameter using the quantiles:
Consider an ordered time series $X_1, X_2, \dots, X_n$. The rawest approximation would be: $\text{loc}^* = X_1$ which was the spirit of having an offset. By the definition of loc, we know we must have $\text{loc} \leq X_1$. A more refined approximation is: $\text{loc}^* = X_1 + (X_1-X_2)$, which is smaller. This might undershoot, i.e. $\text{loc}^* < \text{loc}^{\rm Truth}$. And so on.

If we want to use a specific loc parameter, this could be used through fitkwargs. For instance, in some cases, we might want to impose loc==0. We should probably restrict the access to APP method. I could see it be used when loc==0, but otherwise I would forbid this use.

In any cases, we should keep the zero inflated distributions for SPI/precips. I have a good idea of what would be good for many examples, I just need to figure out a proper way to organize things around these ideas.

I think adding a zero_inflated option to fit would be useful

@coxipi
Copy link
Contributor Author

coxipi commented Mar 26, 2024

I had to change some results in test_indices. The most significant changes are for the fisk distribution since I changed significantly the approximations. We now have a more rigorous implementation of the method of moments (I used and <x^2>/^2.

The choice of starting parameters can significantly affect even the full-fledged numerical approach (e.g. "ML" method)

@coxipi
Copy link
Contributor Author

coxipi commented Apr 15, 2024

I've been exploring the sensitivity to initial guesses in our procedure.

Consider what is done in test_indices.py (test_standardized_precipitation_index, etc), computing a few values of SPI/SPEI. I show the example with the fisk distribution, but I also saw this phenomena for gamma dist

freq = "D"
window = 1
dist = "fisk"
method = "ML"
ds = open_dataset("sdba/CanESM2_1950-2100.nc", cache=True, cache_dir=Path("/home/eridup1/tank/xclim_open_dataset/")).isel(location=1)
pr = ds.pr.sel(time=slice("1998", "2000"))
pr_cal = ds.pr.sel(time=slice("1950", "1980"))
params = xci.stats.standardized_index_fit_params(
    pr_cal, freq=freq, window=window, dist=dist, method=method, zero_inflated=True,
)
spi  = xci.standardized_precipitation_index(pr, params=params)
print(spi.isel(time=slice(-11, -1, 2)))

I used two ways to estimate loc in _fit_start, the spi results sometimes largely, sometimes barely

# loc=0
>> [-0.12736419  1.30823431  0.67651811  0.10446318 -0.50246943]
# loc = (x1 * xn - x2**2) / (x1 + xn - 2 * x2)
# x1 is the smallest precip in the dataset, x2 the second smallest, xn the largest
>> [-0.15894928  1.30822495  0.44984607  0.14669885 -0.50273746]

It makes me think in some cases, the same minimum is found, in other cases no.

The difficulty of optimization for 3-parameter gamma distribution is discussed in this issue, confirms my suspicion.
scipy/scipy#18775

They mention scipy.stats.fit may be more robust than scipy.stats.gamma.fit in the case of 3-params dist for instance

CHANGES.rst Outdated Show resolved Hide resolved
coxipi and others added 2 commits April 16, 2024 15:57
@coxipi
Copy link
Contributor Author

coxipi commented Apr 16, 2024

I did the computation for 20 values of loc0 between 0 and the guess I outlined above. There is some variability in the spi output, and that's true for any day of year:

doy=361 (the 0.104 vs. 0.147 spi values above):

image

another random doy I took, 180

image

The worst doy mentioned above, the one with 0.68 vs. 0.45

image

BONUS: using scipy.stats.fit instead of scipy.stats.fisk.fit

All in all, it does not fare much better. Also, in this case, you have to input reasonable range of values to search within, so maybe this part could be optimized a bit more to avoid large span of values like this.

image

All in all, I think things are controlled well enough. As a last study, I will consider how the R package performs in these circumstances.

@coxipi
Copy link
Contributor Author

coxipi commented Apr 18, 2024

A bit of a tangent, but I was wondering how does this compare to the error in a single fit. I generate 30 points in a gamma dist with R, computed a fit + std errors, then computed a SPI for a hundred set of parameters within the error bars, this shows the inherent variability with this measure.

image

@coxipi
Copy link
Contributor Author

coxipi commented Apr 18, 2024

I've been battling with R trying to obtain a confirmation through another package that indeed, MLE is difficult for 3-parameters function. I've had good success working with 2-params distributions that come with fitdistr and fitdistrplus, but as I try to use 3-parameters in additional packages (e.g. FAdist or qualityTools), it becomes very difficult. I keep getting error message concerning initial value in 'vmmin' is not finite, which as I understand should be solved by giving different starting parameters.

library(FAdist)
library(fitdistrplus)
data <- rgamma3(30, shape = 0.5, scale = 1e-4, thres=0)  # Using Gamma distribution for sample data
params = fitdistr(data,dgamma3, start=list(shape=0.5,scale=1e-4, thres=0))

Maybe there is something simple I'm missing, I will simply wrap up the PR now

@coxipi coxipi closed this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
indicators Climate indices and indicators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants