lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

awst-baum · 2018-07-25T21:59:58Z

I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).

This seems to happen in pytesmo.scaling.gen_cdf_match (scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.

The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: scipy/scipy#8535

If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).

Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?

The issue can be reproduced with SciPy 1.1.0 when running:

    def test_strictly_increasing():
        
        n = 1000
        x = np.ones(n) * 0.5
        y = np.ones(n)

        s = scaling.lin_cdf_match(y, x)

The text was updated successfully, but these errors were encountered:

cpaulik · 2018-07-26T06:54:57Z

This depends on the time series. If there are a lot of identical values then the piecewise linear CDF matching will not work correctly.

In your example the percentiles are

perc_ref
array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5])
(Pdb) perc_src
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

So there is no sensible way scaling that using CDF matching because there are too few unique percentiles. I would argue that the correct scaling in this case would not be to NaN but to the point 0.5.

There is the function cdf_match which will try to convert the percentiles to unique values before attempting the cdf matching. This might help in your time series but not in the example.

awst-baum · 2018-07-26T10:03:17Z

OK, some figures on this: the lin_cdf_match issue occurs for 263 out of 1780 (15%) of ISMN stations in the US from 1978-now (the dataset that Tracy picked for the C3S usecase we're working on). That's quite a lot, isn't it?

The lin_cdf_match percentiles look something like this: [-0.03 0.06 0.06 0.07 0.09 0.1 0.12 0.16 0.23]

I've also tried using cdf_match for the same data and run into a different but probably related issue:
InterpolatedUnivariateSpline in gen_cdf_match (scaling.py:402) now throws

error: (xb<=x[0]) failed for 2nd keyword xb: fpcurf0:xb=-nan

because perc_src now looks like this: [ nan nan nan nan nan 0.0501\n 0.0501 0.0501 ...

To be honest, I'm way out of my depth here... ;-)

cpaulik · 2018-07-26T11:20:44Z

A few general questions/comments:

Does your original time series still contain NaN values?
Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles or are they in there after np.percentile?
Are you using the validation framework? If so then I think it should drop nan values before doing any scaling. See https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/validation_framework/validation.py#L483
If not and you want to use any kind of CDF matching then you have to make sure that the percentiles are unique. Otherwise it will not work.

awst-baum · 2018-07-26T13:23:22Z

Does your original time series still contain NaN values?

Doesn't seem so. data.isnull().values.any() returns False for at least some of the timeseries that have the problem.

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Are you using the validation framework?

Yes - and NaNs in the input data don't seem to be the problem, see above.

Should we add one of the affected ISMN files to the test-data?

cpaulik · 2018-07-26T13:32:45Z

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Then we should add a test for this function with the original percentiles that cause this problem. No need for the whole timeseries in the testdata yet.

We have three ways of making sure that the percentiles are unique:

So the best start would be to add the problematic percentiles to the tests for these three functions.

awst-baum · 2018-07-26T14:25:38Z

OK!

I'll take a look a this after I'm through the latest proposal drive.

In the meantime, so that I don't forget: the percentiles for
ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-B_19780101_20180712.stm
are in the attached textfile.

perc_src.zip

cpaulik · 2018-07-26T14:38:41Z

Something strange seems to be going on here. Normally you should only have percentile values for the percentiles [0, 5, 10, 30, 50, 70, 90, 95, 100]. So 9 values. Your zip file has many more.

awst-baum · 2018-07-26T16:36:25Z

True.

Just to be clear: I'm talking about cdf_match [scaling.py:344] :

percentiles = np.linspace(0, 100, nbins)
perc_src = np.array(np.percentile(src, percentiles))

nbins is the default 100, so len(percentiles) is 100 and len(perc_src) is also 100.

What's in the textfile is perc_src.

awst-baum · 2018-08-08T14:38:08Z

Attaching the offending (reference data) file: data.zip

awst-baum · 2018-08-08T15:07:32Z

A unit test that reproduce the problem:

def test_cdf_nan():
    from os import path
    testdata_path = '/space/projects/qa4sm/data/testdata'
    data_file = path.join(testdata_path, 'thedata.txt')
    ref_file = path.join(testdata_path, 'theref.txt')
    the_data = np.loadtxt(data_file)
    the_ref = np.loadtxt(ref_file)

    from pytesmo.scaling import cdf_match
    result = cdf_match(the_data, the_ref)

    print(result)
    
    assert(not np.isnan(result).any())

The necessary data: unittest.zip

Lmoesinger · 2018-08-23T13:19:23Z

I just tested your unittest with my default environment (scipy-0.16.0) and there weren't any nans in result. So there seems to be scipy versions where it a) works (scipy-0.16.0), b) returns nans (scipy-0.19?) and c) returns an error (scipy-1.0).

But also some other module might be the problem. Could you upload a file that specifies your installed modules? E.g. with

conda env export > environment.txt

Then I can install your environment and do some debugging.

Attached is my output , but as a warning, it has way more modules installed than necessary for the above test.

awst-baum · 2018-08-23T15:01:36Z

Thanks for looking at this, greatly appreciated! :)

Here's my environment
environment.txt
which may also contain more than necessary - plus some installs from non-public EODC GitLab repositories...

Perhaps we should identify a minimum environment for running the test and for reproducing the error.

Lmoesinger · 2018-08-24T14:16:26Z

Both new pytesmo and scipy contribute to this problem:

1) pytesmo>=0.6.8 problem:

scaling.cdf_match() works like this:
a) get the values of the percentiles, perc_src
b) remove duplicate perc_src and then fill in the removed values by interpolation. But here is the problem: If the differences between the perc_src are very small, the interpolated values might again be duplicates. This then breaks subsequent matching.

IMHO there are two ways to solve this:
a) Go back to how it is done in pytesmo 0.6.7 (don´t interpolate the removed percentiles). In very specific cases this might lead to a poor fit tough.
b) Use a threshold while determining what is a duplicate value. np.unique() is very susceptible to floating point errors. But the threshold would have to be set with care (probably dependent on relative difference and largest duplicate streak)

But I'm not the one to make that decision.

2) scipy>=1.0 problem:

(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):

scipy.interpolate.InterpolatedUnivariateSpline breaks if x is not ascending. For some reason perc_src is not always ascending, even if there are no duplicate values in it. Should be fixed with a simple np.sort() in line 351 (current version) in scaling.py:

return gen_cdf_match(src, np.sort(perc_src), np.sort(perc_ref),
                     min_val=min_val, max_val=max_val,
                     k=5)

@awst-baum It seems highly unlikely that real floating point data has that many duplicates - something else seems to have gone wrong in an earlier processing step, I'd suggest looking into that. But in the meanwhile you could use an older pythesmo (0.6.7) and scipy (<1.0) version.

awst-baum · 2018-08-24T14:26:25Z

@Lmoesinger Thanks for the analysis! I guess I have to discuss this with Tracy to figure out what to do.

tracyscanlon · 2019-07-15T13:02:05Z

@Lmoesinger now that you've finished investigating CDF matching, would you be able to comment further on this issue?

Lmoesinger · 2019-07-15T13:57:39Z

Has been a long time, my memory is a bit fuzzy on this subject, but we need to discuss this point in person:
Removal of duplicate percentiles. Id argue that you can only get those if a) there are very few values given the number of percentile bins or b) your data is "bad": Natural data usually does not consist of more than 10% of duplicate (floating point) values. As such I'd generally rather suggest to return an error with a meaningful message rather than not telling the user that either a) he should increase his bin size or b) should check his data.

Lmoesinger mentioned this issue Sep 6, 2018

More robust lon/lat initialization for imagestack TUW-GEO/pynetcf#47

Open

tracyscanlon self-assigned this Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

awst-baum commented Jul 25, 2018 •

edited

Loading

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018 •

edited

Loading

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

awst-baum commented Aug 8, 2018

awst-baum commented Aug 8, 2018 •

edited

Loading

Lmoesinger commented Aug 23, 2018 •

edited

Loading

awst-baum commented Aug 23, 2018

Lmoesinger commented Aug 24, 2018 •

edited

Loading

awst-baum commented Aug 24, 2018

tracyscanlon commented Jul 15, 2019

Lmoesinger commented Jul 15, 2019 •

edited

Loading

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

Comments

awst-baum commented Jul 25, 2018 • edited Loading

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018 • edited Loading

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

cpaulik commented Jul 26, 2018

awst-baum commented Jul 26, 2018

awst-baum commented Aug 8, 2018

awst-baum commented Aug 8, 2018 • edited Loading

Lmoesinger commented Aug 23, 2018 • edited Loading

awst-baum commented Aug 23, 2018

Lmoesinger commented Aug 24, 2018 • edited Loading

1) pytesmo>=0.6.8 problem:

2) scipy>=1.0 problem:

(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):

awst-baum commented Aug 24, 2018

tracyscanlon commented Jul 15, 2019

Lmoesinger commented Jul 15, 2019 • edited Loading

awst-baum commented Jul 25, 2018 •

edited

Loading

awst-baum commented Jul 26, 2018 •

edited

Loading

awst-baum commented Aug 8, 2018 •

edited

Loading

Lmoesinger commented Aug 23, 2018 •

edited

Loading

Lmoesinger commented Aug 24, 2018 •

edited

Loading

Lmoesinger commented Jul 15, 2019 •

edited

Loading