Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

Open
awst-baum opened this issue Jul 25, 2018 · 16 comments
Open

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144

awst-baum opened this issue Jul 25, 2018 · 16 comments
Assignees

Comments

@awst-baum
Copy link
Collaborator

awst-baum commented Jul 25, 2018

I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).

This seems to happen in pytesmo.scaling.gen_cdf_match (scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.

The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: scipy/scipy#8535

If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).

Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?

The issue can be reproduced with SciPy 1.1.0 when running:

    def test_strictly_increasing():
        
        n = 1000
        x = np.ones(n) * 0.5
        y = np.ones(n)

        s = scaling.lin_cdf_match(y, x)
@cpaulik
Copy link
Collaborator

cpaulik commented Jul 26, 2018

This depends on the time series. If there are a lot of identical values then the piecewise linear CDF matching will not work correctly.

In your example the percentiles are

perc_ref
array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5])
(Pdb) perc_src
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

So there is no sensible way scaling that using CDF matching because there are too few unique percentiles. I would argue that the correct scaling in this case would not be to NaN but to the point 0.5.

There is the function cdf_match which will try to convert the percentiles to unique values before attempting the cdf matching. This might help in your time series but not in the example.

@awst-baum
Copy link
Collaborator Author

awst-baum commented Jul 26, 2018

OK, some figures on this: the lin_cdf_match issue occurs for 263 out of 1780 (15%) of ISMN stations in the US from 1978-now (the dataset that Tracy picked for the C3S usecase we're working on). That's quite a lot, isn't it?

The lin_cdf_match percentiles look something like this: [-0.03 0.06 0.06 0.07 0.09 0.1 0.12 0.16 0.23]

I've also tried using cdf_match for the same data and run into a different but probably related issue:
InterpolatedUnivariateSpline in gen_cdf_match (scaling.py:402) now throws

error: (xb<=x[0]) failed for 2nd keyword xb: fpcurf0:xb=-nan

because perc_src now looks like this: [ nan nan nan nan nan 0.0501\n 0.0501 0.0501 ...

To be honest, I'm way out of my depth here... ;-)

@cpaulik
Copy link
Collaborator

cpaulik commented Jul 26, 2018

A few general questions/comments:

  • Does your original time series still contain NaN values?
  • Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles or are they in there after np.percentile?
  • Are you using the validation framework? If so then I think it should drop nan values before doing any scaling. See https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/validation_framework/validation.py#L483
  • If not and you want to use any kind of CDF matching then you have to make sure that the percentiles are unique. Otherwise it will not work.

@awst-baum
Copy link
Collaborator Author

Does your original time series still contain NaN values?

Doesn't seem so. data.isnull().values.any() returns False for at least some of the timeseries that have the problem.

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Are you using the validation framework?

Yes - and NaNs in the input data don't seem to be the problem, see above.

Should we add one of the affected ISMN files to the test-data?

@cpaulik
Copy link
Collaborator

cpaulik commented Jul 26, 2018

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Then we should add a test for this function with the original percentiles that cause this problem. No need for the whole timeseries in the testdata yet.

We have three ways of making sure that the percentiles are unique:

So the best start would be to add the problematic percentiles to the tests for these three functions.

@awst-baum
Copy link
Collaborator Author

OK!

I'll take a look a this after I'm through the latest proposal drive.

In the meantime, so that I don't forget: the percentiles for
ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-B_19780101_20180712.stm
are in the attached textfile.

perc_src.zip

@cpaulik
Copy link
Collaborator

cpaulik commented Jul 26, 2018

Something strange seems to be going on here. Normally you should only have percentile values for the percentiles [0, 5, 10, 30, 50, 70, 90, 95, 100]. So 9 values. Your zip file has many more.

@awst-baum
Copy link
Collaborator Author

True.

Just to be clear: I'm talking about cdf_match [scaling.py:344] :

percentiles = np.linspace(0, 100, nbins)
perc_src = np.array(np.percentile(src, percentiles))

nbins is the default 100, so len(percentiles) is 100 and len(perc_src) is also 100.

What's in the textfile is perc_src.

@awst-baum
Copy link
Collaborator Author

Attaching the offending (reference data) file: data.zip

@awst-baum
Copy link
Collaborator Author

awst-baum commented Aug 8, 2018

A unit test that reproduce the problem:

def test_cdf_nan():
    from os import path
    testdata_path = '/space/projects/qa4sm/data/testdata'
    data_file = path.join(testdata_path, 'thedata.txt')
    ref_file = path.join(testdata_path, 'theref.txt')
    the_data = np.loadtxt(data_file)
    the_ref = np.loadtxt(ref_file)

    from pytesmo.scaling import cdf_match
    result = cdf_match(the_data, the_ref)

    print(result)
    
    assert(not np.isnan(result).any())

The necessary data: unittest.zip

@Lmoesinger
Copy link

Lmoesinger commented Aug 23, 2018

I just tested your unittest with my default environment (scipy-0.16.0) and there weren't any nans in result. So there seems to be scipy versions where it a) works (scipy-0.16.0), b) returns nans (scipy-0.19?) and c) returns an error (scipy-1.0).

But also some other module might be the problem. Could you upload a file that specifies your installed modules? E.g. with

conda env export > environment.txt

Then I can install your environment and do some debugging.

Attached is my output , but as a warning, it has way more modules installed than necessary for the above test.

@awst-baum
Copy link
Collaborator Author

Thanks for looking at this, greatly appreciated! :)

Here's my environment
environment.txt
which may also contain more than necessary - plus some installs from non-public EODC GitLab repositories...

Perhaps we should identify a minimum environment for running the test and for reproducing the error.

@Lmoesinger
Copy link

Lmoesinger commented Aug 24, 2018

Both new pytesmo and scipy contribute to this problem:

1) pytesmo>=0.6.8 problem:

scaling.cdf_match() works like this:
a) get the values of the percentiles, perc_src
b) remove duplicate perc_src and then fill in the removed values by interpolation. But here is the problem: If the differences between the perc_src are very small, the interpolated values might again be duplicates. This then breaks subsequent matching.

IMHO there are two ways to solve this:
a) Go back to how it is done in pytesmo 0.6.7 (don´t interpolate the removed percentiles). In very specific cases this might lead to a poor fit tough.
b) Use a threshold while determining what is a duplicate value. np.unique() is very susceptible to floating point errors. But the threshold would have to be set with care (probably dependent on relative difference and largest duplicate streak)

But I'm not the one to make that decision.

2) scipy>=1.0 problem:

(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):

scipy.interpolate.InterpolatedUnivariateSpline breaks if x is not ascending. For some reason perc_src is not always ascending, even if there are no duplicate values in it. Should be fixed with a simple np.sort() in line 351 (current version) in scaling.py:

return gen_cdf_match(src, np.sort(perc_src), np.sort(perc_ref),
                     min_val=min_val, max_val=max_val,
                     k=5)

@awst-baum It seems highly unlikely that real floating point data has that many duplicates - something else seems to have gone wrong in an earlier processing step, I'd suggest looking into that. But in the meanwhile you could use an older pythesmo (0.6.7) and scipy (<1.0) version.

@awst-baum
Copy link
Collaborator Author

@Lmoesinger Thanks for the analysis! I guess I have to discuss this with Tracy to figure out what to do.

@tracyscanlon
Copy link
Collaborator

@Lmoesinger now that you've finished investigating CDF matching, would you be able to comment further on this issue?

@tracyscanlon tracyscanlon self-assigned this Jul 15, 2019
@Lmoesinger
Copy link

Lmoesinger commented Jul 15, 2019

Has been a long time, my memory is a bit fuzzy on this subject, but we need to discuss this point in person:
Removal of duplicate percentiles. Id argue that you can only get those if a) there are very few values given the number of percentile bins or b) your data is "bad": Natural data usually does not consist of more than 10% of duplicate (floating point) values. As such I'd generally rather suggest to return an error with a meaningful message rather than not telling the user that either a) he should increase his bin size or b) should check his data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants