-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0 #144
Comments
This depends on the time series. If there are a lot of identical values then the piecewise linear CDF matching will not work correctly. In your example the percentiles are
So there is no sensible way scaling that using CDF matching because there are too few unique percentiles. I would argue that the correct scaling in this case would not be to There is the function cdf_match which will try to convert the percentiles to unique values before attempting the cdf matching. This might help in your time series but not in the example. |
OK, some figures on this: the lin_cdf_match issue occurs for 263 out of 1780 (15%) of ISMN stations in the US from 1978-now (the dataset that Tracy picked for the C3S usecase we're working on). That's quite a lot, isn't it? The lin_cdf_match percentiles look something like this: [-0.03 0.06 0.06 0.07 0.09 0.1 0.12 0.16 0.23] I've also tried using cdf_match for the same data and run into a different but probably related issue:
because perc_src now looks like this: [ nan nan nan nan nan 0.0501\n 0.0501 0.0501 ... To be honest, I'm way out of my depth here... ;-) |
A few general questions/comments:
|
Doesn't seem so.
Indeed, it does in my tests (with scipy 0.19.1).
Yes - and NaNs in the input data don't seem to be the problem, see above. Should we add one of the affected ISMN files to the test-data? |
Then we should add a test for this function with the original percentiles that cause this problem. No need for the whole timeseries in the testdata yet. We have three ways of making sure that the percentiles are unique: So the best start would be to add the problematic percentiles to the tests for these three functions. |
OK! I'll take a look a this after I'm through the latest proposal drive. In the meantime, so that I don't forget: the percentiles for |
Something strange seems to be going on here. Normally you should only have percentile values for the percentiles |
True. Just to be clear: I'm talking about cdf_match [scaling.py:344] :
nbins is the default 100, so What's in the textfile is perc_src. |
Attaching the offending (reference data) file: data.zip |
A unit test that reproduce the problem:
The necessary data: unittest.zip |
I just tested your unittest with my default environment (scipy-0.16.0) and there weren't any nans in But also some other module might be the problem. Could you upload a file that specifies your installed modules? E.g. with
Then I can install your environment and do some debugging. Attached is my output , but as a warning, it has way more modules installed than necessary for the above test. |
Thanks for looking at this, greatly appreciated! :) Here's my environment Perhaps we should identify a minimum environment for running the test and for reproducing the error. |
Both new pytesmo and scipy contribute to this problem: 1) pytesmo>=0.6.8 problem:
IMHO there are two ways to solve this: But I'm not the one to make that decision. 2) scipy>=1.0 problem:(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):
@awst-baum It seems highly unlikely that real floating point data has that many duplicates - something else seems to have gone wrong in an earlier processing step, I'd suggest looking into that. But in the meanwhile you could use an older pythesmo (0.6.7) and scipy (<1.0) version. |
@Lmoesinger Thanks for the analysis! I guess I have to discuss this with Tracy to figure out what to do. |
@Lmoesinger now that you've finished investigating CDF matching, would you be able to comment further on this issue? |
Has been a long time, my memory is a bit fuzzy on this subject, but we need to discuss this point in person: |
I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).
This seems to happen in
pytesmo.scaling.gen_cdf_match
(scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: scipy/scipy#8535
If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).
Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?
The issue can be reproduced with SciPy 1.1.0 when running:
The text was updated successfully, but these errors were encountered: