Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divide outside of floor #6130

Merged
merged 1 commit into from
Sep 28, 2023
Merged

Divide outside of floor #6130

merged 1 commit into from
Sep 28, 2023

Conversation

Blunde1
Copy link
Contributor

@Blunde1 Blunde1 commented Sep 21, 2023

Issue
Resolves #6129

Approach
The part

0.5
            * (1 + math.erf((x + _skew) / (_width * math.sqrt(2.0))))
        ) 

is the cdf of a gaussian, so it yields something on the interval 0,1.
multiplying by steps and taking floor then yields steps different possibilities (intended)
the division should occur outside of floor to standardize to the 0, 1 interval again.

Copy link
Contributor

@dafeda dafeda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful!
Remember to capitalize your commit message.

@Blunde1 Blunde1 self-assigned this Sep 22, 2023
@andreas-el
Copy link
Contributor

You should know that the failing tests were setup as part of the refactoring process (moving from C++ to python), and to me it looks like the original had this bug to begin with.

C++:
-static double trans_derrf(double x, const std::vector<double> arg) {
-    int steps = arg[0];
-    double min = arg[1];
-    double max = arg[2];
-    double skewness = arg[3];
-    double width = arg[4];
-    double y;
-
-    y = floor(steps * 0.5 * (1 + erf((x + skewness) / (width * sqrt(2.0)))) /
-              (steps - 1));
-    return min + y * (max - min);
-}

Python:
+    def trans_derrf(x: float, arg: List[float]) -> float:
+        '''Observe that the argument of the shift should be \"+\"'''
+        _steps, _min, _max, _skew, _width = int(arg[0]), arg[1], arg[2], arg[3], arg[4]
+        y = math.floor(
+            _steps
+            * 0.5
+            * (1 + math.erf((x + _skew) / (_width * math.sqrt(2.0))))
+            / (_steps - 1)
+        )
+        return _min + y * (_max - _min)

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 25, 2023

If _skew is too large relative to variance, we get a probability rounded of to exactly 1.0
The resulted binning is then between min and max+1.
This is a problem in the original TransferFunction.trans_errf as well

  def test_that_derrf_is_within_bounds(x, arg):
        result = TransferFunction.trans_derrf(x, arg)
>       assert arg[1] <= result <= arg[2]
E       assert 2.0 <= 1.0
E       Falsifying example: test_that_derrf_is_within_bounds(
E           x=0.0,
E           arg=(2, 0.0, 1.0, 9.0, 1.0),
E       )

All hail hypothesis for finding this!

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 25, 2023

Let's resolve #6166 as well

@Blunde1 Blunde1 linked an issue Sep 25, 2023 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2023

Codecov Report

Merging #6130 (6d18d7e) into main (e9dfa15) will decrease coverage by 0.05%.
Report is 19 commits behind head on main.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##             main    #6130      +/-   ##
==========================================
- Coverage   82.69%   82.65%   -0.05%     
==========================================
  Files         351      347       -4     
  Lines       21536    21247     -289     
  Branches      839      834       -5     
==========================================
- Hits        17810    17562     -248     
+ Misses       3428     3387      -41     
  Partials      298      298              
Files Coverage Δ
src/ert/config/gen_kw_config.py 97.45% <86.66%> (-0.65%) ⬇️

... and 23 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Blunde1 Blunde1 force-pushed the discrete-gauss-cdf branch 2 times, most recently from 2efef67 to bcff96d Compare September 26, 2023 12:32
@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 26, 2023

This is ready for review
Edit: given that tests pass

@Blunde1 Blunde1 requested a review from dafeda September 26, 2023 12:35
@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 26, 2023

Only coverage tests fail.
Rebase + one more test that makes sure the initial problem #6129 using normal cdf

@Blunde1 Blunde1 force-pushed the discrete-gauss-cdf branch 2 times, most recently from b2c4c02 to e7fa8ce Compare September 26, 2023 15:00
@dafeda
Copy link
Contributor

dafeda commented Sep 26, 2023

I may have found an example that fails:

@reproduce_failure('6.84.3', b'AXicY2BgZIAAEL1nAQOMt/8/BDDgAIwAHtgJGQ==')

Could you try running test_that_derrf_is_within_bounds with that decorator and see if you can reproduce the error.
I may have done something wrong...

I also think that we should be able to get rid of the for-loop using for example np.digitize.
Here's perhaps a way:

    def trans_derrf(x: float, arg: List[float]) -> float:
        """
        Bin the result of `trans_errf` with `min=0` and `max=1` to closest of `nbins`
        linearly spaced values on [0,1]. Finally map [0,1] to [min, max].
        """
        _steps, _min, _max, _skew, _width = (
            int(arg[0]),
            arg[1],
            arg[2],
            arg[3],
            arg[4],
        )
        q_values = np.linspace(start=0, stop=1, num=_steps)
        q_checks = np.linspace(start=0, stop=1, num=_steps + 1)[1:]
        y = TransferFunction.trans_errf(x, [0, 1, _skew, _width])
        bin_index = np.digitize(y, q_checks, right=True)
        y_binned = q_values[bin_index]
        result = _min + y_binned * (_max - _min)
        if result > _max or result < _min:
            warnings.warn(
                "trans_derff suffered from catastrophic loss of precision, clamping to min,max",
                stacklevel=1,
            )
            return np.clip(result, _min, _max)
        return result

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 27, 2023

Reproducing the test I see

x = -1.0, arg = (2, -1.7976931348623157e+308, 9.9792015476736e+291, 0.0, 1.0)

    @given(st.floats(allow_nan=False, allow_infinity=False), valid_derrf_parameters())
    @reproduce_failure("6.84.3", b"AXicY2BgZIAAEL1nAQOMt/8/BDDgAIwAHtgJGQ==")
    def test_that_derrf_is_within_bounds(x, arg):
        """The result shold always be between (or equal) min and max"""
        result = TransferFunction.trans_derrf(x, arg)
>       assert arg[1] <= result <= arg[2]
E       assert -1.7976931348623157e+308 <= nan

It seems that min and max are so far apart that simple arithmetic fails, which will likely be a problem for all code (in particular distribution and transfer functions) unless special care is taken.
I think we do not take special care of this in general, so unless I am wrong (which I might of course be), it makes sense to me to bound the values of valid_derrf_parameters()

Edit: should probably also include a raise ValueError if the result is nan

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 27, 2023

I also think that we should be able to get rid of the for-loop using for example np.digitize. Here's perhaps a way:

        bin_index = np.digitize(y, q_checks, right=True)
        y_binned = q_values[bin_index]

This is better, thanks.

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 27, 2023

Hypothesis found through testing test_that_derrf_corresponds_scaled_binned_normal_cdf (@reproduce_failure("6.84.3", b"AXicY2DADxhxSgAAAFUAAw==")) that to mimic the previously loop implementation we should have apply np.digitize with right=False.
What is actually correct?
The documentation says
image
reproducing the floor() functionality in the original implementation, I think right=False is correct.
Edit: but this gives index error...
Edit 2: we need right=True!

@dafeda
Copy link
Contributor

dafeda commented Sep 27, 2023

Tried running it a few times and test_that_derrf_corresponds_scaled_binned_normal_cdf fails with:

E       assert False
E        +  where False = <function isclose at 0x10fb3b870>(0.2500038146972659, 0.25000762939453125)
E        +    where <function isclose at 0x10fb3b870> = np.isclose
E       Falsifying example: test_that_derrf_corresponds_scaled_binned_normal_cdf(
E           x=1.0,
E           arg=(2, -34359738368.0, 0.2500038146972659, 0.0, 1.0),
E       )
E       
E       You can reproduce this example by temporarily adding @reproduce_failure('6.84.3', b'AXicY2CAAUYgPiAFYjVAeRwMeAAjADnbAWY=') as a decorator on your test case

@Blunde1
Copy link
Contributor Author

Blunde1 commented Sep 27, 2023

Tried running it a few times and test_that_derrf_corresponds_scaled_binned_normal_cdf fails with:

E       assert False
E        +  where False = <function isclose at 0x10fb3b870>(0.2500038146972659, 0.25000762939453125)
E        +    where <function isclose at 0x10fb3b870> = np.isclose
E       Falsifying example: test_that_derrf_corresponds_scaled_binned_normal_cdf(
E           x=1.0,
E           arg=(2, -34359738368.0, 0.2500038146972659, 0.0, 1.0),
E       )
E       
E       You can reproduce this example by temporarily adding @reproduce_failure('6.84.3', b'AXicY2CAAUYgPiAFYjVAeRwMeAAjADnbAWY=') as a decorator on your test case

This is again due to precision. Resolving by restricting valid_derrf_parameters() even further

@Blunde1 Blunde1 merged commit b577c65 into equinor:main Sep 28, 2023
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
4 participants