Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly incorrect results with int datatype #59

Closed
eberhard-leander opened this issue Dec 5, 2023 · 3 comments · Fixed by #60
Closed

Seemingly incorrect results with int datatype #59

eberhard-leander opened this issue Dec 5, 2023 · 3 comments · Fixed by #60

Comments

@eberhard-leander
Copy link

eberhard-leander commented Dec 5, 2023

While experimenting with this package, I encountered a strange issue and thought it would be useful to post about it here. In short, it appears that the distance_correlation computation for int dtypes is incorrect when the size of the data is sufficiently large.

Here is a minimal example that can be used to replicate the issue:

import numpy as np
from dcor import distance_correlation

def reproduce_error(n):
    # some simple data
    arr1 = np.array([
        1, 2, 3
    ]*n)
    arr2 = np.array([
        10, 20, 5
    ]*n)
    
    int_int = distance_correlation(
        arr1, arr2
    )
    float_int = distance_correlation(
        arr1.astype(float),
        arr2
    )
    int_float = distance_correlation(
        arr1,
        arr2.astype(float),
    )
    float_float = distance_correlation(
        arr1.astype(float),
        arr2.astype(float),
    )

    print(f"""
    n: {n}
    int vs int: {int_int}
    float vs int: {float_int}
    int vs float: {int_float}
    float vs float: {float_float}
    """)

Now when we run this code for small samples, the correlations for all dtypes agree, and do not substantially change with the sample size.

reproduce_error(1)
    n: 1
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319221

reproduce_error(10)
    n: 10
    int vs int: 0.7621991222319219
    float vs int: 0.7621991222319219
    int vs float: 0.7621991222319219
    float vs float: 0.7621991222319217

reproduce_error(100)
    n: 100
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319215

However, past a certain point, the computations diverge:

reproduce_error(10000)
    n: 10000
    int vs int: 0.890284163962155
    float vs int: 0.890284163962155
    int vs float: 0.7621991222319217
    float vs float: 0.7621991222317823

I've started casting everything to float before computing the correlations to avoid this issue.

@vnmabus
Copy link
Owner

vnmabus commented Dec 6, 2023

It seems that an overflow occurs in the last case, as I obtain the following warning:

/home/carlos/Programas/Utilidades/Lenguajes/miniconda3/envs/fda311/lib/python3.11/site-packages/dcor/_dcor_internals.py:188: RuntimeWarning: overflow encountered in scalar multiply
  third_term = a_total_sum * b_total_sum / n_samples

I will check if it is possible to avoid it when I have a moment.

@vnmabus vnmabus mentioned this issue Dec 7, 2023
@vnmabus
Copy link
Owner

vnmabus commented Dec 7, 2023

I added a possible fix in #60. Note that converting to floating point arrays is still preferable, because the AVL implementation is compiled in that case.

@vnmabus
Copy link
Owner

vnmabus commented Jan 26, 2024

I will merge #60. Note that this can STILL overflow, specially in Windows where the default integer type (used for integer reductions, if the original type was smaller) has only 32 bits. As mentioned before, converting to floating point is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants