Seemingly incorrect results with `int` datatype #59

eberhard-leander · 2023-12-05T18:42:01Z

While experimenting with this package, I encountered a strange issue and thought it would be useful to post about it here. In short, it appears that the distance_correlation computation for int dtypes is incorrect when the size of the data is sufficiently large.

Here is a minimal example that can be used to replicate the issue:

import numpy as np
from dcor import distance_correlation

def reproduce_error(n):
    # some simple data
    arr1 = np.array([
        1, 2, 3
    ]*n)
    arr2 = np.array([
        10, 20, 5
    ]*n)
    
    int_int = distance_correlation(
        arr1, arr2
    )
    float_int = distance_correlation(
        arr1.astype(float),
        arr2
    )
    int_float = distance_correlation(
        arr1,
        arr2.astype(float),
    )
    float_float = distance_correlation(
        arr1.astype(float),
        arr2.astype(float),
    )

    print(f"""
    n: {n}
    int vs int: {int_int}
    float vs int: {float_int}
    int vs float: {int_float}
    float vs float: {float_float}
    """)

Now when we run this code for small samples, the correlations for all dtypes agree, and do not substantially change with the sample size.

reproduce_error(1)
    n: 1
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319221

reproduce_error(10)
    n: 10
    int vs int: 0.7621991222319219
    float vs int: 0.7621991222319219
    int vs float: 0.7621991222319219
    float vs float: 0.7621991222319217

reproduce_error(100)
    n: 100
    int vs int: 0.7621991222319221
    float vs int: 0.7621991222319221
    int vs float: 0.7621991222319221
    float vs float: 0.7621991222319215

However, past a certain point, the computations diverge:

reproduce_error(10000)
    n: 10000
    int vs int: 0.890284163962155
    float vs int: 0.890284163962155
    int vs float: 0.7621991222319217
    float vs float: 0.7621991222317823

I've started casting everything to float before computing the correlations to avoid this issue.

The text was updated successfully, but these errors were encountered:

vnmabus · 2023-12-06T19:31:11Z

It seems that an overflow occurs in the last case, as I obtain the following warning:

/home/carlos/Programas/Utilidades/Lenguajes/miniconda3/envs/fda311/lib/python3.11/site-packages/dcor/_dcor_internals.py:188: RuntimeWarning: overflow encountered in scalar multiply
  third_term = a_total_sum * b_total_sum / n_samples

I will check if it is possible to avoid it when I have a moment.

vnmabus · 2023-12-07T13:03:27Z

I added a possible fix in #60. Note that converting to floating point arrays is still preferable, because the AVL implementation is compiled in that case.

vnmabus · 2024-01-26T09:17:20Z

I will merge #60. Note that this can STILL overflow, specially in Windows where the default integer type (used for integer reductions, if the original type was smaller) has only 32 bits. As mentioned before, converting to floating point is preferred.

vnmabus mentioned this issue Dec 7, 2023

Fix issue 59. #60

Merged

vnmabus closed this as completed in #60 Jan 26, 2024

vnmabus mentioned this issue May 15, 2024

Range of distance correlation #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seemingly incorrect results with `int` datatype #59

Seemingly incorrect results with `int` datatype #59

eberhard-leander commented Dec 5, 2023 •

edited

Loading

vnmabus commented Dec 6, 2023

vnmabus commented Dec 7, 2023

vnmabus commented Jan 26, 2024

Seemingly incorrect results with int datatype #59

Seemingly incorrect results with int datatype #59

Comments

eberhard-leander commented Dec 5, 2023 • edited Loading

vnmabus commented Dec 6, 2023

vnmabus commented Dec 7, 2023

vnmabus commented Jan 26, 2024

Seemingly incorrect results with `int` datatype #59

Seemingly incorrect results with `int` datatype #59

eberhard-leander commented Dec 5, 2023 •

edited

Loading