Improve performance of pairwise distances computation #40

vnmabus · 2022-12-09T08:25:11Z

The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.

Any performance improvement to this function is thus well received.

Palash123-4 · 2024-11-28T14:50:06Z

Using the 'dcor. rowwise' function, I found a way to compute the pairwise distances. The idea is to calculate the distances using the 'dcor. rowwise' function first and later allocate them in the symmetric distance matrix form.

I did implement it on the multivariate normal data. The Python implementation is as follows:

import numpy as np
from dcor import u_distance_correlation_sqr, rowwise
from scipy.spatial.distance import pdist, squareform
from scipy.stats import multivariate_normal

def fast_dcor( u, v):
    value = u_distance_correlation_sqr(u.astype('float'), v.astype('float'), method = 'avl')
    return value

def main():
   np.bool = np.bool_
   matrix_size = 6
   mean_vector =  [2, 3, 5, 3, 2, 1]
    # np.random.seed(123)
    # A = 0.1 * np.random.rand(matrix_size, matrix_size)
   A = np.array([[0.4959073 , 0.1000873 , 0.09526673, 0.32539095, 0.14799651,
            0.14689728],
           [0.08218122, 0.1318435 , 0.38105886, 0.36172172, 0.36502285,
            0.24346033],
           [0.12722501, 0.30467872, 0.34359662, 0.10908805, 0.45987096,
            0.1476644 ],
           [0.38427917, 0.2076916 , 0.15757227, 0.32931607, 0.29685373,
            0.04350785],
           [0.10871352, 0.27666418, 0.25179949, 0.49252174, 0.22908678,
            0.16522994],
           [0.28722805, 0.1650361 , 0.17008386, 0.10228245, 0.01083518,
            0.43123006]])
   B = np.dot(A, A.transpose())
   n_samples = 1000
   mv = multivariate_normal( mean = mean_vector, cov = B)
   a = mv.rvs(size = n_samples, random_state = 123)
   a_transpose = np.array(a.T)
   distance_ = pdist(a_transpose, metric = fast_dcor)
   print(squareform(distance_))
   # optimized pairwise distance using distance correlation rowise function
   print("\n------Optimized pdist for pairwise distance correlation----------")
   pdist_opt = np.zeros((matrix_size,  matrix_size))
   for i in range(matrix_size):       
       if i < matrix_size - 1:
           a1 = a_transpose[0 : (matrix_size - 1 - i)] 
           a2 = a_transpose[ i + 1 : ]  
           r = rowwise(u_distance_correlation_sqr, a1, a2)           
           for j in range(1, matrix_size - i):
               pdist_opt[j - 1, j + i ] = r[j - 1]
               pdist_opt[j + i , j - 1] = r[j - 1]
               pass
           del a1, a2
           pass
       pass
   print(pdist_opt)

if __name__== '__main__' :
    main()

If you see the outcome I print both the result of scipy pdist and the fast one utilizing rowwise function

emoen · 2024-12-17T14:56:13Z

Comment on performance with rowwise: ive benchmarked rowwise on a dataset of ca 23000 datasets and feature creation with dcor.rowwise(dcor.distance_covariance, a, b) using dcor.rowwise. Using distance_covariance took 8 minutes 41 seconds, and using rowwise took 6 minutes 54 seconds. So 20% improvement.

vnmabus · 2024-12-17T15:37:26Z

It does not seem like such a big improvement (although everything helps, of course). To be honest, I am not very happy with rowwise (it is clumsy and feels like a hack), and I would rather remove it in the future in favor of vectorizing directly distance_correlation and related functions, like it is done for generalized ufuncs in NumPy or for correlation functions in SciPy.

Palash123-4 · 2024-12-17T15:46:08Z

Yes, it seems rowwise is indeed helpful only when you have larger number of cores. Vectorized directly will also be a good replacement if it fasts the computation in compare to rowwise. Thanks for the feedback.

vnmabus added the help wanted label Dec 9, 2022

vnmabus added the enhancement label Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of pairwise distances computation #40

Improve performance of pairwise distances computation #40

vnmabus commented Dec 9, 2022

Palash123-4 commented Nov 28, 2024

emoen commented Dec 17, 2024 •

edited

Loading

vnmabus commented Dec 17, 2024

Palash123-4 commented Dec 17, 2024

Improve performance of pairwise distances computation #40

Improve performance of pairwise distances computation #40

Comments

vnmabus commented Dec 9, 2022

Palash123-4 commented Nov 28, 2024

emoen commented Dec 17, 2024 • edited Loading

vnmabus commented Dec 17, 2024

Palash123-4 commented Dec 17, 2024

emoen commented Dec 17, 2024 •

edited

Loading