Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of pairwise distances computation #40

Open
vnmabus opened this issue Dec 9, 2022 · 4 comments
Open

Improve performance of pairwise distances computation #40

vnmabus opened this issue Dec 9, 2022 · 4 comments

Comments

@vnmabus
Copy link
Owner

vnmabus commented Dec 9, 2022

The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.

Any performance improvement to this function is thus well received.

@Palash123-4
Copy link

Using the 'dcor. rowwise' function, I found a way to compute the pairwise distances. The idea is to calculate the distances using the 'dcor. rowwise' function first and later allocate them in the symmetric distance matrix form.

I did implement it on the multivariate normal data. The Python implementation is as follows:

import numpy as np
from dcor import u_distance_correlation_sqr, rowwise
from scipy.spatial.distance import pdist, squareform
from scipy.stats import multivariate_normal
def fast_dcor( u, v):
    value = u_distance_correlation_sqr(u.astype('float'), v.astype('float'), method = 'avl')
    return value  
def main():
   np.bool = np.bool_
   matrix_size = 6
   mean_vector =  [2, 3, 5, 3, 2, 1]
    # np.random.seed(123)
    # A = 0.1 * np.random.rand(matrix_size, matrix_size)
   A = np.array([[0.4959073 , 0.1000873 , 0.09526673, 0.32539095, 0.14799651,
            0.14689728],
           [0.08218122, 0.1318435 , 0.38105886, 0.36172172, 0.36502285,
            0.24346033],
           [0.12722501, 0.30467872, 0.34359662, 0.10908805, 0.45987096,
            0.1476644 ],
           [0.38427917, 0.2076916 , 0.15757227, 0.32931607, 0.29685373,
            0.04350785],
           [0.10871352, 0.27666418, 0.25179949, 0.49252174, 0.22908678,
            0.16522994],
           [0.28722805, 0.1650361 , 0.17008386, 0.10228245, 0.01083518,
            0.43123006]])
   B = np.dot(A, A.transpose())
   n_samples = 1000
   mv = multivariate_normal( mean = mean_vector, cov = B)
   a = mv.rvs(size = n_samples, random_state = 123)
   a_transpose = np.array(a.T)
   distance_ = pdist(a_transpose, metric = fast_dcor)
   print(squareform(distance_))
   # optimized pairwise distance using distance correlation rowise function
   print("\n------Optimized pdist for pairwise distance correlation----------")
   pdist_opt = np.zeros((matrix_size,  matrix_size))
   for i in range(matrix_size):       
       if i < matrix_size - 1:
           a1 = a_transpose[0 : (matrix_size - 1 - i)] 
           a2 = a_transpose[ i + 1 : ]  
           r = rowwise(u_distance_correlation_sqr, a1, a2)           
           for j in range(1, matrix_size - i):
               pdist_opt[j - 1, j + i ] = r[j - 1]
               pdist_opt[j + i , j - 1] = r[j - 1]
               pass
           del a1, a2
           pass
       pass
   print(pdist_opt)
if __name__== '__main__' :
    main()

If you see the outcome I print both the result of scipy pdist and the fast one utilizing rowwise function

@emoen
Copy link

emoen commented Dec 17, 2024

Comment on performance with rowwise: ive benchmarked rowwise on a dataset of ca 23000 datasets and feature creation with dcor.rowwise(dcor.distance_covariance, a, b) using dcor.rowwise. Using distance_covariance took 8 minutes 41 seconds, and using rowwise took 6 minutes 54 seconds. So 20% improvement.

@vnmabus
Copy link
Owner Author

vnmabus commented Dec 17, 2024

It does not seem like such a big improvement (although everything helps, of course). To be honest, I am not very happy with rowwise (it is clumsy and feels like a hack), and I would rather remove it in the future in favor of vectorizing directly distance_correlation and related functions, like it is done for generalized ufuncs in NumPy or for correlation functions in SciPy.

@Palash123-4
Copy link

Yes, it seems rowwise is indeed helpful only when you have larger number of cores. Vectorized directly will also be a good replacement if it fasts the computation in compare to rowwise. Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants