Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is there a fast method for dcor.independence.distance_covariance_test #30

Open
mycarta opened this issue May 30, 2021 · 2 comments

Comments

@mycarta
Copy link

mycarta commented May 30, 2021

WIth reference to the exampel in this notebook, this weekend I compared the performance of the the MERGESORT method vs. the NAIVE with a toy dataset of 8 columns x 21 rows:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'NAIVE'), axis = 0, arr=data), axis =0, arr=data)
>>> 24.3 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs:

%%timeit
dc = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, 
                                                          col2, method = 'MERGESORT'), axis = 0, arr=data), axis =0, arr=data)
>>> 17.4 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Since i sometimes work with many thousands of rows, and possibly more columns, I wonder if there is a way to similarly improve the speed of the pairwise p-value calculation:

p = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.independence.distance_covariance_test(col1, 
                                                         col2, exponent=1.0, num_resamples=2000)[0], 
                                                         axis = 0, arr=data), axis =0, arr=data)
>>> 4.38 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@vnmabus
Copy link
Owner

vnmabus commented May 31, 2021

No, as today. The code would need to have a separate branch to handle that case, but it should be relatively easy to implement (adding a new function in _hypothesis to perform a permutation test using the original array instead of the distance matrix, and using that when the method is not NAIVE). If you want to try a PR I could review it.

BTW, if you have additional CPUs you can use the 'AVL' method in distance_correlation and the rowwise function for an extra boost.

@mycarta
Copy link
Author

mycarta commented May 31, 2021

I am at capacity until the fall. After the summer, if as I hope I will have more time, I can give it a try.

For the purposes of my current projects, for the time being I am going to decimate my array really heavily:

decimated_df = data.copy().sample(frac=0.05, random_state=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants