Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minhash deduplicator based on RAY. #502

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

chenyushuo
Copy link
Collaborator

@chenyushuo chenyushuo commented Nov 28, 2024

Unlike #489, the main approach here is based on Ray Actor's implementation of multi process union find set to complete equivalence class merging.

@yxdyc yxdyc requested review from yxdyc, HYLcool and pan-x-c December 11, 2024 08:04
@yxdyc yxdyc added dj:op issues/PRs about some specific OPs dj:dist issues/PRs about distributed data processing labels Dec 11, 2024
@yxdyc yxdyc added the dj:efficiency regarding to efficiency issues and enhancements label Dec 20, 2024
@chenyushuo chenyushuo changed the title [WIP] Add minhash deduplicator based on RAY. Add minhash deduplicator based on RAY. Dec 20, 2024
Copy link
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments

environments/dist_requires.txt Outdated Show resolved Hide resolved
logger.info(f'union_find_parallel_num = {union_find_parallel_num}')
self.union_find_parallel_num = union_find_parallel_num
self.union_threshold = union_threshold
self.remote_edge_buffers = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the union_find_parallel_num increases, the initialization cost of BTSUnionFind will increase quadratically.
Converting the remote_edge_buffers into a remote reference may be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:dist issues/PRs about distributed data processing dj:efficiency regarding to efficiency issues and enhancements dj:op issues/PRs about some specific OPs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants