Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Research Removing Dask Reliance in GNN Packages #4200

Closed
alexbarghi-nv opened this issue Feb 28, 2024 · 0 comments · Fixed by #4278
Closed

[FEA] Research Removing Dask Reliance in GNN Packages #4200

alexbarghi-nv opened this issue Feb 28, 2024 · 0 comments · Fixed by #4278
Assignees
Labels
feature request New feature or request
Milestone

Comments

@alexbarghi-nv
Copy link
Member

Currently, we use Dask to manage MNMG processing throughout RAPIDS/cuGraph, but this causes some issues when integrating with PyTorch DDP workflows.

The biggest issue is that RMM pools can't be shared across processes, which means when using pools, each Dask worker would have a separate pool from the corresponding DDP worker on the same GPU. This wastes a significant amount of memory, and so in most cases, we don't use pools.

Another issue is that the semantics of Dask make managing loaders awkward. Originally, this wasn't supposed to be the case, but due to #4089 , we can't have multiple loaders calling uniform_neighbor_sample. And furthermore, even if that did work, it is a poor use of GPU resources when we should be combining all of those loader calls.

Finally, managing Dask and DDP in the same workflow is complicated and challenging. Other teams have complained that this makes the examples difficult to understand for new users, or even unreadable to someone unfamiliar with Dask. For MNMG workflows, we have to start Dask in a separate process which clashes with the PyTorch way of doing things.

There are also a number of new planned features (i.e. overlapping sampling/loading) that would be simpler to implement without Dask.

This path would not be unique in RAPIDS; WholeGraph already does something similar, relying on DDP as the process manager, and using its own RAFT/NCCL comms. cuGraph should be able to leverage RAFT and PyLibcuGraph to do the same.

@alexbarghi-nv alexbarghi-nv added the feature request New feature or request label Feb 28, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.06 milestone Feb 28, 2024
@alexbarghi-nv alexbarghi-nv self-assigned this Feb 28, 2024
@alexbarghi-nv alexbarghi-nv changed the title Research Removing Dask Reliance in GNN Packages [FEA] Research Removing Dask Reliance in GNN Packages Feb 28, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 15, 2024
* Adds the ability to run `pylibcugraph` without UCX/dask within PyTorch DDP.
* Adds the new distributed sampler which uses the new nccl+ddp path to perform bulk sampling.

Closes #4200 
Closes #4201 
Closes #4246 
Closes #3851

Authors:
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Rick Ratzel (https://github.com/rlratzel)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Jake Awe (https://github.com/AyodeAwe)
  - Joseph Nke (https://github.com/jnke2016)

URL: #4278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant