[FEA] Research Removing Dask Reliance in GNN Packages #4200

alexbarghi-nv · 2024-02-28T18:22:31Z

Currently, we use Dask to manage MNMG processing throughout RAPIDS/cuGraph, but this causes some issues when integrating with PyTorch DDP workflows.

The biggest issue is that RMM pools can't be shared across processes, which means when using pools, each Dask worker would have a separate pool from the corresponding DDP worker on the same GPU. This wastes a significant amount of memory, and so in most cases, we don't use pools.

Another issue is that the semantics of Dask make managing loaders awkward. Originally, this wasn't supposed to be the case, but due to #4089 , we can't have multiple loaders calling uniform_neighbor_sample. And furthermore, even if that did work, it is a poor use of GPU resources when we should be combining all of those loader calls.

Finally, managing Dask and DDP in the same workflow is complicated and challenging. Other teams have complained that this makes the examples difficult to understand for new users, or even unreadable to someone unfamiliar with Dask. For MNMG workflows, we have to start Dask in a separate process which clashes with the PyTorch way of doing things.

There are also a number of new planned features (i.e. overlapping sampling/loading) that would be simpler to implement without Dask.

This path would not be unique in RAPIDS; WholeGraph already does something similar, relying on DDP as the process manager, and using its own RAFT/NCCL comms. cuGraph should be able to leverage RAFT and PyLibcuGraph to do the same.

The text was updated successfully, but these errors were encountered:

* Adds the ability to run `pylibcugraph` without UCX/dask within PyTorch DDP. * Adds the new distributed sampler which uses the new nccl+ddp path to perform bulk sampling. Closes #4200 Closes #4201 Closes #4246 Closes #3851 Authors: - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Rick Ratzel (https://github.com/rlratzel) - Chuck Hastings (https://github.com/ChuckHastings) - Jake Awe (https://github.com/AyodeAwe) - Joseph Nke (https://github.com/jnke2016) URL: #4278

alexbarghi-nv added the feature request New feature or request label Feb 28, 2024

alexbarghi-nv added this to the 24.06 milestone Feb 28, 2024

alexbarghi-nv self-assigned this Feb 28, 2024

alexbarghi-nv changed the title ~~Research Removing Dask Reliance in GNN Packages~~ [FEA] Research Removing Dask Reliance in GNN Packages Feb 28, 2024

This was referenced Mar 15, 2024

[IMP] Don't Copy to Host When Creating CuGraphStore (cugraph-pyg) #3383

Closed

[FEA] New Distributed Sampler for GNN Packages #4246

Closed

alexbarghi-nv mentioned this issue Mar 25, 2024

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

Merged

rapids-bot bot closed this as completed in #4278 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Research Removing Dask Reliance in GNN Packages #4200

[FEA] Research Removing Dask Reliance in GNN Packages #4200

alexbarghi-nv commented Feb 28, 2024

[FEA] Research Removing Dask Reliance in GNN Packages #4200

[FEA] Research Removing Dask Reliance in GNN Packages #4200

Comments

alexbarghi-nv commented Feb 28, 2024