[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

alexbarghi-nv · 2024-03-25T22:25:46Z

Adds the ability to run pylibcugraph without UCX/dask within PyTorch DDP.
Adds the new distributed sampler which uses the new nccl+ddp path to perform bulk sampling.

Closes #4200
Closes #4201
Closes #4246
Closes #3851

ChuckHastings

Seems alright to me. @seunghwak is a bit more familiar with some of the sampling output functions.

seunghwak

Looks good to me in general, I have some minor suggestions and questions.

seunghwak · 2024-04-03T16:25:26Z

python/cugraph-pyg/cugraph_pyg/examples/cugraph_dist_sampling_mg.py

+    src = cudf.Series(np.array_split(edgelist[0], world_size)[rank])
+    dst = cudf.Series(np.array_split(edgelist[1], world_size)[rank])
+
+    seeds = cudf.Series(np.arange(rank * 50, (rank + 1) * 50))


Just minor nitpicking suggestions.

What is 50? # seeds per rank? If this is an example,

num_seeds_per_rank = 50 seeds = cudf.Series(np.arange(rank * num_seeds_per_rank, (rank + 1) * num_seeds_per_rank))

will be more readable.

And just for completeness. Are we assuming that # ranks * # seeds per rank > # vertices in the input graph? If this code assumes anything about the input edgelist, we may specify that in comments or we may give a check.

I think you mean < # vertices? But this is really just meant to be a toy example. It's using a dataset that has far more than that number of vertices.

Also I changed the variable to seeds_per_rank like you suggested to make it clearer what I'm doing.

seunghwak · 2024-04-03T16:26:41Z

python/cugraph-pyg/cugraph_pyg/examples/cugraph_dist_sampling_sg.py

+    src = cudf.Series(edgelist[0])
+    dst = cudf.Series(edgelist[1])
+
+    seeds = cudf.Series(np.arange(0, 50))


Same here. What happens if edgelist is empty?

This is just a toy example, it's not meant to be robust, just to accept the known good input we give it (in this case the ogbn-products dataset).

seunghwak · 2024-04-03T16:29:07Z

python/cugraph/cugraph/gnn/comms/cugraph_nccl_comms.py

+    rank, world_size, nccl_comms, n_streams_per_handle=0, verbose=False
+):
+    handle = Handle(n_streams=n_streams_per_handle)
+    inject_comms_on_handle_coll_only(handle, nccl_comms, world_size, rank, verbose)


So, we don't need p2p?

I don't think so. As far as I can know we can run all the GNN algorithms without UCX p2p comms.

I thought we initialized p2p comms in our benchmarks .

https://github.com/rapidsai/cugraph/blob/branch-24.04/benchmarks/cugraph/standalone/bulk_sampling/cugraph_bulk_sampling.py#L838

without UCX p2p comms.

Are we configuring UCX here? I thought this was for NCCL

We're only using nccl here. No UCX. I think that's sufficient for what we're doing.

At least that's what @ChuckHastings told me.

OK, yeah, so p2p here means UCX p2p. NCCL has P2P as well. I think the naming here is confusing. NCCL-only might be more appropriate.

seunghwak · 2024-04-03T16:37:22Z

python/cugraph/cugraph/gnn/comms/cugraph_nccl_comms.py

+    prows = int(math.sqrt(ngpus))
+    while ngpus % prows != 0:
+        prows = prows - 1
+    return prows, int(ngpus / prows)


Just FYI, maybe in the future, we may use a common C++ utility function to compute a desirable 2D partition (with an user option to override the default). Setting prows as close as possible to math.sqrt(ngpus) is just one possibility.

I copied this from the existing dask config. I would be happy to replace this with a better function if that becomes available. We can make an issue for it.

seunghwak · 2024-04-03T16:39:15Z

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py

+            edge_id_array_p = (
+                minibatch_dict["edge_id"][start_ix:end_ix]
+                if has_edge_ids
+                else cupy.array([], dtype="int64")
+            )
+            edge_type_array_p = (
+                minibatch_dict["edge_type"][start_ix:end_ix]
+                if has_edge_types
+                else cupy.array([], dtype="int32")
+            )
+            weight_array_p = (
+                minibatch_dict["weight"][start_ix:end_ix]
+                if has_weights
+                else cupy.array([], dtype="float32")
+            )


So, are we assuming that edge IDs are always int64, edge types are always int32 (this may make sense), and edge weights are always float32?

I think C++ only supports int32 edge types. For edge id, the frameworks always use int64. In any case, this does actually allow for int32 edge ids, we just have to set a default empty dtype, and I just went with int64.

Same for edge weights, I'm not technically restricting them to float32 here.

seunghwak · 2024-04-03T16:40:59Z

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py

+        raise NotImplementedError("Must be implemented by subclass")
+
+    def sample_from_nodes(
+        self, nodes: TensorType, *, batch_size: int = 16, random_state: int = 62


Is random state a seed for random number generation? If yes, make sure that you are providing different seeds for different ranks in random number generation. If not and for deterministic random number generators, you are generating same random number sequences in every GPU.

In C++, we take base_seed and use base_seed + rank in random number generator initialization.

We're doing that too. It's taken care of here:

cugraph/python/cugraph/cugraph/gnn/data_loading/dist_sampler.py

Line 319 in e7af045

random_state=random_state + rank,

…to dist-sampler

rlratzel

LGTM, thanks for the new tests and examples.

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py

Co-authored-by: Rick Ratzel <[email protected]>

rlratzel

Looks like CI got further, but more failures that might mean we have to skip tests?

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py

… into dist-sampler

alexbarghi-nv · 2024-04-15T17:52:29Z

Looks like CI got further, but more failures that might mean we have to skip tests?

Yeah, we're skipping for now but I think long-term we may migrate some or all of this code to WholeGraph. But that's a discussion for another time.

alexbarghi-nv · 2024-04-15T22:27:01Z

/merge

alexbarghi-nv added 3 commits March 25, 2024 09:57

dist sampler

46a8b0b

Merge branch 'branch-24.06' into dist-sampler

2d18d56

working prototype of nccl cugraph

39d139d

github-actions bot added python conda labels Mar 25, 2024

revert debug changes

ca732a9

alexbarghi-nv changed the base branch from branch-24.04 to branch-24.06 March 25, 2024 22:26

github-actions bot removed the conda label Mar 25, 2024

alexbarghi-nv self-assigned this Mar 25, 2024

alexbarghi-nv added this to the 24.06 milestone Mar 25, 2024

alexbarghi-nv added feature request New feature or request non-breaking Non-breaking change labels Mar 25, 2024

alexbarghi-nv added 12 commits March 26, 2024 11:26

clean up examples, add sg

2669596

wrap up sampler calls

8e17ee7

more cleanup

4e36719

writing

2fe5084

pull in change

50c5a80

dist sampler io

9e393a0

dist sampling io

b35ad1f

cleanup

9bc944b

style

5e9135b

sg example

0593278

cleanup:

b44ea3b

cleanup imports

e7af045

alexbarghi-nv requested review from VibhuJawa, BradReesWork, ChuckHastings and seunghwak April 2, 2024 22:13

ChuckHastings reviewed Apr 3, 2024

View reviewed changes

seunghwak reviewed Apr 3, 2024

View reviewed changes

alexbarghi-nv added 4 commits April 12, 2024 11:11

performance improvements, debug, cleanup

6bf4f03

fix bugs, cleanup

eb94b72

Merge branch 'branch-24.06' of https://github.com/rapidsai/cugraph in…

d0d3d4d

…to dist-sampler

add yes prompt

6814a65

github-actions bot added the ci label Apr 12, 2024

alexbarghi-nv added 2 commits April 12, 2024 14:25

Merge branch 'branch-24.06' of https://github.com/rapidsai/cugraph in…

92ca361

…to dist-sampler

style

f255b61

alexbarghi-nv requested a review from jnke2016 April 12, 2024 21:29

alexbarghi-nv marked this pull request as ready for review April 12, 2024 21:30

alexbarghi-nv requested review from a team as code owners April 12, 2024 21:30

rlratzel approved these changes Apr 13, 2024

View reviewed changes

rlratzel requested changes Apr 13, 2024

View reviewed changes

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py Outdated Show resolved Hide resolved

forwardref

9f985b5

Co-authored-by: Rick Ratzel <[email protected]>

rlratzel requested changes Apr 15, 2024

View reviewed changes

python/cugraph/cugraph/gnn/data_loading/dist_sampler.py Show resolved Hide resolved

alexbarghi-nv added 3 commits April 15, 2024 07:41

add skip if torch not installed

414e30d

Merge branch 'dist-sampler' of https://github.com/alexbarghi-nv/cugraph…

da3b0fb

… into dist-sampler

style

d747dff

ChuckHastings approved these changes Apr 15, 2024

View reviewed changes

alexbarghi-nv and others added 4 commits April 15, 2024 10:41

switch to yes || true

7f5abff

remove parquet files

71ea76e

yes || true in other script

dd9ed2d

Merge branch 'branch-24.06' into dist-sampler

1163a81

rlratzel approved these changes Apr 15, 2024

View reviewed changes

AyodeAwe approved these changes Apr 15, 2024

View reviewed changes

jnke2016 approved these changes Apr 15, 2024

View reviewed changes

rapids-bot bot merged commit 5c7cb2b into rapidsai:branch-24.06 Apr 15, 2024
131 checks passed

alexbarghi-nv deleted the dist-sampler branch April 15, 2024 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

alexbarghi-nv commented Mar 25, 2024 •

edited

Loading

ChuckHastings left a comment

seunghwak left a comment

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024

alexbarghi-nv Apr 3, 2024

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024 •

edited

Loading

VibhuJawa Apr 3, 2024

alexbarghi-nv Apr 3, 2024

alexbarghi-nv Apr 3, 2024

seunghwak Apr 3, 2024

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024

alexbarghi-nv Apr 3, 2024

seunghwak Apr 3, 2024

seunghwak Apr 3, 2024

alexbarghi-nv Apr 3, 2024

rlratzel left a comment

rlratzel left a comment

alexbarghi-nv commented Apr 15, 2024

alexbarghi-nv commented Apr 15, 2024

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

Conversation

alexbarghi-nv commented Mar 25, 2024 • edited Loading

ChuckHastings left a comment

Choose a reason for hiding this comment

seunghwak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbarghi-nv Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

alexbarghi-nv commented Apr 15, 2024

alexbarghi-nv commented Apr 15, 2024

alexbarghi-nv commented Mar 25, 2024 •

edited

Loading

alexbarghi-nv Apr 3, 2024 •

edited

Loading