Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278

Merged
merged 38 commits into from
Apr 15, 2024

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented Mar 25, 2024

  • Adds the ability to run pylibcugraph without UCX/dask within PyTorch DDP.
  • Adds the new distributed sampler which uses the new nccl+ddp path to perform bulk sampling.

Closes #4200
Closes #4201
Closes #4246
Closes #3851

@alexbarghi-nv alexbarghi-nv changed the base branch from branch-24.04 to branch-24.06 March 25, 2024 22:26
@github-actions github-actions bot removed the conda label Mar 25, 2024
@alexbarghi-nv alexbarghi-nv self-assigned this Mar 25, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.06 milestone Mar 25, 2024
@alexbarghi-nv alexbarghi-nv added feature request New feature or request non-breaking Non-breaking change labels Mar 25, 2024
Copy link
Collaborator

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems alright to me. @seunghwak is a bit more familiar with some of the sampling output functions.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, I have some minor suggestions and questions.

src = cudf.Series(np.array_split(edgelist[0], world_size)[rank])
dst = cudf.Series(np.array_split(edgelist[1], world_size)[rank])

seeds = cudf.Series(np.arange(rank * 50, (rank + 1) * 50))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nitpicking suggestions.

What is 50? # seeds per rank? If this is an example,

num_seeds_per_rank = 50
seeds = cudf.Series(np.arange(rank * num_seeds_per_rank, (rank + 1) * num_seeds_per_rank))

will be more readable.

And just for completeness. Are we assuming that # ranks * # seeds per rank > # vertices in the input graph? If this code assumes anything about the input edgelist, we may specify that in comments or we may give a check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean < # vertices? But this is really just meant to be a toy example. It's using a dataset that has far more than that number of vertices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I changed the variable to seeds_per_rank like you suggested to make it clearer what I'm doing.

src = cudf.Series(edgelist[0])
dst = cudf.Series(edgelist[1])

seeds = cudf.Series(np.arange(0, 50))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. What happens if edgelist is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a toy example, it's not meant to be robust, just to accept the known good input we give it (in this case the ogbn-products dataset).

rank, world_size, nccl_comms, n_streams_per_handle=0, verbose=False
):
handle = Handle(n_streams=n_streams_per_handle)
inject_comms_on_handle_coll_only(handle, nccl_comms, world_size, rank, verbose)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we don't need p2p?

Copy link
Member Author

@alexbarghi-nv alexbarghi-nv Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. As far as I can know we can run all the GNN algorithms without UCX p2p comms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we initialized p2p comms in our benchmarks .

https://github.com/rapidsai/cugraph/blob/branch-24.04/benchmarks/cugraph/standalone/bulk_sampling/cugraph_bulk_sampling.py#L838

without UCX p2p comms.

Are we configuring UCX here? I thought this was for NCCL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only using nccl here. No UCX. I think that's sufficient for what we're doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least that's what @ChuckHastings told me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah, so p2p here means UCX p2p. NCCL has P2P as well. I think the naming here is confusing. NCCL-only might be more appropriate.

prows = int(math.sqrt(ngpus))
while ngpus % prows != 0:
prows = prows - 1
return prows, int(ngpus / prows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, maybe in the future, we may use a common C++ utility function to compute a desirable 2D partition (with an user option to override the default). Setting prows as close as possible to math.sqrt(ngpus) is just one possibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the existing dask config. I would be happy to replace this with a better function if that becomes available. We can make an issue for it.

Comment on lines +93 to +107
edge_id_array_p = (
minibatch_dict["edge_id"][start_ix:end_ix]
if has_edge_ids
else cupy.array([], dtype="int64")
)
edge_type_array_p = (
minibatch_dict["edge_type"][start_ix:end_ix]
if has_edge_types
else cupy.array([], dtype="int32")
)
weight_array_p = (
minibatch_dict["weight"][start_ix:end_ix]
if has_weights
else cupy.array([], dtype="float32")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, are we assuming that edge IDs are always int64, edge types are always int32 (this may make sense), and edge weights are always float32?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think C++ only supports int32 edge types. For edge id, the frameworks always use int64. In any case, this does actually allow for int32 edge ids, we just have to set a default empty dtype, and I just went with int64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for edge weights, I'm not technically restricting them to float32 here.

raise NotImplementedError("Must be implemented by subclass")

def sample_from_nodes(
self, nodes: TensorType, *, batch_size: int = 16, random_state: int = 62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is random state a seed for random number generation? If yes, make sure that you are providing different seeds for different ranks in random number generation. If not and for deterministic random number generators, you are generating same random number sequences in every GPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++, we take base_seed and use base_seed + rank in random number generator initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing that too. It's taken care of here:

random_state=random_state + rank,

@github-actions github-actions bot added the ci label Apr 12, 2024
@alexbarghi-nv alexbarghi-nv requested a review from jnke2016 April 12, 2024 21:29
@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review April 12, 2024 21:30
@alexbarghi-nv alexbarghi-nv requested review from a team as code owners April 12, 2024 21:30
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the new tests and examples.

Co-authored-by: Rick Ratzel <[email protected]>
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI got further, but more failures that might mean we have to skip tests?

@alexbarghi-nv
Copy link
Member Author

Looks like CI got further, but more failures that might mean we have to skip tests?

Yeah, we're skipping for now but I think long-term we may migrate some or all of this code to WholeGraph. But that's a discussion for another time.

@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 5c7cb2b into rapidsai:branch-24.06 Apr 15, 2024
131 checks passed
@alexbarghi-nv alexbarghi-nv deleted the dist-sampler branch April 15, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci feature request New feature or request non-breaking Non-breaking change python
Projects
None yet
7 participants