-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define and Implement C++ API for negative sampling #4523
Define and Implement C++ API for negative sampling #4523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
raft::device_span<value_t> output, | ||
raft::device_span<bias_t const> biases); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something very minor, but should we place input vectors (e.g. biases
) before output vectors (e.g. output
)? AFAIK, that is a convention in the C++ API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may delete this function. I implemented a different way. After I've refactored the implementation I'll revisit whether we need this function or not. If we keep it I'll make that change.
* Sampling occurs by creating a list of source vertex ids from biased samping | ||
* of the source vertex space, and destination vertex ids from biased sampling of the | ||
* destination vertex space, and using this as the putative list of edges. We | ||
* then can optionally remove duplicates and remove false negatives to generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is false negative
here a right terminology? AFAIK, false negative is something that should be reported is missing. Here, isn't it the opposite? (edges that shouldn't appear actually appear).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to remove_existing_edges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you commit the change? I still see false negatives
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed the change in the documentation... I'll search the documentation and push a fix once I resolve the testing issue on one of the build configurations.
* vertices will be selected uniformly | ||
* @param remove_duplicates If true, remove duplicate samples | ||
* @param remove_false_negatives If true, remove false negatives (samples that are actually edges in | ||
* the graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these false negatives? Should we better say something like @param remove_positive_samples If true, remove positive samples (edges that exist in the input graph).
False negatives can be mis-interpreted as something that should be reported but missing. I guess here false negatives mean something that is reported as negative samples but should not be reported (as it is positive samples, this actually means false positive negative samples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to remove_existing_edges
std::optional<raft::device_span<weight_t const>> src_bias, | ||
std::optional<raft::device_span<weight_t const>> dst_bias, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for nitpicking but better be src_biases
and dst_biases
to be consistent with the reset of C++ API? (use plural forms for vectors with multiple elements?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to be plural.
* @param src_bias Optional bias for randomly selecting source vertices. If std::nullopt vertices | ||
* will be selected uniformly | ||
* @param dst_bias Optional bias for randomly selecting destination vertices. If std::nullopt | ||
* vertices will be selected uniformly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the ranges for multi-GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment on that.
graph_view.has_edge(handle, | ||
raft::device_span<vertex_t const>{batch_src.data(), batch_src.size()}, | ||
raft::device_span<vertex_t const>{batch_dst.data(), batch_dst.size()}, | ||
// do_expensive_check); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disable expensive_check once validated.
auto new_end = thrust::remove_if(handle.get_thrust_policy(), | ||
begin_iter, | ||
begin_iter + batch_src.size(), | ||
[] __device__(auto tuple) { return thrust::get<2>(tuple); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use stencil.
auto edge_first = thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin());
auto new_end = thrust::remove_if(handle.get_thrust_policy(), edge_first, edge_first + batch_src.size(), has_edge_flags.begin(), []__device__(auto flag) { return !flag; });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made
thrust::copy(handle.get_thrust_policy(), | ||
thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin()), | ||
thrust::make_zip_iterator(batch_src.end(), batch_dst.end()), | ||
thrust::make_zip_iterator(src.begin(), dst.begin()) + current_end); | ||
|
||
auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin()); | ||
thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use thrust::merge if remove_duplicates is true. If I am not mistaken, both (src, dst) and (batch_src, batch_dst) are sorted.
If remove_duplicates is false, we don't need to sort, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to merge.
if (!remove_duplicates) { | ||
auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin()); | ||
thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we sort here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
handle.get_thrust_policy(), | ||
dst_bias_cache_->view().value_first(), | ||
dst_bias_cache_->view().value_first() + graph_view.local_edge_partition_dst_range_size(), | ||
weight_t{0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work.
edge_src|dst_property_t stores edge src|dst property values in either linear array or (key, value) pairs.
This works only if property values are store in an linear array, but with this approach, memory footprint scales as a function of O(V/sqrt(P)), so this won't scale in large scale for graphs with a relatively low average vertex degree. Using key, value pairs ensure that memory footprint stays as min(O(V/sqrt(P)), O(E/P)).
And I am uncomfortable in exploiting this implementation details outside the primitives. And if we are only visiting the edges that exist in the input graph, we can store in (key, value) pairs, but if we need to cover the entire src/dst range, edge_property_t is not a right data structure.
One robust (but wastes some communication) approach is this.
We locally sum src biases and dst biases in each rank, then call host_scalar_allgather.
We first generate edge sources in two steps.
- In each rank, find how many sources to generate in each V/P segment (host_scalar_allgather output of local src bias sum). Then call host_scalar_all_to_all (host_scalar_multicast). Based on this, each rank generates sources for its local vertex partition range. Then, send the source values back.
- Do the same for destinations.
- Shuffle edges.
A key draw back is that we need to send sources & destinations back and shuffle edges as well. More communication than necessary but no memory footprint issue and minimally exploits any cugraph implementation details.
A more efficient approach is to create a P by sqrt(P) (or minor_comm_size to be exact) array (this requires understanding how the adjacency matrix is partitoined). By just collecting local bias sum values, we can compute how many edges should be generated in each box. We can send these nuumbers to the rank owning the boxes.
When we generate sources, we can generate sources in minor_com_size iterations; in each iteration we hold source biases only for the V/P vertices. When we generate destination vertices. We first just compute how many to generate per each V/P range in each box. Sum it, send to the owning GPUs (or in loops, we can bring V/P dst bias values based on which incur less communication), and get the destination vertices back. This will incur less communication but more involved.
I am inclined to implement the first approach, and if that turns out to be a performance bottleneck, we may consider implementing the second approach may be in the primitive space.... Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revisit this.
My first attempt at refactoring did something akin to your first suggestion. I ran into some problems in debugging and then the current approach occurred to me. My second attempt was something akin to your second approach, but then I ran into the issue that the source edge partitions are separate. This was my third attempt and works [as long as the internal implementation is an array and not a k-v implementation... I guess I didn't have a test that reached that point].
I agree that doing this properly and efficiently might need some work in primitive space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored in latest push to use option 1. Note that's only required for biased... if either src or dst is uniform that can be done directly.
I suspect the original problem I had with that approach was actually the raft bug that I later worked around. Code is much simpler this way, and I think the performance should be fine - at least for now.
…ng inner details of edge partitioning
std::optional<rmm::device_uvector<weight_t>>> | ||
normalize_biases(raft::handle_t const& handle, | ||
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view, | ||
std::optional<raft::device_span<weight_t const>> biases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something minor, but should we better call this function only when biases
is valid? (instead of taking an optional biases
variable?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.
gpu_biases = cugraph::device_allgatherv( | ||
handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't need both the above host_scalar_allreduce
and device_allgatherv
here.
If we just call allgatherv on every GPU's local sum
, we can compute aggregate_sum from there. We have every GPU's local sum and the aggregate sum, we can compute everything without calling an additional host_scalar_allreduce
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up in next push.
thrust::inclusive_scan( | ||
handle.get_thrust_policy(), gpu_biases->begin(), gpu_biases->end(), gpu_biases->begin()); | ||
|
||
weight_t force_to_one{1.1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.1 => 1?
|
||
weight_t force_to_one{1.1}; | ||
raft::update_device( | ||
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_biases->set_element_async(gpu_biases->size() - 1, force_to_one, handle.get_stream());
handle.sync_stream(); <= this is necessary as force_to_one
can become out-of-scope before this update finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if we really want to ensure that no value is larger than 1, we may call thrust::transform
on the entire gpu_biases
and set value to cuda::std::min(org_value, 1.0).
I am not sure that floating point arithmetic guarantees that running inclusive scans on local_sum / aggregate_sum values guarantee that no value exceeds 1 (and if the local_sum on the last GPU is zero, non-last element may have a value larger than 1 as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0
. When I do thrust::upper_bound
below, if the random number is 1.0
then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0
lets us handle this condition.
But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to support the case where some GPUs might have no edges.
size_t comm_size = handle.get_comms().get_size(); | ||
size_t comm_rank = handle.get_comms().get_rank(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t
=> auto const
(comm_size & comm_rank are actually int
values following MPI convention, I guess we won''t need to consider more than 2^31-1 GPUs in our life time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
auto& major_comm = handle.get_subcomm(cugraph::partition_manager::major_comm_name()); | ||
auto const major_comm_size = major_comm.get_size(); | ||
auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name()); | ||
auto const minor_comm_size = minor_comm.get_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
auto all_gpu_counts = cugraph::device_allgatherv( | ||
handle, | ||
handle.get_comms(), | ||
raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to just collect sample counts for comm_size GPUs (no need to collect for the entire comm_size * comm_size matrix). Use device_multicast_sendrecv to send to comm_size GPUs and receive from comm_size GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, the code below will also become simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use shuffle_values
, which is essentially this.
I re-reviewed the refactored PR and approve it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest push addresses the latest batch of requests. Please review again.
std::optional<rmm::device_uvector<weight_t>>> | ||
normalize_biases(raft::handle_t const& handle, | ||
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view, | ||
std::optional<raft::device_span<weight_t const>> biases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.
gpu_biases = cugraph::device_allgatherv( | ||
handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up in next push.
|
||
weight_t force_to_one{1.1}; | ||
raft::update_device( | ||
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0
. When I do thrust::upper_bound
below, if the random number is 1.0
then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0
lets us handle this condition.
But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.
auto& major_comm = handle.get_subcomm(cugraph::partition_manager::major_comm_name()); | ||
auto const major_comm_size = major_comm.get_size(); | ||
auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name()); | ||
auto const minor_comm_size = minor_comm.get_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
size_t comm_size = handle.get_comms().get_size(); | ||
size_t comm_rank = handle.get_comms().get_rank(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
auto all_gpu_counts = cugraph::device_allgatherv( | ||
handle, | ||
handle.get_comms(), | ||
raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use shuffle_values
, which is essentially this.
|
||
weight_t force_to_one{1.1}; | ||
raft::update_device( | ||
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to support the case where some GPUs might have no edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides minor cosmetic issues and the need to test with edge masked graphs (to make sure that every algorithm works properly with edge masked graphs).
raft::handle_t const& handle, | ||
raft::random::RngState& rng_state, | ||
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view, | ||
size_t num_samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of our convention is to list scalar parameters at the end. We may better move num_samples
after dst_biases
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved in next push
normalized_biases->begin()); | ||
|
||
if constexpr (multi_gpu) { | ||
// rmm::device_scalar<weight_t> d_sum((sum / aggregate_sum), handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete commented out code.l
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted in next push
bool use_dst_bias{false}; | ||
bool remove_duplicates{false}; | ||
bool remove_existing_edges{false}; | ||
bool exact_number_of_samples{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should better test with edge masked graphs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would behavior be in an edge masked graph? That if an edge is masked out it would be allowed to be in the negative sampled edges if remove_existing_edges
is set to true? I can add that tests for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If you call has_edge() on masked out edges, you will get false. So, those edges can be included in the negative sampling output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in next push
bool remove_duplicates{false}; | ||
bool remove_existing_edges{false}; | ||
bool exact_number_of_samples{false}; | ||
bool check_correctness{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should better test with edge masked graphs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in next push
/merge |
Defines and implements the PLC/C/C++ APIs for negative sampling.
Closes #4497