Fix MG similarity issues #4741

ChuckHastings · 2024-10-31T20:23:10Z

This PR adds C++ tests for the all-pairs variation of similarity algorithms. Previously the all-pairs variation was only tested in SG mode.

This also addresses an issue where the all-pairs implementation would crash when there was a load imbalance across the GPUs and one of the GPUs ran out of work before the others.

Closes #4704

seunghwak · 2024-11-08T01:24:22Z

cpp/src/link_prediction/similarity_impl.cuh

@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
      sum_two_hop_degrees,
      MAX_PAIRS_PER_BATCH);



In the lines above,

top_v1.reserve(*topk, handle.get_stream()); top_v2.reserve(*topk, handle.get_stream()); top_score.reserve(*topk, handle.get_stream());

Shouldn't reserve here be resize?

raft::update_host(&sum_two_hop_degrees, two_hop_degree_offsets.data() + two_hop_degree_offsets.size() - 1, 1, handle.get_stream());

We are missing handle.sync_stream() after this to ensure that sum_two_hop_degrees is ready to use in the following compute_offset_aligned_element_chunks.

I think this is the big thing that I was tripping over. Added the sync (along with fixing the problem you note below) and the hang appears to be resolved.

seunghwak · 2024-11-08T01:27:10Z

cpp/src/link_prediction/similarity_impl.cuh

+      raft::device_span<vertex_t const> batch_seeds{tmp_vertices.data(), size_t{0}};
+
+      if (((batch_number + 1) < batch_offsets.size()) &&
+          (batch_offsets[batch_number + 1] > batch_offsets[batch_number])) {


(batch_number + 1) < batch_offsets.size() should always be true here, right? batch_number < num_batches and batch_offsets.size() is num_batches + 1.

This is not the case, and was in fact the bug that triggered my PR.

The vertices can be specified as a parameter. The batches are constructed by looking at the size of the 2-hop neighborhood of selected vertices. The test case that triggered me investigating this was a situation where the number of batches on rank 0 was smaller than the number of batches on rank 1.

The above code, by computing the MAX value of the number of batches across all GPUs ensures that every vertex has the same number for num_batches, but that means that if there are no batches on a particular GPU the batch_offsets will not be long enough to do the second half of this computation.

I considered extending batch_offsets and filling it with the last value, but this seemed better since it's the only use of that array.

seunghwak · 2024-11-08T01:32:29Z

cpp/src/link_prediction/similarity_impl.cuh

-        if (top_score.size() == *topk) {
-          raft::update_host(
-            &similarity_threshold, top_score.data() + *topk - 1, 1, handle.get_stream());
+      if (top_score.size() == *topk) {


Print top_score.size(). It is 10 in rank0, 0 in rank 1. So, only rank0 participates in the host_scalar_bcast. This is causing the hang you see.

I restructured the if statements. The code is structured to keep the topk on only rank 0 (makes much of the computation easier). I moved the host_scalar_bcast call outside if this if statement in the next push. Between that and the sync mentioned above, this got me unblocked. I'll push an update to the PR later today.

Thanks for the diagnosis @seunghwak!

seunghwak · 2024-11-08T01:33:08Z

cpp/src/link_prediction/similarity_impl.cuh

+      thrust::copy(
+        handle.get_thrust_policy(), v1.begin(), v1.begin() + top_v1.size(), top_v1.begin());
+      thrust::copy(
+        handle.get_thrust_policy(), v2.begin(), v2.begin() + top_v1.size(), top_v2.begin());
+      thrust::copy(handle.get_thrust_policy(),
+                   score.begin(),
+                   score.begin() + top_v1.size(),
+                   top_score.begin());


Make sure top_v1 and top_v2 are properly re-sized here (not just reserved).

I resize them above on line 487.

I did the reserve to make sure we don't have to move anything as the array grows. I am doing the resizing as necessary so I can use the .size(), .begin(), and .end() methods to reflect only what's actually used.

ChuckHastings

Will push an update to fix things and clean up the debugging code a bit.

ChuckHastings · 2024-11-08T15:24:59Z

cpp/src/link_prediction/similarity_impl.cuh

+      raft::device_span<vertex_t const> batch_seeds{tmp_vertices.data(), size_t{0}};
+
+      if (((batch_number + 1) < batch_offsets.size()) &&
+          (batch_offsets[batch_number + 1] > batch_offsets[batch_number])) {


This is not the case, and was in fact the bug that triggered my PR.

The vertices can be specified as a parameter. The batches are constructed by looking at the size of the 2-hop neighborhood of selected vertices. The test case that triggered me investigating this was a situation where the number of batches on rank 0 was smaller than the number of batches on rank 1.

The above code, by computing the MAX value of the number of batches across all GPUs ensures that every vertex has the same number for num_batches, but that means that if there are no batches on a particular GPU the batch_offsets will not be long enough to do the second half of this computation.

I considered extending batch_offsets and filling it with the last value, but this seemed better since it's the only use of that array.

ChuckHastings · 2024-11-08T15:29:08Z

cpp/src/link_prediction/similarity_impl.cuh

+      thrust::copy(
+        handle.get_thrust_policy(), v1.begin(), v1.begin() + top_v1.size(), top_v1.begin());
+      thrust::copy(
+        handle.get_thrust_policy(), v2.begin(), v2.begin() + top_v1.size(), top_v2.begin());
+      thrust::copy(handle.get_thrust_policy(),
+                   score.begin(),
+                   score.begin() + top_v1.size(),
+                   top_score.begin());


I resize them above on line 487.

I did the reserve to make sure we don't have to move anything as the array grows. I am doing the resizing as necessary so I can use the .size(), .begin(), and .end() methods to reflect only what's actually used.

ChuckHastings · 2024-11-08T21:28:25Z

cpp/src/link_prediction/similarity_impl.cuh

-        if (top_score.size() == *topk) {
-          raft::update_host(
-            &similarity_threshold, top_score.data() + *topk - 1, 1, handle.get_stream());
+      if (top_score.size() == *topk) {


I restructured the if statements. The code is structured to keep the topk on only rank 0 (makes much of the computation easier). I moved the host_scalar_bcast call outside if this if statement in the next push. Between that and the sync mentioned above, this got me unblocked. I'll push an update to the PR later today.

Thanks for the diagnosis @seunghwak!

ChuckHastings · 2024-11-08T21:29:17Z

cpp/src/link_prediction/similarity_impl.cuh

@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
      sum_two_hop_degrees,
      MAX_PAIRS_PER_BATCH);



I think this is the big thing that I was tripping over. Added the sync (along with fixing the problem you note below) and the hang appears to be resolved.

… ranks participate

…re of 1

seunghwak

LGTM

seunghwak · 2024-11-12T22:28:15Z

cpp/src/link_prediction/similarity_impl.cuh

-    // MAX_PAIRS_PER_BATCH{static_cast<size_t>(handle.get_device_properties().multiProcessorCount) *
-    // (1 << 15)};
-    size_t const MAX_PAIRS_PER_BATCH{100};
+    // size_t const MAX_PAIRS_PER_BATCH{100};


Better delete the commented out code?

jnke2016

Looks good to me

…+ code

rlratzel

I just reviewed the updated Python code and had a few comments.

rlratzel · 2024-11-15T17:26:28Z

python/cugraph/cugraph/tests/link_prediction/test_jaccard.py

+    #
+    join = df1.merge(df2, left_on=["src1", "dst1"], right_on=["src2", "dst2"])
+
+    if len(df1) != len(join):


I'm assuming this if block is just for debugging?

rlratzel · 2024-11-15T17:28:26Z

python/cugraph/cugraph/tests/link_prediction/test_jaccard.py

+        join2 = df1.merge(
+            df2, how="left", left_on=["src1", "dst1"], right_on=["src2", "dst2"]
+        )
+        pd.set_option("display.max_rows", 500)


It might be good to restore the option afterwards by saving the original value with a pd.get_option.

rlratzel · 2024-11-15T17:39:45Z

python/cugraph/cugraph/tests/link_prediction/test_jaccard.py

+    #  Check to see if all pairs in the original data frame
+    #  still exist in the new data frame.  If we join (merge)


is "original" df1 and "new" df2?

rlratzel · 2024-11-15T17:57:25Z

python/cugraph/cugraph/tests/link_prediction/test_jaccard.py

+    worst_coeff = all_pairs_jaccard_results["jaccard_coeff"].min()
+    better_than_k = jaccard_results[jaccard_results["jaccard_coeff"] > worst_coeff]
+
+    compare(


Since compare will raise an AssertionError, I think it would be better to name it to indicate that: assert_df_results_equal or something like that.

rlratzel · 2024-11-15T18:02:16Z

python/cugraph/cugraph/tests/link_prediction/test_jaccard.py

@@ -153,6 +154,54 @@ def networkx_call(M, benchmark_callable=None):
    return src, dst, coeff


+# FIXME: This compare is shared across several tests... it should be
+#        a general utility
+def compare(src1, dst1, val1, src2, dst2, val2):


It looks like compare is always being passed values as 3 individual series from a dataframe. Since compare will just re-create these as a dataframe, can compare be written to be called like this: compare(all_pairs_jaccard_results, jaccard_results, ['first', 'second', 'jaccard_coeff']) ?

def compare(a, b, names): df1 = a[names] df2 = b[names] join = df1.merge(...) ...

I chatted with @ChuckHastings offline - I'll take care of refactoring this test utility in a separate PR since I'm looking at updates like this anyway.

I'm addressing much of this feedback in #4776
I'm not rewriting as much of the compare utility as I proposed since I can't tell how general purpose it was intended to be (ex. it currently supports passing in arrays, should that still be allowed?)

I would suggest making it as general purpose as it makes sense to. I didn't write the original function... I was merely speculating about its potential reuse, at least within the similarity algorithms.

I'm not sure compare is really the right name of the function here either. It's validating that the similarity results are correct by determining that the results are a valid subset of the entire result set.

I'm also not 100% certain this was the best way to modify the function to make that computation.

ChuckHastings · 2024-11-18T15:59:41Z

/merge

This PR adds C++ tests for the all-pairs variation of similarity algorithms. Previously the all-pairs variation was only tested in SG mode. This also addresses an issue where the all-pairs implementation would crash when there was a load imbalance across the GPUs and one of the GPUs ran out of work before the others. Closes rapidsai#4704 Authors: - Chuck Hastings (https://github.com/ChuckHastings) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Joseph Nke (https://github.com/jnke2016) - Rick Ratzel (https://github.com/rlratzel) URL: rapidsai#4741

Refactor similarity tests to have MG test all-pairs logic

969ec7d

github-actions bot added the cuGraph label Oct 31, 2024

add fix for batch size anomalies

1e5b5ce

seunghwak reviewed Nov 8, 2024

View reviewed changes

ChuckHastings commented Nov 8, 2024

View reviewed changes

ChuckHastings added 2 commits November 8, 2024 15:05

Fix synchronization issues, move bcast outside of if statement so all…

bbdaf66

… ranks participate

Merge branch 'branch-24.12' into fix_similarity_issue

192bae0

ChuckHastings self-assigned this Nov 8, 2024

ChuckHastings added bug Something isn't working non-breaking Non-breaking change labels Nov 8, 2024

ChuckHastings added this to the 24.12 milestone Nov 8, 2024

ChuckHastings marked this pull request as ready for review November 8, 2024 23:08

ChuckHastings requested a review from a team as a code owner November 8, 2024 23:08

ChuckHastings added 2 commits November 12, 2024 12:42

Update sorensen test to account for more than topk results with a sco…

52a57e1

…re of 1

Merge branch 'branch-24.12' into fix_similarity_issue

f00d4cf

ChuckHastings requested a review from a team as a code owner November 12, 2024 20:43

github-actions bot added the python label Nov 12, 2024

seunghwak reviewed Nov 12, 2024

View reviewed changes

seunghwak approved these changes Nov 12, 2024

View reviewed changes

jnke2016 approved these changes Nov 13, 2024

View reviewed changes

ChuckHastings added 2 commits November 13, 2024 07:54

jaccard test also needed comparison update... delete commented out C+…

eed6afd

…+ code

Trigger Build

f2386a4

rlratzel reviewed Nov 15, 2024

View reviewed changes

rlratzel approved these changes Nov 18, 2024

View reviewed changes

rapids-bot bot merged commit 906ea6c into rapidsai:branch-24.12 Nov 18, 2024
121 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MG similarity issues #4741

Fix MG similarity issues #4741

ChuckHastings commented Oct 31, 2024 •

edited

Loading

seunghwak Nov 8, 2024

seunghwak Nov 8, 2024

ChuckHastings Nov 8, 2024

seunghwak Nov 8, 2024

ChuckHastings Nov 8, 2024

seunghwak Nov 8, 2024

ChuckHastings Nov 8, 2024

seunghwak Nov 8, 2024

ChuckHastings Nov 8, 2024

ChuckHastings left a comment

ChuckHastings Nov 8, 2024

ChuckHastings Nov 8, 2024

ChuckHastings Nov 8, 2024

ChuckHastings Nov 8, 2024

seunghwak left a comment

seunghwak Nov 12, 2024

jnke2016 left a comment

rlratzel left a comment

rlratzel Nov 15, 2024

rlratzel Nov 15, 2024

rlratzel Nov 15, 2024

rlratzel Nov 15, 2024

rlratzel Nov 15, 2024

rlratzel Nov 15, 2024

rlratzel Nov 21, 2024

ChuckHastings Nov 21, 2024

ChuckHastings commented Nov 18, 2024

		@@ -368,187 +368,196 @@ all_pairs_similarity(raft::handle_t const& handle,
		sum_two_hop_degrees,
		MAX_PAIRS_PER_BATCH);

		# Check to see if all pairs in the original data frame
		# still exist in the new data frame. If we join (merge)

Fix MG similarity issues #4741

Fix MG similarity issues #4741

Conversation

ChuckHastings commented Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChuckHastings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seunghwak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnke2016 left a comment

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChuckHastings commented Nov 18, 2024

ChuckHastings commented Oct 31, 2024 •

edited

Loading