diff --git a/notebooks/algorithms/link_prediction/Jaccard-Similarity.ipynb b/notebooks/algorithms/link_prediction/Jaccard-Similarity.ipynb index b5f09c0c145..86bb4d17c22 100755 --- a/notebooks/algorithms/link_prediction/Jaccard-Similarity.ipynb +++ b/notebooks/algorithms/link_prediction/Jaccard-Similarity.ipynb @@ -8,7 +8,12 @@ "# Jaccard Similarity\n", "----\n", "\n", - "In this notebook we will explore the Jaccard vertex similarity metrics available in cuGraph." + "In this notebook we will explore the Jaccard vertex similarity metrics available in cuGraph.\n", + "\n", + "cuGraph supports Jaccard similarity for both unweighted and weighted graphs, but this notebook \n", + "will demonstrate Jaccard similarity only on unweighted graphs. A future update will include an \n", + "example using a graph with edge weights, where the weights are used to influence the Jaccard \n", + "similarity coefficients." ] }, { @@ -18,30 +23,48 @@ "source": [ "## Introduction\n", "\n", - "The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. \n", + "The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection \n", + "divided by the volume of their union, where the sets used are the sets of neighboring vertices for each \n", + "vertex.\n", + "\n", + "The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.\n", "\n", - "The Jaccard Similarity can then be expressed as\n", + "If we then let set __A__ be the set of neighbors for vertex _a_, and set __B__ be the set of neighbors for vertex _b_, then the Jaccard Similarity for the vertex pair _(a, b)_ can be expressed as\n", "\n", "$\\text{Jaccard similarity} = \\frac{|A \\cap B|}{|A \\cup B|}$\n", "\n", "\n", - "To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use:
\n", - "__df = cugraph.jaccard(G)__\n", + "cuGraph's Jaccard function will, by default, compute the Jaccard similarity coefficient for every pair of \n", + "vertices in the two-hop neighborhood for every vertex.\n", + "\n", + "```df = cugraph.jaccard(G, vertex_pair=None)```\n", + "\n", + "Parameters:\n", "\n", " G: A cugraph.Graph object\n", "\n", + " vertex_pair: cudf.DataFrame, optional (default=None)\n", + " A GPU dataframe consisting of two columns representing pairs of\n", + " vertices. If provided, the jaccard coefficient is computed for the\n", + " given vertex pairs. If the vertex_pair is not provided then the\n", + " current implementation computes the jaccard coefficient for all\n", + " adjacent vertices in the graph.\n", + "\n", "Returns:\n", "\n", " df: cudf.DataFrame with three columns:\n", " df[\"first\"]: The first vertex id of each pair.\n", " df[\"second\"]: The second vertex id of each pair.\n", " df[\"jaccard_coeff\"]: The jaccard coefficient computed between the vertex pairs.\n", - "
\n", + "\n", + "To limit the computation to specific vertex pairs, including those not in the same two-hop \n", + "neighborhood, pass a `vertex_pair` value (see example below).\n", "\n", "__References__ \n", "- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and \n", "\n", "__Additional Reading__ \n", + "- [Intro to Graph Analysis using cuGraph: Similarity Algorithms](https://medium.com/rapids-ai/intro-to-graph-analysis-using-cugraph-similarity-algorithms-64fa923791ac)\n", "- [Wikipedia: Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)\n" ] }, @@ -71,7 +94,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "scrolled": true }, @@ -96,7 +119,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -115,7 +138,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -134,7 +157,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -147,9 +170,189 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
firstsecondjaccard_coeff
54114151.000000
54214181.000000
54314201.000000
54414221.000000
56115181.000000
56215201.000000
56315221.000000
58717211.000000
60518201.000000
60618221.000000
62520221.000000
2997130.800000
2856100.750000
388450.750000
44319210.666667
5029280.666667
58417190.666667
22313190.600000
4532330.526316
3107120.500000
\n", + "
" + ], + "text/plain": [ + " first second jaccard_coeff\n", + "541 14 15 1.000000\n", + "542 14 18 1.000000\n", + "543 14 20 1.000000\n", + "544 14 22 1.000000\n", + "561 15 18 1.000000\n", + "562 15 20 1.000000\n", + "563 15 22 1.000000\n", + "587 17 21 1.000000\n", + "605 18 20 1.000000\n", + "606 18 22 1.000000\n", + "625 20 22 1.000000\n", + "299 7 13 0.800000\n", + "285 6 10 0.750000\n", + "388 4 5 0.750000\n", + "443 19 21 0.666667\n", + "502 9 28 0.666667\n", + "584 17 19 0.666667\n", + "223 13 19 0.600000\n", + "45 32 33 0.526316\n", + "310 7 12 0.500000" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Show the top-20 most similar vertices.\n", "jaccard_coeffs.head(20)" @@ -169,15 +372,63 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We have to specify vertices in a DataFrame to see their similarity if they\n", - "are not part of the same two-hop neighborhood." + "If we want to see the similarity of a pair of vertices that are not part of \n", + "the same two-hop neighborhood, we have to specify them in a `cudf.DataFrame` \n", + "to pass to the `jaccard` call." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
firstsecondjaccard_coeff
016330.0
\n", + "
" + ], + "text/plain": [ + " first second jaccard_coeff\n", + "0 16 33 0.0" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "cugraph.jaccard(G, cudf.DataFrame([(16, 33)]))" ] @@ -191,6 +442,88 @@ "neighbors." ] }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the `cudf.DataFrame` argument to pass in any number of specific vertex pairs \n", + "to compute the similarity for, regardless of whether or not they're included by default. \n", + "This is useful to limit the computation and result size when only specific vertex \n", + "similarities are needed." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
firstsecondjaccard_coeff
016330.000000
132330.526316
20230.000000
\n", + "
" + ], + "text/plain": [ + " first second jaccard_coeff\n", + "0 16 33 0.000000\n", + "1 32 33 0.526316\n", + "2 0 23 0.000000" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pairs = cudf.DataFrame([(16, 33), (32, 33), (0, 23)])\n", + "cugraph.jaccard(G, pairs)" + ] + }, { "attachments": {}, "cell_type": "markdown", @@ -206,6 +539,21 @@ "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n", "___" ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Revision History\n", + "\n", + "| Author | Date | Update | cuGraph Version | Test Hardware |\n", + "| --------------|------------|------------------|-----------------|---------------------------|\n", + "| Brad Rees | 10/14/2019 | created | 0.14 | GV100 32 GB, CUDA 10.2 |\n", + "| Don Acosta | 07/20/2022 | tested/updated | 22.08 nightly | DGX Tesla V100, CUDA 11.5 |\n", + "| Ralph Liu | 06/29/2023 | updated | 23.08 nightly | DGX Tesla V100, CUDA 12.0 |\n", + "| Rick Ratzel | 02/23/2024 | tested/updated | 24.04 nightly | DGX Tesla V100, CUDA 12.0 |" + ] } ], "metadata": {