From f8b9ac375b21a1c2787a2a148b34eab756b01e3f Mon Sep 17 00:00:00 2001 From: Ralph Liu <137829296+nv-rliu@users.noreply.github.com> Date: Thu, 3 Oct 2024 09:51:54 -0400 Subject: [PATCH] Add `nx-cugraph` introduction notebook to repo (#4677) ## Proposed Changes This PR adds an introduction notebook to the `notebooks/demo` directory of the repository. Click the link to view the files directory in the dev branch. * [The introduction to `nx-cugraph` notebook](https://github.com/rapidsai/cugraph/blob/1a70b7ad588bcc1dc6f65af8fe07149ed8083ef0/notebooks/demo/accelerating_networkx.ipynb) --------- Co-authored-by: Erik Welch --- notebooks/demo/accelerating_networkx.ipynb | 614 +++++++++++++++++++++ 1 file changed, 614 insertions(+) create mode 100644 notebooks/demo/accelerating_networkx.ipynb diff --git a/notebooks/demo/accelerating_networkx.ipynb b/notebooks/demo/accelerating_networkx.ipynb new file mode 100644 index 00000000000..1a6c6cfb3f6 --- /dev/null +++ b/notebooks/demo/accelerating_networkx.ipynb @@ -0,0 +1,614 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "R2cpVp2WdOsp" + }, + "source": [ + "# NetworkX - Easy Graph Analytics\n", + "\n", + "NetworkX is the most popular library for graph analytics available in Python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 71 million times in September of 2024 alone, which is roughly 71 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than Python.\n", + "\n", + "However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-Python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.\n", + "\n", + "# Accelerated NetworkX - Easy (and fast!) Graph Analytics\n", + "\n", + "To address the performance problem, NetworkX 3.0 introduced a mechanism to dispatch algorithm calls to alternate implementations. The NetworkX Python API remains the same but NetworkX will use more capable algorithm implementations provided by one or more backends. This approach means users don't have to give up NetworkX -or even change their code- in order to take advantage of GPU performance." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xkg10FrNThrK" + }, + "source": [ + "# Let's Get the Environment Setup\n", + "This notebook will demonstrate NetworkX both with and without GPU acceleration provided by the `nx-cugraph` backend.\n", + "\n", + "`nx-cugraph` is available as a package installable using `pip`, `conda`, and [from source](https://github.com/rapidsai/nx-cugraph). Before importing `networkx`, lets install `nx-cugraph` so it can be registered as an available backend by NetworkX when needed. We'll use `pip` to install.\n", + "\n", + "NOTES:\n", + "* `nx-cugraph` requires a compatible NVIDIA GPU, NVIDIA CUDA and associated drivers, and a supported OS. Details about these and other installation prerequisites can be seen [here](https://docs.rapids.ai/install#system-req).\n", + "* The `nx-cugraph` package is currently hosted by NVIDIA and therefore the `--extra-index-url` option must be used.\n", + "* `nx-cugraph` is supported on specific 11.x and 12.x CUDA versions, and the major version number must be known in order to install the correct build (this is determined automatically when using `conda`).\n", + "\n", + "To find the CUDA major version on your system, run the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NMFwzc1I95BS" + }, + "outputs": [], + "source": [ + "!nvcc --version" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i91Yj-yZ-nGS" + }, + "source": [ + "From the above output we can see we're using CUDA 12.x so we'll be installing `nx-cugraph-cu12`. If we were using CUDA 11.x, the package name would be `nx-cugraph-cu11`. We'll also be adding `https://pypi.nvidia.com` as an `--extra-index-url`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mYYN9EpnWphu" + }, + "outputs": [], + "source": [ + "!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0h1K-7tI_AZH" + }, + "source": [ + "Of course, we'll also be using `networkx`, which is already provided in the Colab environment. This notebook will be using features added in version 3.3, so we'll import it here to verify we have a compatible version." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YTV0ZTME2tV6" + }, + "outputs": [], + "source": [ + "import networkx as nx\n", + "nx.__version__" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UiZKOa3WC7be" + }, + "source": [ + "# Let's Start with Something Simple\n", + "\n", + "To begin, we'll compare NetworkX results without a backend to results of the same algorithm using the `nx-cugraph` backend on a small graph. `nx.karate_club_graph()` returns an instance of the famous example graph consisting of 34 nodes and 78 edges from Zachary's paper, described [here](https://en.wikipedia.org/wiki/Zachary%27s_karate_club)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3atL3tI0frYm" + }, + "source": [ + "## Betweenness Centrality\n", + "[Betweenness Centrality](https://en.wikipedia.org/wiki/Betweenness_centrality) is a graph algorithm that computes a centrality score for each node (`v`) based on how many of the shortest paths between pairs of nodes in the graph pass through `v`. A higher centrality score represents a node that \"connects\" other nodes in a network more than that of a node with a lower score.\n", + "\n", + "First, let's create a NetworkX Graph instance of the the Karate Club graph and inspect it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JSw7EZ46-kRu" + }, + "outputs": [], + "source": [ + "G = nx.karate_club_graph()\n", + "G.number_of_nodes(), G.number_of_edges()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_-E17u2gKgbC" + }, + "source": [ + "Next, let's run betweenness centrality and save the results. Because the Karate Club graph is so small, this should not take long." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qjxXXKJhKQ4s" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ClrR3z9XMfLr" + }, + "source": [ + "Now, let's run the same algorithm on the same data using the `nx-cugraph` backend.\n", + "\n", + "There are several ways to instruct NetworkX to use a particular backend instead of the default implementation. Here, we will use the `config` API, which was added in NetworkX version 3.3.\n", + "\n", + "The following two lines set the backend to \"cugraph\" and enable graph conversion caching.\n", + "\n", + "Some notes:\n", + "* The standard convention for NetworkX backends is to name the package with a `nx-` prefix to denote that these are packages intended to be used with NetworkX, but the `nx-` prefix is not included when referring to them in NetworkX API calls. Here, `nx-cugraph` is the name of the backend package, and `\"cugraph\"` is the name NetworkX will use to refer to it.\n", + "* NetworkX can use multiple backends! `nx.config.backend_priority` is a list that can contain several backends, ordered based on priority. If a backend in the list cannot run a particular algorithm (either because it isn't supported in the backend, the algorithm doesn't support a particular option, or some other reason), NetworkX will try the next backend in the list. If no specified backend is able to run the algorithm, NetworkX will fall back to the default implementation.\n", + "* Many backends have their own data structures for representing an input graph, often optimized for that backend's implementation. Prior to running a backend algorithm, NetworkX will have the backend convert the standard NetworkX Graph instance to the backend-specific type. This conversion can be expensive, and rather than repeat it as part of each algorithm call, NetworkX can cache the conversion so it can be skipped on future calls if the graph doesn't change. This caching can save significant time and improve overall performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oFHwNqqsNsqS" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority=[\"cugraph\"] # NETWORKX_BACKEND_PRIORITY=cugraph\n", + "nx.config.cache_converted_graphs=True # NETWORKX_CACHE_CONVERTED_GRAPHS=True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HrUeWRRQRzFP" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z1hxut3GTj5A" + }, + "source": [ + "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not surprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligible when compared to the time saved when running on a GPU for larger graphs.\n", + "\n", + "Since we've enabled graph conversion caching, we can see that if we re-run the same call the execution time is noticeably shorter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7a0XvpUOr9Ju" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ppjE5J5RscOe" + }, + "source": [ + "Notice the warning above about using the cache. This will only be raised **once** per graph instance (it can also be easily disabled), but its purpose is to point out that the cache should not be used if the Graph object will have its attribute dictionary modified directly. In this case and many others, we won't be modifying the dictionaries directly. Instead, we will use APIs such as `nx.set_node_attributes` which properly clear the cache, so it's safe for us to use the cache. Because of that, we'll disable the warning so we don't see it on other graphs in this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Namb5JLvwS-q" + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\", message=\"Using cached graph for 'cugraph' backend\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BzGAphcILFsT" + }, + "source": [ + "Smaller graphs are also easy to visualize with NetworkX's plotting utilities. The flexibility of NetworkX's `Graph` instances make it trivial to add the betweenness centrality scores back to the graph object as node attributes. This will allow us to use those values for the visualization.\n", + "\n", + "In this case, we'll create new attributes for each node called \"nx_bc\" for the default NetworkX results, and \"nxcg_bc\" for the nx-cugraph results. We'll use those values to assign the color for each node and plot two graphs side-by-side. This will make it easy to visually validate that the nodes with the higher centrality scores for both implementations match and do indeed appear to be more \"central\" to other nodes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1coV6ZfcUoqI" + }, + "outputs": [], + "source": [ + "nx.set_node_attributes(G, nx_bc_results, \"nx_bc\")\n", + "nx.set_node_attributes(G, nxcg_bc_results, \"nxcg_bc\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sba2iYJgLoN2" + }, + "outputs": [], + "source": [ + "# Configure plot size and layout/position for each node\n", + "import matplotlib.pyplot as plt\n", + "plt.rcParams['figure.figsize'] = [12, 8]\n", + "pos = nx.spring_layout(G)\n", + "\n", + "# Assign colors for each set of betweenness centrality results\n", + "nx_colors = [G.nodes[n][\"nx_bc\"] for n in G.nodes()]\n", + "nxcg_colors = [G.nodes[n][\"nxcg_bc\"] for n in G.nodes()]\n", + "\n", + "# Plot the graph and color each node corresponding to NetworkX betweenness centrality values\n", + "plt.subplot(1, 2, 1)\n", + "nx.draw(G, pos=pos, with_labels=True, node_color=nx_colors)\n", + "\n", + "# Plot the graph and color each node corresponding to nx-cugraph betweenness centrality values\n", + "plt.subplot(1, 2, 2)\n", + "nx.draw(G, pos=pos, with_labels=True, node_color=nxcg_colors)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dJXH4Zn5VNSg" + }, + "source": [ + "As we can see, the same two nodes (`0` and `33`) are the two most central in both graphs, followed by `2`, `31`, and `32`.\n", + "\n", + "## PageRank\n", + "Another popular algorithm is [PageRank](https://en.wikipedia.org/wiki/PageRank). PageRank also assigns scores to each node, but these scores are based on analyzing links to each node to determine relative \"importance\" within the graph.\n", + "\n", + "Let's update the config to use the default NetworkX implementation and run `nx.pagerank`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9CdYNk62E1v_" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority=[]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jo39YxVmYolq" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_pr_results = nx.pagerank(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sV6dM8ToZDiC" + }, + "source": [ + "We could set `nx.config.backend_priority` again to list `\"cugraph\"` as the backend, but let's instead show how the `backend` kwarg can be used to override the priority list and force a specific backend to be used." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oMSvQVGKY0rn" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_pr_results = nx.pagerank(G, backend=\"cugraph\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZGux_8xFZneI" + }, + "source": [ + "In this example, instead of plotting the graph to show that the results are identical, we can compare them directly using the saved values from both runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RcmtdFy4Zw7p" + }, + "outputs": [], + "source": [ + "sorted(nx_pr_results) == sorted(nxcg_pr_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mefjUEAnZ4pq" + }, + "source": [ + "# Working with Bigger Data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yLY-yl6PuNYo" + }, + "source": [ + "Now we'll look at a larger dataset from https://snap.stanford.edu/data/cit-Patents.html which contains citations across different U.S. patents granted from January 1, 1963 to December 30, 1999. The dataset represents 16.5M citations (edges) between 3.77M patents (nodes).\n", + "\n", + "This will demonstrate that data of this size starts to push the limits of the default pure-Python NetworkX implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lyYF0LbtFwjh" + }, + "outputs": [], + "source": [ + "# The locale encoding may have been modified from the plots above, reset here to run shell commands\n", + "import locale\n", + "locale.getpreferredencoding = lambda: \"UTF-8\"\n", + "!wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv # Skip if cit-Patents.csv already exists.\n", + "# !wget https://snap.stanford.edu/data/cit-Patents.txt.gz # Skip if cit-Patents.txt.gz already exists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kjGINYphQSQ2" + }, + "outputs": [], + "source": [ + "%load_ext cudf.pandas\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iV4DieGZOalc" + }, + "outputs": [], + "source": [ + "%%time\n", + "df = pd.read_csv(\"cit-Patents.csv\",\n", + " sep=\" \",\n", + " names=[\"src\", \"dst\"],\n", + " dtype=\"int32\",\n", + ")\n", + "# df = pd.read_csv(\"cit-Patents.txt.gz\",\n", + "# compression=\"gzip\",\n", + "# skiprows=4,\n", + "# sep=\"\\t\",\n", + "# names=[\"src\", \"dst\"],\n", + "# dtype=\"int32\",\n", + "# )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PREA67u4eKat" + }, + "outputs": [], + "source": [ + "%%time\n", + "G = nx.from_pandas_edgelist(df, source=\"src\", target=\"dst\")\n", + "G.number_of_nodes(), G.number_of_edges()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NcsUxBqpu4zY" + }, + "source": [ + "By default, `nx.betweenness_centrality` will perform an all-pairs shortest path analysis when determining the centrality scores for each node. However, due to the much larger size of this graph, determining the shortest path for all pairs of nodes in the graph is not feasible. Instead, we'll use the parameter `k` to limit the number of shortest path computations used for determining the centrality scores, at the expense of accuracy. As we'll see when using a dataset this size with `nx.betweenness_centrality`, we have to limit `k` to `1` which is not practical but is sufficient here for demonstration purposes (since anything larger than `1` will result in many minutes of execution time)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gNDWbj3kAk3j" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NB8xmxMd1PlX" + }, + "source": [ + "Now we'll configure NetworkX to use the `nx-cugraph` backend (again, using the name convention that drops the package name's `nx-` prefix) and run the same call. Because this is a Graph that `nx-cugraph` hasn't seen before, the runtime will include the time to convert and cache a GPU-based graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xUYNG1xhvbWc" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority = [\"cugraph\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cmK8ZuQGvfPo" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdHb1YXP15TZ" + }, + "source": [ + "Let's run betweenness centrality again, now with a more useful number of samples by setting `k=100`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fKjIrzL-vrGS" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=100)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QeMcrAX2HZSM" + }, + "source": [ + "Let's also run pagerank on the same dataset to compare." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gR8ID6ekHgHt" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority = []" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rTFuvX5wb_c1" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_pr_results = nx.pagerank(G)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8sJx9aeJV9hv" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_pr_results = nx.pagerank(G, backend=\"cugraph\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wGOVQ6ZyY4Ih" + }, + "outputs": [], + "source": [ + "sorted(nx_pr_results) == sorted(nxcg_pr_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k2DfAaZaDIBj" + }, + "source": [ + "---\n", + "\n", + "Information on the U.S. Patent Citation Network dataset used in this notebook is as follows:\n", + "
Authors: Jure Leskovec and Andrej Krevl\n", + "
Title: SNAP Datasets, Stanford Large Network Dataset Collection\n", + "
URL: http://snap.stanford.edu/data\n", + "
Date: June 2014\n", + "
\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}