From 1405036fcbf3ec6347f52e7b69ca5bc9d77bc6f0 Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Thu, 26 Sep 2024 17:21:40 -0700 Subject: [PATCH 01/11] Initial commit --- notebooks/demo/accelerating_networkx.ipynb | 614 +++++++++++++++++++++ notebooks/demo/nxcg_wikipedia.ipynb | 170 ++++++ 2 files changed, 784 insertions(+) create mode 100644 notebooks/demo/accelerating_networkx.ipynb create mode 100644 notebooks/demo/nxcg_wikipedia.ipynb diff --git a/notebooks/demo/accelerating_networkx.ipynb b/notebooks/demo/accelerating_networkx.ipynb new file mode 100644 index 00000000000..19a62261a3b --- /dev/null +++ b/notebooks/demo/accelerating_networkx.ipynb @@ -0,0 +1,614 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "R2cpVp2WdOsp" + }, + "source": [ + "# NetworkX - Easy Graph Analytics\n", + "\n", + "NetworkX is the most popular library for graph analytics available in python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 50 million times in April of 2024 alone, which is roughly 50 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than python.\n", + "\n", + "However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.\n", + "\n", + "# Accelerated NetworkX - Easy (and fast!) Graph Analytics\n", + "\n", + "To address the performance problem, NetworkX 3.0 introduced a mechanism to dispatch algorithm calls to alternate implementations. The NetworkX python API remains the same but NetworkX will use more capable algorithm implementations provided by one or more backends. This approach means users don't have to give up NetworkX -or even change their code- in order to take advantage of GPU performance." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xkg10FrNThrK" + }, + "source": [ + "# Let's Get the Environment Setup\n", + "This notebook will demonstrate NetworkX both with and without GPU acceleration provided by the `nx-cugraph` backend.\n", + "\n", + "`nx-cugraph` is available as a package installable using `pip`, `conda`, and [from source](https://github.com/rapidsai/nx-cugraph). Before importing `networkx`, lets install `nx-cugraph` so it can be registered as an available backend by NetworkX when needed. We'll use `pip` to install.\n", + "\n", + "NOTES:\n", + "* `nx-cugraph` requires a compatible NVIDIA GPU, NVIDIA CUDA and associated drivers, and a supported OS. Details about these and other installation prerequisites can be seen [here](https://docs.rapids.ai/install#system-req).\n", + "* The `nx-cugraph` package is currently hosted by NVIDIA and therefore the `--extra-index-url` option must be used.\n", + "* `nx-cugraph` is supported on specific 11.x and 12.x CUDA versions, and the major version number must be known in order to install the correct build (this is determined automatically when using `conda`).\n", + "\n", + "To find the CUDA major version on your system, run the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NMFwzc1I95BS" + }, + "outputs": [], + "source": [ + "!nvcc --version" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i91Yj-yZ-nGS" + }, + "source": [ + "From the above output we can see we're using CUDA 12.x so we'll be installing `nx-cugraph-cu12`. If we were using CUDA 11.x, the package name would be `nx-cugraph-cu11`. We'll also be adding `https://pypi.nvidia.com` as an `--extra-index-url`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mYYN9EpnWphu" + }, + "outputs": [], + "source": [ + "!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0h1K-7tI_AZH" + }, + "source": [ + "Of course, we'll also be using `networkx`, which is already provided in the Colab environment. This notebook will be using features added in version 3.3, so we'll import it here to verify we have a compatible version." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YTV0ZTME2tV6" + }, + "outputs": [], + "source": [ + "import networkx as nx\n", + "nx.__version__" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UiZKOa3WC7be" + }, + "source": [ + "# Let's Start with Something Simple\n", + "\n", + "To begin, we'll compare NetworkX results without a backend to results of the same algorithm using the `nx-cugraph` backend on a small graph. `nx.karate_club_graph()` returns an instance of the famous example graph consisting of 34 nodes and 78 edges from Zachary's paper, described [here](https://en.wikipedia.org/wiki/Zachary%27s_karate_club)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3atL3tI0frYm" + }, + "source": [ + "## Betweenness Centrality\n", + "[Betweenness Centrality](https://en.wikipedia.org/wiki/Betweenness_centrality) is a graph algorithm that computes a centrality score for each node (`v`) based on how many of the shortest paths between pairs of nodes in the graph pass through `v`. A higher centrality score represents a node that \"connects\" other nodes in a network more than that of a node with a lower score.\n", + "\n", + "First, let's create a NetworkX Graph instance of the the Karate Club graph and inspect it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JSw7EZ46-kRu" + }, + "outputs": [], + "source": [ + "G = nx.karate_club_graph()\n", + "G.number_of_nodes(), G.number_of_edges()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_-E17u2gKgbC" + }, + "source": [ + "Next, let's run betweenness centrality and save the results. Because the Karate Club graph is so small, this should not take long." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qjxXXKJhKQ4s" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ClrR3z9XMfLr" + }, + "source": [ + "Now, let's run the same algorithm on the same data using the `nx-cugraph` backend.\n", + "\n", + "There are several ways to instruct NetworkX to use a particular backend instead of the default implementation. Here, we will use the `config` API, which was added in NetworkX version 3.3.\n", + "\n", + "The following two lines set the backend to \"cugraph\" and enable graph conversion caching.\n", + "\n", + "Some notes:\n", + "* The standard convention for NetworkX backends is to name the package with a `nx-` prefix to denote that these are packages intended to be used with NetworkX, but the `nx-` prefix is not included when referring to them in NetworkX API calls. Here, `nx-cugraph` is the name of the backend package, and `\"cugraph\"` is the name NetworkX will use to refer to it.\n", + "* NetworkX can use multiple backends! `nx.config.backend_priority` is a list that can contain several backends, ordered based on priority. If a backend in the list cannot run a particular algorithm (either because it isn't supported in the backend, the algorithm doesn't support a particular option, or some other reason), NetworkX will try the next backend in the list. If no specified backend is able to run the algorithm, NetworkX will fall back to the default implementation.\n", + "* Many backends have their own data structures for representing an input graph, often optimized for that backend's implementation. Prior to running a backend algorithm, NetworkX will have the backend convert the standard NetworkX Graph instance to the backend-specific type. This conversion can be expensive, and rather than repeat it as part of each algorithm call, NetworkX can cache the conversion so it can be skipped on future calls if the graph doesn't change. This caching can save significant time and improve overall performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oFHwNqqsNsqS" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority=[\"cugraph\"] # NETWORKX_BACKEND_PRIORITY=cugraph\n", + "nx.config.cache_converted_graphs=True # NETWORKX_CACHE_CONVERTED_GRAPHS=True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HrUeWRRQRzFP" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z1hxut3GTj5A" + }, + "source": [ + "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not suprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligable when compared to the time saved when running on a GPU for larger graphs.\n", + "\n", + "Since we've enabled graph conversion caching, we can see that if we re-run the same call the execution time is noticeably shorter." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7a0XvpUOr9Ju" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_bc_results = nx.betweenness_centrality(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ppjE5J5RscOe" + }, + "source": [ + "Notice the warning above about using the cache. This will only be raised **once** per graph instance (it can also be easily disabled), but its purpose is to point out that the cache should not be used if the Graph object will have its attribute dictionary modified directly. In this case and many others, we won't be modifying the dictionaries directly. Instead, we will use APIs such as `nx.set_node_attributes` which properly clear the cache, so it's safe for us to use the cache. Because of that, we'll disable the warning so we don't see it on other graphs in this session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Namb5JLvwS-q" + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\", message=\"Using cached graph for 'cugraph' backend\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BzGAphcILFsT" + }, + "source": [ + "Smaller graphs are also easy to visualize with NetworkX's plotting utilities. The flexibility of NetworkX's `Graph` instances make it trivial to add the betweenness centrality scores back to the graph object as node attributes. This will allow us to use those values for the visualization.\n", + "\n", + "In this case, we'll create new attributes for each node called \"nx_bc\" for the default NetworkX results, and \"nxcg_bc\" for the nx-cugraph results. We'll use those values to assign the color for each node and plot two graphs side-by-side. This will make it easy to visually validate that the nodes with the higher centrality scores for both implementations match and do indeed appear to be more \"central\" to other nodes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1coV6ZfcUoqI" + }, + "outputs": [], + "source": [ + "nx.set_node_attributes(G, nx_bc_results, \"nx_bc\")\n", + "nx.set_node_attributes(G, nxcg_bc_results, \"nxcg_bc\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sba2iYJgLoN2" + }, + "outputs": [], + "source": [ + "# Configure plot size and layout/position for each node\n", + "import matplotlib.pyplot as plt\n", + "plt.rcParams['figure.figsize'] = [12, 8]\n", + "pos = nx.spring_layout(G)\n", + "\n", + "# Assign colors for each set of betweenness centrality results\n", + "nx_colors = [G.nodes[n][\"nx_bc\"] for n in G.nodes()]\n", + "nxcg_colors = [G.nodes[n][\"nxcg_bc\"] for n in G.nodes()]\n", + "\n", + "# Plot the graph and color each node corresponding to NetworkX betweenness centrality values\n", + "plt.subplot(1, 2, 1)\n", + "nx.draw(G, pos=pos, with_labels=True, node_color=nx_colors)\n", + "\n", + "# Plot the graph and color each node corresponding to nx-cugraph betweenness centrality values\n", + "plt.subplot(1, 2, 2)\n", + "nx.draw(G, pos=pos, with_labels=True, node_color=nxcg_colors)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dJXH4Zn5VNSg" + }, + "source": [ + "As we can see, the same two nodes (`0` and `33`) are the two most central in both graphs, followed by `2`, `31`, and `32`.\n", + "\n", + "## PageRank\n", + "Another popular algorithm is [PageRank](https://en.wikipedia.org/wiki/PageRank). PageRank also assigns scores to each node, but these scores are based on analyzing links to each node to determine relative \"importance\" within the graph.\n", + "\n", + "Let's update the config to use the default NetworkX implementation and run `nx.pagerank`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9CdYNk62E1v_" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority=[]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jo39YxVmYolq" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_pr_results = nx.pagerank(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sV6dM8ToZDiC" + }, + "source": [ + "We could set `nx.config.backend_priority` again to list `\"cugraph\"` as the backend, but let's instead show how the `backend` kwarg can be used to override the priority list and force a specific backend to be used." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oMSvQVGKY0rn" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_pr_results = nx.pagerank(G, backend=\"cugraph\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZGux_8xFZneI" + }, + "source": [ + "In this example, instead of plotting the graph to show that the results are identical, we can compare them directly using the saved values from both runs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RcmtdFy4Zw7p" + }, + "outputs": [], + "source": [ + "sorted(nx_pr_results) == sorted(nxcg_pr_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mefjUEAnZ4pq" + }, + "source": [ + "# Working with Bigger Data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yLY-yl6PuNYo" + }, + "source": [ + "Now we'll look at a larger dataset from https://snap.stanford.edu/data/cit-Patents.html which contains citations across different U.S. patents granted from January 1, 1963 to December 30, 1999. The dataset represents 16.5M citations (edges) between 3.77M patents (nodes).\n", + "\n", + "This will demonstrate that data of this size starts to push the limits of the default pure-python NetworkX implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lyYF0LbtFwjh" + }, + "outputs": [], + "source": [ + "# The locale encoding may have been modified from the plots above, reset here to run shell commands\n", + "import locale\n", + "locale.getpreferredencoding = lambda: \"UTF-8\"\n", + "!wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv # Skip if cit-Patents.csv already exists.\n", + "# !wget https://snap.stanford.edu/data/cit-Patents.txt.gz # Skip if cit-Patents.txt.gz already exists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kjGINYphQSQ2" + }, + "outputs": [], + "source": [ + "%load_ext cudf.pandas\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iV4DieGZOalc" + }, + "outputs": [], + "source": [ + "%%time\n", + "df = pd.read_csv(\"cit-Patents.csv\",\n", + " sep=\" \",\n", + " names=[\"src\", \"dst\"],\n", + " dtype=\"int32\",\n", + ")\n", + "# df = pd.read_csv(\"cit-Patents.txt.gz\",\n", + "# compression=\"gzip\",\n", + "# skiprows=4,\n", + "# sep=\"\\t\",\n", + "# names=[\"src\", \"dst\"],\n", + "# dtype=\"int32\",\n", + "# )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PREA67u4eKat" + }, + "outputs": [], + "source": [ + "%%time\n", + "G = nx.from_pandas_edgelist(df, source=\"src\", target=\"dst\")\n", + "G.number_of_nodes(), G.number_of_edges()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NcsUxBqpu4zY" + }, + "source": [ + "By default, `nx.betweenness_centrality` will perform an all-pairs shortest path analysis when determining the centrality scores for each node. However, due to the much larger size of this graph, determining the shortest path for all pairs of nodes in the graph is not feasible. Instead, we'll use the parameter `k` to limit the number of shortest path computations used for determining the centrality scores, at the expense of accuracy. As we'll see when using a dataset this size with `nx.betweenness_centrality`, we have to limit `k` to `1` which is not practical but is sufficient here for demonstration purposes (since anything larger than `1` will result in many minutes of execution time)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gNDWbj3kAk3j" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NB8xmxMd1PlX" + }, + "source": [ + "Now we'll configure NetworkX to use the `nx-cugraph` backend (again, using the name convention that drops the package name's `nx-` prefix) and run the same call. Because this is a Graph that `nx-cugraph` hasn't seen before, the runtime will include the time to convert and cache a GPU-based graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xUYNG1xhvbWc" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority = [\"cugraph\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cmK8ZuQGvfPo" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdHb1YXP15TZ" + }, + "source": [ + "Let's run betweenness centrality again, now with a more useful number of samples by setting `k=100`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fKjIrzL-vrGS" + }, + "outputs": [], + "source": [ + "%%time\n", + "bc_results = nx.betweenness_centrality(G, k=100)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QeMcrAX2HZSM" + }, + "source": [ + "Let's also run pagerank on the same dataset to compare." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gR8ID6ekHgHt" + }, + "outputs": [], + "source": [ + "nx.config.backend_priority = []" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rTFuvX5wb_c1" + }, + "outputs": [], + "source": [ + "%%time\n", + "nx_pr_results = nx.pagerank(G)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8sJx9aeJV9hv" + }, + "outputs": [], + "source": [ + "%%time\n", + "nxcg_pr_results = nx.pagerank(G, backend=\"cugraph\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wGOVQ6ZyY4Ih" + }, + "outputs": [], + "source": [ + "sorted(nx_pr_results) == sorted(nxcg_pr_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k2DfAaZaDIBj" + }, + "source": [ + "---\n", + "\n", + "Information on the U.S. Patent Citation Network dataset used in this notebook is as follows:\n", + "
Authors: Jure Leskovec and Andrej Krevl\n", + "
Title: SNAP Datasets, Stanford Large Network Dataset Collection\n", + "
URL: http://snap.stanford.edu/data\n", + "
Date: June 2014\n", + "
\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/demo/nxcg_wikipedia.ipynb b/notebooks/demo/nxcg_wikipedia.ipynb new file mode 100644 index 00000000000..53b39b51fb9 --- /dev/null +++ b/notebooks/demo/nxcg_wikipedia.ipynb @@ -0,0 +1,170 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `nx-cugraph` Demo - Wikipedia Pagerank" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import networkx as nx" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvcc --version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget \"https://data.rapids.ai/cugraph/datasets/cit-Patents.csv\" # Skip if cit-Patents.csv already exists." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: dgx machine independent\n", + "dataset_folder = \"~/nvrliu/notebooks/demo/data/wikipedia\"\n", + "\n", + "edgelist_csv = f\"{dataset_folder}/enwiki-20240620-edges.csv\"\n", + "nodedata_csv = f\"{dataset_folder}/enwiki-20240620-nodeids.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read the Wikipedia Connectivity data from `edgelist_csv`" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 49 s, sys: 7.13 s, total: 56.1 s\n", + "Wall time: 56.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "edgelist_df = pd.read_csv(\n", + " edgelist_csv,\n", + " sep=\" \",\n", + " names=[\"src\", \"dst\"],\n", + " dtype=\"int32\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read the Wikipedia Page metadata from `nodedata_csv`" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "nodedata_df = pd.read_csv(\n", + " nodedata_csv,\n", + " sep=\"\\t\",\n", + " names=[\"nodeid\", \"title\"],\n", + " dtype={\"nodeid\": \"int32\", \"title\": \"str\"},\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a NetworkX graph from the connectivity info" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "G = nx.from_pandas_edgelist(\n", + " edgelist_df,\n", + " source=\"src\",\n", + " target=\"dst\",\n", + " create_using=nx.DiGraph,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run NetworkX pagerank" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "devenv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 66155bef2df857af916afa73efe4efb5376d9169 Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Thu, 26 Sep 2024 23:15:33 -0700 Subject: [PATCH 02/11] Changes --- notebooks/demo/nxcg_wikipedia.ipynb | 109 ++++++++++++++++++++-------- 1 file changed, 78 insertions(+), 31 deletions(-) diff --git a/notebooks/demo/nxcg_wikipedia.ipynb b/notebooks/demo/nxcg_wikipedia.ipynb index 53b39b51fb9..4e413621db3 100644 --- a/notebooks/demo/nxcg_wikipedia.ipynb +++ b/notebooks/demo/nxcg_wikipedia.ipynb @@ -4,35 +4,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# `nx-cugraph` Demo - Wikipedia Pagerank" + "# `nx-cugraph` Demo - Wikipedia Pagerank\n", + "\n", + "This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`." ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ + "# Uncomment these two lines to enable cudf.pandas and nx-cugraph\n", + "\n", + "%load_ext cudf.pandas\n", + "!NETWORKX_BACKEND_PRIORITY=cugraph\n", + "\n", "import pandas as pd\n", "import networkx as nx" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!nvcc --version" - ] - }, - { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com" + "Downloading the data" ] }, { @@ -46,7 +42,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -66,18 +62,9 @@ }, { "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 49 s, sys: 7.13 s, total: 56.1 s\n", - "Wall time: 56.1 s\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "%%time\n", "\n", @@ -98,7 +85,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -119,7 +106,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -135,7 +122,67 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Run NetworkX pagerank" + "Run pagerank on NetworkX" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nx_pr_vals = nx.pagerank(G)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a DataFrame containing the results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pagerank_df = pd.DataFrame({\n", + " \"nodeid\": nx_pr_vals.keys(),\n", + " \"pagerank\": nx_pr_vals.values()\n", + " })" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Add NetworkX results to `nodedata` as new columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here the top 25 pages based on pagerank value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nodedata_df.sort_values(by=\"pagerank\", ascending=False).head(25)" ] }, { From e0f2415b3017e6c6d2d86f8dd9657df64b70c7db Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Thu, 26 Sep 2024 23:33:27 -0700 Subject: [PATCH 03/11] Changes --- notebooks/demo/nxcg_wikipedia.ipynb | 217 ------------------------ notebooks/demo/nxcg_wikipedia_e2e.ipynb | 151 +++++++++++++++++ 2 files changed, 151 insertions(+), 217 deletions(-) delete mode 100644 notebooks/demo/nxcg_wikipedia.ipynb create mode 100644 notebooks/demo/nxcg_wikipedia_e2e.ipynb diff --git a/notebooks/demo/nxcg_wikipedia.ipynb b/notebooks/demo/nxcg_wikipedia.ipynb deleted file mode 100644 index 4e413621db3..00000000000 --- a/notebooks/demo/nxcg_wikipedia.ipynb +++ /dev/null @@ -1,217 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# `nx-cugraph` Demo - Wikipedia Pagerank\n", - "\n", - "This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Uncomment these two lines to enable cudf.pandas and nx-cugraph\n", - "\n", - "%load_ext cudf.pandas\n", - "!NETWORKX_BACKEND_PRIORITY=cugraph\n", - "\n", - "import pandas as pd\n", - "import networkx as nx" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Downloading the data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!wget \"https://data.rapids.ai/cugraph/datasets/cit-Patents.csv\" # Skip if cit-Patents.csv already exists." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "# TODO: dgx machine independent\n", - "dataset_folder = \"~/nvrliu/notebooks/demo/data/wikipedia\"\n", - "\n", - "edgelist_csv = f\"{dataset_folder}/enwiki-20240620-edges.csv\"\n", - "nodedata_csv = f\"{dataset_folder}/enwiki-20240620-nodeids.csv\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Read the Wikipedia Connectivity data from `edgelist_csv`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "\n", - "edgelist_df = pd.read_csv(\n", - " edgelist_csv,\n", - " sep=\" \",\n", - " names=[\"src\", \"dst\"],\n", - " dtype=\"int32\",\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Read the Wikipedia Page metadata from `nodedata_csv`" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "nodedata_df = pd.read_csv(\n", - " nodedata_csv,\n", - " sep=\"\\t\",\n", - " names=[\"nodeid\", \"title\"],\n", - " dtype={\"nodeid\": \"int32\", \"title\": \"str\"},\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a NetworkX graph from the connectivity info" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "G = nx.from_pandas_edgelist(\n", - " edgelist_df,\n", - " source=\"src\",\n", - " target=\"dst\",\n", - " create_using=nx.DiGraph,\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run pagerank on NetworkX" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "nx_pr_vals = nx.pagerank(G)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a DataFrame containing the results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pagerank_df = pd.DataFrame({\n", - " \"nodeid\": nx_pr_vals.keys(),\n", - " \"pagerank\": nx_pr_vals.values()\n", - " })" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Add NetworkX results to `nodedata` as new columns" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here the top 25 pages based on pagerank value" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "nodedata_df.sort_values(by=\"pagerank\", ascending=False).head(25)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "devenv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb new file mode 100644 index 00000000000..cfa06960ce7 --- /dev/null +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -0,0 +1,151 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `nx-cugraph` Demo - Wikipedia Pagerank\n", + "\n", + "This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Uncomment these two lines to enable cudf.pandas and nx-cugraph\n", + "\n", + "%load_ext cudf.pandas\n", + "!NETWORKX_BACKEND_PRIORITY=cugraph\n", + "\n", + "import pandas as pd\n", + "import networkx as nx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Downloading the data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2024-09-26 23:32:06-- https://data.rapids.ai/cugraph/datasets/cit-Patents.csv\n", + "Resolving data.rapids.ai (data.rapids.ai)... 108.139.10.83, 108.139.10.10, 108.139.10.39, ...\n", + "Connecting to data.rapids.ai (data.rapids.ai)|108.139.10.83|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 261652279 (250M) [text/csv]\n", + "Saving to: ‘cit-Patents.csv’\n", + "\n", + "cit-Patents.csv 100%[===================>] 249.53M 74.1MB/s in 3.7s \n", + "\n", + "2024-09-26 23:32:10 (67.6 MB/s) - ‘cit-Patents.csv’ saved [261652279/261652279]\n", + "\n" + ] + } + ], + "source": [ + "!wget \"https://downloadlink\" # download datasets from s3" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: remove this\n", + "dataset_folder = \"~/nvrliu/notebooks/demo/data/wikipedia\"\n", + "\n", + "edgelist_csv = f\"{dataset_folder}/enwiki-20240620-edges.csv\"\n", + "nodedata_csv = f\"{dataset_folder}/enwiki-20240620-nodeids.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Timed end-to-end code" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "\n", + "# Read the Wikipedia Connectivity data from `edgelist_csv`\n", + "edgelist_df = pd.read_csv(\n", + " edgelist_csv,\n", + " sep=\" \",\n", + " names=[\"src\", \"dst\"],\n", + " dtype=\"int32\",\n", + ")\n", + "\n", + "# Read the Wikipedia Page metadata from `nodedata_csv`\n", + "nodedata_df = pd.read_csv(\n", + " nodedata_csv,\n", + " sep=\"\\t\",\n", + " names=[\"nodeid\", \"title\"],\n", + " dtype={\"nodeid\": \"int32\", \"title\": \"str\"},\n", + ")\n", + "\n", + "# Create a NetworkX graph from the connectivity info\n", + "G = nx.from_pandas_edgelist(\n", + " edgelist_df,\n", + " source=\"src\",\n", + " target=\"dst\",\n", + " create_using=nx.DiGraph,\n", + ")\n", + "\n", + "# Run pagerank on NetworkX\n", + "nx_pr_vals = nx.pagerank(G)\n", + "\n", + "# Create a DataFrame containing the results\n", + "pagerank_df = pd.DataFrame({\n", + " \"nodeid\": nx_pr_vals.keys(),\n", + " \"pagerank\": nx_pr_vals.values()\n", + "})\n", + "\n", + "# Add NetworkX results to `nodedata` as new columns\n", + "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")\n", + "\n", + "# Here the top 25 pages based on pagerank value\n", + "nodedata_df.sort_values(by=\"pagerank\", ascending=False).head(25)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "devenv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From cfd517758f815f71d198f0d8ff0c591b4dde9c08 Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Fri, 27 Sep 2024 06:50:54 -0700 Subject: [PATCH 04/11] Updates --- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 29 +++++-------------------- 1 file changed, 6 insertions(+), 23 deletions(-) diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb index cfa06960ce7..f80b14614d0 100644 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -15,10 +15,11 @@ "metadata": {}, "outputs": [], "source": [ - "# Uncomment these two lines to enable cudf.pandas and nx-cugraph\n", + "# Uncomment these two lines to enable GPU acceleration\n", + "# The rest of the code stays the same!\n", "\n", - "%load_ext cudf.pandas\n", - "!NETWORKX_BACKEND_PRIORITY=cugraph\n", + "# %load_ext cudf.pandas\n", + "# !NETWORKX_BACKEND_PRIORITY=cugraph\n", "\n", "import pandas as pd\n", "import networkx as nx" @@ -33,27 +34,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "--2024-09-26 23:32:06-- https://data.rapids.ai/cugraph/datasets/cit-Patents.csv\n", - "Resolving data.rapids.ai (data.rapids.ai)... 108.139.10.83, 108.139.10.10, 108.139.10.39, ...\n", - "Connecting to data.rapids.ai (data.rapids.ai)|108.139.10.83|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 261652279 (250M) [text/csv]\n", - "Saving to: ‘cit-Patents.csv’\n", - "\n", - "cit-Patents.csv 100%[===================>] 249.53M 74.1MB/s in 3.7s \n", - "\n", - "2024-09-26 23:32:10 (67.6 MB/s) - ‘cit-Patents.csv’ saved [261652279/261652279]\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "!wget \"https://downloadlink\" # download datasets from s3" ] From 1a70b7ad588bcc1dc6f65af8fe07149ed8083ef0 Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Fri, 27 Sep 2024 06:55:39 -0700 Subject: [PATCH 05/11] Improved comment --- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb index f80b14614d0..d8c0656627d 100644 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -38,7 +38,7 @@ "metadata": {}, "outputs": [], "source": [ - "!wget \"https://downloadlink\" # download datasets from s3" + "# wget \"https://data.rapids.ai/cugraph/datasets/\" # Use this command to download datasets from the web" ] }, { From 74da95fe354e6a15661cd22c27791db5f5a48d8f Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Mon, 30 Sep 2024 09:21:21 -0700 Subject: [PATCH 06/11] Space out cells --- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 55 ++++++++++++++++++++++--- 1 file changed, 50 insertions(+), 5 deletions(-) diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb index d8c0656627d..ca1c9d3ef5b 100644 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -75,7 +75,16 @@ " sep=\" \",\n", " names=[\"src\", \"dst\"],\n", " dtype=\"int32\",\n", - ")\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", "\n", "# Read the Wikipedia Page metadata from `nodedata_csv`\n", "nodedata_df = pd.read_csv(\n", @@ -83,7 +92,16 @@ " sep=\"\\t\",\n", " names=[\"nodeid\", \"title\"],\n", " dtype={\"nodeid\": \"int32\", \"title\": \"str\"},\n", - ")\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", "\n", "# Create a NetworkX graph from the connectivity info\n", "G = nx.from_pandas_edgelist(\n", @@ -91,16 +109,43 @@ " source=\"src\",\n", " target=\"dst\",\n", " create_using=nx.DiGraph,\n", - ")\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", "\n", "# Run pagerank on NetworkX\n", - "nx_pr_vals = nx.pagerank(G)\n", + "nx_pr_vals = nx.pagerank(G)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", "\n", "# Create a DataFrame containing the results\n", "pagerank_df = pd.DataFrame({\n", " \"nodeid\": nx_pr_vals.keys(),\n", " \"pagerank\": nx_pr_vals.values()\n", - "})\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", "\n", "# Add NetworkX results to `nodedata` as new columns\n", "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")\n", From 818ce2dfdd877f13e8d23235fa8797a1ab80d8bc Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Mon, 30 Sep 2024 09:52:29 -0700 Subject: [PATCH 07/11] Separate comments --- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 75 +++++++++++++++++++------ 1 file changed, 59 insertions(+), 16 deletions(-) diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb index ca1c9d3ef5b..82547c1ab37 100644 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -61,15 +61,20 @@ "Timed end-to-end code" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read in the Wikipedia Connectivity data from `edgelist_csv`" + ] + }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ - "%%time\n", - "\n", - "# Read the Wikipedia Connectivity data from `edgelist_csv`\n", + "%%time \n", "edgelist_df = pd.read_csv(\n", " edgelist_csv,\n", " sep=\" \",\n", @@ -78,6 +83,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read in the Wikipedia pages metadata from `nodedata_csv`" + ] + }, { "cell_type": "code", "execution_count": null, @@ -85,8 +97,6 @@ "outputs": [], "source": [ "%%time\n", - "\n", - "# Read the Wikipedia Page metadata from `nodedata_csv`\n", "nodedata_df = pd.read_csv(\n", " nodedata_csv,\n", " sep=\"\\t\",\n", @@ -95,6 +105,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a NetworkX graph from the connectivity info we just loaded" + ] + }, { "cell_type": "code", "execution_count": null, @@ -102,8 +119,6 @@ "outputs": [], "source": [ "%%time\n", - "\n", - "# Create a NetworkX graph from the connectivity info\n", "G = nx.from_pandas_edgelist(\n", " edgelist_df,\n", " source=\"src\",\n", @@ -112,6 +127,13 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the Pagerank algorithm on the NetworkX graph" + ] + }, { "cell_type": "code", "execution_count": null, @@ -119,11 +141,16 @@ "outputs": [], "source": [ "%%time\n", - "\n", - "# Run pagerank on NetworkX\n", "nx_pr_vals = nx.pagerank(G)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a DataFrame containing the resulting pagerank values for each nodeid" + ] + }, { "cell_type": "code", "execution_count": null, @@ -131,14 +158,19 @@ "outputs": [], "source": [ "%%time\n", - "\n", - "# Create a DataFrame containing the results\n", "pagerank_df = pd.DataFrame({\n", " \"nodeid\": nx_pr_vals.keys(),\n", " \"pagerank\": nx_pr_vals.values()\n", "})" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, add the NetworkX results to `nodedata` as a new column." + ] + }, { "cell_type": "code", "execution_count": null, @@ -146,11 +178,22 @@ "outputs": [], "source": [ "%%time\n", - "\n", - "# Add NetworkX results to `nodedata` as new columns\n", - "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")\n", - "\n", - "# Here the top 25 pages based on pagerank value\n", + "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Showing the top 25 pages based on pagerank value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "nodedata_df.sort_values(by=\"pagerank\", ascending=False).head(25)" ] } From 023b09642d9e986fcfcc970dff5b2f69a0bf6f2d Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Tue, 1 Oct 2024 08:43:10 -0700 Subject: [PATCH 08/11] Respond to feedback --- notebooks/demo/accelerating_networkx.ipynb | 10 +++++----- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 7 ++----- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/notebooks/demo/accelerating_networkx.ipynb b/notebooks/demo/accelerating_networkx.ipynb index 19a62261a3b..8277666c0ff 100644 --- a/notebooks/demo/accelerating_networkx.ipynb +++ b/notebooks/demo/accelerating_networkx.ipynb @@ -8,13 +8,13 @@ "source": [ "# NetworkX - Easy Graph Analytics\n", "\n", - "NetworkX is the most popular library for graph analytics available in python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 50 million times in April of 2024 alone, which is roughly 50 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than python.\n", + "NetworkX is the most popular library for graph analytics available in Python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 50 million times in April of 2024 alone, which is roughly 50 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than Python.\n", "\n", - "However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.\n", + "However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-Python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.\n", "\n", "# Accelerated NetworkX - Easy (and fast!) Graph Analytics\n", "\n", - "To address the performance problem, NetworkX 3.0 introduced a mechanism to dispatch algorithm calls to alternate implementations. The NetworkX python API remains the same but NetworkX will use more capable algorithm implementations provided by one or more backends. This approach means users don't have to give up NetworkX -or even change their code- in order to take advantage of GPU performance." + "To address the performance problem, NetworkX 3.0 introduced a mechanism to dispatch algorithm calls to alternate implementations. The NetworkX Python API remains the same but NetworkX will use more capable algorithm implementations provided by one or more backends. This approach means users don't have to give up NetworkX -or even change their code- in order to take advantage of GPU performance." ] }, { @@ -192,7 +192,7 @@ "id": "z1hxut3GTj5A" }, "source": [ - "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not suprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligable when compared to the time saved when running on a GPU for larger graphs.\n", + "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not surprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligeble when compared to the time saved when running on a GPU for larger graphs.\n", "\n", "Since we've enabled graph conversion caching, we can see that if we re-run the same call the execution time is noticeably shorter." ] @@ -374,7 +374,7 @@ "source": [ "Now we'll look at a larger dataset from https://snap.stanford.edu/data/cit-Patents.html which contains citations across different U.S. patents granted from January 1, 1963 to December 30, 1999. The dataset represents 16.5M citations (edges) between 3.77M patents (nodes).\n", "\n", - "This will demonstrate that data of this size starts to push the limits of the default pure-python NetworkX implementation." + "This will demonstrate that data of this size starts to push the limits of the default pure-Python NetworkX implementation." ] }, { diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb index 82547c1ab37..872c8860b91 100644 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ b/notebooks/demo/nxcg_wikipedia_e2e.ipynb @@ -47,11 +47,8 @@ "metadata": {}, "outputs": [], "source": [ - "# TODO: remove this\n", - "dataset_folder = \"~/nvrliu/notebooks/demo/data/wikipedia\"\n", - "\n", - "edgelist_csv = f\"{dataset_folder}/enwiki-20240620-edges.csv\"\n", - "nodedata_csv = f\"{dataset_folder}/enwiki-20240620-nodeids.csv\"" + "edgelist_csv = \"enwiki-20240620-edges.csv\"\n", + "nodedata_csv = \"enwiki-20240620-nodeids.csv\"" ] }, { From 5cbb65584f551f89b71ecf393e564ccc11621d0a Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Tue, 1 Oct 2024 09:07:06 -0700 Subject: [PATCH 09/11] Changes --- notebooks/demo/accelerating_networkx.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/demo/accelerating_networkx.ipynb b/notebooks/demo/accelerating_networkx.ipynb index 8277666c0ff..db15e738c1a 100644 --- a/notebooks/demo/accelerating_networkx.ipynb +++ b/notebooks/demo/accelerating_networkx.ipynb @@ -8,7 +8,7 @@ "source": [ "# NetworkX - Easy Graph Analytics\n", "\n", - "NetworkX is the most popular library for graph analytics available in Python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 50 million times in April of 2024 alone, which is roughly 50 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than Python.\n", + "NetworkX is the most popular library for graph analytics available in Python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 71 million times in September of 2024 alone, which is roughly 71 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than Python.\n", "\n", "However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-Python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.\n", "\n", From e3c6b8b179798fc0761a7fb5ddbe5c19d377aa8b Mon Sep 17 00:00:00 2001 From: Ralph Liu <137829296+nv-rliu@users.noreply.github.com> Date: Tue, 1 Oct 2024 18:48:25 -0400 Subject: [PATCH 10/11] Update notebooks/demo/accelerating_networkx.ipynb Co-authored-by: Erik Welch --- notebooks/demo/accelerating_networkx.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/demo/accelerating_networkx.ipynb b/notebooks/demo/accelerating_networkx.ipynb index db15e738c1a..1a6c6cfb3f6 100644 --- a/notebooks/demo/accelerating_networkx.ipynb +++ b/notebooks/demo/accelerating_networkx.ipynb @@ -192,7 +192,7 @@ "id": "z1hxut3GTj5A" }, "source": [ - "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not surprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligeble when compared to the time saved when running on a GPU for larger graphs.\n", + "You may have noticed that using the `nx-cugraph` backend resulted in a slightly slower execution time. This is not surprising when working with a graph this small, since the overhead of converting the graph for the first time and launching the algorithm kernel on the GPU is actually significantly more than the computation time itself. We'll see later that this overhead is negligible when compared to the time saved when running on a GPU for larger graphs.\n", "\n", "Since we've enabled graph conversion caching, we can see that if we re-run the same call the execution time is noticeably shorter." ] From 6058f524dc56394469425de0c6b4fb56e1f08de0 Mon Sep 17 00:00:00 2001 From: Ralph Liu Date: Wed, 2 Oct 2024 09:17:39 -0700 Subject: [PATCH 11/11] Removed notebook that will go to showcase --- notebooks/demo/nxcg_wikipedia_e2e.ipynb | 219 ------------------------ 1 file changed, 219 deletions(-) delete mode 100644 notebooks/demo/nxcg_wikipedia_e2e.ipynb diff --git a/notebooks/demo/nxcg_wikipedia_e2e.ipynb b/notebooks/demo/nxcg_wikipedia_e2e.ipynb deleted file mode 100644 index 872c8860b91..00000000000 --- a/notebooks/demo/nxcg_wikipedia_e2e.ipynb +++ /dev/null @@ -1,219 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# `nx-cugraph` Demo - Wikipedia Pagerank\n", - "\n", - "This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Uncomment these two lines to enable GPU acceleration\n", - "# The rest of the code stays the same!\n", - "\n", - "# %load_ext cudf.pandas\n", - "# !NETWORKX_BACKEND_PRIORITY=cugraph\n", - "\n", - "import pandas as pd\n", - "import networkx as nx" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Downloading the data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# wget \"https://data.rapids.ai/cugraph/datasets/\" # Use this command to download datasets from the web" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "edgelist_csv = \"enwiki-20240620-edges.csv\"\n", - "nodedata_csv = \"enwiki-20240620-nodeids.csv\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Timed end-to-end code" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Read in the Wikipedia Connectivity data from `edgelist_csv`" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "%%time \n", - "edgelist_df = pd.read_csv(\n", - " edgelist_csv,\n", - " sep=\" \",\n", - " names=[\"src\", \"dst\"],\n", - " dtype=\"int32\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Read in the Wikipedia pages metadata from `nodedata_csv`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "nodedata_df = pd.read_csv(\n", - " nodedata_csv,\n", - " sep=\"\\t\",\n", - " names=[\"nodeid\", \"title\"],\n", - " dtype={\"nodeid\": \"int32\", \"title\": \"str\"},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a NetworkX graph from the connectivity info we just loaded" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "G = nx.from_pandas_edgelist(\n", - " edgelist_df,\n", - " source=\"src\",\n", - " target=\"dst\",\n", - " create_using=nx.DiGraph,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the Pagerank algorithm on the NetworkX graph" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "nx_pr_vals = nx.pagerank(G)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create a DataFrame containing the resulting pagerank values for each nodeid" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "pagerank_df = pd.DataFrame({\n", - " \"nodeid\": nx_pr_vals.keys(),\n", - " \"pagerank\": nx_pr_vals.values()\n", - "})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, add the NetworkX results to `nodedata` as a new column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "nodedata_df = nodedata_df.merge(pagerank_df, how=\"left\", on=\"nodeid\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Showing the top 25 pages based on pagerank value" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "nodedata_df.sort_values(by=\"pagerank\", ascending=False).head(25)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "devenv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}