descriptions added per review comments

rapidsai · Nov 20, 2024 · b814c44 · b814c44
1 parent 18ea124
commit b814c44
Showing 1 changed file with 74 additions and 18 deletions.
diff --git a/notebooks/demo/centrality_patentsview.ipynb b/notebooks/demo/centrality_patentsview.ipynb
@@ -17,6 +17,21 @@
    },
    "source": [
     "# Downloading the data\n",
+    "\n",
+    "Citation: U.S. Patent and Trademark Office. “Data Download Tables.” PatentsView. Accessed [10/06/2024]. https://patentsview.org/download/data-download-tables.\n",
+    "\n",
+    " Both files are used under the Creative Commons license https://creativecommons.org/licenses/by/4.0/\n",
+    "\n",
+    "\n",
+    "The first file, g_patent.tsv.zip, contains summary data for each patent such as id, title and the location of the original patent document. The table description is available on the [PatentsView site](https://patentsview.org/download/data-download-dictionary).\n",
+    "\n",
+    "The second file, g_us_patent_citation.tsv.zip, contains a record for every citation between USPatents. The description of this table is also available on the [PatentsView site](https://patentsview.org/download/data-download-dictionary)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "Removing the comment character \"#\" and running the below lines will download and expand the data into the directory the notebook expects it to be in."
    ]
   },
@@ -29,24 +44,18 @@
    "outputs": [],
    "source": [
     "#!wget https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip\n",
-    "#!unzip ./_patent.tsv.zip\n",
+    "#!unzip ./g_patent.tsv.zip\n",
     "#!wget https://s3.amazonaws.com/data.patentsview.org/download/g_us_patent_citation.tsv.zip\n",
     "#!unzip ./g_us_patent_citation.tsv.zip"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We will create the dataframes using cudf and create the graphs with cuGraph."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "# We will create the dataframes using cudf and create the graphs with cuGraph\n",
     "import cudf\n",
     "import cugraph"
    ]
@@ -273,6 +282,13 @@
     "first_hop_df, first_set = next_hop(seed_series)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Show how many patents cite or are cited by the starting one(s)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -282,6 +298,13 @@
     "len(first_hop_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this case we will just use the second hop edge/patent list but for demonstation purposes. However the next_hop function can go out as many hops as necessary to build a relevant graph when desired for different data sets. Here is how in this case we could go out four levels of separation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -290,7 +313,7 @@
    "source": [
     "second_hop_df, second_hop_seeds = next_hop(first_set)\n",
     "third_hop_df, third_hop_seeds = next_hop(second_hop_seeds)\n",
-    "fourth_hop_df, fourth_hop_seeds = next_hop(third_hop_seeds)\n"
+    "fourth_hop_df, fourth_hop_seeds = next_hop(third_hop_seeds)"
    ]
   },
   {
@@ -329,7 +352,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The contents of the dataframe at 2 hops"
+    "The contents of the dataframe we will use which contains 2 hops."
    ]
   },
   {
@@ -341,6 +364,13 @@
     "second_hop_df"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we will build a directed Graph in cuGraph from the second hop dataframe created above"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -351,6 +381,13 @@
     "G = cugraph.from_cudf_edgelist(second_hop_df,create_using=cugraph.Graph(directed=True),source='source', destination='target')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use the compute_centrality function above to calculate and note the execution time"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -361,6 +398,13 @@
     "dc, bc, kc, pr, ev = compute_centrality(G)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We import the formatting package and print out the top 10 patents for each centrality measure"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -376,7 +420,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Calls the function that draws the graph with the specified number of the most central nodes labeled"
+    "Now call the function that draws the graph with the specified number of the most central nodes labeled.\n",
+    "The final parameter, pr in this case, for PageRank sends in the particular algorithm results to graph."
    ]
   },
   {
@@ -388,6 +433,24 @@
     "draw_centrality_graph(second_hop_df,12, pr)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lets run edge betweenness centrality to find the central edges in the graph."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "G_2_hops = cugraph.from_cudf_edgelist(second_hop_df,create_using=cugraph.Graph(directed=True),source='source', destination='target')\n",
+    "results=cugraph.edge_betweenness_centrality(G_2_hops).sort_values(ascending=False,by=['betweenness_centrality'])\n",
+    "results.head(10)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -411,13 +474,6 @@
     "len(title_df)\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Lets run edge betweenness centrality to find the central edges in the graph."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,