From b814c4412e36cff43036afa9b1fe6e0f9e1fb740 Mon Sep 17 00:00:00 2001 From: acostadon Date: Wed, 20 Nov 2024 11:31:57 -0500 Subject: [PATCH] descriptions added per review comments --- notebooks/demo/centrality_patentsview.ipynb | 92 +++++++++++++++++---- 1 file changed, 74 insertions(+), 18 deletions(-) diff --git a/notebooks/demo/centrality_patentsview.ipynb b/notebooks/demo/centrality_patentsview.ipynb index aece3dda4c..387d88ef87 100644 --- a/notebooks/demo/centrality_patentsview.ipynb +++ b/notebooks/demo/centrality_patentsview.ipynb @@ -17,6 +17,21 @@ }, "source": [ "# Downloading the data\n", + "\n", + "Citation: U.S. Patent and Trademark Office. “Data Download Tables.” PatentsView. Accessed [10/06/2024]. https://patentsview.org/download/data-download-tables.\n", + "\n", + " Both files are used under the Creative Commons license https://creativecommons.org/licenses/by/4.0/\n", + "\n", + "\n", + "The first file, g_patent.tsv.zip, contains summary data for each patent such as id, title and the location of the original patent document. The table description is available on the [PatentsView site](https://patentsview.org/download/data-download-dictionary).\n", + "\n", + "The second file, g_us_patent_citation.tsv.zip, contains a record for every citation between USPatents. The description of this table is also available on the [PatentsView site](https://patentsview.org/download/data-download-dictionary)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "Removing the comment character \"#\" and running the below lines will download and expand the data into the directory the notebook expects it to be in." ] }, @@ -29,24 +44,18 @@ "outputs": [], "source": [ "#!wget https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip\n", - "#!unzip ./_patent.tsv.zip\n", + "#!unzip ./g_patent.tsv.zip\n", "#!wget https://s3.amazonaws.com/data.patentsview.org/download/g_us_patent_citation.tsv.zip\n", "#!unzip ./g_us_patent_citation.tsv.zip" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will create the dataframes using cudf and create the graphs with cuGraph." - ] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ + "# We will create the dataframes using cudf and create the graphs with cuGraph\n", "import cudf\n", "import cugraph" ] @@ -273,6 +282,13 @@ "first_hop_df, first_set = next_hop(seed_series)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Show how many patents cite or are cited by the starting one(s)" + ] + }, { "cell_type": "code", "execution_count": null, @@ -282,6 +298,13 @@ "len(first_hop_df)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case we will just use the second hop edge/patent list but for demonstation purposes. However the next_hop function can go out as many hops as necessary to build a relevant graph when desired for different data sets. Here is how in this case we could go out four levels of separation." + ] + }, { "cell_type": "code", "execution_count": null, @@ -290,7 +313,7 @@ "source": [ "second_hop_df, second_hop_seeds = next_hop(first_set)\n", "third_hop_df, third_hop_seeds = next_hop(second_hop_seeds)\n", - "fourth_hop_df, fourth_hop_seeds = next_hop(third_hop_seeds)\n" + "fourth_hop_df, fourth_hop_seeds = next_hop(third_hop_seeds)" ] }, { @@ -329,7 +352,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The contents of the dataframe at 2 hops" + "The contents of the dataframe we will use which contains 2 hops." ] }, { @@ -341,6 +364,13 @@ "second_hop_df" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we will build a directed Graph in cuGraph from the second hop dataframe created above" + ] + }, { "cell_type": "code", "execution_count": null, @@ -351,6 +381,13 @@ "G = cugraph.from_cudf_edgelist(second_hop_df,create_using=cugraph.Graph(directed=True),source='source', destination='target')" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use the compute_centrality function above to calculate and note the execution time" + ] + }, { "cell_type": "code", "execution_count": null, @@ -361,6 +398,13 @@ "dc, bc, kc, pr, ev = compute_centrality(G)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We import the formatting package and print out the top 10 patents for each centrality measure" + ] + }, { "cell_type": "code", "execution_count": null, @@ -376,7 +420,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Calls the function that draws the graph with the specified number of the most central nodes labeled" + "Now call the function that draws the graph with the specified number of the most central nodes labeled.\n", + "The final parameter, pr in this case, for PageRank sends in the particular algorithm results to graph." ] }, { @@ -388,6 +433,24 @@ "draw_centrality_graph(second_hop_df,12, pr)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lets run edge betweenness centrality to find the central edges in the graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "G_2_hops = cugraph.from_cudf_edgelist(second_hop_df,create_using=cugraph.Graph(directed=True),source='source', destination='target')\n", + "results=cugraph.edge_betweenness_centrality(G_2_hops).sort_values(ascending=False,by=['betweenness_centrality'])\n", + "results.head(10)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -411,13 +474,6 @@ "len(title_df)\n" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets run edge betweenness centrality to find the central edges in the graph." - ] - }, { "cell_type": "code", "execution_count": null,