diff --git a/src/content/notebooks/impresso-py-maps.mdx b/src/content/notebooks/impresso-py-maps.mdx index af3332b..5523c11 100644 --- a/src/content/notebooks/impresso-py-maps.mdx +++ b/src/content/notebooks/impresso-py-maps.mdx @@ -2,6 +2,7 @@ title: Exploring impresso with maps githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/4-impresso-py/maps_explore.ipynb authors: + - impresso-team - RomanKalyakin sha: 168c669246385a2ec6c3e088b0081364f129d11c date: 2024-09-27T12:54:12Z @@ -9,22 +10,26 @@ googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datal --- {/* cell:0 cell_type:markdown */} + ## Install dependencies We need the following packages: - * [impresso-py](https://impresso-project.ch/) - * [ipyleaflet](https://ipyleaflet.readthedocs.io/en/latest/index.html) +- [impresso-py](https://impresso-project.ch/) +- [ipyleaflet](https://ipyleaflet.readthedocs.io/en/latest/index.html) {/* cell:1 cell_type:code */} + ```python %pip install git+https://github.com/impresso/impresso-py.git ipyleaflet ``` {/* cell:2 cell_type:markdown */} + ## Connect to Impresso {/* cell:3 cell_type:code */} + ```python from impresso import connect, OR, DateRange @@ -32,11 +37,13 @@ impresso = connect(public_api_url="https://dev.impresso-project.ch/public-api") ``` {/* cell:4 cell_type:markdown */} + ## Search and collect entities Find top 100 location entities mentioned in articles that talk about nuclear power plants in the first three decades following the second world war. {/* cell:5 cell_type:code */} + ```python locations = impresso.search.facet( "location", @@ -52,6 +59,7 @@ locations Get entities details, including wikidata details {/* cell:7 cell_type:code */} + ```python entities_ids = locations.df.index.tolist() entities = impresso.entities.find(entity_id=OR(*entities_ids), load_wikidata=True, limit=len(entities_ids)) @@ -62,6 +70,7 @@ entities Filter out entities that have no coordinates and add a country tag. {/* cell:9 cell_type:code */} + ```python df = entities.df entities_with_coordinates = df[df['wikidata.coordinates.latitude'].notna() & df['wikidata.coordinates.longitude'].notna()] @@ -74,6 +83,7 @@ entities_with_coordinates Add counts of mentions to the entities dataframe. {/* cell:11 cell_type:code */} + ```python entities_with_coordinates['mentions_count'] = entities_with_coordinates.index.map(locations.df['count']) ``` @@ -82,6 +92,7 @@ entities_with_coordinates['mentions_count'] = entities_with_coordinates.index.ma Plot entities on the map. {/* cell:13 cell_type:markdown */} + ### Utility methods Functions used to calculate extra details needed to plot data on a map. @@ -90,6 +101,7 @@ Functions used to calculate extra details needed to plot data on a map. Find geo bounds of a group of items. {/* cell:15 cell_type:code */} + ```python def find_bounds(coordinates): """ @@ -124,6 +136,7 @@ def find_bounds(coordinates): Create an HTML used for rendering the hover pop-up. {/* cell:17 cell_type:code */} + ```python from ipywidgets import HTML from ipyleaflet import Popup @@ -150,9 +163,11 @@ def build_hover_popup(title: str, subtitle: str, mentions: int) -> Popup: ``` {/* cell:18 cell_type:markdown */} + ### Plot {/* cell:19 cell_type:code */} + ```python from ipyleaflet import Map, Marker, AwesomeIcon, CircleMarker diff --git a/src/content/notebooks/impresso-py-network.mdx b/src/content/notebooks/impresso-py-network.mdx index 2b9696c..b56ce8f 100644 --- a/src/content/notebooks/impresso-py-network.mdx +++ b/src/content/notebooks/impresso-py-network.mdx @@ -2,6 +2,7 @@ title: Network graph with Impresso Py githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/4-impresso-py/network_graph.ipynb authors: + - impresso-team - RomanKalyakin sha: 168c669246385a2ec6c3e088b0081364f129d11c date: 2024-09-27T12:54:12Z @@ -9,17 +10,21 @@ googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datal --- {/* cell:0 cell_type:markdown */} + ## Install dependencies {/* cell:1 cell_type:code */} + ```python %pip install git+https://github.com/impresso/impresso-py.git ipysigma ``` {/* cell:2 cell_type:markdown */} + ## Connect to Impresso {/* cell:3 cell_type:code */} + ```python from impresso import connect, OR, AND @@ -27,16 +32,19 @@ impresso = connect(public_api_url="https://dev.impresso-project.ch/public-api") ``` {/* cell:4 cell_type:markdown */} + ## Part 1: Get entities and their co-occurrences Find all persons mentioned in all articles that talk about the [Prague Spring](https://en.wikipedia.org/wiki/Prague_Spring). {/* cell:5 cell_type:code */} + ```python query = OR("Prague Spring", "Prager Frühling", "Printemps de Prague") ``` {/* cell:6 cell_type:code */} + ```python persons = impresso.search.facet( facet="person", @@ -51,6 +59,7 @@ persons Get all combinations of all entities with a mention count higher than `N`. {/* cell:8 cell_type:code */} + ```python import itertools @@ -66,6 +75,7 @@ print(f"Total combinations: {len(persons_ids_combinations)}") ``` {/* cell:9 cell_type:code */} + ```python if len(persons_ids_combinations) > 500: msg = ( @@ -81,6 +91,7 @@ if len(persons_ids_combinations) > 500: Get timestamps and counts of all articles where persons pairs appear. {/* cell:11 cell_type:code */} + ```python from impresso.util.error import ImpressoError from time import sleep @@ -115,6 +126,7 @@ for idx, combo in enumerate(persons_ids_combinations): Put them all into a dataframe {/* cell:13 cell_type:code */} + ```python import pandas as pd @@ -132,14 +144,17 @@ connections_df ``` {/* cell:14 cell_type:code */} + ```python connections_df.to_csv("connections.csv") ``` {/* cell:15 cell_type:markdown */} + ## Part 2: visualise {/* cell:16 cell_type:code */} + ```python import pandas as pd @@ -148,6 +163,7 @@ connections_df ``` {/* cell:17 cell_type:code */} + ```python grouped_connections_df = connections_df.groupby(['node_a', 'node_b']) \ .agg({'timestamp': lambda x: ', '.join(list(x)), 'count': 'sum', 'url': lambda x: list(set(x))[0]}) \ @@ -156,6 +172,7 @@ grouped_connections_df ``` {/* cell:18 cell_type:code */} + ```python import networkx as nx @@ -172,12 +189,14 @@ G.nodes ``` {/* cell:19 cell_type:code */} + ```python filename = input("Enter the filename: ") filename = f"{filename.replace(' ', '_')}.gefx" ``` {/* cell:20 cell_type:code */} + ```python nx.write_gexf(G, filename) ``` @@ -186,6 +205,7 @@ nx.write_gexf(G, filename) If running in Colab - activate custom widgets to allow Sigma to render the graph. {/* cell:22 cell_type:code */} + ```python try: from google.colab import output @@ -198,6 +218,7 @@ except: Render the graph. {/* cell:24 cell_type:code */} + ```python import networkx as nx from ipysigma import Sigma diff --git a/src/content/notebooks/impresso-py-search.mdx b/src/content/notebooks/impresso-py-search.mdx index e4fe012..6229d52 100644 --- a/src/content/notebooks/impresso-py-search.mdx +++ b/src/content/notebooks/impresso-py-search.mdx @@ -1,6 +1,7 @@ --- githubUrl: https://github.com/impresso/impresso-py/blob/main/examples/notebooks/search.ipynb authors: + - impresso-team - RomanKalyakin seealso: - impresso-py-collections @@ -11,6 +12,7 @@ googleColabUrl: https://colab.research.google.com/github/impresso/impresso-py/bl --- {/* cell:0 cell_type:code */} + ```python from impresso import connect @@ -18,39 +20,49 @@ impresso = connect() ``` {/* cell:1 cell_type:markdown */} + ## Term Find all items containing "impresso" keyword. {/* cell:2 cell_type:code */} + ```python impresso.search.find(q="impresso") ``` {/* cell:3 cell_type:markdown */} + ## With text content only Limit to articles that have text. {/* cell:4 cell_type:code */} + ```python impresso.search.find(q="impresso", with_text_contents=True) ``` {/* cell:5 cell_type:markdown */} + ## Title + Find items that have the keyword "impresso" in their title. {/* cell:6 cell_type:code */} + ```python impresso.search.find(title="impresso") ``` {/* cell:7 cell_type:markdown */} + ### Complex term requests + Find items that have both terms. {/* cell:8 cell_type:code */} + ```python from impresso import AND @@ -63,6 +75,7 @@ Find items that have either one term or the other. Here we find all articles that contain either "homme" or "femme" in the title. {/* cell:10 cell_type:code */} + ```python from impresso import OR @@ -70,11 +83,13 @@ impresso.search.find(title=OR("homme", "femme")) ``` {/* cell:11 cell_type:markdown */} -## Inverted search (everything excluding term A __OR__ term B). + +## Inverted search (everything excluding term A **OR** term B). We want to find all articles with the word "luddite" in the title that do not mention neither "textile" nor "machine" {/* cell:12 cell_type:code */} + ```python from impresso import OR @@ -82,15 +97,17 @@ impresso.search.find(title="luddite", q=~OR("textile", "machine")) ``` {/* cell:13 cell_type:markdown */} + ### Complex combintation of terms The following cell searches all articles with all of the the following condition: -* mentioning "hitler" and "stalin" -* also mentioning one of: "molotow" or "ribbentrop" -* and not mentioning "churchill" +- mentioning "hitler" and "stalin" +- also mentioning one of: "molotow" or "ribbentrop" +- and not mentioning "churchill" {/* cell:14 cell_type:code */} + ```python from impresso import AND, OR @@ -98,21 +115,25 @@ impresso.search.find(q=AND("hitler", "stalin") & OR("molotow", "ribbentrop") & ~ ``` {/* cell:15 cell_type:markdown */} + ## Front page Find articles published on the front page only {/* cell:16 cell_type:code */} + ```python impresso.search.find(q="impresso", front_page=True) ``` {/* cell:17 cell_type:markdown */} + ## Entity ID Search by entity ID {/* cell:18 cell_type:code */} + ```python impresso.search.find(entity_id="aida-0001-54-Switzerland") ``` @@ -121,6 +142,7 @@ impresso.search.find(entity_id="aida-0001-54-Switzerland") Find all articles that mention Switzerland and Albert Einstein. {/* cell:20 cell_type:code */} + ```python impresso.search.find(entity_id=AND("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein")) ``` @@ -129,26 +151,31 @@ impresso.search.find(entity_id=AND("aida-0001-54-Switzerland", "aida-0001-50-Alb Find all articles that mention either Switzerland or Albert Einstein. {/* cell:22 cell_type:code */} + ```python impresso.search.find(entity_id=OR("aida-0001-54-Switzerland", "aida-0001-50-Albert_Einstein")) ``` {/* cell:23 cell_type:markdown */} + ## Newspaper Limit search to two newspapers {/* cell:24 cell_type:code */} + ```python impresso.search.find(q="independence", newspaper_id=OR("EXP", "GDL")) ``` {/* cell:25 cell_type:markdown */} + ## Date range Items published between dates {/* cell:26 cell_type:code */} + ```python from impresso import DateRange @@ -159,6 +186,7 @@ impresso.search.find(q="independence", date_range=DateRange("1921-05-21", "2001- Articles published at any time excluding the range (not the `~` that negates the range). {/* cell:28 cell_type:code */} + ```python from impresso import DateRange @@ -166,11 +194,13 @@ impresso.search.find(q="independence", date_range=~DateRange("1921-05-21", "2001 ``` {/* cell:29 cell_type:markdown */} + ## Language Search for the term "banana" in English or Italian. {/* cell:30 cell_type:code */} + ```python impresso.search.find(q="banana", language=OR("it", "en")) ``` @@ -179,235 +209,281 @@ impresso.search.find(q="banana", language=OR("it", "en")) And now search for the word "banana" in any language _except_ English or Italian. {/* cell:32 cell_type:code */} + ```python impresso.search.find(q="banana", language=~OR("it", "en")) ``` {/* cell:33 cell_type:markdown */} + ## Entity mention Find articles that mention two entities. {/* cell:34 cell_type:code */} + ```python impresso.search.find(mention=AND("Charlie Chaplin", "Switzerland")) ``` {/* cell:35 cell_type:markdown */} + ## Topic Find articles that match either of the two topics. {/* cell:36 cell_type:code */} + ```python impresso.search.find(topic_id=OR("tm-fr-all-v2.0_tp07_fr", "tm-fr-all-v2.0_tp48_fr")) ``` {/* cell:37 cell_type:markdown */} + ## Collection Find all articles in a collection. {/* cell:38 cell_type:code */} + ```python impresso.search.find(collection_id="REPLACEME") ``` {/* cell:39 cell_type:markdown */} + ## Country Find all articles published in either of the two specified countries. {/* cell:40 cell_type:code */} + ```python impresso.search.find(q="Schengen", country=OR("FR", "CH")) ``` {/* cell:41 cell_type:markdown */} + ## Access rights Limit search to articles with specific access rights. {/* cell:42 cell_type:code */} + ```python impresso.search.find(q="Schengen", access_rights="Closed") ``` {/* cell:43 cell_type:markdown */} + ## Partner Limit search to articles provided by a specific partner of the Impresso project. {/* cell:44 cell_type:code */} + ```python impresso.search.find(q="Schengen", partner_id="Migros") ``` {/* cell:45 cell_type:markdown */} + ## Text reuse cluster Find all articles that are part of a specific text reuse cluster. {/* cell:46 cell_type:code */} + ```python from impresso import OR impresso.search.find(text_reuse_cluster_id=OR("tr-nobp-all-v01-c29")) ``` {/* cell:47 cell_type:markdown */} + # Facets -Facets are a way to get a summary of the search results from the perspective of a specific field. In a facet search result the field values are grouped together and the number of items in each group is displayed. +Facets are a way to get a summary of the search results from the perspective of a specific field. In a facet search result the field values are grouped together and the number of items in each group is displayed. Facet search method has the same attributes as the search method. {/* cell:48 cell_type:markdown */} + ## Date range Get the number of articles that mention "Impresso", published on ever particular date. {/* cell:49 cell_type:code */} + ```python impresso.search.facet("daterange", q="impresso") ``` {/* cell:50 cell_type:markdown */} + ## Year Get the number of articles that mention "impresso", published during every particular year. {/* cell:51 cell_type:code */} + ```python impresso.search.facet("year", q="impresso") ``` {/* cell:52 cell_type:markdown */} + ## Content length Get the number of articles that mention "impresso", grouped by content length. {/* cell:53 cell_type:code */} + ```python impresso.search.facet("contentLength", q="impresso") ``` {/* cell:54 cell_type:markdown */} + ## Month Get the number of articles that mention "impresso", published during every particular month. {/* cell:55 cell_type:code */} + ```python impresso.search.facet("month", q="impresso") ``` {/* cell:56 cell_type:markdown */} + ## Country Get the number of articles that mention "impresso", grouped by country they were published in. {/* cell:57 cell_type:code */} + ```python impresso.search.facet("country", q="impresso") ``` {/* cell:58 cell_type:markdown */} + ## Type Get the number of items that mention "impresso", grouped by type of item. {/* cell:59 cell_type:code */} + ```python impresso.search.facet("type") ``` {/* cell:60 cell_type:markdown */} + ## Topic Find topics that the articles mentioning "impresso" are related to. {/* cell:61 cell_type:code */} + ```python impresso.search.facet("topic", q="pomme") ``` {/* cell:62 cell_type:markdown */} + ## Collection Find collections the articles mentioning "pomme" are part of. {/* cell:63 cell_type:code */} + ```python impresso.search.facet("collection", q="pomme") ``` {/* cell:64 cell_type:markdown */} + ## Newspaper Find newspapers that the articles mentioning "Schengen" were published in. {/* cell:65 cell_type:code */} + ```python impresso.search.facet("newspaper", q="Schengen") ``` {/* cell:66 cell_type:markdown */} + ## Language Find all languages the articles mentioning "impresso" were published in. {/* cell:67 cell_type:code */} + ```python impresso.search.facet("language", q="Schengen") ``` {/* cell:68 cell_type:markdown */} + ## Person Find all persons mentioned in articles that mention "Schengen". Get only the last page. {/* cell:69 cell_type:code */} + ```python impresso.search.facet("person", q="Schengen", offset=7140) ``` {/* cell:70 cell_type:markdown */} + ## Location Find all locations mentioned in articles that mention "Schengen". Get only the last page. {/* cell:71 cell_type:code */} + ```python impresso.search.facet("location", q="Schengen", offset=3310) ``` {/* cell:72 cell_type:markdown */} + ## NAG Find all entities without a known type mentioned in articles that mention "homme" and "femme". {/* cell:73 cell_type:code */} + ```python from impresso import AND impresso.search.facet("nag", title=AND("homme", "femme")) ``` {/* cell:74 cell_type:markdown */} + ## Access rights Get access rights of articles mentioning "pomme". {/* cell:75 cell_type:code */} + ```python impresso.search.facet("accessRight", q="pomme") ``` {/* cell:76 cell_type:markdown */} + ## Partner Get Impresso partners that provided articles mentioning "pomme". {/* cell:77 cell_type:code */} + ```python impresso.search.facet("partner", q="pomme") ```