Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #20

Merged
merged 4 commits into from
Dec 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,5 @@ jobs:
with:
user: __token__
password: ${{ secrets.PYPI_TOKEN }}
- name: Publish docs
run: mkdocs gh-deploy
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ __pycache__/

impresso.egg-info/*
tmp/
dist/
dist/
site/
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@
[![PyPI version](https://badge.fury.io/py/impresso.svg)](https://badge.fury.io/py/impresso)
![PyPI - License](https://img.shields.io/pypi/l/impresso)

Impresso is a library to interact with the [Impresso](https://impresso-project.ch/app) dataset. It provides a set of classes to interact with the API and a set of tools that make working with the data easier.

Impresso is a library designed to facilitate interaction with the [Impresso](https://impresso-project.ch/app) dataset. It offers a comprehensive set of classes for API interaction and a variety of tools to streamline data manipulation and analysis.

You can find the full documentation at [https://impresso.github.io/impresso-py/](https://impresso.github.io/impresso-py/).

## Installation

Expand Down
36 changes: 36 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Impresso Python

<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>

Impresso is a library designed to facilitate interaction with the [Impresso](https://impresso-project.ch/app) dataset. It offers a comprehensive set of classes for API interaction and a variety of tools to streamline data manipulation and analysis.

## Installation and prerequisites

The Impresso python library can be installed using `pip`:

```shell
pip install impresso
```

The library requires Python version `3.10` or higher. It also depends on several packages commonly found in Jupyter environments, such as `matplotlib` and `pandas`.

## Create a session

::: impresso.connect


## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2024 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.
30 changes: 30 additions & 0 deletions docs/preparing_queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Preparing queries

Some filter parameters accept a combination of modifiers to create complex queries. For example, to search for content that mentions both `Titanic` and `ship`, you can use the `AND` modifier to combine these conditions:

```python
from impresso import AND

impresso.search.find(term=AND("Titanic", "ship"))
```

We can refine this condition and search for all content items that mention `Titanic` and `ship` together **OR** mention `Titanic` and `iceberg` together **AND** do not mention `Di Caprio`.


```python
from impresso import AND, OR

impresso.search.find(
term=(
AND("Titanic", "ship") |
AND("Titanic", "iceberg")
) & ~OR("Di Caprio")
)
```

## Modifiers

::: impresso.structures.OR
::: impresso.structures.AND
::: impresso.structures.DateRange
::: impresso.structures.NumericRange
74 changes: 74 additions & 0 deletions docs/resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Impresso API resources

## Search

Search content items in the Impresso corpus.

```python
impresso.search.find(term='Titanic', limit=10)
```

::: impresso.resources.search.SearchResource

::: impresso.api_client.models.search_order_by.SearchOrderByLiteral
::: impresso.resources.search.SearchDataContainer

## Entities

Search entities in the Impresso corpus.

```python
impresso.entities.find(term="Douglas Adams")
```

::: impresso.resources.entities.EntitiesResource

::: impresso.resources.entities.EntityType
::: impresso.api_client.models.find_entities_order_by.FindEntitiesOrderByLiteral

## Newspapers

Search newspapers available in the Impresso corpus.

```python
impresso.newspapers.find(
term="wort",
order_by="lastIssue",
)
```

::: impresso.resources.newspapers.NewspapersResource

::: impresso.api_client.models.find_newspapers_order_by.FindNewspapersOrderByLiteral
::: impresso.resources.newspapers.FindNewspapersContainer

## Content Items

Get a single content item by ID.

```python
impresso.content_items.get("NZZ-1794-08-09-a-i0002")
```

## Collections

Work with collections

::: impresso.resources.collections.CollectionsResource

::: impresso.api_client.models.find_collections_order_by.FindCollectionsOrderByLiteral
::: impresso.resources.collections.FindCollectionsContainer

## Named entity recognition

The python library contains a set of named entity recognition methods that use the same NER model used to add entities to the Impresso database.

::: impresso.resources.tools.ToolsResource
::: impresso.resources.tools.NerContainer

## Text reuse

Two resources can be used to search text reuse clusters and passages.

::: impresso.resources.text_reuse.clusters.TextReuseClustersResource
::: impresso.resources.text_reuse.passages.TextReusePassagesResource
7 changes: 7 additions & 0 deletions docs/result.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Result object

When you execute a query, a `DataContainer` object is returned. This object encapsulates the query results along with metadata about the query. Additionally, it provides a suite of utility methods for accessing the results in various ways.

In a Python notebook environment, the `DataContainer` object can render a preview of its data, facilitating quick inspection of the query results.

::: impresso.data_container.DataContainer
8 changes: 7 additions & 1 deletion impresso/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,15 @@ def connect(
public_api_url: str | None = None,
persisted_token: bool = True,
) -> ImpressoClient:
f"""
"""
Connect to the Impresso API and return a client object.

```python
from impresso import connect

impresso = connect()
```

Args:
public_api_url (str): The URL of the Impresso API to connect to. By default using the default URL set
in the config file (~/.impresso_py.yml) or the Impresso default URL ({DEFAULT_API_URL}).
Expand Down
21 changes: 14 additions & 7 deletions impresso/data_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,11 @@


class DataContainer(Generic[IT, T]):
"""Response of a resource call"""
"""
Generic container for responses from the Impresso API
returned by resource methods (`get`, `find`).
Generally represents a single page of the result.
"""

def __init__(
self,
Expand Down Expand Up @@ -72,17 +76,17 @@ def _get_preview_image_(self) -> str | None:

@property
def raw(self) -> dict[str, Any]:
"""Return the data as a python dictionary."""
"""Returns the response data as a python dictionary."""
return getattr(self._data, "to_dict")()

@property
def pydantic(self) -> T:
"""Return the data as a pydantic model."""
"""Returns the response data as a pydantic model."""
return self._pydantic_model.model_validate(self.raw)

@property
def df(self) -> DataFrame:
"""Return the data as a pandas dataframe."""
"""Returns the response data as a pandas dataframe."""
return DataFrame.from_dict(self._data) # type: ignore

@property
Expand All @@ -92,12 +96,12 @@ def total(self) -> int:

@property
def limit(self) -> int:
"""Page size."""
"""Current page size."""
return self.raw.get("pagination", {}).get("limit", 0)

@property
def offset(self) -> int:
"""Page offset."""
"""Current page offset."""
return self.raw.get("pagination", {}).get("offset", 0)

@property
Expand All @@ -107,5 +111,8 @@ def size(self) -> int:

@property
def url(self) -> str | None:
"""A URL of the result set in the Impresso web app."""
"""
URL of an Impresso web application page
representing the result set from this container.
"""
return self._web_app_search_result_url
42 changes: 37 additions & 5 deletions impresso/resources/collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,9 @@ def total(self) -> int:


class CollectionsResource(Resource):
"""Work with collections"""
"""
Work with collections.
"""

name = "collections"

Expand All @@ -75,7 +77,18 @@ def find(
limit: int | None = None,
offset: int | None = None,
) -> FindCollectionsContainer:
"""Find collections."""
"""
Search collections in Impresso.

Args:
term: Search term.
order_by: Order by aspect.
limit: Number of results to return.
offset: Number of results to skip.

Returns:
FindCollectionsContainer: Data container with a page of results of the search.
"""

result = find_collections.sync(
client=self._api_client,
Expand Down Expand Up @@ -122,7 +135,18 @@ def items(
limit: int | None = None,
offset: int | None = None,
) -> SearchDataContainer:
"""Return all items in a collection."""
"""
Return all content items from a collection.

Args:
collection_id: ID of the collection.
limit: Number of results to return.
offset: Number of results to skip.

Returns:
SearchDataContainer: Data container with a page of results of the search.
"""

search_resource = SearchResource(self._api_client)
return search_resource.find(
collection_id=collection_id, limit=limit, offset=offset
Expand All @@ -135,6 +159,10 @@ def add_items(self, collection_id: str, item_ids: list[str]) -> None:
**NOTE**: Items are not added immediately.
This operation may take up to a few minutes
to complete and reflect in the collection.

Args:
collection_id: ID of the collection.
item_ids: IDs of the content items to add.
"""
result = patch_collections_collection_id_items.sync(
client=self._api_client,
Expand All @@ -148,11 +176,15 @@ def add_items(self, collection_id: str, item_ids: list[str]) -> None:

def remove_items(self, collection_id: str, item_ids: list[str]) -> None:
"""
Remove items from a collection by their IDs.
Add items to a collection by their IDs.

**NOTE**: Items are not added immediately.
**NOTE**: Items are not removed immediately.
This operation may take up to a few minutes
to complete and reflect in the collection.

Args:
collection_id: ID of the collection.
item_ids: IDs of the content items to add.
"""
result = patch_collections_collection_id_items.sync(
client=self._api_client,
Expand Down
19 changes: 17 additions & 2 deletions impresso/resources/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def df(self) -> DataFrame:


class EntitiesResource(Resource):
"""Work with entities"""
"""Search entities in the Impresso database."""

name = "entities"

Expand All @@ -66,7 +66,22 @@ def find(
limit: int | None = None,
offset: int | None = None,
) -> FindEntitiesContainer:
"""Find entities."""
"""
Search entities in Impresso.

Args:
term: Search term.
wikidata_id: Return only entities resolved to this Wikidata ID.
entity_id: Return only entity with this ID.
entity_type: Return only entities of this type.
order_by: Field to order results by.
resolve: Return Wikidata details of the entities, if the entity is linked to a Wikidata entry.
limit: Number of results to return.
offset: Number of results to skip.

Returns:
FindEntitiesContainer: Data container with a page of results of the search.
"""

filters: list[Filter] = []
if entity_type is not None:
Expand Down
13 changes: 12 additions & 1 deletion impresso/resources/newspapers.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def df(self) -> DataFrame:


class NewspapersResource(Resource):
"""Search newspapers"""
"""Search newspapers in the Impresso database."""

name = "newspapers"

Expand All @@ -43,7 +43,18 @@ def find(
limit: int | None = None,
offset: int | None = None,
) -> FindNewspapersContainer:
"""
Search newspapers in Impresso.

Args:
term: Search term.
order_by: Field to order results by.
limit: Number of results to return.
offset: Number of results to skip.

Returns:
FindNewspapersContainer: Data container with a page of results of the search.
"""
result = find_newspapers.sync(
client=self._api_client,
term=term if term is not None else UNSET,
Expand Down
Loading
Loading