Skip to content

Commit

Permalink
docs(wrappers/python): add docstrings, check links
Browse files Browse the repository at this point in the history
Note that the python docstrings are written using reStructuredText
(see
https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#rst-primer,
https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html).
This has some notable differences from markdown:

```rst
   links: `link text <https://example.com>`
   inline code: ``code``
```

As a drive-by fix, I made the `PagefindIndex.config -> _config` private
instead of noting that it should be immutable -- I think this sends a
clearer message.

Finally, I checked that all the documentation site links were correct:

```sh
cd docs
npm i
hugo # build the docs
lychee --include-fragments public/ # check the links
```

This validated the link in ./docs/content/docs/py-api.md work, but
it turned up another interesting finding: there's a broken link to https://github.com/CloudCannon/pagefind/blob/main/pagefind/features/compound_filtering.feature.
  • Loading branch information
SKalt committed Sep 28, 2024
1 parent 8f58c1d commit 89c795c
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 35 deletions.
35 changes: 23 additions & 12 deletions docs/content/docs/py-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@
title: "Indexing content using the Python API"
nav_title: "Using the Python API"
nav_section: References
weight: 54
weight: 54 # slightly less weight than the node API
---

Pagefind provides an interface to the indexing binary as a Python package you can install and import.

There are situations where using this Python package is beneficial:
- Integrating Pagefind into an existing Python project, e.g. writing a plugin for a static site generator that can pass in-memory HTML files to Pagefind.
Pagefind can also return the search index in-memory, to be hosted via the dev mode alongside the files.
- Users looking to index their site and augment that index with extra non-HTML pages can run a standard Pagefind crawl with [`add_directory`](#indexadddirectory) and augment it with [`add_custom_record`](#indexaddcustomrecord).
- Users looking to use Pagefind's engine for searching miscellaneous content such as PDFs or subtitles, where [`add_custom_record`](#indexaddcustomrecord) can be used to build the entire index from scratch.
- Users looking to index their site and augment that index with extra non-HTML pages can run a standard Pagefind crawl with [`add_directory`](#indexadd_directory) and augment it with [`add_custom_record`](#indexadd_custom_record).
- Users looking to use Pagefind's engine for searching miscellaneous content such as PDFs or subtitles, where [`add_custom_record`](#indexadd_custom_record) can be used to build the entire index from scratch.

## Example Usage

<!-- this is copied verbatim from wrappers/python/src/tests/integration.py -->
<!-- this example is copied verbatim from wrappers/python/src/tests/integration.py -->

```py
import asyncio
Expand Down Expand Up @@ -90,10 +90,21 @@ from pagefind.index import PagefindIndex

async def main():
async with PagefindIndex() as index: # open the index
... # write to the index
... # update the index
# the index is closed here and files are written to disk.
```

Each method of `PagefindIndex` that talks to the backing Pagefind service can raise errors.
If an error is is thrown inside `PagefindIndex`'s context, the context closes without writing the index files to disk.

```py
async def main():
async with PagefindIndex() as index: # open the index
await index.add_directory("./public")
raise Exception("not today")
# the index closes without writing anything to disk
```

`PagefindIndex` optionally takes a configuration dictionary that can apply parts of the [Pagefind CLI config](/docs/config-options/). The options available at this level are:

```py
Expand Down Expand Up @@ -135,8 +146,6 @@ indexed_dir = await index.add_directory("./public", glob="**.{html}")
Optionally, a custom `glob` can be supplied which controls which files Pagefind will consume within the directory. The default is shown, and the `glob` option can be omitted entirely.
See [Wax patterns documentation](https://github.com/olson-sean-k/wax#patterns) for more details.

<!-- FIXME: don't discard errors list -->

## index.add_html_file

Adds a virtual HTML file to the Pagefind index. Useful for files that don't exist on disk, for example a static site generator that is serving files from memory.
Expand Down Expand Up @@ -168,7 +177,6 @@ Instead of `source_path`, a `url` may be supplied to explicitly set the URL of t

The `content` should be the full HTML source, including the outer `<html> </html>` tags. This will be run through Pagefind's standard HTML indexing process, and should contain any required Pagefind attributes to control behaviour.

<!-- FIXME: error array? -->
If successful, the `file` object is returned containing metadata about the completed indexing.

## index.add_custom_record
Expand Down Expand Up @@ -208,8 +216,6 @@ See the [Filters documentation](https://pagefind.app/docs/filtering/) for semant
See the [Sort documentation](https://pagefind.app/docs/sorts/) for semantics.
*When Pagefind is processing an index, number-like strings will be sorted numerically rather than alphabetically. As such, the value passed in should be `"20"` and not `20`*

<!-- FIXME: errors? -->

If successful, the `file` object is returned containing metadata about the completed indexing.

## index.get_files
Expand All @@ -233,7 +239,12 @@ Closing the `PagefindIndex`'s context automatically calls `index.write_files`.
If you aren't using `PagefindIndex` as a context manager, calling `index.write_files()` writes the index files to disk, as they would be written when running the standard Pagefind binary directly.

```py
await index.write_files("./public/pagefind")
await index = PagefindIndex(
IndexConfig(
output_path="./public/pagefind",
),
)
await index.write_files()
```

The `output_path` option should contain the path to the desired Pagefind bundle directory. If relative, is relative to the current working directory of your Python process.
Expand All @@ -244,7 +255,7 @@ Deletes the data for the given index from its backing Pagefind service.
Doesn't affect any written files or data returned by `get_files()`.

```python
await index.delete_index();
await index.delete_index()
```

Calling `index.get_files()` or `index.write_files()` doesn't consume the index, and further modifications can be made. In situations where many indexes are being created, the `delete_index` call helps clear out memory from a shared Pagefind binary service.
Expand Down
110 changes: 87 additions & 23 deletions wrappers/python/src/pagefind/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,37 +20,79 @@

class IndexConfig(TypedDict, total=False):
root_selector: Optional[str]
"""
The root selector to use for the index.
If not supplied, Pagefind will use the ``<html>`` tag.
"""
exclude_selectors: Optional[Sequence[str]]
"""Extra element selectors that Pagefind should ignore when indexing."""
force_language: Optional[str]
"""
Ignores any detected languages and creates a single index for the entire site as the
provided language. Expects an ISO 639-1 code, such as ``en`` or ``pt``.
"""
verbose: Optional[bool]
"""
Prints extra logging while indexing the site. Only affects the CLI, does not impact
web-facing search.
"""
logfile: Optional[str]
"""
A path to a file to log indexing output to in addition to stdout.
The file will be created if it doesn't exist and overwritten on each run.
"""
keep_index_url: Optional[bool]
"""Whether to keep ``index.html`` at the end of search result paths.
By default, a file at ``animals/cat/index.html`` will be given the URL
``/animals/cat/``. Setting this option to ``true`` will result in the URL
``/animals/cat/index.html``.
"""
output_path: Optional[str]
"""
The folder to output the search bundle into, relative to the processed site.
Defaults to ``pagefind``.
"""


class PagefindIndex:
"""Manages a Pagefind index.
``PagefindIndex`` operates as an async contextmanager.
Entering the context starts a backing Pagefind service and creates an in-memory index in the backing service.
Exiting the context writes the in-memory index to disk and then shuts down the backing Pagefind service.
Each method of ``PagefindIndex`` that talks to the backing Pagefind service can raise errors.
If an exception is is rased inside ``PagefindIndex``'s context, the context closes without writing the index files to disk.
``PagefindIndex`` optionally takes a configuration dictionary that can apply parts of the [Pagefind CLI config](/docs/config-options/). The options available at this level are:
See the relevant documentation for these configuration options in the
`Configuring the Pagefind CLI <https://pagefind.app/docs/config-options/>` documentation.
"""

_service: Optional["PagefindService"] = None
_index_id: Optional[int] = None
config: Optional[IndexConfig] = None
"""Note that config is immutable after initialization."""
_config: Optional[IndexConfig] = None
"""Note that config should be immutable."""

def __init__(
self,
config: Optional[IndexConfig] = None,
*,
_service: Optional["PagefindService"] = None,
_index_id: Optional[int] = None,
# TODO: cache config
):
self._service = _service
self._index_id = _index_id
self.config = config
self._config = config

async def _start(self) -> "PagefindIndex":
"""Start the backing Pagefind service and create an in-memory index."""
assert self._index_id is None
assert self._service is None
self._service = await PagefindService().launch()
_index = await self._service.create_index(self.config)
_index = await self._service.create_index(self._config)
self._index_id = _index._index_id
return self

Expand All @@ -61,14 +103,14 @@ async def add_html_file(
source_path: Optional[str] = None,
url: Optional[str] = None,
) -> InternalIndexedFileResponse:
"""
ARGS:
content: The source HTML content of the file to be parsed.
source_path: The source path of the HTML file if it were to exist on disk. \
"""Add an HTML file to the index.
:param content: The source HTML content of the file to be parsed.
:param source_path: The source path of the HTML file would have on disk. \
Must be a relative path, or an absolute path within the current working directory. \
Pagefind will compute the result URL from this path.
url: an explicit URL to use, instead of having Pagefind compute the URL \
based on the source_path. If not supplied, source_path must be supplied.
:param url: an explicit URL to use, instead of having Pagefind compute the \
URL based on the source_path. If not supplied, source_path must be supplied.
"""
assert self._service is not None
assert self._index_id is not None
Expand All @@ -87,6 +129,16 @@ async def add_html_file(
async def add_directory(
self, path: str, *, glob: Optional[str] = None
) -> InternalIndexedDirResponse:
"""Indexes a directory from disk using the standard Pagefind indexing behaviour.
This is equivalent to running the Pagefind binary with ``--site <dir>``.
:param path: the path to the directory to index. If the `path` provided is relative, \
it will be relative to the current working directory of your Python process.
:param glob: a glob pattern to filter files in the directory. If not provided, all \
files matching ``**.{html}`` are indexed. For more information on glob patterns, \
see the `Wax patterns documentation <https://github.com/olson-sean-k/wax#patterns>`.
"""
assert self._service is not None
assert self._index_id is not None
result = await self._service.send(
Expand All @@ -101,11 +153,12 @@ async def add_directory(
return cast(InternalIndexedDirResponse, result)

async def get_files(self) -> List[InternalSyntheticFile]:
"""
"""Get raw data of all files in the Pagefind index.
WATCH OUT: this method emits all files. This can be a lot of data, and
this amount of data can cause reading from the subprocess pipes to deadlock.
STRICTLY PREFER calling `self.write_files()`.
STRICTLY PREFER calling ``self.write_files()``.
"""
assert self._service is not None
assert self._index_id is not None
Expand All @@ -118,6 +171,10 @@ async def get_files(self) -> List[InternalSyntheticFile]:
return result

async def delete_index(self) -> None:
"""
Deletes the data for the given index from its backing Pagefind service.
Doesn't affect any written files or data returned by ``get_files()``.
"""
assert self._service is not None
assert self._index_id is not None
result = await self._service.send(
Expand All @@ -137,14 +194,16 @@ async def add_custom_record(
filters: Optional[Dict[str, List[str]]] = None,
sort: Optional[Dict[str, str]] = None,
) -> InternalIndexedFileResponse:
"""
ARGS:
content: the raw content of this record.
url: the output URL of this record. Pagefind will not alter this.
language: ISO 639-1 code of the language this record is written in.
meta: the metadata to attach to this record. Supplying a `title` is highly recommended.
filters: the filters to attach to this record. Filters are used to group records together.
sort: the sort keys to attach to this record.
"""Add a direct record to the Pagefind index.
This method is useful for adding non-HTML content to the search results.
:param content: the raw content of this record.
:param url: the output URL of this record. Pagefind will not alter this.
:param language: ISO 639-1 code of the language this record is written in.
:param meta: the metadata to attach to this record. Supplying a ``title`` is highly recommended.
:param filters: the filters to attach to this record. Filters are used to group records together.
:param sort: the sort keys to attach to this record.
"""
assert self._service is not None
assert self._index_id is not None
Expand All @@ -164,12 +223,17 @@ async def add_custom_record(
return cast(InternalIndexedFileResponse, result)

async def write_files(self) -> None:
"""Write the index files to disk.
If you're using PagefindIndex as a context manager, there's no need to call this method:
if no error occurred, closing the context automatically writes the index files to disk.
"""
assert self._service is not None
assert self._index_id is not None
if not self.config:
if not self._config:
output_path = None
else:
output_path = self.config.get("output_path")
output_path = self._config.get("output_path")

result = await self._service.send(
InternalWriteFilesRequest(
Expand Down

0 comments on commit 89c795c

Please sign in to comment.