Skip to content

Commit

Permalink
docs(wrappers/python): document python API
Browse files Browse the repository at this point in the history
The structure is more-or-less entirely plagiarized from node-api.md,
but with the Python-specific details filled in.
  • Loading branch information
SKalt committed Sep 25, 2024
1 parent 112f4a8 commit 8109fdd
Showing 1 changed file with 292 additions and 0 deletions.
292 changes: 292 additions & 0 deletions docs/content/docs/py-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
---
title: "Indexing content using the Python API"
nav_title: "Using the Python API"
nav_section: References
weight: 54
---

Pagefind provides an interface to the indexing binary as a Python package you can install and import.

There are situations where using this Python package is beneficial:
- Integrating Pagefind into an existing Python project, e.g. writing a plugin for a static site generator that can pass in-memory HTML files to Pagefind.
Pagefind can also return the search index in-memory, to be hosted via the dev mode alongside the files.
- Users looking to index their site and augment that index with extra non-HTML pages can run a standard Pagefind crawl with [`add_directory`](#indexadddirectory) and augment it with [`add_custom_record`](#indexaddcustomrecord).
- Users looking to use Pagefind's engine for searching miscellaneous content such as PDFs or subtitles, where [`add_custom_record`](#indexaddcustomrecord) can be used to build the entire index from scratch.

## Example Usage

<!-- this is copied verbatim from wrappers/python/src/tests/integration.py -->

```py
import asyncio
import json
import logging
import os
from pagefind.index import PagefindIndex, IndexConfig

logging.basicConfig(level=os.environ.get("LOG_LEVEL", "INFO"))
log = logging.getLogger(__name__)
html_content = (
"<html>"
" <body>"
" <main>"
" <h1>Example HTML</h1>"
" <p>This is an example HTML page.</p>"
" </main>"
" </body>"
"</html>"
)


def prefix(pre: str, s: str) -> str:
return pre + s.replace("\n", f"\n{pre}")


async def main():
config = IndexConfig(
root_selector="main", logfile="index.log", output_path="./output", verbose=True
)
async with PagefindIndex(config=config) as index:
log.debug("opened index")
new_file, new_record, new_dir = await asyncio.gather(
index.add_html_file(
content=html_content,
url="https://example.com",
source_path="other/example.html",
),
index.add_custom_record(
url="/elephants/",
content="Some testing content regarding elephants",
language="en",
meta={"title": "Elephants"},
),
index.add_directory("./public"),
)
print(prefix("new_file ", json.dumps(new_file, indent=2)))
print(prefix("new_record ", json.dumps(new_record, indent=2)))
print(prefix("new_dir ", json.dumps(new_dir, indent=2)))

files = await index.get_files()
for file in files:
print(prefix("files", f"{len(file['content']):10}B {file['path']}"))


if __name__ == "__main__":
asyncio.run(main())
```

All interactions with Pagefind are asynchronous, as they communicate with the native Pagefind binary in the background.

## PagefindIndex

`pagefind.index.PagefindIndex` manages a pagefind index.

`PagefindIndex` operates as an async contextmanager.
Entering the context starts a backing Pagefind service and creates an in-memory index in the backing service.
Exiting the context writes the in-memory index to disk and then shuts down the backing Pagefind service.

```py
from pagefind.index import PagefindIndex

async def main():
async with PagefindIndex() as index: # open the index
... # write to the index
# the index is closed here and files are written to disk.
```

`PagefindIndex` optionally takes a configuration dictionary that can apply parts of the [Pagefind CLI config](/docs/config-options/). The options available at this level are:

```py
from pagefind.index import PagefindIndex, IndexConfig
config = IndexConfig(
root_selector="main",
exclude_selectors="nav",
force_language="en",
verbose=True,
logfile="index.log",
keep_index_url=True,
output_path="./output",
)

async def main():
async with PagefindIndex(config=config) as index:
...
```

See the relevant documentation for these configuration options in the [Configuring the Pagefind CLI](/docs/config-options/) documentation.

## index.add_directory

Indexes a directory from disk using the standard Pagefind indexing behaviour.
This is equivalent to running the Pagefind binary with `--site <dir>`.

```py
# Index all the HTML files in the public directory
indexed_dir = await index.add_directory("./public")
page_count: int = new_dir["page_count"]
```
If the `path` provided is relative, it will be relative to the current working directory of your Python process.

```py
# Index files in a directory matching a given glob pattern.
indexed_dir = await index.add_directory("./public", glob="**.{html}")
```

Optionally, a custom `glob` can be supplied which controls which files Pagefind will consume within the directory. The default is shown, and the `glob` option can be omitted entirely.
See [Wax patterns documentation](https://github.com/olson-sean-k/wax#patterns) for more details.

<!-- FIXME: don't discard errors list -->

## index.add_html_file

Adds a virtual HTML file to the Pagefind index. Useful for files that don't exist on disk, for example a static site generator that is serving files from memory.

```py
html_content = (
"<html lang='en'><body>"
" <h1>A Full HTML Document</h1>"
" <p> ... </p>"
"</body></html>"
)

# Index a file as if Pagefind was indexing from disk
new_file = await index.add_html_file(
content=html_content,
source_path="other/example.html",
)

# Index HTML content, giving it a specific URL
new_file = await index.add_html_file(
content=html_content,
url="https://example.com",
)
```

The `source_path` should represent the path of this HTML file if it were to exist on disk. Pagefind will use this path to generate the URL. It should be relative, or absolute to a path within the current working directory.

Instead of `source_path`, a `url` may be supplied to explicitly set the URL of this search result.

The `content` should be the full HTML source, including the outer `<html> </html>` tags. This will be run through Pagefind's standard HTML indexing process, and should contain any required Pagefind attributes to control behaviour.

<!-- FIXME: error array? -->
If successful, the `file` object is returned containing metadata about the completed indexing.

## index.add_custom_record
Adds a direct record to the Pagefind index.
Useful for adding non-HTML content to the search results.

```py
custom_record = await index.add_custom_record(
url="/contact/",
content=(
"My raw content to be indexed for search. "
"Will be lightly processed by Pagefind."
),
language="en",
meta={
"title": "Contact",
"category": "Landing Page"
},
filters={"tags": ["landing", "company"]},
sort={"weight": "20"},
)

page_word_count: int = custom_record["page_word_count"]
page_url: str = custom_record["page_url"]
page_meta: dict[str, str] = custom_record["page_meta"]
```

The `url`, `content`, and `language` fields are all required. `language` should be an [ISO 639-1 code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

`meta` is optional, and is strictly a flat object of keys to string values.
See the [Metadata documentation](https://pagefind.app/docs/metadata/) for semantics.

`filters` is optional, and is strictly a flat object of keys to arrays of string values.
See the [Filters documentation](https://pagefind.app/docs/filtering/) for semantics.

`sort` is optional, and is strictly a flat object of keys to string values.
See the [Sort documentation](https://pagefind.app/docs/sorts/) for semantics.
*When Pagefind is processing an index, number-like strings will be sorted numerically rather than alphabetically. As such, the value passed in should be `"20"` and not `20`*

<!-- FIXME: errors? -->

If successful, the `file` object is returned containing metadata about the completed indexing.

## index.get_files

Get raw data of all files in the Pagefind index.
Useful for integrating a Pagefind index into the development mode of a static site generator and hosting these files yourself.

**WATCH OUT**: these files can be large enough to clog the pipe reading from the `pagefind` binary's subprocess, causing a deadlock.

```py
for file in (await index.get_files()):
path: str = file["path"]
content: str = file["content"]
...
```

## index.write_files

Closing the `PagefindIndex`'s context automatically calls `index.write_files`.

If you aren't using `PagefindIndex` as a context manager, calling `index.write_files()` writes the index files to disk, as they would be written when running the standard Pagefind binary directly.

```py
await index.write_files("./public/pagefind")
```

The `output_path` option should contain the path to the desired Pagefind bundle directory. If relative, is relative to the current working directory of your Python process.

## index.delete_index

Deletes the data for the given index from its backing Pagefind service.
Doesn't affect any written files or data returned by `get_files()`.

```python
await index.delete_index();
```

Calling `index.get_files()` or `index.write_files()` doesn't consume the index, and further modifications can be made. In situations where many indexes are being created, the `delete_index` call helps clear out memory from a shared Pagefind binary service.

Reusing an `PagefindIndex` object after calling `index.delete_index()` will cause errors to be returned.

Not calling this method is fine — these indexes will be cleaned up when your `PagefindIndex`'s context closes, its backing Pagefind service closes, or your Python process exits.

## PagefindService

`PagefindService` manages a pagefind service running in a subprocess.

`PagefindService` operates as an async context manager: when the context is entered, the backing service starts, and when the context exits, the backing service shuts down.

```py
from pagefind.service import PagefindService

async def main():
# or you can write
service = await PagefindService().launch()
...
await service.close()

async with PagefindService() as service: # the service launches
...
# the service closes
```

You should invoke `PagefindService` directly when you want to use the same backing service for many indexes:

```py
async with PagefindService() as service:
default_index = await service.create_index()
other_index = await service.create_index(
config=IndexConfig(output_path="./search/nonstandard"),
)
await asyncio.gather(
default_index.add_directory("./a"),
other_index.add_directory("./b"),
)
await asyncio.gather(
default_index.write_files(),
other_index.write_files(),
)
```

0 comments on commit 8109fdd

Please sign in to comment.