Skip to content

Commit

Permalink
Minor clarifications on transformations docs (run-llama#9044)
Browse files Browse the repository at this point in the history
  • Loading branch information
seldo authored Nov 21, 2023
1 parent dfea98a commit 6d11718
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 11 deletions.
10 changes: 9 additions & 1 deletion docs/module_guides/loading/ingestion_pipeline/root.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ An `IngestionPipeline` uses a concept of `Transformations` that are applied to i

## Usage Pattern

At it's most basic level, you can quickly instantiate an `IngestionPipeline` like so:
The simplest usage is to instantiate an `IngestionPipeline` like so:

```python
from llama_index import Document
Expand All @@ -26,6 +26,8 @@ pipeline = IngestionPipeline(
nodes = pipeline.run(documents=[Document.example()])
```

Note that in a real-world scenario, you would get your documents from `SimpleDirectoryReader` or another reader from Llama Hub.

## Connecting to Vector Databases

When running an ingestion pipeline, you can also chose to automatically insert the resulting nodes into a remote vector store.
Expand Down Expand Up @@ -63,6 +65,12 @@ from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store)
```

## Calculating embeddings in a pipeline

Note that in the above example, embeddings are calculated as part of the pipeline. If you are connecting your pipeline to a vector store, embeddings must be a stage of your pipeline or your later instantiation of the index will fail.

You can omit embeddings from your pipeline if you are not connecting to a vector store, i.e. just producing a list of nodes.

## Caching

In an `IngestionPipeline`, each node + transformation combination is hashed and cached. This saves time on subsequent runs that use the same data.
Expand Down
10 changes: 5 additions & 5 deletions docs/module_guides/loading/ingestion_pipeline/transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

A transformation is something that takes a list of nodes as an input, and returns a list of nodes. Each component that implements the `Transformation` base class has both a synchronous `__call__()` definition and an async `acall()` definition.

Current;y, the following components are `Transformation` objects:
Currently, the following components are `Transformation` objects:

- `TextSplitter`
- `NodeParser`
- `MetadataExtractor`
- `Embeddings`model
- [`TextSplitter`](text_splitters)
- [`NodeParser`](/module_guides/loading/node_parsers/modules.md)
- [`MetadataExtractor`](/module_guides/loading/documents_and_nodes/usage_metadata_extractor.md)
- `Embeddings`model (check our [list of supported embeddings](list_of_embeddings))

## Usage Pattern

Expand Down
10 changes: 6 additions & 4 deletions docs/module_guides/loading/node_parsers/modules.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,9 @@ parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(markdown_docs)
```

## Text-Based Node Parsers
(text_splitters)=

## Text-Splitters

### CodeSplitter

Expand All @@ -66,7 +68,7 @@ Splits raw code-text based on the language it is written in.
Check the full list of [supported languages here](https://github.com/grantjenks/py-tree-sitter-languages#license).

```python
from llama_index.node_parser import CodeSplitter
from llama_index.text_splitter import CodeSplitter

splitter = CodeSplitter(
language="python",
Expand Down Expand Up @@ -94,7 +96,7 @@ nodes = parser.get_nodes_from_documents(documents)
The `SentenceSplitter` attempts to split text while respecting the boundaries of sentences.

```python
from llama_index.node_parser import SentenceSplitter
from llama_index.text_splitter import SentenceSplitter

splitter = SentenceSplitter(
chunk_size=1024,
Expand Down Expand Up @@ -132,7 +134,7 @@ A full example can be found [here in combination with the `MetadataReplacementNo
The `TokenTextSplitter` attempts to split text while respecting the boundaries of sentences.

```python
from llama_index.node_parser import TokenTextSplitter
from llama_index.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(
chunk_size=1024,
Expand Down
4 changes: 3 additions & 1 deletion docs/module_guides/models/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,9 @@ embeddings = embed_model.get_text_embedding(
)
```

## Modules
(list_of_embeddings)=

## List of supported embeddings

We support integrations with OpenAI, Azure, and anything LangChain offers.

Expand Down

0 comments on commit 6d11718

Please sign in to comment.