Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example to text chunking processor documentation #6794

Merged
Prev Previous commit
Next Next commit
correct example
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
  • Loading branch information
yuye-aws committed Mar 29, 2024
commit 5b840f0e2e83badeb8af61b278477477eeac886d
62 changes: 13 additions & 49 deletions _search-plugins/text-chunking.md
Original file line number Diff line number Diff line change
@@ -4,22 +4,20 @@
nav_order: 65
---

# Text chunking
# haining text chunking and embedding processors

Check failure on line 7 in _search-plugins/text-chunking.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: haining. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: haining. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/text-chunking.md", "range": {"start": {"line": 7, "column": 3}}}, "severity": "ERROR"}

Check failure on line 7 in _search-plugins/text-chunking.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'haining text chunking and embedding processors' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'haining text chunking and embedding processors' is a heading and should be in sentence case.", "location": {"path": "_search-plugins/text-chunking.md", "range": {"start": {"line": 7, "column": 3}}}, "severity": "ERROR"}
yuye-aws marked this conversation as resolved.
Show resolved Hide resolved
Introduced 2.13
{: .label .label-purple }

To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` or `sparse_encoding` processors. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/).

The following example preprocesses text by splitting it into passages and then produces embeddings using the `sparse_encoding` processor.
To split long text into passages, you can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the `text_embedding` processor.

## Step 1: Create a pipeline

The following request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field:
The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:

```json
PUT _ingest/pipeline/text-chunking-ingest-pipeline
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking ingest pipeline",
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
@@ -34,6 +32,14 @@
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
@@ -109,48 +115,6 @@
```
{% include copy-curl.html %}


## Chaining text chunking and embedding processors

You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage.

Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model.

The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:

```json
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
```
{% include copy-curl.html %}


## Cascaded text chunking processors

You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows:
yuye-aws marked this conversation as resolved.
Show resolved Hide resolved
Loading