Skip to content

Commit

Permalink
Doc review
Browse files Browse the repository at this point in the history
Signed-off-by: Fanit Kolchina <[email protected]>
  • Loading branch information
kolchfa-aws committed Dec 5, 2024
1 parent a331379 commit 2daa567
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions _analyzers/tokenizers/ngram.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ nav_order: 80

# N-gram tokenizer

The `ngram` tokenizer split text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality, as it generates substrings (character n-grams) of the original input text.
The `ngram` tokenizer splits text into overlapping n-grams (sequences of characters) of a specified length. This tokenizer is particularly useful when you want to perform partial word matching or autocomplete search functionality because it generates substrings (character n-grams) of the original input text.

## Example usage

The following example request creates a new index named `my_index` and configures an analyzer with `ngram` tokenizer:
The following example request creates a new index named `my_index` and configures an analyzer with an `ngram` tokenizer:

```json
PUT /my_index
Expand Down Expand Up @@ -40,7 +40,7 @@ PUT /my_index

## Generated tokens

Use the following request to examine the tokens generated using the created analyzer:
Use the following request to examine the tokens generated using the analyzer:

```json
POST /my_index/_analyze
Expand All @@ -67,22 +67,22 @@ The response contains the generated tokens:
}
```

## Configuration
## Parameters

The `ngram` tokenizer can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`min_gram` | Optional | Integer | Minimum length of n-grams. Default is `1`.
`max_gram` | Optional | Integer | Maximum length of n-grams. Default is `2`.
`token_chars` | Optional | List of strings | Character classes to be included in tokenization. The following are the possible options:<br>- `letter`<br>- `digit`<br>- `whitespace`<br>- `punctuation`<br>- `symbol`<br>- `custom` (Parameter `custom_token_chars` needs to also be configured in this case)<br>Default is empty list (`[]`) which retains all the characters
`custom_token_chars` | Optional | String | Custom characters that will be included as part of the tokens.
`token_chars` | Optional | List of strings | Character classes to be included in tokenization. Valid values are:<br>- `letter`<br>- `digit`<br>- `whitespace`<br>- `punctuation`<br>- `symbol`<br>- `custom` (You must also specify the `custom_token_chars` parameter)<br>Default is empty list (`[]`), which retains all the characters.
`custom_token_chars` | Optional | String | Custom characters to be included as part of the tokens.

### Maximum difference between `min_gram` and `max_gram`

The maximum difference between `min_gram` and `max_gram` is configured using index level setting `index.max_ngram_diff` and defaults to `1`.
The maximum difference between `min_gram` and `max_gram` is configured using the index-level `index.max_ngram_diff` setting and defaults to `1`.

The following example creates index with custom `index.max_ngram_diff` setting:
The following example creates index with a custom `index.max_ngram_diff` setting:

```json
PUT /my-index
Expand Down

0 comments on commit 2daa567

Please sign in to comment.