From 3645c4974c72e1e699609df8b9e97f856c666fd0 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Mon, 25 Nov 2024 19:12:46 +0000 Subject: [PATCH] Add predicate token filter docs #8272 (#8279) * adding predicate token filter docs #8272 Signed-off-by: Anton Rubin * Doc review Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower (cherry picked from commit 3f6fe1c3eb956a181dc28699e5d800f725ebf893) Signed-off-by: github-actions[bot] --- _analyzers/token-filters/index.md | 2 +- .../token-filters/predicate-token-filter.md | 82 +++++++++++++++++++ 2 files changed, 83 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/predicate-token-filter.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index 10861aaf40..3af92168f7 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -49,7 +49,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache `pattern_replace` | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). `phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin. `porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. -`predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only. +[`predicate_token_filter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/predicate-token-filter/) | N/A | Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts. `remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. `reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. `shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. diff --git a/_analyzers/token-filters/predicate-token-filter.md b/_analyzers/token-filters/predicate-token-filter.md new file mode 100644 index 0000000000..24729f0224 --- /dev/null +++ b/_analyzers/token-filters/predicate-token-filter.md @@ -0,0 +1,82 @@ +--- +layout: default +title: Predicate token filter +parent: Token filters +nav_order: 340 +--- + +# Predicate token filter + +The `predicate_token_filter` evaluates whether tokens should be kept or discarded, depending on the conditions defined in a custom script. The tokens are evaluated in the analysis predicate context. This filter supports only inline Painless scripts. + +## Parameters + +The `predicate_token_filter` has one required parameter: `script`. This parameter provides a condition that is used to evaluate whether the token should be kept. + +## Example + +The following example request creates a new index named `predicate_index` and configures an analyzer with a `predicate_token_filter`. The filter specifies to only output tokens if they are longer than 7 characters: + +```json +PUT /predicate_index +{ + "settings": { + "analysis": { + "filter": { + "my_predicate_filter": { + "type": "predicate_token_filter", + "script": { + "source": "token.term.length() > 7" + } + } + }, + "analyzer": { + "predicate_analyzer": { + "tokenizer": "standard", + "filter": [ + "lowercase", + "my_predicate_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +POST /predicate_index/_analyze +{ + "text": "The OpenSearch community is growing rapidly", + "analyzer": "predicate_analyzer" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "opensearch", + "start_offset": 4, + "end_offset": 14, + "type": "", + "position": 1 + }, + { + "token": "community", + "start_offset": 15, + "end_offset": 24, + "type": "", + "position": 2 + } + ] +} +```