From 2c7d53cc56f175ecb602d4a6528d565d00d77a32 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Tue, 15 Oct 2024 13:00:31 +0100 Subject: [PATCH 1/4] add fingerprint analyzer docs Signed-off-by: Anton Rubin --- _analyzers/fingerprint.md | 112 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 _analyzers/fingerprint.md diff --git a/_analyzers/fingerprint.md b/_analyzers/fingerprint.md new file mode 100644 index 0000000000..2e528eb593 --- /dev/null +++ b/_analyzers/fingerprint.md @@ -0,0 +1,112 @@ +--- +layout: default +title: Fingerprint analyzer +nav_order: 110 +--- + +# Fingerprint analyzer + +The `fingerprint` analyzer is designed to create a fingerprint of text. This analyzer sorts and deduplicates the terms (tokens) generated from the input, and then concatenates them using a separator. It is commonly used for data deduplication, as it produces the same output for similar inputs with the same words, regardless of the order of the words. + +The `fingerprint` analyzer is comprised of the following components: + +- Standard tokenizer +- Lowercase token filter +- ASCII folding token filter +- Stop token filter +- Fingerprint token filter + +## Configuration + +The `fingerprint` analyzer can be configured using the following parameters: + +- `separator`: Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is empty space (` `). (String, _Optional_) +- `max_output_size`: Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. (Integer, _Optional_) +- `stopwords`: A custom list or predefined list of stop words. Default is `_none_`. (String or list of strings, _Optional_) +- `stopwords_path`: Path (absolute of relative to config directory) to the list of stop words. (String, _Optional_) + + +## Example configuration + +You can use the following command to create index `my_custom_fingerprint_index` with `fingerprint` analyzer: + +```json +PUT /my_custom_fingerprint_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_custom_fingerprint_analyzer": { + "type": "fingerprint", + "separator": "-", + "max_output_size": 50, + "stopwords": ["to", "the", "over", "and"] + } + } + } + }, + "mappings": { + "properties": { + "my_field": { + "type": "text", + "analyzer": "my_custom_fingerprint_analyzer" + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the created analyzer: + +```json +POST /my_custom_fingerprint_index/_analyze +{ + "analyzer": "my_custom_fingerprint_analyzer", + "text": "The slow turtle swims over to the dog" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "dog-slow-swims-turtle", + "start_offset": 0, + "end_offset": 37, + "type": "fingerprint", + "position": 0 + } + ] +} +``` + +## Further customization + +If further customization is needed, you can define an analyzer with the components that make up the `fingerprint` analyzer using the following example: + +```json +PUT /custom_fingerprint_analyzer +{ + "settings": { + "analysis": { + "analyzer": { + "custom_fingerprint": { + "tokenizer": "standard", + "filter": [ + "lowercase", + "asciifolding", + "fingerprint" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} From eb782196656cb44ffd70d66b4e42b496df9d1f7b Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 16 Oct 2024 16:39:25 +0100 Subject: [PATCH 2/4] updating parameter table Signed-off-by: Anton Rubin --- _analyzers/fingerprint.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/_analyzers/fingerprint.md b/_analyzers/fingerprint.md index 2e528eb593..9f72b9a2ef 100644 --- a/_analyzers/fingerprint.md +++ b/_analyzers/fingerprint.md @@ -18,12 +18,14 @@ The `fingerprint` analyzer is comprised of the following components: ## Configuration -The `fingerprint` analyzer can be configured using the following parameters: - -- `separator`: Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is empty space (` `). (String, _Optional_) -- `max_output_size`: Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. (Integer, _Optional_) -- `stopwords`: A custom list or predefined list of stop words. Default is `_none_`. (String or list of strings, _Optional_) -- `stopwords_path`: Path (absolute of relative to config directory) to the list of stop words. (String, _Optional_) +The `fingerprint` analyzer can be configured using the following parameters. + +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is empty space (` `). +`max_output_size` | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. +`stopwords` | Optional | String or list of strings | A custom list or predefined list of stop words. Default is `_none_`. +`stopwords_path` | Optional | String | Path (absolute of relative to config directory) to the list of stop words. ## Example configuration From cb8bfda13972f77364b02cdbd8edfb26f6d8129a Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Fri, 6 Dec 2024 13:16:35 -0500 Subject: [PATCH 3/4] Doc review Signed-off-by: Fanit Kolchina --- _analyzers/fingerprint.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/_analyzers/fingerprint.md b/_analyzers/fingerprint.md index 9f72b9a2ef..8e2ab236de 100644 --- a/_analyzers/fingerprint.md +++ b/_analyzers/fingerprint.md @@ -6,7 +6,7 @@ nav_order: 110 # Fingerprint analyzer -The `fingerprint` analyzer is designed to create a fingerprint of text. This analyzer sorts and deduplicates the terms (tokens) generated from the input, and then concatenates them using a separator. It is commonly used for data deduplication, as it produces the same output for similar inputs with the same words, regardless of the order of the words. +The `fingerprint` analyzer is designed to create a fingerprint of text. This analyzer sorts and deduplicates the terms (tokens) generated from the input, and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of the order of the words. The `fingerprint` analyzer is comprised of the following components: @@ -16,7 +16,7 @@ The `fingerprint` analyzer is comprised of the following components: - Stop token filter - Fingerprint token filter -## Configuration +## Parameters The `fingerprint` analyzer can be configured using the following parameters. @@ -24,13 +24,13 @@ Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is empty space (` `). `max_output_size` | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. -`stopwords` | Optional | String or list of strings | A custom list or predefined list of stop words. Default is `_none_`. -`stopwords_path` | Optional | String | Path (absolute of relative to config directory) to the list of stop words. +`stopwords` | Optional | String or list of strings | A custom list or predefined list of stopwords. Default is `_none_`. +`stopwords_path` | Optional | String | The path (absolute of relative to the config directory) to the list of stopwords. -## Example configuration +## Example -You can use the following command to create index `my_custom_fingerprint_index` with `fingerprint` analyzer: +Use the following command to create an index named `my_custom_fingerprint_index` with a `fingerprint` analyzer: ```json PUT /my_custom_fingerprint_index @@ -61,7 +61,7 @@ PUT /my_custom_fingerprint_index ## Generated tokens -Use the following request to examine the tokens generated using the created analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json POST /my_custom_fingerprint_index/_analyze @@ -90,7 +90,7 @@ The response contains the generated tokens: ## Further customization -If further customization is needed, you can define an analyzer with the components that make up the `fingerprint` analyzer using the following example: +If further customization is needed, you can define an analyzer with additional components that make up the `fingerprint` analyzer: ```json PUT /custom_fingerprint_analyzer From 188a011324d20993611d24ae2984ed53d1136f01 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Tue, 10 Dec 2024 10:57:14 -0500 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/fingerprint.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/_analyzers/fingerprint.md b/_analyzers/fingerprint.md index 8e2ab236de..dd8027f037 100644 --- a/_analyzers/fingerprint.md +++ b/_analyzers/fingerprint.md @@ -6,9 +6,9 @@ nav_order: 110 # Fingerprint analyzer -The `fingerprint` analyzer is designed to create a fingerprint of text. This analyzer sorts and deduplicates the terms (tokens) generated from the input, and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of the order of the words. +The `fingerprint` analyzer creates a text fingerprint. The analyzer sorts and deduplicates the terms (tokens) generated from the input and then concatenates them using a separator. It is commonly used for data deduplication because it produces the same output for similar inputs containing the same words, regardless of word order. -The `fingerprint` analyzer is comprised of the following components: +The `fingerprint` analyzer comprises the following components: - Standard tokenizer - Lowercase token filter @@ -18,14 +18,14 @@ The `fingerprint` analyzer is comprised of the following components: ## Parameters -The `fingerprint` analyzer can be configured using the following parameters. +The `fingerprint` analyzer can be configured with the following parameters. Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- -`separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is empty space (` `). +`separator` | Optional | String | Specifies the character used to concatenate the terms after they have been tokenized, sorted, and deduplicated. Default is an empty space (` `). `max_output_size` | Optional | Integer | Defines the maximum size of the output token. If the concatenated fingerprint exceeds this size, it will be truncated. Default is `255`. -`stopwords` | Optional | String or list of strings | A custom list or predefined list of stopwords. Default is `_none_`. -`stopwords_path` | Optional | String | The path (absolute of relative to the config directory) to the list of stopwords. +`stopwords` | Optional | String or list of strings | A custom or predefined list of stopwords. Default is `_none_`. +`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stopwords. ## Example @@ -90,7 +90,7 @@ The response contains the generated tokens: ## Further customization -If further customization is needed, you can define an analyzer with additional components that make up the `fingerprint` analyzer: +If further customization is needed, you can define an analyzer with additional `fingerprint` analyzer components: ```json PUT /custom_fingerprint_analyzer