-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add delimited term frequency token filter documentation #5043
Conversation
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
@noCharger Could you please review this PR for technical accuracy? |
Signed-off-by: Fanit Kolchina <[email protected]>
…h-project/documentation-website into delimited-token-filter
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_toc: false | ||
--- | ||
|
||
# Token filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing! I would like to double check this list based on what OpenSearch supports https://github.com/opensearch-project/OpenSearch/tree/2a5b124ee8ef4376d62c484b6cd3ea1d98ca75d1/modules/analysis-common/src/main/java/org/opensearch/analysis/common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked against this list and also ran/tested all token filters to verify that they work.
Signed-off-by: Fanit Kolchina <[email protected]>
@russcam Could you review this documentation PR please? |
}, | ||
"f2": { | ||
"type": "text", | ||
"similarity": "BM25", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, this similarity
setup is not closely related within the context of this example. I am fine with removing it.
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Minimal comments/changes.
|
||
The following table lists all parameters the `delimited_term_freq` supports. | ||
|
||
Parameter | Required/Optional | Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you styled this heading. I'll follow same format.
_analyzers/token-filters/index.md
Outdated
`trim` | [TrimFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing whitespace from each token in a stream. | ||
`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit. | ||
`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream. | ||
`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "lowercase" be "uppercase?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you!
Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Great job on this. Only minimal changes. Thanks!
``` | ||
{% include copy-curl.html %} | ||
|
||
The `attributes` array specifies that you want to filter the output of the `explain` parameter to return only `termFrequency`. The response contains both the original token and the parsed output of the token filter that includes term frequency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "the" precede "term frequency"?
``` | ||
{% include copy-curl.html %} | ||
|
||
In the response, document 1 has a score of 30 because the term frequency of the term `v1` in the field `f2` is 30. Document 2 has a score of 0 because the term `v1` does not appear in `f2`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the first instance of "document" be capitalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because it's not a proper name of the document?
_analyzers/token-filters/index.md
Outdated
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. | ||
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. | ||
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. | ||
`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for" instead of "to"?
_analyzers/token-filters/index.md
Outdated
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed. | ||
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream. | ||
`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file. | ||
`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, to the analysis process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for" instead of "to"?
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
…roject#5043) * Add token filter documentation Signed-off-by: Fanit Kolchina <[email protected]> * Add delimited term frequency token filter documentation Signed-off-by: Fanit Kolchina <[email protected]> * Add phonetic token filter Signed-off-by: Fanit Kolchina <[email protected]> * Table format fix Signed-off-by: Fanit Kolchina <[email protected]> * Add script example Signed-off-by: Fanit Kolchina <[email protected]> * Remove similarity Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
* Add token filter documentation Signed-off-by: Fanit Kolchina <[email protected]> * Add delimited term frequency token filter documentation Signed-off-by: Fanit Kolchina <[email protected]> * Add phonetic token filter Signed-off-by: Fanit Kolchina <[email protected]> * Table format fix Signed-off-by: Fanit Kolchina <[email protected]> * Add script example Signed-off-by: Fanit Kolchina <[email protected]> * Remove similarity Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
Fixes #4986
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.