Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delimited term frequency token filter documentation #5043

Merged
merged 11 commits into from
Sep 22, 2023

Conversation

kolchfa-aws
Copy link
Collaborator

Fixes #4986

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kolchfa-aws kolchfa-aws self-assigned this Sep 18, 2023
@kolchfa-aws kolchfa-aws added v2.10.0 release-notes PR: Include this PR in the automated release notes 3 - Tech review PR: Tech review in progress labels Sep 18, 2023
@kolchfa-aws
Copy link
Collaborator Author

@noCharger Could you please review this PR for technical accuracy?

Copy link
Contributor

@noCharger noCharger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you; this looks fantastic! I was wondering if we could include an example of how this token filter works in conjunction with the termFreq method in painless script.

Also I would check with @russcam, who is the author of original PR

has_toc: false
---

# Token filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@kolchfa-aws kolchfa-aws Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked against this list and also ran/tested all token filters to verify that they work.

Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws
Copy link
Collaborator Author

@russcam Could you review this documentation PR please?

},
"f2": {
"type": "text",
"similarity": "BM25",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this similarity setup is not closely related within the context of this example. I am fine with removing it.

Signed-off-by: Fanit Kolchina <[email protected]>
Copy link
Contributor

@vagimeli vagimeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Minimal comments/changes.

_analyzers/token-filters/delimited-term-frequency.md Outdated Show resolved Hide resolved
_analyzers/token-filters/delimited-term-frequency.md Outdated Show resolved Hide resolved

The following table lists all parameters the `delimited_term_freq` supports.

Parameter | Required/Optional | Description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you styled this heading. I'll follow same format.

`trim` | [TrimFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html) | Trims leading and trailing whitespace from each token in a stream.
`truncate` | [TruncateTokenFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html) | Truncates tokens whose length exceeds the specified character limit.
`unique` | N/A | Ensures each token is unique by removing duplicate tokens from a stream.
`uppercase` | [UpperCaseFilter](https://lucene.apache.org/core/8_7_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "lowercase" be "uppercase?"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you!

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@hdhalter hdhalter added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Sep 20, 2023
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Great job on this. Only minimal changes. Thanks!

```
{% include copy-curl.html %}

The `attributes` array specifies that you want to filter the output of the `explain` parameter to return only `termFrequency`. The response contains both the original token and the parsed output of the token filter that includes term frequency:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "the" precede "term frequency"?

_analyzers/token-filters/delimited-term-frequency.md Outdated Show resolved Hide resolved
```
{% include copy-curl.html %}

In the response, document 1 has a score of 30 because the term frequency of the term `v1` in the field `f2` is 30. Document 2 has a score of 0 because the term `v1` does not appear in `f2`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the first instance of "document" be capitalized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because it's not a proper name of the document?

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for" instead of "to"?

`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
`synonym` | N/A | Supplies a synonym list to the analysis process. The synonym list is provided using a configuration file.
`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, to the analysis process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for" instead of "to"?

_analyzers/token-filters/index.md Outdated Show resolved Hide resolved
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 5 - Editorial review PR: Editorial review in progress labels Sep 20, 2023
@kolchfa-aws kolchfa-aws merged commit e44a4e7 into main Sep 22, 2023
5 checks passed
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023
…roject#5043)

* Add token filter documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Add delimited term frequency token filter documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Add phonetic token filter

Signed-off-by: Fanit Kolchina <[email protected]>

* Table format fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add script example

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove similarity

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
vagimeli added a commit that referenced this pull request Dec 21, 2023
* Add token filter documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Add delimited term frequency token filter documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Add phonetic token filter

Signed-off-by: Fanit Kolchina <[email protected]>

* Table format fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add script example

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove similarity

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
@Naarcha-AWS Naarcha-AWS deleted the delimited-token-filter branch March 28, 2024 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6 - Done but waiting to merge PR: The work is done and ready to merge release-notes PR: Include this PR in the automated release notes v2.10.0
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[DOC] Document delimited term frequency token filter
5 participants