-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for 33 language analyzers #779
base: main
Are you sure you want to change the base?
Added support for 33 language analyzers #779
Conversation
2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
LGTM! Fix spotless & friends? Would love to have a sample/doc for this, https://github.com/opensearch-project/opensearch-java/tree/main/samples and/or https://github.com/opensearch-project/opensearch-java/tree/main/guides. |
Signed-off-by: brentam <[email protected]>
Yes. I ran the spotlessApply command and fixed the formatting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brentam my apologies but this is not how it is already implemented in opensearch-java
: the language specific analyzers are the combination of org.opensearch.client.opensearch._types.analysis.LanguageAnalyzer
and org.opensearch.client.opensearch._types.analysis.Language
enumeration
@reta LanguageAnalyzer is generating a completely different analyzer than DutchAnalyzer ( Remember, DutchAnalyzer already existed, I just added other analyzers for other languages: GermanAnalyzer, ItalianAnalyzer, etc...). A more detailed explanation follows:In The test below we are deserializing to IndexSettings from a valid index json that includes an analyzer of "type":"dutch" . Debugging, one can see it is using the "existing" DutchAnalyzer class to deserialize the analyzer.
if we serialize it back to json we get we are missing the char_filter but this is a discussion to another time/fix.. The point is, this is a valid analyzer definition in opensearch. We wont use the LanguageAnalyzer to build that analyzer. Here is the code we use to build it
Therefore, I think the current changes are valid. The Code using the LanguageAnalyzer is generating a completely different analyzer...
Generates this json is worth noticing this configuration will fail in opensearch since "type": "language" is not valid: you can test in the dashboard console using this command:
|
@reta I already pointed out that the "type" : "language" generated by the LanguageAnalyzer buider is not valid.
Regardless, the design decision to restrict the language to the enum is suspect, since Opensearch/Elastic allows anything as the value of the language parameter.
Note the "language" : "AlienLanguage" is accepted, so this type of analyzer is not actually creating an analyzer that uses the supported language analyzers. My guess is the "language" field in these analyzers serves as a metadata, has no bearing on the text processing. |
My understanding is that
I sadly don't have any insights, but I believe it appeared exactly for the same cause as yours (no way to specify language analyzer). I am not claiming that the
Should roughly be serialized as (limiting to only
I am not saying that your change is not valid. I am pointing out only the fact that we already do have a mechanism to use language specific analyzers which apparently does not work - that to me is the issue that has to be fixed (vs introducing yet another way to achieve the desired goal). |
@reta So the existing DutchAnalyzer was an obsolete and should have been removed with the introduction of the LanguageAnalyzer... But I have some reservations: The Analyzer.Kind enum contains these values, among others..
The string is supposed to be the value for the json field "type". If using the LanguageAnalyzer we can have two approaches.
Regardless of the approach, this new change will not follow the rest of the design/code used for the other enum values... @reta it seems the LanguageAnalyzer+Language approach was not thought with the current desing in mind, I dont want to spend time on it just to have the changes rejected later on. As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design, |
Thanks @brentam
So the current client model reflects the builtin analyzer types as they are described in the documentation [1] (sadly I have to refer to the Elasticsearch 7.10 here as OpenSearch documentation is not yet complete). In there, all language specific analyzers falls under "Language analyzers" category. I think if LanguageAnalyzer+Language are not supposed to configure language specific analyzers, than I honestly don't know what the purpose of those really is. [1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-analyzers.html
I was thinking about this approach: "language" is gone (your 1st point) and "type" becomes the value of language (which represents all languages supported by OpenSearch since it is an enumeration, covers your 2nd point). I think it would fix two issues at once: a) allow to specify language analyzer b) fix incorrect (de)serialization. It is also quite a small change from the scope perspective. What do you think about this? |
@reta Regardess, I am working on some quick changes to enable the LanguageAnalyzer to generate the correct json. I will actually create another PR, so we can discuss the changes there... |
We cannot keep the Language("language") value, the "language" is what is expected from the type in the json |
Description
Describe what this change achieves.
When trying to build
IndexSettings
. There is no way to build default analyzers for languages supported in opensearch.The only supported language is Dutch at the moment.
This is also a problem when deserializing
IndexSettings
from json. It will fail for any language other than Dutch.See this issue 684
In this PR, support for creating language Analyzers was added for these languages:
arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, czech, danish, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai
Issues Resolved
List any issues this PR will resolve, e.g. Closes [...].
Will close this issue: 684
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.