-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ICU tokenizer to improve some Asian languages support #498
base: master
Are you sure you want to change the base?
Use ICU tokenizer to improve some Asian languages support #498
Conversation
e97e8e9
to
8c2712b
Compare
@missinglink please take a look when you'll have a chance 🙏🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be an interesting change, I'd like to continue investigation.
It will need significant testing since this is an absolutely core function of the system, even small regressions here will be hugely impactful.
It's promising that the integration
tests all still pass, pointing to a possibility that this can be implemented in a backwards-compatible way.
I've added a few comments in the review.
Mostly I'd like to better understand the differences between the two tokenizers, any potential regressions/changes to the existing method, and enumerate the benefits of this work.
As it's a major change, it would be great to list at least a few queries which would improve with this change so other developers understand the value.
@@ -49,6 +49,16 @@ module.exports.tests.functional = function(test, common){ | |||
assertAnalysis( 'place', 'Toys "R" Us!', [ 'toys', 'r', 'us' ]); | |||
assertAnalysis( 'address', '101 mapzen place', [ '101', 'mapzen', 'place' ]); | |||
|
|||
// complicated tokenization for some Asian languages | |||
assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a little more about the segmentation here?
Are these token groups from research/knowledge or just a copy->paste of what was generated?
The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑
'ซอย', 'เพชรบุรี1'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these token groups from research/knowledge or just a copy->paste of what was generated?
Well, I used ChatGPT to generate them and then tools like Google Translate to validate, that these tokens indeed have separate meanings... Tbh it would be great to have someone native to one of those languages to validate 🤔 I will try to find one 😄
The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑
Thai language has its own digits (but they use Arabic ones as well btw) and ๑
is 1
. Btw we convert them already even without this tokenizer. Not sure at which step though. But good question is if this number should be a separate token, but IIRC it is how it works right now(if you input foo1
it will generate single foo1
token), so I decided to limit scope of changes and may be will address it with separate PR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
๑
is1
.
This makes sense, it's not an issue, normalizing to Arabic numerals is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and yeah, we should get a human to check a few examples, maybe best to generate a few more before asking someone in different classes such as address, venue(s), streets, intersections, localities etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feature will go through a bunch of testing which might take a long time, flagging potential issues now will be much easier than finding them later, I'm fairly confident we can resolve any issues.
@@ -22,16 +22,16 @@ function generate(){ | |||
"analysis": { | |||
"tokenizer": { | |||
"peliasTokenizer": { | |||
"type": "pattern", | |||
"pattern": "[\\s,/\\\\-]+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous pattern matched on [whitespace, comma, forward/bash slashes and hyphen]
, is this still the case with the icu_tokenizer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll try to check if we have relevant tests and if we don't, will write them 👍
@@ -175,6 +181,12 @@ function generate(){ | |||
}, | |||
}, | |||
"filter" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish this AMPERSANDPLACEHOLDER
replacement wasn't necessary but understand why it is.
Possibly stupid question, but does Elastic support the regex word boundary character \b and is that unicode-aware? |
@missinglink thanks for the review! I'll come with more answers to your questions later. |
assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] ); | ||
assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]); | ||
assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室', | ||
['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct splitting is:
北京市 - Beijing city
朝阳区 - The district
东三环中路 - East 3rd Ring Middle Road
1号 - Road number
国际大厦 - Building name
a座 - Block number
1001室 - Room number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full Chinese addresses are usually ...省 ...市 ...区 ...路 ...号 building_name ...座 ...楼 ...室
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if it’s by tokens then they make sense. I feel single characters like '东', '三', '环', '中路' may be too small for search, but it’s not wrong.
integration/analyzer_peliasQuery.js
Outdated
assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]); | ||
assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室', | ||
['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']); | ||
assertAnalysis('japanese_address', '東京都渋谷区渋谷2丁目21−1渋谷スクランブルスクエア4階', ["東京", "都", "渋谷", "区", "渋谷", "2", "丁目", "21", "1", "渋谷", "スクランフル", "スクエア", "4", "階"]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct splitting is:
東京都 - Tokyo city
渋谷区 - District
渋谷2 丁目 - Street
211 - Street number
渋谷スクランフルスクエア - Building: Shibuya Scramble Square
4 階 - Floor
Thank you @kenil-cheng for your time. |
I've tried to simply extend original regex with it and got error (smth about it doesn't know what |
// complicated tokenization for some Asian languages | ||
assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] ); | ||
assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] ); | ||
assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My Thai mate suggested
["บ้าน", "เลขที่", "123", "ถนน", "สุขุมวิท", "แขวง", "คลองตันเหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร", "10110",]
apart from different split, it's visible that first letter was changed from บ้ to บ in your case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dzviagin ! I will take a look, likely it is a result of removal of diacritic symbols, which existed even before this PR...
👋
Disclaimer: I am not native to Thai, Chinese, Japanese or any languages I am adding support for here, but this change still seems to be reasonable.
Here's the reason for this change 🚀
At the moment we use simple regex based tokenizer which simply splits strings using whitespaces and dashes as separator (
[\s,/\\-]+
). While it perfectly works for most of Western languages it fails in some Asian ones. Mainly it is because they don't actually use spaces to separate words (!). This change should address this via using icu_tokenizer which uses different complicated technics to solve it.Worth saying that this solution is still not perfect, e.g. for Thai there is separate
thai_tokenizer
(https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html), but icu one still works better than what we are using right now...Here's what actually got changed 👏
Here's how others can test the changes 👀
I added a couple of integration tests with examples of queries which were tokenized as single token in the past.