Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ICU tokenizer to improve some Asian languages support #498

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions integration/analyzer_peliasIndexOneEdgeGram.js
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,11 @@ module.exports.tests.analyze = function(test, common){

assertAnalysis( 'british_american_english', 'town theatre', ['0:town', '1:theatre', '1:theater'] );
assertAnalysis( 'british_american_english', 'town theater', ['0:town', '1:theater', '1:theatre'] );

assertAnalysis('thai_address', 'ซอยเพชรบุรี๑foo', [
'0:ซ', '0:ซอ', '0:ซอย',
'1:เพชรบุรี1', '1:เพชรบุรี', '1:เพชรบุร', '1:เพชรบุ', '1:เพชรบ', '1:เพชร', '1:เพช', '1:เพ', '1:เ',
'2:f', '2:fo', '2:foo'] );

suite.run( t.end );
});
Expand Down
10 changes: 10 additions & 0 deletions integration/analyzer_peliasQuery.js
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,16 @@ module.exports.tests.functional = function(test, common){
assertAnalysis( 'place', 'Toys "R" Us!', [ 'toys', 'r', 'us' ]);
assertAnalysis( 'address', '101 mapzen place', [ '101', 'mapzen', 'place' ]);

// complicated tokenization for some Asian languages
assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a little more about the segmentation here?

Are these token groups from research/knowledge or just a copy->paste of what was generated?

The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑

'ซอย', 'เพชรบุรี1'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these token groups from research/knowledge or just a copy->paste of what was generated?

Well, I used ChatGPT to generate them and then tools like Google Translate to validate, that these tokens indeed have separate meanings... Tbh it would be great to have someone native to one of those languages to validate 🤔 I will try to find one 😄

The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑

Thai language has its own digits (but they use Arabic ones as well btw) and is 1. Btw we convert them already even without this tokenizer. Not sure at which step though. But good question is if this number should be a separate token, but IIRC it is how it works right now(if you input foo1 it will generate single foo1 token), so I decided to limit scope of changes and may be will address it with separate PR. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is 1.

This makes sense, it's not an issue, normalizing to Arabic numerals is fine.

Copy link
Member

@missinglink missinglink Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and yeah, we should get a human to check a few examples, maybe best to generate a few more before asking someone in different classes such as address, venue(s), streets, intersections, localities etc.

Copy link
Member

@missinglink missinglink Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feature will go through a bunch of testing which might take a long time, flagging potential issues now will be much easier than finding them later, I'm fairly confident we can resolve any issues.

assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] );
assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]);
assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室',
['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']);
Copy link

@kenil-cheng kenil-cheng Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct splitting is:
北京市 - Beijing city
朝阳区 - The district
东三环中路 - East 3rd Ring Middle Road
1号 - Road number
国际大厦 - Building name
a座 - Block number
1001室 - Room number

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full Chinese addresses are usually ...省 ...市 ...区 ...路 ...号 building_name ...座 ...楼 ...室

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if it’s by tokens then they make sense. I feel single characters like '东', '三', '环', '中路' may be too small for search, but it’s not wrong.

assertAnalysis('japanese_address', '東京都渋谷区渋谷2丁目21−1渋谷スクランブルスクエア4階', ["東京", "都", "渋谷", "区", "渋谷", "2", "丁目", "21", "1", "渋谷", "スクランフル", "スクエア", "4", "階"]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct splitting is:
東京都 - Tokyo city
渋谷区 - District
渋谷2 丁目 - Street
211 - Street number
渋谷スクランフルスクエア - Building: Shibuya Scramble Square
4 階 - Floor

assertAnalysis('khmer_address', 'ផ្ទះលេខ១២៣ផ្លូវព្រះសីហនុសង្កាត់ទន្លេបាសាក់ខណ្ឌចំការមនរាជធានីភ្នំពេញ', ["ផទះលេខ123ផលូវ", "ពរះសីហនុ", "សងកាត", "ទនលេបាសាក", "ខណឌចំការមន", "រាជធានី", "ភនំពេញ"]);
assertAnalysis('lao_address', 'ບ້ານເລກທີ່໑໕໕ຖະໜົນທ່ານຊານຂອງເຂດຈັນທະບູລີນະຄອນວຽງຈັນ', ["ບານ", "ເລກ", "ທີ155ຖະຫນົນ", "ທານ", "ຊານ", "ຂອງ", "ເຂດ", "ຈັນທະ", "ບູ", "ລີ", "ນະຄອນ", "ວຽງຈັນ"]);

suite.run( t.end );
});
};
Expand Down
11 changes: 11 additions & 0 deletions integration/analyzer_peliasStreet.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
// validate analyzer is behaving as expected
const { assert } = require('@hapi/joi');
SiarheiFedartsou marked this conversation as resolved.
Show resolved Hide resolved
const Suite = require('../test/elastictest/Suite')

module.exports.tests = {};
Expand All @@ -22,6 +23,16 @@ module.exports.tests.analyze = function(test, common){
assertAnalysis( 'remove_ordinals', '1st 2nd 3rd 4th 5th', ['1','2','3','4','5'] );
assertAnalysis( 'remove_ordinals', 'Ast th 101st', ['ast','th','101'] );

// complicated tokenization for some Asian languages
assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] );
assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] );
assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]);
Copy link

@dzviagin dzviagin Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Thai mate suggested

["บ้าน", "เลขที่", "123", "ถนน", "สุขุมวิท", "แขวง", "คลองตันเหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร", "10110",]

apart from different split, it's visible that first letter was changed from บ้ to บ in your case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dzviagin ! I will take a look, likely it is a result of removal of diacritic symbols, which existed even before this PR...

assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室',
['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']);
assertAnalysis('japanese_address', '東京都渋谷区渋谷2丁目21−1渋谷スクランブルスクエア4階', ["東京", "都", "渋谷", "区", "渋谷", "2", "丁目", "21", "1", "渋谷", "スクランフル", "スクエア", "4", "階"]);
assertAnalysis('khmer_address', 'ផ្ទះលេខ១២៣ផ្លូវព្រះសីហនុសង្កាត់ទន្លេបាសាក់ខណ្ឌចំការមនរាជធានីភ្នំពេញ', ["ផទះលេខ123ផលូវ", "ពរះសីហនុ", "សងកាត", "ទនលេបាសាក", "ខណឌចំការមន", "រាជធានី", "ភនំពេញ"]);
assertAnalysis('lao_address', 'ບ້ານເລກທີ່໑໕໕ຖະໜົນທ່ານຊານຂອງເຂດຈັນທະບູລີນະຄອນວຽງຈັນ', ["ບານ", "ເລກ", "ທີ155ຖະຫນົນ", "ທານ", "ຊານ", "ຂອງ", "ເຂດ", "ຈັນທະ", "ບູ", "ລີ", "ນະຄອນ", "ວຽງຈັນ"]);

suite.run( t.end );
});
};
Expand Down
37 changes: 28 additions & 9 deletions settings.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,16 @@ function generate(){
"analysis": {
"tokenizer": {
"peliasTokenizer": {
"type": "pattern",
"pattern": "[\\s,/\\\\-]+"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous pattern matched on [whitespace, comma, forward/bash slashes and hyphen], is this still the case with the icu_tokenizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll try to check if we have relevant tests and if we don't, will write them 👍

"type": "icu_tokenizer"
}
},
"analyzer": {
"peliasAdmin": {
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter" : ["punctuation", "nfkc_normalizer"],
"char_filter" : ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"synonyms/custom_admin/multiword",
Expand All @@ -46,8 +46,9 @@ function generate(){
"peliasIndexOneEdgeGram" : {
"type": "custom",
"tokenizer" : "peliasTokenizer",
"char_filter" : ["punctuation", "nfkc_normalizer"],
"char_filter" : ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"synonyms/custom_name/multiword",
Expand All @@ -66,8 +67,9 @@ function generate(){
"peliasQuery": {
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": ["punctuation", "nfkc_normalizer"],
"char_filter": ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -80,8 +82,9 @@ function generate(){
"peliasPhrase": {
"type": "custom",
"tokenizer":"peliasTokenizer",
"char_filter" : ["punctuation", "nfkc_normalizer"],
"char_filter" : ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"remove_duplicate_spaces",
Expand Down Expand Up @@ -129,8 +132,9 @@ function generate(){
"peliasStreet": {
"type": "custom",
"tokenizer":"peliasTokenizer",
"char_filter" : ["punctuation", "nfkc_normalizer"],
"char_filter" : ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"remove_duplicate_spaces",
Expand All @@ -147,8 +151,9 @@ function generate(){
"peliasIndexCountryAbbreviation": {
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": ["punctuation", "nfkc_normalizer"],
"char_filter": ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -161,8 +166,9 @@ function generate(){
"peliasIndexCountryAbbreviationOneEdgeGram": {
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": ["punctuation", "nfkc_normalizer"],
"char_filter": ["ampersand_mapper", "punctuation", "nfkc_normalizer"],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -175,6 +181,12 @@ function generate(){
},
},
"filter" : {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this AMPERSANDPLACEHOLDER replacement wasn't necessary but understand why it is.

// replaces ampersand placeholders back to `&` (see `ampersand_mapper` char_filter)
"ampersand_replacer": {
"type": "pattern_replace",
"pattern": "AMPERSANDPLACEHOLDER",
"replacement": "&"
},
"street_synonyms_multiplexer": {
"type": "multiplexer",
"preserve_original": false,
Expand Down Expand Up @@ -248,6 +260,13 @@ function generate(){
// more generated below
},
"char_filter": {
// icu-tokenizer treats ampersands as a word boundary, so we replace them with a placeholder to avoid it,
// as we want to handle them separately, we replace them back after tokenization (see `ampersand_replacer` filter)
"ampersand_mapper": {
"type": "pattern_replace",
"pattern": "&",
"replacement": " AMPERSANDPLACEHOLDER "
},
"punctuation" : {
"type" : "mapping",
"mappings" : punctuation.blacklist.map(function(c){
Expand Down
27 changes: 25 additions & 2 deletions test/fixtures/expected.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,20 @@
"analysis": {
"tokenizer": {
"peliasTokenizer": {
"type": "pattern",
"pattern": "[\\s,/\\\\-]+"
"type": "icu_tokenizer"
}
},
"analyzer": {
"peliasAdmin": {
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"synonyms/custom_admin/multiword",
Expand All @@ -43,10 +44,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"synonyms/custom_name/multiword",
Expand All @@ -66,10 +69,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -83,10 +88,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"remove_duplicate_spaces",
Expand Down Expand Up @@ -143,10 +150,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"remove_duplicate_spaces",
Expand All @@ -164,10 +173,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -181,10 +192,12 @@
"type": "custom",
"tokenizer": "peliasTokenizer",
"char_filter": [
"ampersand_mapper",
"punctuation",
"nfkc_normalizer"
],
"filter": [
"ampersand_replacer",
"lowercase",
"trim",
"icu_folding",
Expand All @@ -197,6 +210,11 @@
}
},
"filter": {
"ampersand_replacer": {
"type": "pattern_replace",
"pattern": "AMPERSANDPLACEHOLDER",
"replacement": "&"
},
"street_synonyms_multiplexer": {
"type": "multiplexer",
"preserve_original": false,
Expand Down Expand Up @@ -2271,6 +2289,11 @@
}
},
"char_filter": {
"ampersand_mapper": {
"type": "pattern_replace",
"pattern": "&",
"replacement": " AMPERSANDPLACEHOLDER "
},
"punctuation": {
"type": "mapping",
"mappings": [
Expand Down
Loading