Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"custom_street" synonyms should be applied to name.* and phrase.* indices #449

Merged
merged 1 commit into from
Sep 4, 2020

Conversation

missinglink
Copy link
Member

As mentioned in #446 (comment)

If developers extend the synonyms lists by editing synonyms/custom_street.txt the synonyms are only being applied to the address_parts.street field and not to name.* & phrase.* indices.

The problem with this is that we use the name.* & phrase.* indices as top-level MUST conditions in most of our queries so adding custom street synonyms has very little effect.

The current workaround is for developer to manually add the synonyms filter to the indices as per:
Joxit@514fb02

I think this was an oversight at the time and worth fixing.

@orangejulius
Copy link
Member

Oh nice, yeah, this duplication has always been annoying for us, and has lead to much confusion for Pelias users.

@missinglink
Copy link
Member Author

I'm happy to merge this now, but the thing is that the synonyms/custom_street.txt file is empty by default so it's going to be a no-op for anyone running the canonical synonyms which ship with Pelias.

@Joxit 👍 or 👎 ?

@Joxit
Copy link
Member

Joxit commented Jun 15, 2020

Yep, I'm ok with this 😄 this will help me for future builds !

@orangejulius
Copy link
Member

Do we need this PR after #453? I don't think so, but just wanted to confirm. Feel free to close if it's no longer needed.

@missinglink
Copy link
Member Author

This has been resolved in a cleaner way using the name_synonyms_multiplexer

Screenshot 2020-07-14 at 11 43 59

@missinglink missinglink reopened this Jul 14, 2020
@missinglink
Copy link
Member Author

oh... actually... the multiplexer still doesn't list custom_street.. I think the issue is still valid and unresolved but the code will require some modernisation to include custom_street within name_synonyms_multiplexer.

@missinglink
Copy link
Member Author

I'd actually have preferred to delete custom_* synonym files as part of the recent refactor, from what I've seen of 3rd-party synonyms files they are usually very messy and error-prone.

I didn't drop support for custom_* synonym files because developers are already using them and it would potentially break functionality.

I'm not going to do any further work on this topic but I'd be happy to review and merge a PR from a developer who is actively using custom synonyms and can test it out on their setup.

@missinglink missinglink force-pushed the custom_street_synonyms branch from d4fceaf to f14061f Compare September 3, 2020 07:46
@missinglink
Copy link
Member Author

missinglink commented Sep 3, 2020

This issue came up via a client today, I've modernized the code as per #449 (comment) and rebased it against master.

In addition to including custom_street in the name.* and phrase.* indices, I'm also now including custom_admin for completeness.

I'd like to solicit some feedback before merging this, I suspect it will not have any issues besides the pervasive 'viral synonyms' issue related to multi-word synonyms that we inherited from Lucene and have always had to deal with.

A good example of where this is relevant is "MLK" vs "Martin Luther King" vs "Martin Luther King Jr" which is a common synonym in the United States but also unique enough in a global context to possibly not include in the core Pelias synonyms files.

@missinglink
Copy link
Member Author

@Joxit this got put on hold when I was doing the recent synonyms work but is open for testing and merging now.

It's slightly different what what you tested last time because I've rebased it against master to bring in those changes.

We don't use custom synonyms so any feedback you could provide would be helpful to confirm it's good to go.

I was hoping that it would be able to replace Joxit@514fb02

@Joxit
Copy link
Member

Joxit commented Sep 3, 2020

In fact, since #453 includes all the synonyms I need, I am no longer using my fork (I did a full build yesterday).

I might need the custom street again someday 🤷‍♂️
This PR is still OK and custom_street and custom_admin are definitely useful in peliasPhrase

@missinglink
Copy link
Member Author

I confirmed this works as expected by adding the following line to custom_street.txt:

note: no other custom synonyms are enabled

Martin Luther King,ML King,M L King,MLK

And then running an elyzer query against peliasPhrase (which is used on the name.* and phrase.* fields).

elyzer --index pelias --analyzer peliasPhrase 'MLK Boulevard'
CHAR_FILTER: punctuation
{0:MLK Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:MLK Boulevard}
TOKENIZER: peliasTokenizer
{0:MLK}	{1:Boulevard}
TOKEN_FILTER: lowercase
{0:mlk}	{1:boulevard}
TOKEN_FILTER: trim
{0:mlk}	{1:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:mlk}	{1:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:mlk}	{1:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:mlk,martin,ml,m}	{1:boulevard,luther,king,l}	{2:king,king}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:mlk,martin,ml,m}	{1:boulevard,luther,king,l}	{2:king,king}
TOKEN_FILTER: name_synonyms_multiplexer
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}
TOKEN_FILTER: icu_folding
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}
TOKEN_FILTER: remove_ordinals
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}
TOKEN_FILTER: unique_only_same_position
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}
TOKEN_FILTER: notnull
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}
TOKEN_FILTER: flatten_graph
{0:mlk,martin,ml,mall,mill,m}	{1:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld,luther,king,kg,l}	{2:king,kg}

@missinglink
Copy link
Member Author

The above also highlights the issue with multi-word synonyms where boulevard (and variants) are indexed at position:1 for xxx boulevard and not at position:3 as we would prefer for xxx xxx xxx boulevard.

@missinglink
Copy link
Member Author

The inverse is true for when the data provider gives us Martin Luther King Boulevard:

elyzer --index pelias --analyzer peliasPhrase 'Martin Luther King Boulevard'
CHAR_FILTER: punctuation
{0:Martin Luther King Boulevard}
CHAR_FILTER: nfkc_normalizer
{0:Martin Luther King Boulevard}
TOKENIZER: peliasTokenizer
{0:Martin}	{1:Luther}	{2:King}	{3:Boulevard}
TOKEN_FILTER: lowercase
{0:martin}	{1:luther}	{2:king}	{3:boulevard}
TOKEN_FILTER: trim
{0:martin}	{1:luther}	{2:king}	{3:boulevard}
TOKEN_FILTER: remove_duplicate_spaces
{0:martin}	{1:luther}	{2:king}	{3:boulevard}
TOKEN_FILTER: synonyms/custom_name/multiword
{0:martin}	{1:luther}	{2:king}	{3:boulevard}
TOKEN_FILTER: synonyms/custom_street/multiword
{0:martin,ml,m,mlk}	{1:luther,king,l}	{2:king,king}	{3:boulevard}
TOKEN_FILTER: synonyms/custom_admin/multiword
{0:martin,ml,m,mlk}	{1:luther,king,l}	{2:king,king}	{3:boulevard}
TOKEN_FILTER: name_synonyms_multiplexer
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: icu_folding
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: remove_ordinals
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: unique_only_same_position
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: notnull
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}
TOKEN_FILTER: flatten_graph
{0:martin,ml,mall,mill,m,mlk}	{1:luther,king,kg,l}	{2:king,kg}	{3:boulevard,bd,blvd,bde,blv,bl,blvde,blvrd,boulavard,boul,boulv,bvd,boulevarde,bld}

In this case MLK is position:0 and Boulevard is position:3 🤷‍♂️

@missinglink missinglink merged commit 687dbfb into master Sep 4, 2020
@missinglink missinglink deleted the custom_street_synonyms branch September 4, 2020 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants