-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update street synonyms #446
Conversation
I'm still not sure how comfortable I am with some of these synonyms which come from libpostal: eg:
|
d3e26a5
to
eb55831
Compare
rebased on top of #447 |
synonyms/street_synonyms_en.txt
Outdated
rise, ri | ||
riverway, rvwy | ||
riviera, rvra | ||
road, rd, ro, r, roa, raod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe we leave out any single character synonyms for now?
And definitely typos should not be synonyms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added another PR to remove a bunch of single char synonyms (although I'm not sure if they are such a big issue?)
Also removed some synonyms from the libpostal
list which appeared to only be there for the purpose of spelling correction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a recent issue opened specifically suggesting adding a synonym for typos #443
Let's chat about this some more, for now I'm going to remove them.
@@ -68,7 +72,9 @@ function generate(){ | |||
"icu_folding", | |||
"trim", | |||
"custom_name", | |||
"street_suffix", | |||
"street_synonyms_en", | |||
"street_synonyms_usps", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about having something with a regex like street_synonyms_*
and will fetch all the files ?
This may help for #393 or if we need to add custom street_synonyms by lang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, we should rethink these a bit and make them easier to customise.
Right now we have fairly limited synonyms coverage (ie. 'just the basics').
I'd like to explore what would happen if we added significantly more synonyms, I suspect that the impact would be very positive for search quality and hopefully doesn't affect performance much.
@Joxit could you please tell me a bit about how you're adding custom synonyms for your builds?
I'm guessing you're adding a bunch of French-specific and some other ones which you include in the custom_street.txt
file?
Pending successful testing of this I would next like to expand on French, German and Spanish synonyms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit annoying but better since #375 and #407
I have a fork joxit/pelias-schema, when I start a new build, I fetch the upstream and add this commit.
For the record, I added cutom_street
in peliasPhrase
and peliasQuery
in build-2019-09
(before pelias/parser
release). peliasQuery
is used for autocomplete where we can find streets names... And peliasPhrase
because street_suffix
is also present.
I add my synonyms in synonyms/custom_street.txt
bd, bld, blvd, boul, bvd, boulevard #only bld is missing in this PR
ave, av, avenue # OK with this PR
saint, st, street # missing
sainte, ste, suite # missing
This list was created from our common uses (when we test the pelias). I have always been afraid of slowing down queries with too many synonyms 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you possibly know of an official source which publishes a French equivalent of this?
https://pe.usps.com/text/pub28/28apc_002.htm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #449 to hopefully clean up your build process a little.
Ideally you shouldn't have to do any synonyms
work in peliasQuery
, I think you can remove that and you'll get a small perf benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this one : https://blog.bureaudeposte.net/rediger-une-adresse-aux-normes-postales/ for abbreviations
Name | abbreviation |
---|---|
Allée | ALL |
Avenue | AV |
Boulevard | BD |
Carrefour | CARR |
Chaussée | CHS |
Chemin | CHEM |
Cours | CRS |
Faubourg | FG |
Immeuble | IMM |
Impasse | IMP |
Lieu-dit | LD |
Lotissement | LOT |
Montée | MTE |
Passage | PAS |
Place | PL |
Résidence | RES |
Route | RTE |
Ruelle | RLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time I will check differences between synonyms in peliasQuery
and #449
Oh, and saint/sainte aren't for street suffix/prefix but a workaround for street names... 😅
block, blk, blck | ||
bluff, blf, bluf, bluffs, blfs | ||
boardwalk, bwk, bwlk, board walk | ||
boulevard, blvd, bd, bde, blv, bl, blvde, blvrd, boulavard, boul, boulv, bvd, boulevarde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will help #301 and the example bd saint-martin
eb55831
to
f0aa06b
Compare
…ynonyms which seem to only be present for the purpose of spelling correction
Okay, I'm going to run this as a full-planet build and see what effect it has on the acceptance tests. I'm particularly interested to see if adding loads of synonyms is feasible from quality/perf perspective 🤷 |
Thanks for working on this, Peter! |
Looks pretty good to me 🎉 |
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
I've removed all the multi-word synonyms in 5852086, these are potentially dangerous, especially when used at query time (such as is the case right now for |
One additional synonym we might want to add here is We can also do it as a followup PR if that's easier, but let's make sure not to forget :) |
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This PR has been replaced by #453 |
This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.
This WIP PR is to improve the synonyms specifically for USA streets.
Motivated by the following matching failures:
I've added the USPS synonyms list which should cover a large amount of USA street names.
I'd like to see the effect of this on a full-planet build before deciding if this is something we might consider doing more of, for other languages and classes of synonyms (such as Gray<>Grey and 1st<>First).