too many results in search #213

mellybelly · 2018-01-24T01:15:36Z

for example, EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1 has 91491 matches

kshefchek · 2018-01-29T19:17:51Z

As a reference here is the page on production:
https://monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1

The issue here is that we're matching on the terms "syndrome", "type", and "1". The solr relevancy score does factor in IDF (inverse document frequency) of terms in order to decrease the weight of common terms; however, it doesn't completely filter out these terms in the results. If anyone is interested here are the wiki pages for the algorithm solr uses for relevancy score:
Current: https://en.wikipedia.org/wiki/Okapi_BM25
TF-IDF (Classic): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Some quick ideas:

Add stopwords, such as "syndrome" and "disorder", so these terms are not indexed. As a test I did this on beta, but the results are still overwhelmed with matches on "type" and "1" : https://beta.monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1
Extend and modify the similarity class. Solr allows for the configuration of third party java packages, so we could theoretically extend any similarity class and adjust the algorithm to more heavily weight IDF
Test out other relevancy algorithms (I tried TF-IDF and the results were the same)
Set a global result limit
Filter out documents with an X decrease in max relevancy score (this is discouraged, since the max relevancy score and distribution changes for each query).

@kltm have you dealt with this in amigo?

cc @pnrobinson @DoctorBud I believe this was brought up on the last weekly call as well (for searches on Marfan Syndrome), which looks much better with the stopword filter:
https://beta.monarchinitiative.org/search/marfan%20syndrome
vs
https://monarchinitiative.org/search/marfan%20syndrome

kltm · 2018-01-29T21:11:40Z

Looking at the results list, and keeping in mind that your Solr installation is set to a default "OR" search and using score to bring up results, the large number of results with general terms like term and 1 would be completely expected. It might be better to view this as a UI issue, possibly limiting results to a certain score threshold when the returned numbers are large.

That said, there could be some tweaks in there to get the 4 result to 3; some playing with field boosts could help, but Solr's use of preferring larger (more informative) words and match counts seems to e about right.

kshefchek · 2018-09-05T21:22:23Z

bump, from @realmarcin:

I’m searching with a specific MONDO curie https://monarchinitiative.org/search/MONDO%3A0000554
the top hit looks correct, however, the top left page section seems to suggest over 20k diseases? ah looks like its matching ‘MONDO’ …

jmcmurry · 2018-09-10T20:03:24Z

please move future curie-search issue conversation over to https://github.com/monarch-initiative/monarch-app/issues/1625

kshefchek · 2019-09-26T14:48:24Z

Transferring to the new app as this was never addressed.

@pnrobinson recently discovered another case when querying a HGVS variant label,
https://beta.monarchinitiative.org/search/NM_144997.5(FLCN):c.1429C%3ET%20(p.Arg477Ter)

Setting debugQuery=true, I can see the tokenizer is being aggressive with the punctuation in the hgvs format, the tokens queried are:

nm
144997.5
flcn
1429c
t
c
nm_144997.5(flcn):c.1429c>t
nm_144997.5(flcn):c.1429c>t (p.arg477ter)

Which results in many false positives (from a domain perspective). Some quick ideas

Quote all input strings, effectively overriding the tokenizer (could result in a drop in true positives)
Test out another tokenizer, eg classic tokenizer instead of standard
Use the solr relevancy score and adjust at the UI level, as @kltm suggested above too many results in search #213 (comment)

pnrobinson · 2019-09-26T14:57:06Z

@kshefchek @mellybelly I wonder if we can have one search for the initial autocomplete, but once the user presses go, we switch strategies and show only exact or very near matches?

kshefchek · 2019-09-26T16:08:54Z

The keyword tokenizer is the closest we have to an exact match, https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer. This is one of three we are matching on (standard/classic tokenizer, edge ngram, keyword). We could go this route but we may lose valid hits, keeping in mind our test cases outlined in https://github.com/monarch-initiative/monarch-app/issues/1383

I tested out the classic tokenizer but this did not help much with the HGVS label, 297106 down to 248200 matches, the only difference is instead of the two tokens
"1429c" and "c", we get "1429c.c", which doesn't help much

kshefchek · 2019-09-30T22:22:41Z

fixed with #214

kshefchek transferred this issue from monarch-initiative/monarch-legacy Sep 26, 2019

monicacecilia assigned kshefchek Sep 28, 2019

monicacecilia added this to the Monarch-UI v1.0 milestone Sep 28, 2019

kshefchek closed this as completed Sep 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many results in search #213

too many results in search #213

mellybelly commented Jan 24, 2018

kshefchek commented Jan 29, 2018 •

edited

Loading

kltm commented Jan 29, 2018

kshefchek commented Sep 5, 2018

jmcmurry commented Sep 10, 2018 •

edited

Loading

kshefchek commented Sep 26, 2019 •

edited

Loading

pnrobinson commented Sep 26, 2019

kshefchek commented Sep 26, 2019

kshefchek commented Sep 30, 2019

too many results in search #213

too many results in search #213

Comments

mellybelly commented Jan 24, 2018

kshefchek commented Jan 29, 2018 • edited Loading

kltm commented Jan 29, 2018

kshefchek commented Sep 5, 2018

jmcmurry commented Sep 10, 2018 • edited Loading

kshefchek commented Sep 26, 2019 • edited Loading

pnrobinson commented Sep 26, 2019

kshefchek commented Sep 26, 2019

kshefchek commented Sep 30, 2019

kshefchek commented Jan 29, 2018 •

edited

Loading

jmcmurry commented Sep 10, 2018 •

edited

Loading

kshefchek commented Sep 26, 2019 •

edited

Loading