Skip to content
This repository has been archived by the owner on Oct 31, 2024. It is now read-only.

too many results in search #213

Closed
mellybelly opened this issue Jan 24, 2018 · 8 comments
Closed

too many results in search #213

mellybelly opened this issue Jan 24, 2018 · 8 comments
Assignees

Comments

@mellybelly
Copy link

for example, EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1 has 91491 matches
screen shot 2018-01-24 at 2 14 29 am

@kshefchek
Copy link
Contributor

kshefchek commented Jan 29, 2018

As a reference here is the page on production:
https://monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1

The issue here is that we're matching on the terms "syndrome", "type", and "1". The solr relevancy score does factor in IDF (inverse document frequency) of terms in order to decrease the weight of common terms; however, it doesn't completely filter out these terms in the results. If anyone is interested here are the wiki pages for the algorithm solr uses for relevancy score:
Current: https://en.wikipedia.org/wiki/Okapi_BM25
TF-IDF (Classic): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Some quick ideas:

  • Add stopwords, such as "syndrome" and "disorder", so these terms are not indexed. As a test I did this on beta, but the results are still overwhelmed with matches on "type" and "1" : https://beta.monarchinitiative.org/search/EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1
  • Extend and modify the similarity class. Solr allows for the configuration of third party java packages, so we could theoretically extend any similarity class and adjust the algorithm to more heavily weight IDF
  • Test out other relevancy algorithms (I tried TF-IDF and the results were the same)
  • Set a global result limit
  • Filter out documents with an X decrease in max relevancy score (this is discouraged, since the max relevancy score and distribution changes for each query).

@kltm have you dealt with this in amigo?

cc @pnrobinson @DoctorBud I believe this was brought up on the last weekly call as well (for searches on Marfan Syndrome), which looks much better with the stopword filter:
https://beta.monarchinitiative.org/search/marfan%20syndrome
vs
https://monarchinitiative.org/search/marfan%20syndrome

@kltm
Copy link
Member

kltm commented Jan 29, 2018

Looking at the results list, and keeping in mind that your Solr installation is set to a default "OR" search and using score to bring up results, the large number of results with general terms like term and 1 would be completely expected. It might be better to view this as a UI issue, possibly limiting results to a certain score threshold when the returned numbers are large.

That said, there could be some tweaks in there to get the 4 result to 3; some playing with field boosts could help, but Solr's use of preferring larger (more informative) words and match counts seems to e about right.

@kshefchek
Copy link
Contributor

bump, from @realmarcin:

I’m searching with a specific MONDO curie https://monarchinitiative.org/search/MONDO%3A0000554
the top hit looks correct, however, the top left page section seems to suggest over 20k diseases? ah looks like its matching ‘MONDO’ …

@jmcmurry
Copy link
Member

jmcmurry commented Sep 10, 2018

please move future curie-search issue conversation over to https://github.com/monarch-initiative/monarch-app/issues/1625

@kshefchek kshefchek transferred this issue from monarch-initiative/monarch-legacy Sep 26, 2019
@kshefchek
Copy link
Contributor

kshefchek commented Sep 26, 2019

Transferring to the new app as this was never addressed.

@pnrobinson recently discovered another case when querying a HGVS variant label,
https://beta.monarchinitiative.org/search/NM_144997.5(FLCN):c.1429C%3ET%20(p.Arg477Ter)

Setting debugQuery=true, I can see the tokenizer is being aggressive with the punctuation in the hgvs format, the tokens queried are:

  1. nm
  2. 144997.5
  3. flcn
  4. 1429c
  5. t
  6. c
  7. nm_144997.5(flcn):c.1429c>t
  8. nm_144997.5(flcn):c.1429c>t (p.arg477ter)

Which results in many false positives (from a domain perspective). Some quick ideas

  1. Quote all input strings, effectively overriding the tokenizer (could result in a drop in true positives)
  2. Test out another tokenizer, eg classic tokenizer instead of standard
  3. Use the solr relevancy score and adjust at the UI level, as @kltm suggested above too many results in search #213 (comment)

@pnrobinson
Copy link
Member

@kshefchek @mellybelly I wonder if we can have one search for the initial autocomplete, but once the user presses go, we switch strategies and show only exact or very near matches?

@kshefchek
Copy link
Contributor

The keyword tokenizer is the closest we have to an exact match, https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer. This is one of three we are matching on (standard/classic tokenizer, edge ngram, keyword). We could go this route but we may lose valid hits, keeping in mind our test cases outlined in https://github.com/monarch-initiative/monarch-app/issues/1383

I tested out the classic tokenizer but this did not help much with the HGVS label, 297106 down to 248200 matches, the only difference is instead of the two tokens
"1429c" and "c", we get "1429c.c", which doesn't help much

@kshefchek
Copy link
Contributor

fixed with #214

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants