-
Notifications
You must be signed in to change notification settings - Fork 29
too many results in search #213
Comments
As a reference here is the page on production: The issue here is that we're matching on the terms "syndrome", "type", and "1". The solr relevancy score does factor in IDF (inverse document frequency) of terms in order to decrease the weight of common terms; however, it doesn't completely filter out these terms in the results. If anyone is interested here are the wiki pages for the algorithm solr uses for relevancy score: Some quick ideas:
@kltm have you dealt with this in amigo? cc @pnrobinson @DoctorBud I believe this was brought up on the last weekly call as well (for searches on Marfan Syndrome), which looks much better with the stopword filter: |
Looking at the results list, and keeping in mind that your Solr installation is set to a default "OR" search and using score to bring up results, the large number of results with general terms like That said, there could be some tweaks in there to get the 4 result to 3; some playing with field boosts could help, but Solr's use of preferring larger (more informative) words and match counts seems to e about right. |
bump, from @realmarcin:
|
please move future curie-search issue conversation over to https://github.com/monarch-initiative/monarch-app/issues/1625 |
Transferring to the new app as this was never addressed. @pnrobinson recently discovered another case when querying a HGVS variant label, Setting debugQuery=true, I can see the tokenizer is being aggressive with the punctuation in the hgvs format, the tokens queried are:
Which results in many false positives (from a domain perspective). Some quick ideas
|
@kshefchek @mellybelly I wonder if we can have one search for the initial autocomplete, but once the user presses go, we switch strategies and show only exact or very near matches? |
The keyword tokenizer is the closest we have to an exact match, https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-KeywordTokenizer. This is one of three we are matching on (standard/classic tokenizer, edge ngram, keyword). We could go this route but we may lose valid hits, keeping in mind our test cases outlined in https://github.com/monarch-initiative/monarch-app/issues/1383 I tested out the classic tokenizer but this did not help much with the HGVS label, 297106 down to 248200 matches, the only difference is instead of the two tokens |
fixed with #214 |
for example, EHLERS-DANLOS SYNDROME, KYPHOSCOLIOTIC TYPE, 1 has 91491 matches
The text was updated successfully, but these errors were encountered: