Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General search results rankings (p53 as example) #102

Open
cmungall opened this issue May 2, 2014 · 17 comments
Open

General search results rankings (p53 as example) #102

cmungall opened this issue May 2, 2014 · 17 comments

Comments

@cmungall
Copy link
Member

cmungall commented May 2, 2014

Typing "p53" (no space) in the search box has this gene showing up first:

http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:2146005

Due to the fact the full name reflects the function of p53 binding.

Adding the space gives better results (we have to fix this I'm afraid).. but the mouse p53 is nowhere to be found.

Even with the MGI filter turned on with this search
http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/search/bioentity?q=p53

It's hard to find - I had to go via the panther family, eventually I got it:
http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:98834

It does have p53 as the synonym...

Let's work with others on fixing this.

@kltm kltm changed the title p53 search result ranking Search result rankings (p53 as example) May 2, 2014
@kltm kltm changed the title Search result rankings (p53 as example) Search results rankings (p53 as example) May 2, 2014
@kltm
Copy link
Member

kltm commented May 2, 2014

For the middle (bioentity live search) case, considering that synonyms are apparently not in there at all, it's doing a pretty good job. I'll add synonyms to the boost config at 1.0 for starters.

Current bio-config.yaml:
boost_weights: bioentity^2.0 bioentity_label^2.0 bioentity_name^1.0 bioentity_internal_id^1.0 isa_partof_closure_label^1.0 regulates_closure^1.0 regulates_closure_label^1.0 panther_family^1.0 panther_family_label^1.0 taxon_closure_label^1.0

For the other two, it's a matter of structuring the general search better. Currently, there are three categories: entity (id), entity_label, and a big ball of "stuff"--synonyms are in "stuff", along with everything else. This is done to allow a unioned search across all of the various doc types. We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something.

@kltm
Copy link
Member

kltm commented May 2, 2014

Either way, will need to actually change fields or field types in the loader, so we can explore it when we climb into the Java again for 2.2.

@kltm kltm added this to the 2.2 milestone May 2, 2014
@monicacecilia
Copy link

Oh wow, yes that looks like a mess of search results. -- There is also similarly "funny" behavior when annotators are using the "GO ID" tool on the "Information Editor" in Web Apollo. As I understand, that tool connects to AmiGO -- but I don't know the details of that connection. "This is another story and shall be told another time" (M. Ende). What is relevant to say is that fixing this search will also improve experiences outside AmiGO.

From Seth: "We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something."

Likely better to create the bag of "important stuff", than repeating the synonyms.

@kltm
Copy link
Member

kltm commented May 2, 2014

Hm. A new "important stuff" bag gets one to consider how important the stuff is; maybe we need and "importanter stuff" bag too? That could get very silly pretty fast. OTOH, gaming the schema at too low a level can get fiddly.

Also to consider are issues like #24 and how they would relate to a general schema. We're going to need to extend it a little no matter what it seems. Perhaps I'll change the item to something like: re-engineer the general schema, with a list of things we want out of it.

@kltm kltm modified the milestones: 2.3, 2.2 May 5, 2014
@kltm kltm changed the title Search results rankings (p53 as example) General search results rankings (p53 as example) Sep 30, 2014
@kltm
Copy link
Member

kltm commented Sep 30, 2014

We've had a similar discussion with @rbalakri about the results with "proximal" and the GO--currently when searching for "proximal", many non-GO terms take priority, which may confuse some users expectations. (E.g. "proximal rib", etc.)

After a little discussion with @cmungall about things that might be done to improve that, one possibility that we might look at is adding a field to the general search schema (maybe document_relevance_category) that would be strings like "core ontology", "peripheral bioentity"; we could then tweak the search to give greater preference to "core" entities or add a collapsible radio button set under the box that allowed you to goose the search for ontology terms, etc.

Essentially searching and giving preference to relevance tagging done during the load stage. While this would require some playing with the loader, I feel that this happens in enough of a transparent way that it might be the way forward.

@rbalakri
Copy link

I like this idea.
Can we talk about this at Barcelona?

Rama

@kltm
Copy link
Member

kltm commented Sep 30, 2014

We can, but this is already scheduled for 2.3, so we'll likely be getting to it post-meeting at some point anyways.

@kltm
Copy link
Member

kltm commented Nov 3, 2014

This is related to berkeleybop/bbop-js#16.

@kltm
Copy link
Member

kltm commented Sep 1, 2015

Answering @cmungall on #239.

Ideally human would be first followed by MODs. This could be a configuration, or alternatively scoring each gene by number of experimental annotations would be a nice generic way to do it. This would be an easy field for @hdietze to add when loading. 

What this would boil down to would be two new fields, say: search_bin_priority_one and search_bin_priority_two. Human genes would populate the first one, MODs get the second, everybody else gets none.

The search would then be boosted on those two fields, say: search_bin_priority_one^4.0 search_bin_priority_two^2.0.

@kltm
Copy link
Member

kltm commented Oct 25, 2015

As another case, from http://jira.geneontology.org/browse/GO-1007, it would be nice to have tokenizing more sensitive to common use cases like let-23, where a user might be surprised by the fact that the tokenizer defaults to breaking on the hyphen.

@kltm kltm modified the milestones: 2.4, 2.5 Mar 2, 2016
@doughowe
Copy link

From the Noctua session at the Geneva GO meeting..Seth suggested I post this here:

At ZFIN, for autocompletion in term entry boxes, we use a model that allows "starts with" searching for multiple words. This saves many key strokes.
Example:
Entering "trans fac pol"
would find all the terms with the terms including words that match all three:
"trans_"
"fac_"
"pol*"

like "transcription bla bla factor bla bla bla polymerase bla bla bla"

We really like that mechanism for term searching in ZFIN...food for thought.

@cmungall
Copy link
Member Author

ooh, I like this. @doughow Is this on user-facing autocompletes as well as curation? I can see this as being massively useful for biocurators (although with lego you tend to go for the subset of classes with fewer words, but not always). I don't have a strong sense of whether the average non-power user would do this much

@kltm
Copy link
Member

kltm commented Oct 13, 2017

See @ValWood transport example on #447

@ValWood
Copy link

ValWood commented Oct 13, 2017

If you are using lucene we have fine tuned our search over many iterations. We always find what we type, pretty much. @kimrutherford can point you to our weighting.

It might do what Doug describes above too. I'm not sure but it seems to work well for us. I think it even handles typos....

@kltm
Copy link
Member

kltm commented Oct 13, 2017

Thank you--more input is always appreciated. That said, we already understand why we have this problem and have implemented an experimental tokenizing/parser fix that solves it (berkeleybop/bbop-manager-golr#4). The issue that we currently have is to rollout the solution and update the software to make use of it.

@kimrutherford
Copy link

have implemented an experimental tokenizing/parser fix that solves it (berkeleybop/bbop-manager-golr#4).

That issue mentions EdgeNGramTokenizer, which is what we're using at PomBase.

It might do what Doug describes above too.

We're doing more or less as Doug describes as well as allowing minor typos. We currently index only the names and synonyms. The synonyms get a lower weighting when we query.

@doughowe
Copy link

doughowe commented Oct 18, 2017

Loooooooong ago @cmungall asked if we use our "multi-word begins with" search mechanism for curators only, or if it is also public facing. I believe it is only for curators. I'm not sure how intuitive or natural it would be for general database users. If you know about it, it whittles down long autosuggest lists quickly, particularly for those pesky long terms you know the name of...sort of.

Actually..I just tried it in our single box search at ZFIN.org and it seems to work there, so that is public facing. Its not hurting anything, and is helpful if you know about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants