-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General search results rankings (p53 as example) #102
Comments
For the middle (bioentity live search) case, considering that synonyms are apparently not in there at all, it's doing a pretty good job. I'll add synonyms to the boost config at 1.0 for starters. Current bio-config.yaml: For the other two, it's a matter of structuring the general search better. Currently, there are three categories: entity (id), entity_label, and a big ball of "stuff"--synonyms are in "stuff", along with everything else. This is done to allow a unioned search across all of the various doc types. We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something. |
Either way, will need to actually change fields or field types in the loader, so we can explore it when we climb into the Java again for 2.2. |
Oh wow, yes that looks like a mess of search results. -- There is also similarly "funny" behavior when annotators are using the "GO ID" tool on the "Information Editor" in Web Apollo. As I understand, that tool connects to AmiGO -- but I don't know the details of that connection. "This is another story and shall be told another time" (M. Ende). What is relevant to say is that fixing this search will also improve experiences outside AmiGO. From Seth: "We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something." Likely better to create the bag of "important stuff", than repeating the synonyms. |
Hm. A new "important stuff" bag gets one to consider how important the stuff is; maybe we need and "importanter stuff" bag too? That could get very silly pretty fast. OTOH, gaming the schema at too low a level can get fiddly. Also to consider are issues like #24 and how they would relate to a general schema. We're going to need to extend it a little no matter what it seems. Perhaps I'll change the item to something like: re-engineer the general schema, with a list of things we want out of it. |
We've had a similar discussion with @rbalakri about the results with "proximal" and the GO--currently when searching for "proximal", many non-GO terms take priority, which may confuse some users expectations. (E.g. "proximal rib", etc.) After a little discussion with @cmungall about things that might be done to improve that, one possibility that we might look at is adding a field to the general search schema (maybe document_relevance_category) that would be strings like "core ontology", "peripheral bioentity"; we could then tweak the search to give greater preference to "core" entities or add a collapsible radio button set under the box that allowed you to goose the search for ontology terms, etc. Essentially searching and giving preference to relevance tagging done during the load stage. While this would require some playing with the loader, I feel that this happens in enough of a transparent way that it might be the way forward. |
I like this idea. Rama |
We can, but this is already scheduled for 2.3, so we'll likely be getting to it post-meeting at some point anyways. |
This is related to berkeleybop/bbop-js#16. |
What this would boil down to would be two new fields, say: search_bin_priority_one and search_bin_priority_two. Human genes would populate the first one, MODs get the second, everybody else gets none. The search would then be boosted on those two fields, say: search_bin_priority_one^4.0 search_bin_priority_two^2.0. |
As another case, from http://jira.geneontology.org/browse/GO-1007, it would be nice to have tokenizing more sensitive to common use cases like let-23, where a user might be surprised by the fact that the tokenizer defaults to breaking on the hyphen. |
From the Noctua session at the Geneva GO meeting..Seth suggested I post this here: At ZFIN, for autocompletion in term entry boxes, we use a model that allows "starts with" searching for multiple words. This saves many key strokes. like "transcription bla bla factor bla bla bla polymerase bla bla bla" We really like that mechanism for term searching in ZFIN...food for thought. |
ooh, I like this. @doughow Is this on user-facing autocompletes as well as curation? I can see this as being massively useful for biocurators (although with lego you tend to go for the subset of classes with fewer words, but not always). I don't have a strong sense of whether the average non-power user would do this much |
… to cause a failure in searching similar to geneontology/amigo#102
If you are using lucene we have fine tuned our search over many iterations. We always find what we type, pretty much. @kimrutherford can point you to our weighting. It might do what Doug describes above too. I'm not sure but it seems to work well for us. I think it even handles typos.... |
Thank you--more input is always appreciated. That said, we already understand why we have this problem and have implemented an experimental tokenizing/parser fix that solves it (berkeleybop/bbop-manager-golr#4). The issue that we currently have is to rollout the solution and update the software to make use of it. |
That issue mentions EdgeNGramTokenizer, which is what we're using at PomBase.
We're doing more or less as Doug describes as well as allowing minor typos. We currently index only the names and synonyms. The synonyms get a lower weighting when we query. |
Loooooooong ago @cmungall asked if we use our "multi-word begins with" search mechanism for curators only, or if it is also public facing. I believe it is only for curators. I'm not sure how intuitive or natural it would be for general database users. If you know about it, it whittles down long autosuggest lists quickly, particularly for those pesky long terms you know the name of...sort of. Actually..I just tried it in our single box search at ZFIN.org and it seems to work there, so that is public facing. Its not hurting anything, and is helpful if you know about it. |
Typing "p53" (no space) in the search box has this gene showing up first:
http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:2146005
Due to the fact the full name reflects the function of p53 binding.
Adding the space gives better results (we have to fix this I'm afraid).. but the mouse p53 is nowhere to be found.
Even with the MGI filter turned on with this search
http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/search/bioentity?q=p53
It's hard to find - I had to go via the panther family, eventually I got it:
http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:98834
It does have p53 as the synonym...
Let's work with others on fixing this.
The text was updated successfully, but these errors were encountered: