-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consolidate results based on is_set query parameter #341
Comments
Does this change #331, such that we should only combine results if the |
hmmm, are you suggesting that the specific case of #331 could be solved by a generic solution to this issue if we require that |
Yes |
In that case, should I make a pull request to just solve #331, or should I wait for you to confirm with the rest of the consortium? The way you phrased it in your comment is very well stated. If the consortium says yes, we could solve both this issue and #331 with a simpler solution than the one I came up with for #331 alone. |
I'm crafting the question to the consortium now. So let's give them a day or two to see if there is clear consensus there... |
Just got confirmation that we should indeed handle #331 as a special case of the |
@marcodarko, the query graph @andrewsu provided in issue #331 appears to surface a possible issue with What's confusing is that Update: this may well be expected behavior. If true, I'll update some code in query_results.js to handle this case.
|
@ariutta that sounds right to me? The edge manager queries edges based on which node has less IDs. So the first edge to execute will be reversed e01: 3 IDs -> Gene and it'll likely return > 6 Gene IDs. That means the second edge will be e00 not reversed: 6 IDs -> Gene But if n02 doesn't have any IDs on it, then BTE only has 1 way to execute it....e00 not reversed for 6 IDs -> Gene and then e01 not reversed for Gene IDs -> Disease |
@colleenXu, there still appears to be a discrepancy of some sort. I created a test to demonstrate: In both of these tests, I would expect In the second test, it appears the directionality is reversed for Is there anything wrong how I formatted these records? Edit: yes, it appeared I had to change record2 by switching If |
This issue may be addressed in my branch This code block is where consolidation happens. Note that consolidation is needed for two different cases. This subsection of that block is specific to the case of the There are several things to verify before we can merge this branch. Specific to this issue about
Unrelated to this issue but still needs to be verified (more related to issue 359):
|
@ariutta Looking at the
|
Agreed, I don't think we can assume that only one node has |
Latest update handles 0, 1 or 2 QNodes having an |
Note from today's lab meeting:
|
One caution regarding cycles -- the tree traversal code gives this error if it encounters a cycle:
|
This work is in the |
Update: @marcodarko and @colleenXu found some issues, and @marcodarko and I have been working on this code to address them. I just pushed some code that appears to correctly handle I haven't had a chance to test this new code fully, but I wanted to let folks know the current status. Note: this new code may incorrectly handle a query graph like the one below, with two QEdges between the same two QNodes. Is the query graph invalid and/or can we ignore this case? If no, what should the results look like? {
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"NCBIGene:3630"
]
},
"n1": {
"ids": [
"MONDO:0005068"
],
"categories": [
"biolink:Disease"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "n1",
"predicates": [
"biolink:gene_has_variant_that_contributes_to_disease_association"
]
},
"e1": {
"subject": "n0",
"object": "n1",
"predicates": [
"biolink:negatively_regulates"
]
}
}
}
}
} I think the query graph above should be considered invalid and be rewritten like this: {
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"NCBIGene:3630"
]
},
"n1": {
"ids": [
"MONDO:0005068"
],
"categories": [
"biolink:Disease"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "n1",
"predicates": [
"biolink:negatively_regulates",
"biolink:gene_has_variant_that_contributes_to_disease_association"
]
}
}
}
}
} |
Unique identifiers for results I think the key question for this issue (and possibly #386 and #390) is, "What combination of data items uniquely defines a result?" I think it is the following:
Here's what a unique result ID could look like:
Are there any cases this would miss, where two valid results could have the same resultID? |
@ariutta sorry for the late reply. For the first post, as we discussed on Slack, I think we can assume 1 edge between two QNodes when a user wants to specify multiple predicates (they'll just use the predicates array on the QEdge) - which is your second case. However, I'm not sure of the effect of future TRAPI changes (particularly this one, and this one might also be related). For the second post: To confirm: are the query graphs in the first post #341 (comment) meant to show a 1-hop Explain structure (where both n0 and n1 have IDs)? In a 1-hop Explain, I think it's not possible to talk about the |
Just to chime in that I'm in agreement with @colleenXu replies above. I've never seen a query with two separate edges between a pair of nodes, so I'm fine with calling that invalid for now. And |
These query graphs were just for the multiple predicates question. I agree that adding The resultID question is intended to be applicable for any query graph. The example was a 1-hop query for simplicity, but we could show a sample resultID for a 2-hop query:
In the example above, I used Restated another way, can you think of any cases where two separate results could both have the same resultID if the resultID is defined as for every record in the result, we concatenate the following items:
to finally produce a single string like this: |
The purpose of the result ID is to do consolidation. I think the simplest way of doing consolidation is to take one pass to group by result ID and another pass to consolidate groups. So at a high level, query_results.js would do this:
I don't especially like the names |
Queries that use query
|
note the simpler version of C1: it should have 10 results, since each chem is attached to both diseases
|
UPDATE: SOLVED @tokebe updated the logging on the It appears some queries get discordant values for |
UPDATE: SOLVED Pulling the content from Slack over to here... For the following query,
If you log this:
Here's an example of what's expected:
Here's an example of what you actually see at the moment for the
|
@colleenXu, for the query in your comment re: a simpler C1, I'm getting 352 results with the |
@ariutta sorry, it looks like my edit of that comment didn't go through. I've edited it now so that it should have two IDs in the QNode. This should give only 10 results |
@colleenXu, I just tried the updated query. Prod gave 20 results. |
@ariutta yep I got 10 results from |
@colleenXu, when I post the query to https://api.bte.ncats.io/v1/query, I get 20 results. When I post it to http://localhost:3000/v1/query, I get 0 results, regardless of whether I use the I did To help debug, here's the command I use to start the server: |
@ariutta It looks like all records come from Connections Hypothesis Provider API (server url http://chp.thayer.dartmouth.edu). Have you run the cron job, and successfully retrieved the meta_knowledge_graph from that server url? And your config file contains Connections Hypothesis Provider API (not commented out)? https://github.com/biothings/BioThings_Explorer_TRAPI/blob/b0a53722b8abb526474ea7013d6af498a2eb7902/src/routes/v1/config.js#L227 EDIT: if that's not it, it would help to get your TRAPI response / console logs, and we can review those... |
I'd assume the cron job would have run, but I'm not sure how to check it. "Connections Hypothesis Provider API" is unchecked (just left it as default on latest |
@ariutta I use this command to run the cron job when I want to (it exits when done, runs from the bte-trapi-workspace directory): npm run smartapi_sync --workspace='@biothings-explorer/bte-trapi' So it sounds like Connections...should be queried by your local machine if it's available.... |
That command fixed it, thanks! Now I'm also getting 10 results on the |
The Reasoner API defines an
is_set
parameter (link) that should be used to help define our results merging behavior. One example that uses thisis_set
parameter is in NCATSTranslator/testing#134 -- this query is visually depicted below:Currently in the results (https://arax.ncats.io/?r=604132c0-b073-49ad-be84-902231915061), BTE has three separate records where pazopanib is in
n3
-- each record has a different gene inn0
(HDAC2, TP53, and ALK).However, the intent of the
is_set
parameter in the query is to indicate that those three results be combined into a single results object. Although it's shown for a different SmallMolecule (Rosmarinic Acid), the ARAX results at the same link above shows the desired behavior:related to our other issues on combining results: #331 and #164
The text was updated successfully, but these errors were encountered: