Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate results based on is_set query parameter #341

Closed
andrewsu opened this issue Oct 29, 2021 · 36 comments
Closed

consolidate results based on is_set query parameter #341

andrewsu opened this issue Oct 29, 2021 · 36 comments

Comments

@andrewsu
Copy link
Member

The Reasoner API defines an is_set parameter (link) that should be used to help define our results merging behavior. One example that uses this is_set parameter is in NCATSTranslator/testing#134 -- this query is visually depicted below:

image

Currently in the results (https://arax.ncats.io/?r=604132c0-b073-49ad-be84-902231915061), BTE has three separate records where pazopanib is in n3 -- each record has a different gene in n0 (HDAC2, TP53, and ALK).

image
image
image

However, the intent of the is_set parameter in the query is to indicate that those three results be combined into a single results object. Although it's shown for a different SmallMolecule (Rosmarinic Acid), the ARAX results at the same link above shows the desired behavior:

image

related to our other issues on combining results: #331 and #164

@ariutta
Copy link
Collaborator

ariutta commented Nov 3, 2021

Does this change #331, such that we should only combine results if the is_set parameter is specified?

@andrewsu
Copy link
Member Author

andrewsu commented Nov 3, 2021

hmmm, are you suggesting that the specific case of #331 could be solved by a generic solution to this issue if we require that is_set is true for multi-ID nodes? Would that greatly simplify the implementation? If so, we could discuss this with the rest of the consortium...

@ariutta
Copy link
Collaborator

ariutta commented Nov 3, 2021

Yes

@ariutta
Copy link
Collaborator

ariutta commented Nov 3, 2021

In that case, should I make a pull request to just solve #331, or should I wait for you to confirm with the rest of the consortium? The way you phrased it in your comment is very well stated. If the consortium says yes, we could solve both this issue and #331 with a simpler solution than the one I came up with for #331 alone.

@andrewsu
Copy link
Member Author

andrewsu commented Nov 3, 2021

I'm crafting the question to the consortium now. So let's give them a day or two to see if there is clear consensus there...

@andrewsu
Copy link
Member Author

andrewsu commented Nov 4, 2021

Just got confirmation that we should indeed handle #331 as a special case of the is_set usage described in this issue. So, I will close #331, and please implement the generic solution based on the assumption that is_set will be True if multiple IDs in the query are to be collapsed into a single result. Good catch, @ariutta!

@ariutta
Copy link
Collaborator

ariutta commented Nov 10, 2021

@marcodarko, the query graph @andrewsu provided in issue #331 appears to surface a possible issue with QEdge.isReversed. In my tests, e01 has isReversed() == true when the query graph node n02 has the property ids. Without the property ids, isReversed() == false. Is this expected behavior?

What's confusing is that e00 always has isReversed() == false, regardless of whether the query graph node n00 has the property ids.

Update: this may well be expected behavior. If true, I'll update some code in query_results.js to handle this case.

{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "subject": "n00",
          "object": "n01"
        },
        "e01": {
          "subject": "n01",
          "object": "n02"
        }
      },
      "nodes": {
        "n00": {
          "ids": [
            "DRUGBANK:DB00215",
            "DRUGBANK:DB01175",
            "DRUGBANK:DB00472",
            "DRUGBANK:DB00176",
            "DRUGBANK:DB00715",
            "DRUGBANK:DB01104"
          ],
          "categories": ["biolink:SmallMolecule"]
        },
        "n01": {
          "categories": ["biolink:Gene"]
        },
        "n02": {
          "ids": ["DOID:5844", "DOID:1936", "DOID:3393"],
          "categories": ["biolink:Disease"]
        }
      }
    }
  }
}

@colleenXu
Copy link
Collaborator

@ariutta that sounds right to me? The edge manager queries edges based on which node has less IDs. So the first edge to execute will be reversed e01: 3 IDs -> Gene and it'll likely return > 6 Gene IDs. That means the second edge will be e00 not reversed: 6 IDs -> Gene

But if n02 doesn't have any IDs on it, then BTE only has 1 way to execute it....e00 not reversed for 6 IDs -> Gene and then e01 not reversed for Gene IDs -> Disease

@ariutta
Copy link
Collaborator

ariutta commented Nov 11, 2021

@colleenXu, there still appears to be a discrepancy of some sort. I created a test to demonstrate:
https://github.com/biothings/bte_trapi_query_graph_handler/blob/ba086e4d7dc1553cba5007f629f34473112157de/__test__/integration/helper.test.js#L196

In both of these tests, I would expect n1 to be associated with NCBIGene:3778, n2 with MONDO:0011122 & n3 with a second instance of NCBIGene:3778. That's what we get for the first test, but not for the second test.

In the second test, it appears the directionality is reversed for helper._getInputID, but not for helper._getInputQueryNodeID. If the directionality were to be consistently reversed for edge2, we'd still expect n2 to be associated with MONDO:0011122. But in this case, n2 is associated with NCBIGene:3778.

Is there anything wrong how I formatted these records?

Edit: yes, it appeared I had to change record2 by switching $input and $output. Compare this record:
https://github.com/biothings/bte_trapi_query_graph_handler/blob/300657d788be3e8de045aaba63d63a315947f694/__test__/integration/QueryResult.test.js#L353-L381
vs. this record:
https://github.com/biothings/bte_trapi_query_graph_handler/blob/300657d788be3e8de045aaba63d63a315947f694/__test__/integration/QueryResult.test.js#L544-L572

If ids is specified for the "last" query node, the query_edge.js code will automatically switch $input and $output for the relevant records. For the test, we need to do this manually.

@ariutta
Copy link
Collaborator

ariutta commented Nov 16, 2021

This issue may be addressed in my branch results_assembly_tree_traversal_is_set. Note that the code changes in this branch are primarily intended for improved performance (10x in some cases!) in results assembly, taking advantage of the tree constraint (issue 359). The handling of this is_set issue is a smaller part tacked on.

This code block is where consolidation happens. Note that consolidation is needed for two different cases. This subsection of that block is specific to the case of the is_set param. If we aren't ready to use all the changes from this branch, we could potentially port just the is_set consolidation code to the current production codebase, but it wouldn't be 100% straightforward.

There are several things to verify before we can merge this branch.

Specific to this issue about is_set params:

  • Is the way I specified is_set in the test records correct? (@tokebe may be the best positioned to answer the questions about the implementation of is_set in the code?)
  • Is this method correct for determining whether an is_set param has been specified? If yes, it should probably get moved to another file like helper.js.
  • related to the point above, my code assumes only one QNode of a QEdge can have an is_set param. And because the current codebase automatically reverses an edge if the subject QNode doesn't have a curie but the object QNode does (relevant code), my code further assumes if a QNode has an is_set param, that QNode is an input (related code). update: now only making these assumptions: a QEdge can have 0, 1 or 2 QNodes having is_set params. If the QEdge has just 1 QNode with an is_set param, that QNode is the input (or will be reversed by the QEdge code to be treated as the input). See this code section.
  • My tests were only for cases where QNodes with is_set params were either the root node or a leaf node. (update: removed this assumption)

Unrelated to this issue but still needs to be verified (more related to issue 359):

  • This code assumes either the records passed in have been de-duplicated or it doesn't matter if we return duplicate identical results. @marcodarko, does that sound OK? (update: now handles duplicates)
  • This code also assumes the query graph doesn't have cycles. Should we add a validation step to the query graph parser to be run before the query is executed? Note that it's probably possible to change the _getPreresults method to handle cycles, but it would add complexity.
  • I created a new test case that fails, and I'm not sure why. Is it because I'm creating the records incorrectly, or is it a problem with the algorithm? Does it matter?

@tokebe
Copy link
Member

tokebe commented Nov 16, 2021

@ariutta Looking at the is_set params questions:

  • Given your assumptions, I'm pretty sure you have the node parameters set/checked correctly. input.obj[0] is definitely the right place, and I don't think there's a case where there will ever be an obj[1] to worry about?
  • Based on the description for is_set, it does look like it's possible for both nodes to have is_set, which does change the behavior (if both have is_set=true, then then every subject must connect to at least one object, if I'm reading it right). I'm not sure that it's necessarily safe to assume only one node will have is_set=true, but that goes a bit outside my current sphere of knowledge...

@andrewsu
Copy link
Member Author

Agreed, I don't think we can assume that only one node has is_set=true, and I think we also cannot assume that it will only be on a root or leaf node.

@ariutta
Copy link
Collaborator

ariutta commented Nov 17, 2021

@colleenXu
Copy link
Collaborator

Note from today's lab meeting:

  • adding the cycles-prevalidation may be less essential since we won't be getting that from demo queries / won't be expecting these queries for now...

@ariutta
Copy link
Collaborator

ariutta commented Nov 17, 2021

One caution regarding cycles -- the tree traversal code gives this error if it encounters a cycle:

RangeError: Maximum call stack size exceeded

@ariutta
Copy link
Collaborator

ariutta commented Nov 17, 2021

I discussed results assembly and consolidation with some of our group and created a graphic to explain it. In case it's helpful for anyone else, I'll share it here too:

image

@ariutta
Copy link
Collaborator

ariutta commented Dec 7, 2021

This work is in the is-set branch. Current status: appears to work, but @marcodarko is verifying and making any fixes required.

@ariutta
Copy link
Collaborator

ariutta commented Jan 14, 2022

Update: @marcodarko and @colleenXu found some issues, and @marcodarko and I have been working on this code to address them. I just pushed some code that appears to correctly handle is_set:
https://github.com/biothings/bte_trapi_query_graph_handler/blob/is_set2/src/query_results.js#L312-L385

I haven't had a chance to test this new code fully, but I wanted to let folks know the current status.

Note: this new code may incorrectly handle a query graph like the one below, with two QEdges between the same two QNodes. Is the query graph invalid and/or can we ignore this case? If no, what should the results look like?

image

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "NCBIGene:3630"
          ]
        },
        "n1": {
          "ids": [
            "MONDO:0005068"
          ],
          "categories": [
            "biolink:Disease"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:gene_has_variant_that_contributes_to_disease_association"
          ]
        },
        "e1": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:negatively_regulates"
          ]
        }
      }
    }
  }
}

I think the query graph above should be considered invalid and be rewritten like this:
image

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "NCBIGene:3630"
          ]
        },
        "n1": {
          "ids": [
            "MONDO:0005068"
          ],
          "categories": [
            "biolink:Disease"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:negatively_regulates",
            "biolink:gene_has_variant_that_contributes_to_disease_association"
          ]
        }
      }
    }
  }
}

@ariutta
Copy link
Collaborator

ariutta commented Jan 18, 2022

Unique identifiers for results

I think the key question for this issue (and possibly #386 and #390) is, "What combination of data items uniquely defines a result?" I think it is the following:

  1. QNodeID (e.g., n0) for every node
  2. node primaryID, (e.g., NCBIGene:3630) for every QNode where "is_set" = false

Here's what a unique result ID could look like:

  • when "is_set" = true for QNode n1:
    n0-NCBIGene:3630-n1
  • when "is_set" = false for QNode n1:
    n0-NCBIGene:3630-n1-MONDO:001

Are there any cases this would miss, where two valid results could have the same resultID?

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 20, 2022

@ariutta sorry for the late reply.

For the first post, as we discussed on Slack, I think we can assume 1 edge between two QNodes when a user wants to specify multiple predicates (they'll just use the predicates array on the QEdge) - which is your second case. However, I'm not sure of the effect of future TRAPI changes (particularly this one, and this one might also be related).


For the second post:

To confirm: are the query graphs in the first post #341 (comment) meant to show a 1-hop Explain structure (where both n0 and n1 have IDs)?

In a 1-hop Explain, I think it's not possible to talk about the is-set behavior. Since there is a different form of "set-related logic" operating: that n0 must map to a specific ID and n1 must map to another specific ID. There aren't multiple nodes to "handle together" with the is-set behavior.

@andrewsu
Copy link
Member Author

Just to chime in that I'm in agreement with @colleenXu replies above. I've never seen a query with two separate edges between a pair of nodes, so I'm fine with calling that invalid for now. And is-set is not really meaningful on query nodes that are constrained to a single identifier. (I may not be understanding your most recent comment completely, and if that's the case, let's set up a meeting to chat...)

@ariutta
Copy link
Collaborator

ariutta commented Jan 20, 2022

are the query graphs in the first post #341 (comment) meant to show a 1-hop Explain structure (where both n0 and n1 have IDs)?

These query graphs were just for the multiple predicates question. I agree that adding is_set to either node wouldn't make sense.


The resultID question is intended to be applicable for any query graph. The example was a 1-hop query for simplicity, but we could show a sample resultID for a 2-hop query:

  • when "is_set" = true for n1:
    n0-NCBIGene:3630-n1_&_n1-n2-PUBCHEM.COMPOUND:43815
  • when "is_set" = false for n1:
    n0-NCBIGene:3630-n1-MONDO:0005068_&_n1-MONDO:0005068-n2-PUBCHEM.COMPOUND:43815
    n0-NCBIGene:3630-n1-MONDO:0005010_&_n1-MONDO:0005010-n2-PUBCHEM.COMPOUND:43815

In the example above, I used _&_ for concatenating between records. I specified every input/output pair of nodes because I thought that's more clear, but if we don't like the repeated nodes (e.g., n1_&_n1), we could include just the first input and after that only include outputs, e.g.: n0-NCBIGene:3630_&_n1_&_n2-PUBCHEM.COMPOUND:43815.

Restated another way, can you think of any cases where two separate results could both have the same resultID if the resultID is defined as for every record in the result, we concatenate the following items:

  • input QNodeID
  • input primaryID (if is_set false)
  • output QNodeID
  • output primaryID (if is_set false)

to finally produce a single string like this:
record0 input QNodeID + record0 input primaryID (if is_set false) + - + record0 output QNodeID + record0 output primaryID (if is_set false) + _&_ + record1 input QNodeID + record1 input primaryID (if is_set false) + - + record1 output QNodeID + record1 output primaryID (if is_set false) ...

@ariutta
Copy link
Collaborator

ariutta commented Jan 20, 2022

The purpose of the result ID is to do consolidation. I think the simplest way of doing consolidation is to take one pass to group by result ID and another pass to consolidate groups. So at a high level, query_results.js would do this:

  1. create sets of records where each set has one record per QEdge and each record in a set has the same primaryID as its neighbor(s) at the same QNode (currently calling each set a preresult; there can be one or multiple preresults per result)
  2. group those sets by result ID
  3. consolidate each group (currently calling each of these a consolidatedPreresult; consolidatedPreresults are 1:1 with results, where each consolidatedPreresult becomes a result)
  4. format to match the translator standard (calling each of these a result)

I don't especially like the names preresult and consolidatedPreresult. I was thinking of unconsolidatedResult and consolidatedResult or atomicResult and mergedResult, but I'm not sure those are any better. If anyone has suggestions, I'd love to rename these to something more intuitive.

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 21, 2022

Queries that use is-set:

  • the opening post from Andrew from Translator standup: this query
  • is-set version of D2
query
{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "subject": "n00",
          "object": "n01"
        },
        "e01": {
          "subject": "n01",
          "object": "n02"
        }
      },
      "nodes": {
        "n00": {
          "ids": [
            "DRUGBANK:DB00215",
            "DRUGBANK:DB01175",
            "DRUGBANK:DB00472",
            "DRUGBANK:DB00176",
            "DRUGBANK:DB00715",
            "DRUGBANK:DB01104"
          ],
          "categories": ["biolink:SmallMolecule"],
          "is_set": true
        },
        "n01": {
          "categories": ["biolink:Gene"]
        },
        "n02": {
          "ids": [
              "DOID:5844", 
              "DOID:1936", 
              "DOID:3393"
            ],
          "categories": ["biolink:Disease"],
          "is_set": true
        }
      }
    }
  }
}

@colleenXu
Copy link
Collaborator

colleenXu commented Jan 21, 2022

note the simpler version of C1: it should have 10 results, since each chem is attached to both diseases

{
    "message": {
        "query_graph": {
            "nodes": {
                "n00": {
                    "categories": ["biolink:Disease"],
                    "ids": [
                        "MONDO:0013433",
                        "MONDO:0019340"
                    ],
                    "is_set": true
                },
                "n01": {
                    "categories": ["biolink:SmallMolecule"],
                    "is_set": false
                }
            },
            "edges": {
                "e00": {
                    "subject": "n00",
                    "object": "n01",
                    "predicates": [
                        "biolink:has_real_world_evidence_of_association_with"
                    ]
                }
            }
        }
    }
}

@ariutta
Copy link
Collaborator

ariutta commented Jan 26, 2022

UPDATE: SOLVED

@tokebe updated the logging on the is_set2 branch, so we were able to try it out.

It appears some queries get discordant values for helper._getInputQueryNodeID, helper._getInputID, helper._getOutputQueryNodeID and helper._getOutputID:
https://suwulab.slack.com/archives/CC218TEKC/p1643236225129800?thread_ts=1643220256.112100&cid=CC218TEKC

@ariutta
Copy link
Collaborator

ariutta commented Jan 27, 2022

UPDATE: SOLVED

Pulling the content from Slack over to here...

For the following query, n0 should only get the primaryID NCBIGene:23221:

{
  "message": {
    "query_graph": {
      "edges": {
        "e01": {
          "object": "n0",
          "subject": "n1",
          "predicates": [
            "biolink:entity_negatively_regulates_entity"
          ]   
        },  
        "e02": {
          "object": "n1",
          "subject": "n2",
          "predicates": [
            "biolink:increases_abundance_of",
            "biolink:increases_expression_of",
            "biolink:increases_stability_of",
            "biolink:increases_uptake_of",
            "biolink:decreases_degradation_of",
            "biolink:increases_secretion_of",
            "biolink:increases_metabolic_processing_of",
            "biolink:increases_folding_of",
            "biolink:increases_localization_of",
            "biolink:increases_synthesis_of",
            "biolink:increases_response_to",
            "biolink:increases_splicing_of",
            "biolink:increases_mutation_rate_of",
            "biolink:increases_transport_of",
            "biolink:increases_activity_of",
            "biolink:increases_molecular_modification_of",
            "biolink:increases_molecular_interaction"
          ]   
        }   
      },  
      "nodes": {
        "n0": {
          "ids": [
            "NCBIGene:23221"
          ],  
          "categories": [
            "biolink:Gene"
          ]   
        },  
        "n1": {
          "categories": [
            "biolink:Gene"
          ]   
        },  
        "n2": {
          "categories": [
            "biolink:SmallMolecule"
          ]   
        }   
      }   
    }   
  }
}

If you log this:

records.forEach(record => {
  debug(`  inputQueryNodeID: ${helper._getInputQueryNodeID(record)}, inputPrimaryID: ${helper._getInputID(record)}, outputQueryNodeID: ${helper._getOutputQueryNodeID(record)}, outputPrimaryID: ${helper._getOutputID(record)}`)
});

Here's an example of what's expected:

inputQueryNodeID: n1, inputPrimaryID: NCBIGene:6794, outputQueryNodeID: n0, outputPrimaryID: NCBIGene:23221

Here's an example of what you actually see at the moment for the is_set2 branch:

inputQueryNodeID: n0, inputPrimaryID: NCBIGene:6794, outputQueryNodeID: n1, outputPrimaryID: NCBIGene:23221

@ariutta
Copy link
Collaborator

ariutta commented Jan 28, 2022

@colleenXu, for the query in your comment re: a simpler C1, I'm getting 352 results with the is_set2_pr86 branch vs. 903 with the production code. But I should be getting just 10 results?

@colleenXu
Copy link
Collaborator

@ariutta sorry, it looks like my edit of that comment didn't go through. I've edited it now so that it should have two IDs in the QNode. This should give only 10 results

@ariutta
Copy link
Collaborator

ariutta commented Jan 31, 2022

@colleenXu, I just tried the updated query. Prod gave 20 results. is_set2_pr86 gave 0 results. Did you get 10 results from is_set2_pr86?

@colleenXu
Copy link
Collaborator

@ariutta yep I got 10 results from is_set2_pr86....I'm not sure why you aren't when the prod query looks like it's working correctly. Both queries are post to http://localhost:3000/v1/query....

@ariutta
Copy link
Collaborator

ariutta commented Jan 31, 2022

@colleenXu, when I post the query to https://api.bte.ncats.io/v1/query, I get 20 results. When I post it to http://localhost:3000/v1/query, I get 0 results, regardless of whether I use the main or the is_set2_pr86 branch of https://github.com/biothings/bte_trapi_query_graph_handler.

I did npm run pull to get the latest code from all the branches. Any ideas why I'm getting 0 results?

To help debug, here's the command I use to start the server:
npm run debug USE_THREADING=false --workspace='@biothings-explorer/bte-trapi'

@colleenXu
Copy link
Collaborator

colleenXu commented Feb 1, 2022

@ariutta It looks like all records come from Connections Hypothesis Provider API (server url http://chp.thayer.dartmouth.edu).

Have you run the cron job, and successfully retrieved the meta_knowledge_graph from that server url? And your config file contains Connections Hypothesis Provider API (not commented out)? https://github.com/biothings/BioThings_Explorer_TRAPI/blob/b0a53722b8abb526474ea7013d6af498a2eb7902/src/routes/v1/config.js#L227

EDIT: if that's not it, it would help to get your TRAPI response / console logs, and we can review those...

@ariutta
Copy link
Collaborator

ariutta commented Feb 1, 2022

I'd assume the cron job would have run, but I'm not sure how to check it.

"Connections Hypothesis Provider API" is unchecked (just left it as default on latest main).

@colleenXu
Copy link
Collaborator

@ariutta I use this command to run the cron job when I want to (it exits when done, runs from the bte-trapi-workspace directory): npm run smartapi_sync --workspace='@biothings-explorer/bte-trapi'

So it sounds like Connections...should be queried by your local machine if it's available....

@ariutta
Copy link
Collaborator

ariutta commented Feb 1, 2022

That command fixed it, thanks!
npm run smartapi_sync --workspace='@biothings-explorer/bte-trapi'

Now I'm also getting 10 results on the is_set2_pr86 branch, so as far as I can tell, this issue #341 is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants