Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support of collections #373

Open
vdancik opened this issue Sep 13, 2022 · 16 comments
Open

add support of collections #373

vdancik opened this issue Sep 13, 2022 · 16 comments
Assignees
Milestone

Comments

@vdancik
Copy link
Collaborator

vdancik commented Sep 13, 2022

We should add support of collections in TRAPI by adding Boolean property is_set to KnowledgeGraph.Node and QueryGraph.QNode to indicate that a node represents a collection of entities rather then a single entity.

Since there already is is_set in QueryGraph.QNode with somewhat confusing meaning, we should also add collate to QueryGraph.QNode to indicate that nodes in results should be grouped.

@vdancik vdancik added this to the v1.4 milestone Sep 13, 2022
@edeutsch
Copy link
Collaborator

Following up on today's discussion "use case 3" of collections and enrichment, I maintain that this problem was solved long ago and ARAX implements exactly this with existing TRAPI 1.3 and no change is needed. Here's my example query:

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    }
  },
  "nodes": {
    "n0": {
      "ids": [
        "UniProtKB:Q9BXW9",
        "UniProtKB:Q9NW38",
        "UniProtKB:Q9NPD8",
        "UniProtKB:Q9NVI1",
        "UniProtKB:Q9UI95",
        "UniProtKB:O15360"
      ],
      "is_set": true,
      "categories": [
        "biolink:Protein"
      ]
    },
    "n1": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ]
    }
  }
}

Notably, is_set = true indicates that the list of ids should be treated as a group.

And here's the ARAX result for this query:
https://arax.ncats.io/?r=64606

Each result is a disease that is highly connected to that list of proteins (not necesarily all). A higher fraction of that set causes results to bubble to the top, and more edges also cause higher ranking.

The set/collection for query is defined by the QNode.ids list and QNode.is_set=true
The set/collection for the results is defined by the bindings in each Result between KG Nodes and the relevant QNode.

I think this is simple and logical and does everything we need.

@andrewsu
Copy link

After further thought, I think I agree with @edeutsch here. Originally I was thinking there were two use cases that should be handled separately -- for results merging and for enrichment-based associations. But the query behavior for both is the same, and the enrichment score can be reflected in the results scoring. So I'm on board with is_set already handling the use cases as I see them...

@cbizon
Copy link
Contributor

cbizon commented Sep 16, 2022

is_set might be the answer, I agree. But I'm a little unsure how it works. I understand the example that @edeutsch posted above, but I don't really understand the behavior for something like this (@andrewsu is this what you meant by result merging?):

{
  "edges": {
    "e0": {
      "subject": "n0",
      "object": "n1"
    },
    "e1": { ...}
  },
  "nodes": {
    "n0": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ],
     "ids": ["MONDO:1234"]
    }
    "n1": {
      "is_set": true,
      "categories": [
        "biolink:Protein"
      ]
    },
    "n2": {
      "is_set": false,
      "categories": [
        "biolink:Disease"
      ]
    }
  }
}

@edeutsch
Copy link
Collaborator

This is also valid, but a different use case than we were discussing. In this case, each Result is MONDO:1234 at one end and a disease at the other end, and then a set/collection of proteins that they share in common between them in the middle. Ranking should be something like the results with the most shared proteins appear highest, although there is plenty of room for improvements on the ranking that could take things like the quality of the edges, NGD between the two diseases, etc. into account as well.

@cbizon
Copy link
Contributor

cbizon commented Sep 16, 2022

So in the case that I put above every element of n1 in an answer must be attached to both n0 and n2?

@edeutsch
Copy link
Collaborator

In the ARAX implementation currently, yes. I suppose there might be an opportunity for different implementations to include only partially connected nodes, although I wouldn't recommend it. Seems related to the whole "can you return partial paths" discussion, which I'm not certain we ever really resolved.

@cbizon
Copy link
Contributor

cbizon commented Sep 16, 2022

So it seems like there is different behavior for the same construct? If it's a bound node then I do enrichment, but if it's an unbound node then I don't?

@edeutsch
Copy link
Collaborator

I don't think the behavior needs to be any different whether it is bound or unbound. I suppose it might be, as a refinement decided by the implementer, but I'm think it it would normally be the same.

@cbizon
Copy link
Contributor

cbizon commented Sep 20, 2022

Sorry, I might be missing something, but is it up to the server to decide how to implement is_set? It might mean the fully connected, or it might mean partially connected, and that partially connected might mean enrichment or max connectedness, or other versions?

@edeutsch
Copy link
Collaborator

Until we decide that everyone has to do things the same way, I suppose we're all free to do things a bit differently. Aragorn is doing a whole lot of things differently than ARAX. Our current definition for is_set is this:

        is_set:
          type: boolean
          description: >-
            Boolean that if set to true, indicates that this QNode MAY have
            multiple KnowledgeGraph Nodes bound to it within each Result.
            The nodes in a set should be considered as a set of independent
            nodes, rather than a set of dependent nodes, i.e., the answer
            would still be valid if the nodes in the set were instead returned
            individually. Multiple QNodes may have is_set=True. If a QNode
            (n1) with is_set=True is connected to a QNode (n2) with
            is_set=False, each n1 must be connected to n2. If a QNode (n1)
            with is_set=True is connected to a QNode (n2) with is_set=True,
            each n1 must be connected to at least one n2.

So a strict reading means to me that partial connectedness is not permitted (contrary to what I supposed above).

It stipulates nothing about how ranking should done, and I'm sure there is diversity in ideas on how ranking is best done in cases like this from enrichment to max connectedness. So until we stipulate how it must be done, there can be diversity.

@vdancik vdancik modified the milestones: v1.4, v1.5 Jun 22, 2023
@vdancik
Copy link
Collaborator Author

vdancik commented Dec 7, 2023

Example of a KG with a gene set:

{
    "knowledge_graph": {
        "nodes": {
            "NCBIGene:10000": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "AKT3"
            },
            "NCBIGene:10097": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "ACTR2"
            },
            "NCBIGene:10111": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "RAD50"
            },
            "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f": {
                "categories": [
                    "biolink:Gene"
                ],
                "name": "AKT3,ACTR2,RAD50",
                "is_set": true
            },
            "MSigDB:HALLMARK_GLYCOLYSIS": {
                "categories": [
                    "biolink:Pathway"
                ],
                "name": "HALLMARK_GLYCOLYSIS"
            }
        },
        "edges": {
            "e0-fBufztAzDx": {
                "subject": "NCBIGene:10000",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e1-fBufztAzDx": {
                "subject": "NCBIGene:10097",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e2-fBufztAzDx": {
                "subject": "NCBIGene:10111",
                "predicate": "biolink:member_of",
                "object": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            },
            "e3-fBufztAzDx": {
                "subject": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f",
                "predicate": "biolink:enriched_in",
                "object": "MSigDB:HALLMARK_GLYCOLYSIS",
                "sources": [
                    {
                        "resource_id": "infores:gelinea",
                        "resource_role": "primary_knowledge_source"
                    },
                    {
                        "resource_id": "infores:molepro",
                        "resource_role": "aggregator_knowledge_source",
                        "upstream_resource_ids": [
                            "infores:gelinea"
                        ]
                    }
                ]
            }
        }
    }
}

@edeutsch
Copy link
Collaborator

Here's my graphical representation of what I think the proposal is. Is this right @vdancik ?

image

@vdancik
Copy link
Collaborator Author

vdancik commented Jan 3, 2024

Example query with an is_set flag:

{
    "message": {
        "query_graph": {
            "nodes": {
                "pathway": {
                    "categories": [
                        "biolink:Pathway"
                    ]
                },
                "gene": {
                    "ids": [
                        "NCBIGene:10000",
                        "NCBIGene:10097",
                        "NCBIGene:10111"
                    ],
                    "is_set": true
                }
            },
            "edges": {
                "t_edge": {
                    "object": "gene",
                    "subject": "pathway",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type":"inferred"
                }
            }
        }
    }
}

would result in a following result

{
    "results": [
        {
            "analyses": [
                {
                    "edge_bindings": {
                        "gene": [
                            {
                                "id": "e3-fBufztAzDx"
                            }
                        ]
                    },
                    "resource_id": "infores:gelinea",
                    "support_graphs": [
                        "gene_set_aux_graph"
                    ]
                }
            ],
            "node_bindings": {
                "gene": [
                    {
                        "id": "UUID:c5d67629-ce16-41e9-8b35-e4acee04ed1f"
                        "is_set": true
                    },
                    {
                        "id": "NCBIGene:10000",
                        "query_id": "NCBIGene:10000"
                    },
                    {
                        "id": "NCBIGene:10097",
                        "query_id": "NCBIGene:10097"
                    },
                    {
                        "id": "NCBIGene:10111",
                        "query_id": "NCBIGene:10111"
                    }
                ],
                "pathway": [
                    {
                        "id": "MSigDB:HALLMARK_GLYCOLYSIS"
                    }
                ]
            }
        }
    ]
}

where as auxiliary graph is

{
    "auxiliary_graphs": {
        "gene_set_aux_graph": {
            "edges": [
                "e0-fBufztAzDx",
                "e1-fBufztAzDx",
                "e2-fBufztAzDx"
            ]
        }
    }
}

and a KG is in my previous comment

@edeutsch
Copy link
Collaborator

edeutsch commented Jan 4, 2024

So here is a slight update to the picture based on today's discussion. The query predicate is updated. And I depicted Result #1 as one that contains all 5 input genes, but Result #2 is the next best match where 3 of the 5 match.

image

There was some discussion of whether this means AND or OR. or a "soft AND", i.e. "as many as possible". I am thinking that the is_set=true construction is interpreted to mean "as many of the set as possible". More members would mean a higher rank. But sets that don't contain all members are not automatically discarded. But maybe this is not the desired outcome.

Additional note: In this scenario, the Query must have knowledge_type: inferred (i.e. "creative mode")

How is this different from the sort of thing that COHD already does?

@edeutsch
Copy link
Collaborator

edeutsch commented Jan 18, 2024

We should probably document why this isn't good enough:

image

Can we capture all the enrichment statistical metrics in each Result.Analysis.attributes[]?

The query predicate "related_to" is tripping us up here. Better to consider a query predicate like "enriched_in" (*does not actually exist yet). Or "participates_in"?

@TereseCamp
Copy link
Collaborator

TereseCamp commented Jan 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants