Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve synonymization of Technetium Tc-99m Albumin Aggregated #1046

Open
amykglen opened this issue Aug 31, 2020 · 5 comments
Open

Improve synonymization of Technetium Tc-99m Albumin Aggregated #1046

amykglen opened this issue Aug 31, 2020 · 5 comments

Comments

@amykglen
Copy link
Member

amykglen commented Aug 31, 2020

as @edeutsch requested, I traced a couple instances where the local fastNGD database (#729) 'misses' a concept but eUtils doesn't - this is the write-up for one example: NCIT:C87398 (Technetium Tc-99m Albumin Aggregated).

  - 2020-08-25 15:36:43.838157 DEBUG: Had to use eUtils to compute NGD between renal cell carcinoma (MONDO:0005086) and Technetium Tc-99m Albumin Aggregated (NCIT:C87398)(value is: 0.7492777150971099)

from kg2canonicalized:

n.id n.name n.equivalent_curies
"NCIT:C87398" "Technetium Tc-99m Albumin Aggregated" ["NCIT:C87398", "CHEMBL.COMPOUND:CHEMBL1201522"]

so in this case, we can see there's no MESH curie in the equivalent_curies, and I confirmed that neither of the equivalent nodes nor their attached edges have publications listed in KG2, so it's not surprising fastNGD isn't aware of any PMIDs for this node.

but what is interesting is that there is a MESH node in KG2 named "Technetium Tc 99m Aggregated Albumin" (word order is slightly different):

n.id n.name n.equivalent_curies
"MESH:D013668" "Technetium Tc 99m Aggregated Albumin" ["UMLS:C0740185", "UMLS:C0087067", "UMLS:C0087068", "MESH:D013668"]

and there are definitely PubMed articles associated with MESH term D013668, so if NodeSynonymizer synonymized these two concepts, then the fastNGD system would no longer 'miss' NCIT:C87398/CHEMBL.COMPOUND:CHEMBL1201522.

@amykglen
Copy link
Member Author

amykglen commented May 3, 2023

well this is slightly better merged in the new synonymizer (#2003) - now a CHEMBL node has been added. though the MESH term is still on its own:

Cluster for NCIT:C87398 (CHEMBL.COMPOUND:CHEMBL1201522) has 2 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
CHEMBL.COMPOUND:CHEMBL1201522 ChemicalEntity TECHNETIUM TC 99M ALBUMIN AGGREGATED X X X
NCIT:C87398 Drug Technetium Tc-99m Albumin Aggregated X

Cluster for MESH:D013668 has 1 nodes:

id category name in_SRI in_KG2pre is_cluster_rep
MESH:D013668 NamedThing Technetium Tc 99m Aggregated Albumin X X

kinda funny the SRI doesn't recognize this MESH term, since they do recognize most MESH

@edeutsch
Copy link
Collaborator

@amykglen can this be closed?

@edeutsch edeutsch assigned amykglen and unassigned edeutsch Jun 26, 2024
@dkoslicki
Copy link
Member

And @amykglen , it looks like the NGD API endpoint handles this: https://arax.ci.transltr.io/api/arax/v1.4/ui/#/PubmedMeshNgd/pubmed_mesh_ngd
image
Though I can't get this to work with DSL or a TRAPI query due to NGD-expansion not being a "thing"

@amykglen
Copy link
Member Author

so I'm seeing three clusters in our latest synonymizer (KG2.9.2c) for Technetium Tc 99m albumin aggregated:
https://arax.ncats.io/beta/?term=MESH:D013668 (Technetium Tc 99m Aggregated Albumin)
https://arax.ncats.io/beta/?term=CHEMBL.COMPOUND:CHEMBL1201522 (TECHNETIUM TC 99M ALBUMIN AGGREGATED)
https://arax.ncats.io/beta/?term=KEGG.DRUG:D06023 (Technetium Tc 99m albumin aggregated (USP))

the first two appear to arise from the fact that the SRI node normalizer assigns those identifiers to two such separate clusters:
https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=MESH:D013668
https://nodenormalization-sri.renci.org/get_normalized_nodes?curie=CHEMBL.COMPOUND:CHEMBL1201522

but the third is for a KEGG node, which the SRI node normalizer doesn't currently support - so with that one, I think we could do better in our synonymization and assign it to one of the first two clusters.

@dkoslicki - that's really interesting that the NGD endpoint does appear to map the name for the CHEMBL node to the main cluster for Technetium Tc-99m Albumin Aggregated, even though that CHEMBL node is still in a separate cluster in the synonymizer. I'm a bit perplexed as to how that would be happening.. I'm not familiar with that endpoint but I thought our NGD only uses the synonymizer to map concepts? but I suppose I'll take it!

so in summary - I wrote up an issue in the SRI NN repo about there being two clusters for Technetium Tc 99m albumin aggregated (TranslatorSRI/NodeNormalization#280), but I think it's worth keeping this issue open for now due to the poor synonymization of the KEGG node (which is in our hands, since the SRI NN doesn't appear to support KEGG identifiers currently)

@dkoslicki
Copy link
Member

Thanks for the sleuthing @amykglen ! I'll leave this open and tag as technical debt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants