-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GTDB for large number of genomes (1.3M) #611
Comments
Hi Intikhab,
Cheers, |
Hi Both, Perhaps try this? https://github.com/jean-pierreBoth/gsearch, for any number of database genomes and query genomes. |
Hi Jianshu,
gsearch looks very powerful but it also only provides ANI and closest reference genomes. Finding closest lineage still requires something like accession to taxon id and taxon id to lineages, considering say 95% ANI.
I have now dereplicated 1.3 million MAGs using skder/skani leading to 786,807 nr MAGs.
Now using ani_rep from gtdb-tk, There is some progress, as below:
[2024-11-05 16:46:40] INFO: Creating Mash sketch file: ...
[2024-11-05 18:11:44] INFO: Completed 786,807 genomes in 85.06 minutes (9,249.88 genomes/minute).
[2024-11-05 18:11:44] INFO: Loading data from existing Mash sketch file: ../gtdb220_mashdb.msh
[2024-11-05 18:11:50] INFO: Calculating Mash distances.
I am wondering when ANI and closest genome accession is available, is there a straightforward way to obtain taxonomic lineage, considering say >=95% ANI?
I am trying another round of skder/skani dereplication using 95% ANI so that I could obtain monphylytic representative genomes which I may be able to pass through gtdb-tk comparitively fast.
Best Wishes,
Intikhab
…--
Intikhab Alam, PhD
Senior Research Scientist
CEMSE Division, Building #3, Office #4328
4700 King Abdullah University of Science and Technology (KAUST)
Thuwal 23955-6900, KSA
W: http://www.kaust.edu.sa<https://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa>
T +966 (0) 2 808-2423 F +966 (2) 802 0127
________________________________
From: Jianshu_Zhao ***@***.***>
Sent: Tuesday, November 5, 2024 21:33
To: Ecogenomics/GTDBTk ***@***.***>
Cc: Intikhab Alam ***@***.***>; Author ***@***.***>
Subject: [EXTERNAL] Re: [Ecogenomics/GTDBTk] GTDB for large number of genomes (1.3M) (Issue #611)
Hi Both, Perhaps try this? https://github.com/jean-pierreBoth/gsearch<https://urldefense.com/v3/__https://github.com/jean-pierreBoth/gsearch__;!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovIbvpFoIg$>, for any number of database genomes and query genomes.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/Ecogenomics/GTDBTk/issues/611*issuecomment-2457893683__;Iw!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovITwtQdoU$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63ESJEO5KUOP2HFO2B5DZ7EFN7AVCNFSM6AAAAABRGUDWMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXHA4TGNRYGM__;!!Nmw4Hv0!xwX0DBHD-NYSdvcSgRHembMGk0jJuZi_nyEzp9J6boBuMwRNLatJNAzD17mc2VPsFHd4XBexuaflOS2KmTS4eovIPxgIDeg$>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Each genome are attached to a taxonomy, it is just there for any database that were made with taxonomy. In any case, taxonomy were named based on ANI and AAI. So what's the real problem then? Jianshu |
Dear Jianshu,
Thanks for your point, yes, I agree each NCBI genome is attached to a taxonomy.
However, for each of the query genome you need to decide which taxonomic level is an appropriate assignment e.g. say we assign the strain level if ANI is 100% and Alignment_fraction is also 100%. For species level, it is recommended to have >=95% identity. If the top neighbour reference genome shows <95% ANI, may be genus level can be assigned.
gsearch and skANI provide closest neighbour reference genomes but next step of assignment to a taxon level is not available. This step helps us to evaluate novel vs known taxons e.g. if we can assign genus level taxonomy to a query genome, this perhaps shows you have found a novel species.
Are you able to add a step in gsearch for taxonomic lineage assignment to identify known vs novel species?
It would be a good addition.
Best Wishes,
Intikhab
…--
Intikhab Alam, PhD
Senior Research Scientist
CEMSE Division, Building #3, Office #4328
4700 King Abdullah University of Science and Technology (KAUST)
Thuwal 23955-6900, KSA
W: http://www.kaust.edu.sa<https://webmail.kaust.edu.sa/owa/redir.aspx?C=wkduJ0ChSE-OkyUQwL9vutDH6L5Gg9EImiJ7GyYOxcPLuActd9iwo85DHDgQZup2zR1MyXCk7as.&URL=http%3a%2f%2fwww.kaust.edu.sa>
T +966 (0) 2 808-2423 F +966 (2) 802 0127
________________________________
From: Jianshu_Zhao ***@***.***>
Sent: Tuesday, November 5, 2024 22:59
To: Ecogenomics/GTDBTk ***@***.***>
Cc: Intikhab Alam ***@***.***>; Author ***@***.***>
Subject: [EXTERNAL] Re: [Ecogenomics/GTDBTk] GTDB for large number of genomes (1.3M) (Issue #611)
Each genome are attached to a taxonomy, it is just there for any database that were made with taxonomy. In any case, taxonomy were named based on ANI and AAI. So what's the real problem then? Jianshu
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/Ecogenomics/GTDBTk/issues/611*issuecomment-2458044886__;Iw!!Nmw4Hv0!34IEEWuXk2LZi_g8_STTAF8d788KJcapvuh52IGXnx9h3okMN82eGyBQaeKLfcQWniuF-eQGamH2fh1ukecAcMPO_3bMMV8$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAV63EXDFYVF2EKNQ6QGC5DZ7EPSPAVCNFSM6AAAAABRGUDWMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGA2DIOBYGY__;!!Nmw4Hv0!34IEEWuXk2LZi_g8_STTAF8d788KJcapvuh52IGXnx9h3okMN82eGyBQaeKLfcQWniuF-eQGamH2fh1ukecAcMPOv2FsfwM$>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi Donovan, Regarding GTDB taxonomic lineage assignment at genus, family, order, class, phyla or kingdom level, do you use ANI values below 85%? E.g. If the close genome AF is >=50 and ANI is below <90, gtdb assigns family level taxonomy? A fast approach for large number of genomes can be to have skANI/gsearch or gtdb-tk ani_rep results with ANI and AF that could be processed further for to assign taxonomic lineages for query genomes. Do we have such a feature internally in gtdb-tk, that could be provided as an option? Intikhab |
Dear GTDB team,
I have a few questions while I process large number of MAGs using GTDB-tk version 2.4.
Any suggestions on the above would be great to move forward.
Many Thanks,
Intikhab
The text was updated successfully, but these errors were encountered: