Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTDB for large number of genomes (1.3M) #611

Open
intikhab opened this issue Nov 5, 2024 · 6 comments
Open

GTDB for large number of genomes (1.3M) #611

intikhab opened this issue Nov 5, 2024 · 6 comments

Comments

@intikhab
Copy link

intikhab commented Nov 5, 2024

Dear GTDB team,

I have a few questions while I process large number of MAGs using GTDB-tk version 2.4.

  1. If I already have gtdb220 based results from skANI, is there a way to establish associated taxonomy lineage? e.g. accessions to lineage, etc.
  2. Full run of gtdb-tk on 1.3 million MAGs is stuck although I use 3 Tb of RAM and 40 CPUs. If I separately calculate the mash for query genomes, is there a way to process this data with better speed?
  3. I am also running anirep for these MAGs that is also slower at the start. If I calculate mash version of query genomes and calculate mash distances, is there a script from gtdb-tk repository for each of these steps separately and to process the workflow after mash distances are complete?
  4. I used -f (full tree) option in one of the runs and this appears to be very slow so far.

Any suggestions on the above would be great to move forward.

Many Thanks,
Intikhab

@donovan-h-parks
Copy link
Collaborator

Hi Intikhab,

  1. Yes, but not using GTDB-Tk. You'd need to write your own script to determine which of your MAGs are similar enough to existing GTDB species clusters for them to be assigned to that species. GTDB species clusters generally are define as >=95% ANI, but this can be as high as 97% ANI. The following file indicates the ANI threshold for each GTDB species cluster: https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/sp_clusters_r220.tsv

  2. I don't see an easy way to improve speed. I would recommend that you run your MAGs in batches of 5,000 or 10,000. This is how we typically run large numbers of MAGs. This lets you better monitor progress and ensures that one "bad" genome that might crash GTDB-Tk doesn't run the entire run.

  3. The GTDB-Tk workflow is divided into individual steps, but it isn't designed for people to provide their own Mash calculations as an input. This is probably possible, but you'd need to modify the GTDB-Tk code.

  4. The -f flag is not recommended if you have a large number of MAGs.

Cheers,
Donovan

@jianshu93
Copy link

Hi Both, Perhaps try this? https://github.com/jean-pierreBoth/gsearch, for any number of database genomes and query genomes.

@intikhab
Copy link
Author

intikhab commented Nov 5, 2024 via email

@jianshu93
Copy link

Each genome are attached to a taxonomy, it is just there for any database that were made with taxonomy. In any case, taxonomy were named based on ANI and AAI. So what's the real problem then? Jianshu

@intikhab
Copy link
Author

intikhab commented Nov 5, 2024 via email

@intikhab
Copy link
Author

intikhab commented Nov 6, 2024

Hi Donovan,

Regarding GTDB taxonomic lineage assignment at genus, family, order, class, phyla or kingdom level, do you use ANI values below 85%? E.g. If the close genome AF is >=50 and ANI is below <90, gtdb assigns family level taxonomy?

A fast approach for large number of genomes can be to have skANI/gsearch or gtdb-tk ani_rep results with ANI and AF that could be processed further for to assign taxonomic lineages for query genomes.

Do we have such a feature internally in gtdb-tk, that could be provided as an option?

Intikhab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants