-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where are you Sulfobacillus? #255
Comments
It's very strange. You shall observe something like "Selecting two genomes to merge" message after this quite soon. Are you running it on a cluster server or a local machine? Thank you. |
Hi, thanks for the reply. I'm running it on a local machine (wsl: 14 threads, 24GB RAM). I also tried running the non-compressed version, which failed with a memory limitation error. Perhaps I need to try limiting the mem usage? |
I think 24GB is not enough for index building. It is a quite memory-intensive task, so we usually create the index on a computing server with large memory. Would the pre-built index work? |
Ah, I wondered. I ran it again and got the following error: real 780m2.875s |
How old is the pre-built index? For the full background, I used the WIMP classifier in EPI2ME to analyse some MINION data. I have a very simple community in one of the samples, comprising Leptospirillum ferriphilum, Acidithiobacillus caldus and Sulfobacillus thermosulfidooxidans. When I did denovo genome construction on the data, I get all three organisms. WIMP just picks up the first two and doesn't hit Sulfobacillus at all. My only thought is that the WIMP centrifuge index is out of date? |
Sorry for the delayed reply. Somehow Sulfobacillus is not in the database, while ncbi refseq says it is complete genome. Maybe it is an issue with my system, and I'll check that later. There is a more recent database at: https://zenodo.org/record/3732127 , would you please give it a try? It probably takes about 21G memory, so your local machine should handle it. |
No worries - thanks again for the replies. I am recalling the fast5 files, and will then test with your database and let you know. Yes - weird about Sulfobacillus... it's a really important genus for us as well! :) |
Hmmm.... Reading ftab (1048577): 09:17:34 Maybe I can't give it enough memory? Maybe not possible to run on a laptop? |
Ah - I hadn't allocated enough RAM to WSL. :) |
Right, so I re-ran the analysis using the database you suggested, and still no Sulfobacillus... |
PS - is it normal the that calculated abundance in the tsv file is always 0.0?? |
I just checked the assembly_summary file downloaded from NCBI refseq, and it seems there is no complete genome for Sulfobacillus. They are all on Scaffold level or Contig level Therefore, they will be ignored in the default index building. GCF_001280565.1 PRJNA224116 SAMN03894429 LGRO00000000.1 na 28034 28034 Sulfobacillus thermosulfidooxidans strain=CBAR-13 latest Scaffold Major Full 2015/09/09 ASS.CBAR13.1 Fundacion Ciencia & Vida GCA_001280565.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/280/565/GCF_001280565.1_ASS.CBAR13.1 na I think one way is to build an index only for the three species, which I think your local machine should be able to handle. |
Ah-ha. Thanks so much for all this help. I can't build a database just for those species as others are important as well, but I should almost always detect Sulfobacillus spp. in my samples. Is it possible to build a database just of 16S genes? (i.e. maybe use the NCBI 16S curated db - I can't remember the name just now) Should I forget local classification? Thanks again for all your time. |
Yes, I built a database for SILVA several years ago at: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data . It may have Sulfobacillus. I think the SILVA database is not very large, you may build it locally. Though it is not directly supported by the "make" database script in Centrifuge. I think once you use centrifuge-download to get the taxonomy trees, taxonomy names, you can create your own sequence id to taxonomy id conversion table, then run centrifuge-build to create the index. |
Thanks again. I downloaded your SILVA db, and re-ran the sample... To be fair, it has Sulfobacillus, which is great, but the overall results of the run don't make much sense; it should be the three species (Acidithiobacillus caldus, Leptospirillum ferriphilum and Sulfobacillus thermosulfidooxidans) alone (or at least hugely dominant), but ranking by read count gives the following top ten: name taxID taxRank genomeSize numReads numUniqueReads abundance I'm wondering if there is actually a solution for our needs... :-/ |
You can inspect the classification result file to check the assignment score for each read. I think you can use 128 or 256 to filter many false positives and run "centrifuge-kreport" on the filtered classification file to obtain a summary of abundances, and see whether the composition becomes cleaner. |
Thanks, I gave this a try (for either 128 of 256): awk 'NR==1 || $4 >= 128' sil_biomx_c_result.out > filter_result_128.out But the results are still crazy. It at least picks up my species, but loads others, including things like Salmon, Chinese mustard and Ethiopian banana being the dominant organisms (!) and Pseudomonas geniculata being the dominant bacterium. (I just used your SILVA db) As a temporary work-around, is it possible to append the Sulfobacillus genomes to the index, even if they're not "complete"? (i.e. add the scaffold-level assemblies?) |
Centrifuge does not support adding the sequence to the sequence on the fly :( . I think another approach is to run the classification on the original full index and then run the SILVA db for the unclassified reads |
So, I tried a different approach, with some trial and error! I pulled the taxids for the organisms I'm likely to see (bact and arch) and used these to limit centrifuge-download, which I asked to pull Complete Genomes and Scaffolds. I think this was around 550 sequences which my laptop managed to process... Using this index gave me far more reasonable results: it allowed me to detect my Sulfobacillus species... The report.tsv still has just zeros for the abundance; is this normal? Thanks again for all the help! |
Glad you figure out a way to resolve this! For the abundance, do all the species in the report have 0 abundance or is there still some taxonomy id with some abundance value? |
They are all zero. I don't think it's linked to my custom index, as it was the same when I tried your pre-compiled database at the start. |
That is a known issue in Centrifuge, but I could not reproduce this error somehow. The only time I get close is due to there is a very short genome in the index, and the abundance estimation algorithm will give it a high value due to the length normalization factor. I think you can run "centrifuge-kreport" or other summaries to use unique assignment to approximate the abundance. |
Ah, thank you. If you like, I can share my data & index with you? A stupid question: when using parvian to view the kreport, the number of reads on the Sankey diagram for each OTU: are these unique reads? i.e. an indicator of relative abundance? |
Yes, it would be very helpful to have your data! I'm not sure which column Pavian loads to do the visualization. Maybe you can ask its author Florian? |
Sure, no problem – where should I send them?
I’ve left a note on the Pavian github ☺
|
My email [email protected] , maybe the google drive link, the dropbox link or any way you prefer. Thank you! |
Just shared a google docs folder with you – it contains my sample data, as well as my custom db.
|
Thank you! I will check it. |
Hello @mourisl @DntBScrdDv ! Command : I am running on AWS , c6a.2xlarge instance (vCPU 8 and memory 16GB ) It's been 2 days since the command is running and now I think it is stuck (Screenshot attached) , Can anyone guide me what should I do ? Stop this and rerun on higher instance or just wait ? Thank you ! |
Make index is very memory-consuming, and you may need around 400G of memory to build the index for a recent prokaryotic database. Furthermore, the compression method might be too slow for a recent database as well, I think you may want just to use "make p". |
@mourisl Thank you ! |
I think one reason is that there might be some super small genome in the database, such as the vectors. Since the abundance estimation takes the genome length into account, the small genomes will get a lot of abundance in the end. If your output, do you see all the abundances be 0 or there is one with >99 abundance? |
@mourisl all 0.0 , Here I am attaching the output from centrifuge, Let me know If you need more info from my end. Thank you so much for your response ! |
Hi,
I'm running make p_compressed, but I think it may be stuck? I modified the make file to use 12 threads instead of one (which was taking forever), but it's been stuck here for nearly two hours... is this normal or should I kill it and try again?
SYSTEM CALL: jellyfish count --if tmp_tmp_0_562.testingKmer -o tmp_tmp_0_562.jf -m 53 -s 5000000 -C -t 8 tmp_0_562_2460.fa
finished
SYSTEM CALL: jellyfish dump tmp_tmp_0_562.jf > tmp_tmp_0_562.jf_dump
finished
Thanks!
The text was updated successfully, but these errors were encountered: