Discussion: How to assemble complicated metagenome e.g. soil #418

SilasK · 2021-03-29T08:27:34Z

SilasK
Mar 29, 2021
Maintainer

I group here a discussion on how to assemble complex metagenomes, e.g. soil.

Spades is probably still the best assembler even for complex metagenomes (https://twitter.com/ryneches/status/1352732023262089216)

If it really the assembly doesn't give good results, maybe using bbnorm on the QC reads is a solution.

@botellaflotante @slambrechts did find a better solution to assemble complex metagenomes?

ChristianFurbo · 2021-03-29T09:20:45Z

ChristianFurbo
Mar 29, 2021

Hi Silask

Some observation I made.

I have 16 metagenomic samples from the water column. Overall, these had a "percent assembled reads" between 26% and 42%
__
I did a preliminary test run on one of the samples.
I used either Megahit or MetaSpades on one of the samples. Here. Megahit had a higher "percent of assembled reads" by 4% compared to metaSpades.
__
I tried to split my samples into three bins using bbnorm on QC_error_corrected reads. Overall, it did not improve the assembly; I did have "percent of assembled reads" up to 60% in my second bin, where reads with coverage between 10x and 80x were. However, I only had 2% assembled reads in my third bin, containing reads with coverage above 80x.

So, I thought that the low percent assembled could be due to a strict pre-filtering? With that said, I did try to decrease the "minimum_percent_coverage_bases" to 5. It did not change anything.
But I can say that based on the file "filter_by_coverage.log", I had ~530000 reads out, wherein I had ~904000 reads in.

At this point, I am stuck :)

Cheers

0 replies

SilasK · 2021-03-31T08:52:27Z

SilasK
Mar 31, 2021
Maintainer Author

Thank you very much for your comments.

It seems that at least for some genomes you have 'too many reads' which then creates problems by the assembly (example) In this case maybe the normalization would improve the assembly.

@ChristianFurbo Can I suggest to you to run bbnorm on all the reads (no splitting) with the parameters target=50 min=2 ortarget=100 min=2 and see what you get with metaspades. In this case you remove the redundancy of the too abundant kmers and still keep the difference in abundance beteen target and min that is useful for the assembly.
I think this is what most people suggested to me for complicated metagenomes.

What you also could try is to use megahit with the preset meta-large and min_count:6 or so.

0 replies

slambrechts · 2021-03-31T13:43:03Z

slambrechts
Mar 31, 2021

Interesting! Thank you very much for the discussion!

I have 58 soil samples, and only yet tried coassembly of everything (using megahit with only the meta-large preset). N50 was 912 bp before subsetting, and 2752 bp after subsetting for contigs > 1000 bp. Indeed, I also encounter the problem of a few dominant organisms that are sequenced at very high depth, and a lot of rare biosphere members sequenced at low depth:

Definitely want to try normalization before assembly, thanks for the suggestion!

At the moment I'm aiming to run atlas and use metaspades to assemble each sample separately. But I also don't want to lose the less dominant members of my microbial communities...so I would like to add bins generated from the coassembly or the ones generated when using the BinGroup parameter in atlas. After that, dereplication should do it's work? I'm not yet sure what would be the best way of doing this in atlas though, should I somehow add the coassembly to the workflow with a hack, or forget about the coassembly and use cross sample binning by setting the Bingroup parameter?

0 replies

ChristianFurbo · 2021-04-01T08:43:01Z

ChristianFurbo
Apr 1, 2021

Hi

I tried your suggestion SilasK, and it helped.
I ran bbnorm with target=50 and min=2; hereafter, I used megahit with meta-large and min_count 6. From this, I got ~50% assembled reads. Without bbnorm, it was ~26%.

Using MetaSpades, using the default, I got ~40% assembled reads. Without normalization, I got roughly 21%.

Nevertheless, it looks like normalization worked :)

0 replies

SilasK · 2021-04-01T08:48:27Z

SilasK
Apr 1, 2021
Maintainer Author

Thank you for the update.
The 50% is of the normalized reads, isn't it? So if the normalization throws away 50% or more of the reads and now you can map 2 times more onto the contig, then this is not really an improvement. Am I right?

Ideally one should normalize the reads after error correction but then use the unnormalized QC reads for the mapping + downstream. I think I should finalize #289

I meant to use normalization + spades (which is arguably still the better assembler) or megahit with min_count 6.

0 replies

slambrechts · 2021-04-01T09:34:55Z

slambrechts
Apr 1, 2021

This might also be relevant here: bbcms.sh vs bbnorm

Recently Brian Bushnell, who developed BBMap told me that using bbcms with bbcms.sh mincount=2 highcountfraction=0.6 might be better for metagenomes.

and:

Our current metagenome assembly pipeline gets rid of most of the unassemble-able reads with BBCMS:

bbcms.sh mincount=2 highcountfraction=0.6

This gets rid of all of the reads in which fewer than 60% of the kmers have depth of at least 2, which are generally junk. It will also reduce the overall kmer count somewhat with error correction, where the depth is sufficient. link

They seem to use it at JGI: Loxahatchee wildlife refuge study & giant virus study

But bbcms appears to also include error correction:

The error-correction algorithm is taken from Tadpole. But if there is sufficient memory to use Tadpole, then Tadpole is more desirable.

Will test bbcms without the error-correction by using ecc=f, but does this also remove the redundancy of the too abundant kmers?

0 replies

ChristianFurbo · 2021-04-05T06:46:19Z

ChristianFurbo
Apr 5, 2021

Hi

I ran the assembly again using metaspades. I attach the contigs stats below.

I the normalized error_corrected reads using bbnorm with target=50, min=2, and target=100, min=2. More and less the same results. ~41 %
I also tried using bbcms as slambrechts suggested. It gave a higher contig length and assembly percent, ~50%.
to compare, I also ran metaspades without normalization the reads. It gave an assembly percent of 21%.

What I observed was:
That bbnorm increased the assembly_percent, though it is on the cost of "number_of_contigs", compared to not normalizing. However, it may be that this is due to the increased contig length, by roughly 200bp.
bbcms had a higher assembly percent, however, seems like I lost ~80000 contigs compared to the not normalized reads, and the assembly length decreased with 100M bp (from ~200M to 100M).

Question:
You mention, that ideally we should map the unnormalized QC reads for the mapping. I did not do that here. How can you tell Atlas to use another set of reads to map towards an assembly? :)

0 replies

SilasK · 2021-04-06T16:01:05Z

SilasK
Apr 6, 2021
Maintainer Author

The percent aligned reads is misleading if they are based on different number of reads in the input, isn't it?
However, we can compare more or less the number of Assembled_Reads.

It seems that the unnormalized assembly produces the largest contigs followed by the normalized (Assembled_Reads,N50, number of bp, and genes). Normalization doesn't seem to improve the assembly and bbcms seems even to be worse.

If I understand it correctly, bcms is an alternative to tadpole for the error correction. I don't see a reason to use bbcms instead of tadpole unless your dataset is too big. However, maybe I should adapt the filtering parameters to achieve the same.
Brian Bushnell doesn't recommend normalization for metagenomes but in very complicated cases, e.g. soil it might help.

@ChristianFurbo Now the question is do you want to use normalization for your assembly?
Can you run the binning for these datasets to see what is the binning output?

I don't know, could normalization to target=10 be worth trying?

My idea in the atlas workflow is that we start with QC reads, then they are error corrected and merged before the assembly. But then we map the QC reads to the assembly.
I will add the option for normalization after error correction #289

0 replies

ChristianFurbo · 2021-04-07T12:07:10Z

ChristianFurbo
Apr 7, 2021

Hi

Yes, you are right. They would be misleading. Looking at my "read_stats_length", I can see that I have roughly 50% more reads in my non-normalized samples.

I would assume losing the filtering parameters will help a little. However, on a previous run, I did try setting the "minimum_percent_covered_bases" to 5. It did not change a lot.

I ran metaspades with target=10. It did not change so much (photo attached).

The binning output, via atlas on default. Is there anything specific you want to see? Otherwise, I attached a photo with some number of bins, complete and contamination.
It seems I actually got more bins with both bbnorm and bbcms compared to non-normalized.
eight bins -> non-normalized, here one bin were above threshold (>90%, <5%)
11 bins -> bbnorm_100
ten bins -> bbnorm_50
nine bins -> bbcms, here one bin were above threshold (>90%, <5%)

However, based on the last figure I attached. It seems, just by observing, that the non-normalized bins have "higher" quality bins since 7 out of 8 bins are below 5% contamination, except one which is at 8%.
Comparing to normalized samples, they have bins with contamination up to 19%.

I can also see that I get a difference in the taxonomy, e.g. 3 bins in non-normalized which could not be resolved, while only 1 bin in bbnorm were unresolved.

Lastly, your question.
Based on the binning, both bbcms and bbnorm gave more bins. Yet, the quality decreased, if I am correct? In that case, I would not do the normalization.

0 replies

slambrechts · 2021-04-07T13:11:31Z

slambrechts
Apr 7, 2021

interesting! @ChristianFurbo did you run bbcms with or without error-correction (i.e. ecc=t or ecc=f)? And with both R1 and R2 files using in2=<R2 file>, or running bbcms on forward and reverse reads separately?

0 replies

ChristianFurbo · 2021-04-07T13:22:37Z

ChristianFurbo
Apr 7, 2021

Hi slambrechts

I ran bbcms as default. So it must have been ecc=t. I ran it with both R1 and R2 using in=R1... in2=R2.. out=R1... out2=R2...

However, I think I made a mistake. I ran the bbcms on my error_corrected_reads, as I used to bbnorm. Which I realize may be wrong? :)
So ideally, it should be run on the quality checked reads, using ecc=f?

0 replies

slambrechts · 2021-04-07T13:38:13Z

slambrechts
Apr 7, 2021

Indeed, bbcms does error correction by default, so if you want to test the effect of only the bbcms depth filter you need to set ecc=f

I'm also not sure whether error correction should be done before or afer filtering. The bbcms description states:

Because accuracy declines with an increasing number of unique kmers, it can
be useful with very large datasets to run this in 2 passes, with the first
pass for filtering only using a 2-bit filter with the flags tossjunk=t and
ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a
4-bit filter for the actual error correction.

So I was thinking to use bbcms with ecc=f before running atlas run assembly. The actual error correction then being performed during atlas run assembly, and thus after the bbcms filter?

0 replies

ChristianFurbo · 2021-04-07T13:52:17Z

ChristianFurbo
Apr 7, 2021

I am running it agian now where I used bbcms on QC_reads. Also I am doing the error correction with atlas, as you suggest.

A question - I was wondering, is there a limit to "how much" you can error correct? e.g. in this case, we are error-correcting with bbcms, followed by error-correcting with atlas and lastly after the filtering? So three error-correction?
Would it be an idea to firstly assembled reads without normalizing with bbcms or bbnorm. Hereby, take all the un-assembled reads, normalized those and assembling a second time using only the normalized?

0 replies

slambrechts · 2021-04-07T14:05:29Z

slambrechts
Apr 7, 2021

if you run bbcms with ecc=f on QC reads, and then run atlas run assembly, there should be only one error-correction step (tadpole during atlas run assembly), or am I wrong?

In case of running bbcms in default mode on QC reads, before atlas run assembly, I have no idea what the effect is of doing error-correction twice to be honest

0 replies

ChristianFurbo · 2021-04-07T14:09:02Z

ChristianFurbo
Apr 7, 2021

Sorry, yes you are right :) ecc=f would leave one error-correction step in the assembly step of atlas.

0 replies

ChristianFurbo · 2021-04-07T17:57:46Z

ChristianFurbo
Apr 7, 2021

Hi
So I ran bbcms again on the QC reads using ecc=f mincount=2 highcountfraction=0.6. The results seem to be more and less the same.

0 replies

SilasK · 2021-04-29T18:33:28Z

SilasK
Apr 29, 2021
Maintainer Author

Hello, @makrez You also have a difficult metagenome.

For everybody here:
My idea behind my implementation #289 is to normalize reads after QC but before assembly. In this way, you would have all the qc reads available for the binning.

Here is how to run atlas on the dev branch:

branch_name=normalizeagain

git clone https://github.com/metagenome-atlas/atlas.git
cd atlas
git checkout ${branch_name} # change to new branch
mamba env create -n atlas-dev --file atlasenv.yml
conda activate atlas-dev
pip install --editable .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: How to assemble complicated metagenome e.g. soil #418

{{title}}

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Discussion: How to assemble complicated metagenome e.g. soil #418

SilasK Mar 29, 2021 Maintainer

Replies: 17 comments

SilasK Mar 31, 2021 Maintainer Author

SilasK Apr 1, 2021 Maintainer Author

SilasK Apr 6, 2021 Maintainer Author

SilasK Apr 29, 2021 Maintainer Author

SilasK
Mar 29, 2021
Maintainer

SilasK
Mar 31, 2021
Maintainer Author

SilasK
Apr 1, 2021
Maintainer Author

SilasK
Apr 6, 2021
Maintainer Author

SilasK
Apr 29, 2021
Maintainer Author