You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Skesa is awesome assembler. But, I have stumbled on an issue and I was hoping you may be able to help. I detail what I have done so far and where I think the problem is below. Please let me know if you need anything else.
I am performing limit of detection experiments for detecting Salmonella serotype using the tool sistr. In this experiment, I randomly downsample reads to various depths (e.g., 10, 20, etc), assemble them, and run the assembly through sistr. I am experimenting with spades.py 3.12.0, shovill 1.0.1 (using spades) and skesa 2.2.0. Each combination of sample/depth was repeated 10X.
sistr attempts three different methods of serotype inference: (1) by matching the assembly to a DB of antigen genes; (2) Based on mash against a curated set of genomes; and (3) using their own 330 cgMLST.
In our pipeline, a successful inference is attributed to assemblies that result in concordant calls across cgMLST and antigen methods. We also require the correct inference of O antigen, the H1 and/or H2 (Enteritidis only has H1 – Salmonella antigenic formula wiki), and the Serogroup.
At the moment, I have run the pipeline with two isolates from SRA (both serovar Enteritidis): SRR7284860, SRR7284877. The preliminary results suggest that skesa assemblies performs significantly worse than spades or shovill:
Digging a bit deeper, this results is in part due to isolate SRR7284860 performing poorly at practically every depth (although the other isolate only really started to perform relatively well at depths 70 and larger):
When we examine why skesa assemblies are failing at such a high rate, it seems the Serogroup inference is missing. Serogroup is determined by wzx or wzyThis isolate seems to have relatively uneven and poor coverage over one of the crucial genes determining serogroup (wzx --- figure below is mapping of the whole read set to the gene):
Which, I guess, once we start to subsample the reads causes issues.
To investigate a little further, I used minimap2 to bait reads mapping to the wzx gene in isolate SRR7284860, and used samtools to create a pair of FASTQ files (only kept reads mapping to the region and that were properly paired). This is one of the central genes associated with determination of serogroups. When I try to assemble these reads with spades.py I get 2 contigs that roughly span the whole gene (with a break around that region of zero coverage). With skesa, I also get two contigs, but they are far far smaller. I am attaching the contigs generated from the assembly of the reads in the gene, as well as the reads mapping to the gene.
This allele of the gene has about ~30% GC content, and that seems to be the average for this gene.
Attached files
r1.fq, f2.fq --- paired end reads obtained by baiting the whole read set of SRR7284860 for the wzx gene
SKESA was designed to avoid questionable connections and extensions. On the other hand, SPAdes indeed uses every read to extend the assembly. This may result in less than perfect assembly if this last read is bogus. You can relax SKESA slightly using --allow_snps options (see issue #5 for explanation about additional output). Even with this option SKESA will not assemble the whole gene because of the coverage gap.
I understand that you do the assembly just to find some specific genes. For this purpose we have a better tool which we call guided assembler. The idea behind it is to align target sequences directly to the de Bruijn graph and find all paths with reasonably good alignments. If you think that this application, which is not public yet, might be useful for your project send me an email to [email protected] for further discussion.
Hello.
Skesa is awesome assembler. But, I have stumbled on an issue and I was hoping you may be able to help. I detail what I have done so far and where I think the problem is below. Please let me know if you need anything else.
The tool
sistr
can be found here: https://github.com/peterk87/sistr_cmdBackground
I am performing limit of detection experiments for detecting Salmonella serotype using the tool
sistr
. In this experiment, I randomly downsample reads to various depths (e.g., 10, 20, etc), assemble them, and run the assembly throughsistr
. I am experimenting withspades.py 3.12.0
,shovill 1.0.1
(using spades) andskesa 2.2.0
. Each combination of sample/depth was repeated 10X.sistr
attempts three different methods of serotype inference: (1) by matching the assembly to a DB of antigen genes; (2) Based onmash
against a curated set of genomes; and (3) using their own 330 cgMLST.In our pipeline, a successful inference is attributed to assemblies that result in concordant calls across cgMLST and antigen methods. We also require the correct inference of O antigen, the H1 and/or H2 (Enteritidis only has H1 – Salmonella antigenic formula wiki), and the Serogroup.
We use NextSeq 500 data.
Commands
Skesa assembly:
Spades assembly:
Shovill assembly:
Results
At the moment, I have run the pipeline with two isolates from SRA (both serovar Enteritidis): SRR7284860, SRR7284877. The preliminary results suggest that
skesa
assemblies performs significantly worse thanspades
orshovill
:Digging a bit deeper, this results is in part due to isolate SRR7284860 performing poorly at practically every depth (although the other isolate only really started to perform relatively well at depths 70 and larger):
When we examine why
skesa
assemblies are failing at such a high rate, it seems the Serogroup inference is missing. Serogroup is determined bywzx
orwzy
This isolate seems to have relatively uneven and poor coverage over one of the crucial genes determining serogroup (wzx
--- figure below is mapping of the whole read set to the gene):Which, I guess, once we start to subsample the reads causes issues.
To investigate a little further, I used
minimap2
to bait reads mapping to thewzx
gene in isolate SRR7284860, and usedsamtools
to create a pair of FASTQ files (only kept reads mapping to the region and that were properly paired). This is one of the central genes associated with determination of serogroups. When I try to assemble these reads withspades.py
I get 2 contigs that roughly span the whole gene (with a break around that region of zero coverage). Withskesa
, I also get two contigs, but they are far far smaller. I am attaching the contigs generated from the assembly of the reads in the gene, as well as the reads mapping to the gene.This allele of the gene has about ~30% GC content, and that seems to be the average for this gene.
Attached files
SRR7284860_wzx.zip
The text was updated successfully, but these errors were encountered: