-
Notifications
You must be signed in to change notification settings - Fork 3
Home
During bioinformatic analysis, we often have to deal with multiple large data sets. Good scientific practice requires that all the steps of the analysis can be reproduced, which is mainly possible if all steps have been properly documented. In addition, the same analyses might have to be applied to different data sets to compare results from one data set from another. Structured work-flows facilitate assuring reproducibility and comparability of analyses.
The pipeline was designed for short read high-throughput sequence data. It covers quality control on the fastq files, assembling paired-end reads to fragments, splitting the libraries into samples, OTU/ASV picking and taxonomy assignment. Besides other output files, it will return an OTU/ASV table.
CASCABEL can be customized to work on different marker genes, use different reference databases and applying a range of different methods for e.g. picking OTUs. Many steps are based on different public tools, among them, some QIIME scripts and the full functionality of these scripts is available to the user.
Cascabel is designed to run with either single-end or paired-end data; for the latter, forward and reverse reads must be supplied. Cascabel can also perform library demultiplexing, for such case, a mapping file between samples and barcodes must be provided.
For one library, input files can be supplied directly into the configuration file, or you can specify more than one library using a tab-separated file with the following columns:
- Library: Name of the library (this, has to match with the values entered in the LIBRARY variable).
- Forward reads: Full path to the forward reads.
- Reverse reads: Full path to the reverse reads (only for paired-end data; omit this column for single-end).
- metadata: Full path to the barcode mapping file with the information for demultiplexing the samples (only if needed). For details, see section 2.2.
For multiple libraries, if you want to avoid the creation of this file a third solution is also available by creating the directory structure and linking the raw data as it is expected for running the pipeline. For this, you can execute the init_sample.sh
or init_sample_SE.sh
scripts for paired-end data or single end-data respectively. The init_sample.sh
script needs 5 arguments, in the exact same order as follows:
- The name of the project
- The name of the library
- Full path to barcode data file
- Full path to forward raw reads
- Full path to reverse raw reads
Example:
$ Scripts/init_sample.sh Project_Name Library_name /full/path/to/barcode/file.tsv /full/path/to/raw/fw_reads.fq /full/path/to/raw/rv_reads.fq
For single-end data, you should omit the last column full path to reverse raw reads and use the init_sample_SE.sh
script.
This step is executed internally while analyzing a single amplicon sequencing library, however, for multi library analysis, it is needed to do it manually for all the libraries the user wants to run. In order to do so, run the same script, with the same project name but different library names and of course, different absolute paths pointing at the proper barcode data file and raw reads.
After running the script, you should have a directory structure similar to the following:
<PROJECT_NAME>
├── metadata
│ ├── sampleList_mergedBarcodes_<LIBRARY_NAME_A>.txt -> /path/to/barcode/file_A.tsv
│ └── sampleList_mergedBarcodes_<LIBRARY_NAME_B>.txt -> /path/to/barcode/file_B.tsv
└── samples
├── <LIBRARY_NAME_A>
│ └── rawdata
│ ├── fw.fastq -> /path/to/raw/data/fw_reads_A.fq
│ └── rv.fastq* -> /path/ro/raw/data/rv_reads_A.fq
└── <LIBRARY_NAME_B>
└── rawdata
├── fw.fastq -> /path/to/raw/data/fw_reads_B.fq
└── rv.fastq* -> /path/to/raw/data/rv_reads_B.fq
*The rv.fastq file is only created for paired-end data. The name fw.fastq is maintained for single-end data.
If your data is already demultiplexed, you need to do a similar step, but instead of using the init_sample.sh
you need to use the init_sample_dmx.sh
script and provide the following fields:
- The name of the project
- The name of the library
- Full path to forward reads
- Full path to reverse reads*
- For single-end data use
init_sample_dmx_SE.sh
script and omit the last column full path to reverse raw reads
Please, notice that while filling up the configuration file, the PROJECT parameter has to be the same as the one entered while using this script, as well as for the LIBRARY parameter, all the libraries should be included surrounded by quotes and comma-separated (see the example on the general parameter section).
If you are going to perform demultiplexing of your libraries, it is very important to supply a mapping file with the information between the barcodes and their samples. The mapping file is in compliance with Qiime's mapping file, and therefore needs the following format.
The mapping file is validated during the pipeline execution. To see the complete validation criteria, please refer to the following link.
These are some of the most important details about the mapping file:
General
- Tab-delimited file
- avoid spaces
- avoid empty fields
Header
- Name of first 3 columns must be:
#SampleID BarcodeSequence LinkerPrimerSequence
- No spaces
- Starts with #
- Use only alphanumeric (a-z, A-Z and 1-9) and/or underscore (“_”)
- Name of the last column must be: "Description"
- Additional optional headers(columns) can follow the “LinkerPrimerSequence” header
- names and number of these additional columns are optional
Fields
-
#SampleID will be the name of the sample in downstream analysis.
- Must be unique
- only alphanumeric and period (.) characters are allowed
- don’t start with # (unless you want to comment such line)
-
BarcodeSequence sequence to link a read with its sample
- For paired-end reads, is the concatenation of forward barcode and the reverse complement of the reverse barcode (in case of dual labeled barcoding).
- Must be unique.
- All the barcodes need to have the same length
- Only A, C, T, G allowed Please find a valid mapping file in the test repository: barcode mapping file example
-
LinkerPrimerSequence forward primer used to amplify the sample.
- Only IUPAC characters allowed
-
ReversePrimer reverse primer used to amplify the sequence.
- Only IUPAC characters allowed
- Used for removing primers from the samples
-
ReversePrimerRevCom* reverse complement of the reverse primer.
- Only IUPAC characters allowed
- Used for removing primers from the samples
*For primer removal it is only needed one of the columns: ReversePrimer or ReversePrimerRevCom
Once you have all your files in place and directory structure (only for multiple libraries), you need to edit the configuration file config.yaml. This configuration is one of the most important steps since it encapsulates the workflow behavior and all the parameters to be applied to each rule.
The configuration file is written in YAML language. YAML™ is a human-friendly, cross-language, Unicode based data serialization language designed around the common native data types of agile programming languages.
In this file, you only need to overwrite settings according to your needs, most values are already pre-configured. It is very important to keep the indentation of the file (don’t change the tabs and spaces), as well as the name of the parameters. But you can change the values of the parameters to deviate from the default settings. Any text after a hashtag is considered a comment and will be ignored by snakemake.
The general parameters section defines parameters that are global or general for the complete workflow execution, as well as extra information for the report generation.
In this section, the project name and the library name(s) are the only mandatory values for both execution types (multiple libraries or single library) that you need to change. In order to do this, open the file with any text editor and set the values for the PROJECT and LIBRARY parameters:
#------------------------------------------------------------------------------#
# Project Name #
#------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter with the init_sample.sh script #
# (if used for multiple libraries). #
#------------------------------------------------------------------------------#
PROJECT: "PROJECT_NAME"
#------------------------------------------------------------------------------#
# LIBRARIES/SAMPLES #
#------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis. #
# Use the same library names as with the init_sample.sh script. #
# Include each library name surrounded by quotes, and comma separated. #
# i.e LIBRARY: ["LIB_1","LIB_2",..."LIB_N"] #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples #
# must have the same configuration; use: #
# "PE" for paired-end reads [Default]. #
# "SE" for single-end reads. #
#------------------------------------------------------------------------------#
LIBRARY: ["SAMP_XX"]
LIBRARY_LAYOUT: "PE"
In the case that you want to run the pipeline for multiple library analysis, make sure to make the correct match between the project name and the libraries used with the init_sample.sh script and the ones used in the configuration file (view: section 2.2).
If you are planning to run Cascabel for a single library analysis, you can provide the complete path for the raw reads and the barcode metadata file directly on the configuration file (recommended), instead of using the init_sample.sh script:
#------------------------------------------------------------------------------#
# INPUT FILES #
#------------------------------------------------------------------------------#
# There are two ways to initialize the required project structure in order to #
# run the pipeline. The first one is to use the script init_sample.sh before #
# running the pipeline; more info at: #
# https://github.com/AlejandroAb/CASCABEL/wiki#initialize-structure-for-multiple-librarie
s
# If you are planning to run the pipeline for one single library, you can #
# supply all the necessary input files directly into this section and avoid #
# running the previously mentioned script. #
# Here you have to enter the FULL PATH for both: the raw reads and the metadata#
# file (barcode mapping file). The metadata file is only needed if you want to #
# perform demultiplexing. #
# - fw_reads: Full path to the raw reads in forward direction (R1) #
# - rw_reads: Full path to the raw reads in reverse direction (R2) #
# - metadata: Full path to the metadata file with barcodes for each sample #
# to perform library demultiplexing #
# ** Please supply only one of the following: #
# - fw_reads, rv_reads and metadata #
# - input_files #
# - or use init_sample.sh script directly #
#------------------------------------------------------------------------------#
fw_reads: ""
rv_reads: ""
metadata: ""
#OR
input_files: ""
You can also pass those arguments with the --config option within the command line. e.g.:
snakemake --configfile config.yaml --config RUN="MyRUN" fw_reads="/path/to/raw/data/fw_reads.fq" RV_reads="/path/to/raw/data/rv_reads.fq" metadata="/path/to/barcode/file.tsv"
The RUN parameter can be set here or remain empty, in the latter case, the user must assign this value via the command line. This parameter should be an alphanumeric string without spaces, try to use a name that will help you identify this specific run in the future, i.e., include the date or some particular configuration you are using or testing.
The description parameter is designed to store information that is included in the final report, so you can use this variable if you want to have some specific description of your analysis on the final report.
#------------------------------------------------------------------------------#
# RUN #
#------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces. #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally). #
# The RUN parameter can be set here or remain empty, in the latter case, the #
# user must assign this value via the command line. #
# e.g: --config RUN=run_name #
#------------------------------------------------------------------------------#
RUN: ""
#------------------------------------------------------------------------------#
# Description #
#------------------------------------------------------------------------------#
# Brief description of the run. Any description written here will be included #
# in the final report. This field is not mandatory so it can remain empty. #
#------------------------------------------------------------------------------#
description: ""
By default, CASCABEL operates with uncompressed fastq files, in order to change this and accept compressed files you can modify the input type under this section.
#------------------------------------------------------------------------------#
# INPUT TYPE #
#------------------------------------------------------------------------------#
# CASCABEL supports two types of input files, fastq and gzipped fastq files. #
# This parameter can take the values "T" if the input files are gziped #
# (only the reads!, the metadata file always needs to be uncompressed) or "F" #
# if the input files are regular fastq files #
#------------------------------------------------------------------------------#
gzip_input: "T"
This section defines parameters that will influence the type of report to be generated at the end of the workflow. By default, CASCABEL only generates an HTML report, however, you can choose to create a pdf file or a compressed file with all the html resources in order to make the HTML report "full portable".
#------------------------------------------------------------------------------#
# PDF Report #
#------------------------------------------------------------------------------#
# By default, CASCABEL creates the final report in HTML format. In order to #
# create the report also as pdf file, set this flag to "T". #
# Important! in order to convert the file to pdf format, it is necessary to #
# execute the pipeline within a Xserver session i.e MobaXterm or ssh -X. #
# One way to validate if your active session is using an Xserver, execute #
# 'echo $DISPLAY' on a command line terminal. If this returns empty, you do #
# not have an Xserver session. #
# - pdfReport "T" for generate pdf report or "F" to skip it. #
# - wkhtmltopdf_command name w/wo path to execute the html to pdf translation #
#------------------------------------------------------------------------------#
pdfReport: "F"
wkhtmltopdf_command: "wkhtmltopdf"
#------------------------------------------------------------------------------#
# Portable Report #
#------------------------------------------------------------------------------#
# CASCABEL creates the final report in HTML format, containing references to #
# other images or links. Therefore just copying the HTML files for sharing or #
# inspecting the results will break these links. #
# By setting 'portableReport' to true "T", CASCABEL will generate a zip file #
# with all the resources necessary to share and distribute CASCABEL's report. #
#------------------------------------------------------------------------------#
portableReport: "T"
Cascabel supports two main types of analysis:
- Analysis based on traditional OTUs (Operational Taxonomic Units) which are mainly generated by clustering sequences based on a sheared similarity threshold.
- Analysis based on ASV (Amplicon sequence variant). This kind of analysis deal also with the errors on the sequence reads such that true sequence variants can be resolved, down to the level of single-nucleotide differences.
In the later case, the ASV analysis is performed with dada2 (High-resolution sample inference) algorithm. For the OTUs, different clustering algorithms can be chosen.
#------------------------------------------------------------------------------#
# ANALYSIS TYPE #
#------------------------------------------------------------------------------#
# #
# - ANALYSIS_TYPE "OTU" or "ASV". Defines the type analysis #
#------------------------------------------------------------------------------#
ANALYSIS_TYPE: "ASV"
In this section of the configuration file, you can find all the parameters used to run the different rules during the execution of the pipeline.
All the parameters contain some default value and they are group together according to the rule they belong to. Each set of parameters, starts with a block comment, indicating to which rule the parameters belong to, following we have a brief explanation of the rule and following the parameter usage and if applicable, the possible values this parameter can take. It is recommended to go trough ALL the configuration file and change the default values according to your specific needs or characteristics of your dataset.
The parameters on the configuration file are transferred to the command line when the rule is executed, so, if the user wants to include any parameter that is not been taking into account for some specific tool via its rule in the config file, there is a parameter called “extra_params” for almost all the rules. This parameter can be used in order to include any extra arguments to the specific command line executed by this rule.
Example of a rule with all the comments and parameters in the configuration file:
# Assemble fragments (merge forward with reverse reads) #
# rule: pear #
#------------------------------------------------------------------------------#
# This step is performed to merge paired reads. #
# - t Minimum length of reads after trimming the low. #
# - v Minimum overlap size. The minimum overlap may be set to 1 #
# when the statistical test is used. (default in pear: 10) #
# - j Number of threads to use. #
# - p The p-value cutoff used in the statistical test. Valid #
# options are: 0.0001, 0.001, 0.01, 0.05 and 1.0. Setting 1.0 #
# disables the test. (default: 0.01) #
# - extra_params Extra parameters. See pear --help for more info. #
# - prcpear The minimun percentage of expected peared reads, if the #
# actual percentage is lower and the 'interactive' parameter #
# is set to "T"; a warning message will be shown. #
#------------------------------------------------------------------------------#
pear:
command: "pear"
t: 100
v: 10
j: 6
p: 0.05
extra_params: ""
prcpear: 90
Cascabel can perform the library demultiplexing by supplying the following information:
demultiplexing:
demultiplex: "T"
create_fastq_files: "T"
remove_bc: 12
order_by_strand: "T"
add_unpair: "T"
dmx_params: "--remove-header"
bc_mismatch: 0
bcc_params: ""
create_tag_pairs_heatmap: "T"
- demultiplex. Set this flag to true ("T") in case you want to demultiplex your data.
- create_fastq_files. If true ("T"), individual fastq files per sample will be created
- remove_bc. Truncate first N bases corresponding to the barcode sequence.
- add_unpair. If true ("T"), non-overlapping reads are included in the individual generated fastq files.
- bc_mismatch. Cascabel can tolerate error correction for barcode assignation. For a 12bp barcode we advice a maximum 2 bases for error correction. Distance between barcodes and template is computed with an internal implementation of the Levenshtein distance algorithm, same that is available outside the pipeline with the following command:
java -jar BarcodeCorrector.jar distance SEQ1 SEQ2
- create_tag_pairs_heatmap. It will create a heatmap and a matrix with all the fw and reverse combinations found in the demultiplexed library. Useful to identify tag jumps on Illumina libraries index-hopping.
Cascabel is capable of removing primers at different stages of the workflow.
- OTU workflow. Cascabel removes primers after demultiplexing and assembling samples, so it is important to notice that at this stage of the workflow, the reverse primer is reverse complemented in the assembled reads (paired-end input). The user can configure this step according to the following options:
primers:
remove: "F"
fw_primer: ""
rv_primer: ""
min_overlap: "5"
min_length: "100"
min_prc: 50
threads: 10
extra_params: "--discard-untrimmed --match-read-wildcards "
For this end, Cascabel use Cutadapt and the option remove will tell Cascabel whether or not to remove the primers and where to find the primer information:
- "F" Do not remove primers [default].
- "CFG" Remove primers using values from the configuration file.
- "METADATA" Primers are taken from the metadata/mapping file.
For the last option (METADATA), primers are taken from the mapping file. Cascabel will look for the columns LinkerPrimerSequence and ReversePrimer or ReversePrimerRevCom. For a paired-end input, at this stage the reads are already assembled, so Cascabel will reverse complement the sequences from the ReversePrimer unless the reverse complement of the reverse primer is supplied using the column ReversePrimerRevCom.
If the primers are supplied via the configuration file (option CFG), values for fw_primer and rv_primer variables must be supplied for a paired-end workflow and only fw_primer for a single-end workflow. Both primers should be in 5'-> 3' orientation.
The min_prc is used to verify that the minimum number of reads with primers is above a user-defined threshold (50% by default). If the number of reads is below this threshold, an interactive option would be triggered.
The extra_params option can hold any valid argument used within Cutadapt, so always refer to the Cutadapt manual to see which options fit better your needs. By default, Cascabel suggests two options: "--discard-untrimmed" for removing sequences without primers (as these can be sequencing artifacts) and "--match-read-wildcards" for allowing the usage of IUPAC wildcards within the linker primer sequences.
- ASV workflow. In the case of an ASV workflow, the same options for primer removal should be used. The main difference with the OTU workflow is that primers on the ASV workflow, are removed from individual sample fastq files, while for the OTU workflow, primers are removed from assembled reads, as explained above.
CASCABEL supports multiple taxonomy assignation methods or tools:
- VSEARCH (global pairwise alignment).
- BLAST+.
- QIIME. Runs the assign_taxonomy.py script which can use any of the following methods:
- BLAST. This uses the old version of BLAST (not BLAST+).
- UCLUST. A method based on sequence clustering. To use this method type
- RDP classifier
In order to select the desired tool, you have to enter one of the following values in the tool section:
assignTaxonomy:
tool: "VSEARCH|BLAST|QIIME"
blast:
fasta_db: ""
mapFile: ""
taxo_separator: "';'"
...
vsearch:
db_file: ""
mapFile: ""
...
qiime:
method: ""
mapFile: ""
dbFile: ""
...
After selecting the tool, you need to enter the values only for the selected one. As you can see above, each tool has their own options.
The instructions on how to complete the variables for any specific method are clearly detailed within the configuration file.
Any of these methods need to be supplied with a fasta file (or blast database for BLAST methods) and its corresponding mapping file, which have to map the sequence accessions (according to the supplied fasta file) and their taxonomy. This file should be a two-column file, tab-separated with the accessions in the first column and the taxonomy in the second column with the taxonomic levels separated by a semicolon (;), however, this can be modified with the taxo_separator variable. e.g. AY190.44 Bacteria;Firmicutes;Bacilli;...
In this sense, CASCABEL can work with almost any reference database or custom databases made by the users. With the aim to help in the process of mapping file creation for custom databases, we also suggest the usage of our tool TaxoMapper. For example, if you have a custom set of nucleotide sequences downloaded from NCBI and you need to create a mapping file for CASCABEL (this suppose your fasta file have the NCBI's accession numbers in the header) you only need to follow these steps:
- Download the nucleotide to tax id mapping file (More mapping files here)
- Download and load database for TaxoMapper
- Run TaxoMapper with the following command:
java -jar TaxoMapper.jar -M QIIME -o nucl_gb.accession2path -tc 3 -ac 2 -m nucl_gb.accession2taxid -r CUSTOM superkingdom,phylum,class,order,family,genus,species -u <db_user> -p <db_pass>
This will create an output file nucl_gb.accession2path with the expected format.
The previous command translate/map the entries:
A00036 A00036.1 1423 38633
A04646 A04646.1 1717 21694064
to:
A00036.1 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__Bacillus; s__Bacillus subtilis
A04646.1 k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Corynebacteriales; f__Corynebacteriaceae; g__Corynebacterium; s__Corynebacterium diphtheriae
By default CASCABEL will remove most of the generated intermediate files. Set this flag to true ("T") in order to keep all the intermediate files. Default: "F" remove the files!
KEEP_TMP: "F"
Once that the configuration file has been filled according to your needs, you are ready to run the pipeline. To do this, you should be located in the “CASCABEL” directory and then use the following command:
snakemake --configfile config.yaml
If you haven't specified the RUN parameter in the config file, you have to provide it with the --config flag. e.g.
snakemake --config RUN="test" --configfile config.yaml
Snakemake applies the rules given in the Snakefile in a top-down way. The application of a rule to generate a set of output files is called job. Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job. CASCABEL is designed in such a way that the RUN parameter creates a new directory (with the name of the RUN) where all the results are stored, this way, the pipeline can re-run no matter if the inputs didn’t change, and allowing the user to have several runs with different configurations and benchmarks of the different results. Be sure to change the RUN parameter to take advantage of that!.
Please notice that you can overwrite almost any value from the configuration file using the --config flag, the name of the parameter to update, and its value.
Before running the complete workflow, it is good practice to first make a dry run. The idea of this is to check if all the steps are working in principle, as well as to see the rules and commands that are going to be executed. The dry run is performed by including the flags -n (print the execution plan instead of actually performing the steps). The -p flag instructs Snakemake to also print the resulting shell command for illustration. e.g.:
snakemake --configfile config.yaml -np
When the previous command is performed you should be able to see all the jobs that are going to be executed:
Job counts:
count jobs
1 align_rep_seqs
1 all
1 assign_taxonomy
1 bc_mapping_validation
1 cluster_OTUs
1 combine_accepted_reads
... ...
... ...
... ...
1 report_all
1 split_libraries_rc
1 summarize_taxa
1 translate_pdf_final_report
1 translate_to_pdf
1 tune_report
1 validatePear
1 validateQC
49
And above the job list, all the inputs, outputs and command lines that are needed by each job on the previous list:
...
rule filter_alignment:
input: PROJECT_NAME/runs/TT/otu/uclust/aligned/representative_seq_set_noSingletons_aligned.fasta
output: PROJECT_NAME/runs/TT/otu/uclust/aligned/filtered/representative_seq_set_noSingletons_aligned_pfiltered.fasta
jobid: 16
benchmark: PROJECT_NAME/runs/TT/otu/uclust/aligned/filtered/align_rep_seqs.benchmark
wildcards: PROJECT=PROJECT_NAME, run=TT
filter_alignment.py -i PROJECT_NAME/runs/TT/otu/uclust/aligned/representative_seq_set_noSingletons_aligned.fasta -o PROJECT_NAME/runs/TT/otu/uclust/aligned/filtered/
rule convert_table:
input: PROJECT_NAME/runs/TT/otu/uclust/otuTable.biom
output: PROJECT_NAME/runs/TT/otu/uclust/otuTable.txt
jobid: 9
benchmark: PROJECT_NAME/runs/TT/otu/uclust/otuTable.txt.benchmark
wildcards: PROJECT=PROJECT_NAME, run=TT
biom convert -i PROJECT_NAME/runs/TT/otu/uclust/otuTable.biom -o PROJECT_NAME/runs/TT/otu/uclust/otuTable.txt --table-type 'OTU table' --header-key taxonomy --to-tsv
Whenever you want to run CASCABEL from some specific point, you can do this by passing the --forcerun argument with the name of the rule you will like to restart the analysis.
This option is very useful when it is needed to fine-tune some arguments on the configuration file in order to re-run the pipeline, but you want to preserve results/outputfiles from previous rules (relative to the rule's arguments changed) but overwrite outputs from downstream rules. e.g., Restart a flow from the cutadapt rule:
snakemake --configfile config.yaml --forcerun cutadapt
Sometimes, universal primers like the forward 15F (GTGCCAGCMGCCGCGGTAA) and the reverse primer 926R (CCGYCAATTYMTTTRAGTTT) can potentially amplify both: bacterial 16S ribosomal RNA and eukaryotic 18S ribosomal RNA, having this last one a fragment size typically around 575–595 bp, thus, in an Illumina run 2x250, forward and reverse reads won't overlap, therefore these reads are generally discarded.
CASCABEL allows an option to merge unpaired data and continue with downstream analysis using such reads, instead of working with the paired reads.
#------------------------------------------------------------------------------#
# UNPAIRED FLOW #
#------------------------------------------------------------------------------#
# A regular workflow for marker gene analysis using paired-end data, #
# comprehends the merging of forward and reverse reads, prior to continuing #
# with downstream analysis implemented within this pipeline. #
# However, primers can intentially amplify fragments which are so large that #
# forward and reverse reads do not overlap. A regular analysis would discard #
# all those "unpaired" reads during the FW and RV read assembly. For this #
# scenario, Cascabel implements an alternative flow, where instead of #
# continuing with assembled reads, un-assembled reads are concatenated together#
# with a degenerated base 'N' between the FW read and the reverse complemented #
# RV read (which does not significantly influence k-mer based classification #
# methods such as RDP). #
# #
# - UNPAIRED_DATA_PIPELINE "T" or "F". True to work with the "un-assembled" #
# reads. If ANALYSIS_TYPE = "ASV" this option use #
# the "justConcatenate" option from the #
# mergePairs() function from the dada2 package. #
# - CHAR_TO_PAIR In-silico base used to pair both fragments. #
# Valid values A|T|G|C|N. You can use more than one#
# base, e.g. "GGGGG" pair with 5 Gs. #
# This parameter only applies for the OTU-WF. For #
# the ASV-WF, dada2 merge both reads with 10 Ns. #
# - QUALITY_CHAR The forward and reverse reads are paired #
# in-silico in a fastq file, thus the quality for #
# the pairing bases must be provided. If more than #
# one CHAR_TO_PAIR is used, you only need to #
# specify one and only one QUALITY_CHAR. #
# This parameter only applies for the OTU-WF. For #
# the ASV-WF, dada2 merge both reads with 10 Ns. #
#------------------------------------------------------------------------------#
UNPAIRED_DATA_PIPELINE: "T"
CHAR_TO_PAIR: "N"
QUALITY_CHAR: "#"
One of the goals of CASCABEL is not to only run all the steps for the analysis, but rather, to interpret some of the output files and according to this interpretation, warn the user of possible failures or things to take into account in the analysis. Therefore, during the execution of the pipeline you will be prompted to enter or confirm some data when running on this interactive mode
#------------------------------------------------------------------------------#
# Execution mode #
#------------------------------------------------------------------------------#
# This parameter allows the user to inspect intermediate files in order to #
# finetune some downstream analises, re-do previous steps or exit the workflow.#
# -interactive Set this flag to "T" (default) in order to interact at some #
# specific steps with the pipeline. "F" will try to run all the #
# pipeline without communicating intermediate results until the #
# report. #
# For a list for all the interactive checkpoints take a look at the following #
# link: TBD #
#------------------------------------------------------------------------------#
interactive : "T"
Having this flag set to "F" will skip all the checking points. Hereundera table with all the interactive points.
Checkpoint | Rule | Description |
---|---|---|
Quality on raw reads | validateQC | Validates FastQC results after quality check on raw reads. If the report contains more than a certain number of errors or warnings, Cascabel will suggest the user to verify the complete FastQC report before continuing. |
Assembled reads | validatePear | After assembly, Cascabel verifies the percentage of assembled reads, if this percentage is under a pre-defined threshold (config.yaml - [pear][prcpear] - ) The pipeline will warn the user about this fact, so the user is able to fine-tune the assembly parameters or continue with subsequent analyzes. |
Quality on assembled reads | validateFastQCPear | Same as validateQC rule but on assembled reads. |
Barcode metadata validation | validateBCV | Cascabel runs QIIME's validate_mapping_file.py in order to asses the format of the file with metadata and barcode information. If this report returns any error or warning, Cascabel will make a pause to elt the user knows about these possible errors. |
Library demultiplexing | validateDemultiplex | If demultiplexing is performed, Cascabel will always make a stop in this checkpoint, in order to let the user know the percentage of successfully assigned reads. Then, the user will have the opportunity to either continue or fine-tune some parameters and re-run the library splitting. |
Length filtering | remove_short_long_reads | Cascabel computes a histogram based on the length of the assembled reads. Next, such information is shown to the user so he/she can decide based on the data, where to apply the cut values. For this, several options are available. |
Remove Chimeras | remove_chimera | After chimera identification is performed, Cascabel will show the result to the user in order to decide to keep or discard chimeric sequences. After performing the chimera identification, Cascabel will show the result to the user to decide if the chimeric sequences are kept or discarded. |
Please notice that although that all the checking points are executed, not necessarily Cascabel will stop at all these points to require user input. When the results are good and above some thresholds, Cascabel will continue with downstream analyzes.
Once that the pipeline is finished, you will have a structure very similar to the following:
├── metadata
│ ├── bc_validation
│ │ └── <LIB_NAME>
│ │ └── sampleList_mergedBarcodes_<LIB_NAME>.html // barcode HTML report
│ └── sampleList_mergedBarcodes_<LIB_NAME>.txt -> /path/to/original/bacode/file.tsv
├── runs
│ └── <RUN_NAME> //directory with all the data per RUN
│ ├── <LIB_NAME>_data //directory with the processed data per library: <LIB_NAME>
│ │ ├── barcodes //directory with extracted barcodes
│ │ ├── barcodes_unassigned //directory with extracted barcodes (reverse complement unassigned reads)
│ │ ├── chimera //directory with qiimeras
│ │ ├── demultiplexed //directory with demultiplexed fastq files
│ │ ├── peared
│ │ │ ├── seqs.assembled.fastq // assembled reads
│ │ │ └── qc //QC on assembled reads (barcodes/)
│ │ ├── splitLibs //Split library based on FW barcodes (barcodes_unassigned/)
│ │ ├── splitLibsRC //Split library based on Reverse complement barcodes.
│ │ ├── seqs_fw_rev_accepted.fna //Fasta file with demultiplexed reads (splitLibs+splitLibsRC)
│ │ └── seqs_fw_rev_filtered.fasta //Fasta file with demultiplexed reads after length filtering
│ ├── otu //OTU analysis (ALL LIBRARIES)
│ │ ├── representative_seq_set.fasta //Fasta file with one rep. seq. per OTU
│ │ └── taxonomy_<TOOL> //directory with the taxonomy assignation by <tool>: vsearch|blast|qiime
│ │ ├── aligned //directory with the phylogenetics data
│ │ │ ├── filtered //aligment with the p_filtered singleton reads
│ │ │ │ ├── representative_seq_set_noSingletons_aligned_pfiltered.fasta //Sequence alignment
│ │ │ │ └── representative_seq_set_noSingletons_aligned_pfiltered.tre //Phylogenetic tree
│ │ │ └── representative_seq_set_noSingletons_aligned.fasta //alignment with the singleton reads (no filter)
│ │ ├── otuTable.biom //biom table
│ │ ├── otuTable_noSingletons.biom //biom table without singletons
│ │ ├── otuTable_noSingletons.txt //biom table (txt format) without singletons
│ │ ├── otuTable.txt //biom table (txt format)
│ │ ├── representative_seq_set_noSingletons.fasta //Fasta file with one rep. seq. per OTU without singletons
│ │ ├── representative_seq_set_tax_<TOOL>.out //Output file with the taxonomy assignation <tool>: vsearch|blast|qiime
│ │ └── summary //directory with OTU tables in biom and txt format summarized at different taxonomic levels.
│ ├── otu_report_<TOOL>.html //OTU report for ALL Libraries by <tool>: vsearch|blast|qiime
│ ├── otu_report_<TOOL>.pdf //Same as previous in pdf format
│ ├── report_<LIB_NAME>_<TOOL>.html //Individual report by <LIB_NAME>
│ ├── report_<LIB_NAME>_<TOOL>.pdf //Same as previous in pdf format
│ ├── report_vsearch.zip //zip file with the html reports and all their resources - "portable report"
│ ├── report_files //directory with all the resources for the html reports
│ │ └── krona_report.vsearch.html //KRONA report for ALL Libraries by Sample
│ └── seqs_fw_rev_combined.fasta //All the concatenated filtered fasta files per library.
└── samples //directory with library raw data and QC reports
└── <LIB_NAME>
├── qc
│ ├── fw_fastqc.html
│ └── rv_fastqc.html
└── rawdata
├── fw.fastq -> /path/to/raw/fw-reads.fq
└── rv.fastq -> /path/to/raw/rv-reads.fq
Please notice that we are listing only the most important files along with the previous file structure. In practice, you will notice some other files, most of them are just intermediate files used mainly for track and trace the progress, performance information or log some results.
Snakemakes already provides all the necessary capabilities to deploy any workflow in cluster environments, therefore we advice you to be aware of the latest Snakemake documentation to take the most advantage of this utility. An example of how to execute Cascabel in a Sun Grid Engine queuing system:
snakemake --cluster " qsub -q all.q@n0065 -l h_vmem=50G -pe threaded 16" --configfile config.yaml
In this example, please note that the --cluster argument (the string within quotes) must start with a whitespace. Otherwise, it can be interpreted as normal Snakemake arguments, and the execution will fail. Within this string, any cluster argument can be provided, for instance, in the example above -q represents the name of the node where the job is going to run (it must be a valid name according to your system), -l- specifies the maximum RAM memory and -pe threaded will denote the parallel environment and the number of cores. Please be aware that these arguments, need to be supplied according to your queuing system specifications.
Snakemake creates a job script for each rule to execute within the workflow, therefore all the arguments passed to the --cluster option are inherited to each one of these rules. However, you can provide specific arguments per rule by providing a cluster-configuration file, although it is also recommended to create a snakemake profile.
The tools and methods for bioinformatic analysis are very diverse and they are constantly changing and improving, therefore CASCABEL may not meet the current needs of some users, or maybe the user wants to perform some of the steps with a different or newer tool. In that case, we recommend the user to try to adapt this current work-flow according to her/his needs.
Snakemake is a very easy to understand work-flow language. It extends the Python syntax, adding syntactic structures for rule definition and additional controls. Just to give the user a very general idea, Snakemake rules have a name and a number of directives, the most important are: input, output and shell. The input and output directives are followed by lists of files that are expected to be used or created by the rule. The shell directive is followed by a Python string containing the shell command to execute. For more information on how to use Snakemake we recommend the following tutorial
Wiki
Other Resources