- Atria
- R and R packages argparse, plotly, and ggsci. It is required by
atria statplot
.Make sure
Rscript
is in PATH.
Useinstall.packages("XXX")
in R session to install a package. (Replace XXX with a package name)
- When benchmarking adapter-trimming with other adapter trimmers, please use big fastq files and add
--no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration
to prevent non-adapter trimming in Atria. It is recommended to use GNUtime
command to evaluate time and CPU usage (/usr/bin/time -v
). User + system time is a better indicator than elapsed time. Atria also recordsprogram-initializing-time
,file-initializing-time
,read-processing-time
, andpost-processing-time
in*.atria.log.json
. - Different trimming options may influence adapter trimming accuracy and run time. It is not recommended to change the k-mer related arguments.
Atria integrates benchmark programs. See available programs, run atria prog
.
Available programs:
atria Pair-end trimming software (default)
simulate Generate artificial pair-end reads
randtrim Randomly trim R1 or R2 at a random position
readstat Collect trimming statistics
(reads should be generated by `atria simulate`)
statplot Plot trimming statistics
(`Rscript` in PATH required & `plotly` package installed in R)
test Test Atria program
p | prog Show this program list
atria simulate
is designed to simulate paired-end reads. You can customize adapter sequences of R1 and R2, sequence length, repeat times, as well as different insert sizes and error rates.
Print the help page, using atria simulate -h
usage: <PROGRAM> [-o PREF] [-x REPEAT] [-a SEQ] [-A SEQ]
[-s SEQ-LENGTH]
[-i INSERT-SIZE-RANGE [INSERT-SIZE-RANGE...]]
[-S SUBSITUTION-RATE [SUBSITUTION-RATE...]]
[-I INSERTION-RATE [INSERTION-RATE...]]
[-D DELETION-RATE [DELETION-RATE...]] [-h]
optional arguments:
-h, --help show this help message and exit
output:
-o, --prefix PREF prefix of output fastq files (default:
"read_simulation")
simulation:
-x, --repeat REPEAT repeat times for each case (type: Int64,
default: 30000)
-a, --adapter1 SEQ read 1 adapter (default:
"AGATCGGAAGAGCACACGTCTGAACTCCAGTCA")
-A, --adapter2 SEQ read 2 adapter (default:
"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT")
-s, --seq-length SEQ-LENGTH
a given sequence length; simulated sequence
length might be 1 base more than the value
because of simulated phasing error (type:
Int64, default: 100)
-i, --insert-size-range INSERT-SIZE-RANGE [INSERT-SIZE-RANGE...]
range of insert size (type: Int64, default:
[80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100,
102, 104, 106, 108, 110, 112, 114, 116, 118,
120])
-S, --subsitution-rate SUBSITUTION-RATE [SUBSITUTION-RATE...]
subsitution rate per base. it is random for
each base. error type includs mismatch (type:
Float64, default: [0.001, 0.002, 0.003, 0.004,
0.005])
-I, --insertion-rate INSERTION-RATE [INSERTION-RATE...]
insertion rate; number of arg should be the
same as --subsitution-rate (type: Float64,
default: [1.0e-5, 2.0e-5, 3.0e-5, 4.0e-5,
5.0e-5])
-D, --deletion-rate DELETION-RATE [DELETION-RATE...]
deletion rate; number of arg should be the
same as --subsitution-rate (type: Float64,
default: [1.0e-5, 2.0e-5, 3.0e-5, 4.0e-5,
5.0e-5])
For example, create paired-end fastq files with the default setting:
atria simulate -o read_simulation
# ┌ Info: read simulation: output files
# │ r1 = "read_simulation.R1.fastq"
# └ r2 = "read_simulation.R2.fastq"
# ┌ Info: read simulation: all done
# └ elapsed = 2.0790791511535645
The first line of every read in the FastQ file records the read simulation attributes. Here is an example:
@PeReadSimulator2:69:69 TRUE=220 INSERT_SIZE=220 ERROR_RATE=0.00254 SEQ_LENGTH=251 ERROR_INSERT=1 ERROR_ADAPTER=0 SUB=0.0025 INS=2.0e-5 DEL=2.0e-5
It has seven fields:
Field | Description |
---|---|
@PeReadSimulator:6268:868 | The unique read ID in the FastQ file. |
TRUE=220 | The length of the real inserted DNA. |
INSERT_SIZE=220 | The length of the original inserted DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.) |
ERROR_RATE=0.00254 | The total error rate for base simulation. |
SEQ_LENGTH=251 | The raw sequence length. |
ERROR_INSERT=1 | The number of errors in the inserted DNA. |
ERROR_ADAPTER=0 | The number of errors in the adapter. |
SUB=0.0025 | The error rate of substitution |
INS=2.0e-5 | The error rate of insert |
DEL=2.0e-5 | The error rate of deletion |
In some cases, R1 and R2 are not in the same length because of the poly N tails or other problems. If the sequence lengths of R1 and R2 are different, trimmers might not work as well as a uniform sequence length. Random-trim test is designed to examine the performance of trimmers by randomly trimming one of the paired reads at a random position.
It should be used after randtrim
and before adapter trimming.
For example, input read_simulation.R1.fastq and read_simulation.R2.fastq.
atria randtrim read_simulation.R1.fastq read_simulation.R2.fastq
The output files are read_simulation.R1.randtrim.fastq and read_simulation.R2.randtrim.fastq.
Running Atria or other trimmers. Please make sure each folder contains results from a single trimmer.
Please only allow adapter trimming and disable other trimming methods. For Atria: use
--no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration
atria -r read_simulation.R1.fastq -R read_simulation.R2.fastq --no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration -o Atria-result
Please make sure trimmers do not shuffle reads, which is quite common in multi-threading implementations in some trimmers.
Atria readstat
collects adapter-trimming metrics. Please make sure trimmers do not shuffle reads, which is quite common in multi-threading implementations in some trimmers.
atria readstat trimed.R1.fastq trimmed.R2.fastq
For each FastQ file, a detail stats for every read and a summary stats are saved.
The detail stats is a tab-delimited matrix:
Column name | Description |
---|---|
seq_id | The read ID. (Letters after a blank space is removed.) |
seq_length | The raw sequence length. |
insert_size | The length of the original inserted DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.) |
error_rate | The error rate for base simulation. |
error_insert | The number of errors in the insert. |
error_adapter | The number of errors in the adapter. |
true_length | The real insert size after error simulation. |
trimmed_length | The read length after adapter trimming. |
delta_length | true_length - trimmed_length. |
is_trim_successful | If true_length = trimmed_length, true; otherwise, false. |
The summary stats is a tab-delimited matrix:
Column name | Description |
---|---|
seq_length | The read ID. (Letters after a blank space is removed.) |
insert_size | The length of the original insert DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.) |
error_rate | The error rate for base simulation. |
repeat | The number of read pairs found with the same seq_length, insert_size and error_rate. |
precision | The rate of reads trimmed at the true position. |
rate_overtrim | The rate of reads trimmed all adapter letters and some insert DNA letters. |
rate_undertrim | The rate of reads with adapter letters untrimmed. |
deviation | The median trimming deviation of imprecisely-trimmed reads. |
deviation_overtrim | The median trimming deviation of over-trimmed reads. |
deviation_undertrim | The median trimming deviation of under-trimmed reads. |
rate_precision_in1 | The rate of reads trimmed at the true position and adjacent 1 bp. |
rate_overtrim_gt1 | The rate of reads trimmed all adapter letters and more than 1 bp insert DNA letters. |
rate_undertrim_gt1 | The rate of reads with more than 1 bp adapter letters untrimmed. |
atria statplot
is an interactive plotter for summary stats.
Print the help message, use atria statplot
:
usage: atria statplot [-h] [-i STAT.TSV [STAT.TSV ...]] [-o PREF]
[-l DIR|DIR2|BASE|.] [-a INT] [-F]
Trimming performance plots with stat.tsv files
optional arguments:
-h, --help show this help message and exit
-i STAT.TSV [STAT.TSV ...], --input STAT.TSV [STAT.TSV ...]
input stat.tsv files; if auto, search files ended with
r12.stat.tsv
-o PREF, --outpref PREF
Prefix of output files (default: trimmer_accuracy)
-l DIR|DIR2|BASE|., --legend DIR|DIR2|BASE|.
legend name: DIR = dir name, BASE = base name, DIR2 = second dir name (default: DIR)
-a INT, --adapter-length INT
Main adapter length for stats
-F, --no-format Stop formatting legend names (keep ^[A-Za-z]*, first
uppercase)
Examples:
- Input is TRIMMER/xxx.stat.tsv
ls -ahl */*stat.tsv
# AdapterRemoval-3/adapterremoval.pair1.fq.r12.stat.tsv
# Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# Atropos/reads_diff_indel.R1.fastq.atropos.fq.r12.stat.tsv
# fastp/out.fastp.r1.fq.r12.stat.tsv
# SeqPurge/reads_diff_indel.R1.fastq.seqpurge.fq.r12.stat.tsv
# Skewer/Skewer-trimmed-pair1.fastq.r12.stat.tsv
# TrimGalore/reads_diff_indel.R1_val_1.fq.r12.stat.tsv
# Trimmomatic/out-pair1.paired.fq.r12.stat.tsv
# Trimmomatic/out-pair2.paired.fq.r12.stat.tsv
atria statplot -i */*stat.tsv -l DIR
- Include adapter length: adapter_length_INT has to be in folder names.
ls adapter_length_*/Atria/*stat.tsv
# adapter_length_16/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_20/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_24/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_28/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_33/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
atria statplot -i adapter_length_*/Atria/*stat.tsv -l DIR2
The output HTML file contains six interactive plots.The layout is:
A1 A2
B1 B2
C1 C2
A1, B1 and C1 are statistics for reads with adapter contamination, while A2, B2, C2 for reads without adapters.
A1 and A2 show the accumulated rates of accurate trim, 1 bp over trim, 1 bp under trim, multiple bp over trim and multiple bp under trim.
C1 and C2 show the trimming accuracy on different adapter length.