Skip to content

Latest commit

 

History

History
270 lines (203 loc) · 12.7 KB

3.Benchmark_toolkit.md

File metadata and controls

270 lines (203 loc) · 12.7 KB

Atria Benchmark Toolkit

Dependency

  1. Atria
  2. R and R packages argparse, plotly, and ggsci. It is required by atria statplot.

    Make sure Rscript is in PATH.
    Use install.packages("XXX") in R session to install a package. (Replace XXX with a package name)

Notes for benchmark

  1. When benchmarking adapter-trimming with other adapter trimmers, please use big fastq files and add --no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration to prevent non-adapter trimming in Atria. It is recommended to use GNU time command to evaluate time and CPU usage (/usr/bin/time -v). User + system time is a better indicator than elapsed time. Atria also records program-initializing-time, file-initializing-time, read-processing-time, and post-processing-time in *.atria.log.json.
  2. Different trimming options may influence adapter trimming accuracy and run time. It is not recommended to change the k-mer related arguments.

Benchmark

Atria integrates benchmark programs. See available programs, run atria prog.

Available programs:
    atria       Pair-end trimming software (default)
    simulate    Generate artificial pair-end reads
    randtrim    Randomly trim R1 or R2 at a random position
    readstat    Collect trimming statistics
                    (reads should be generated by `atria simulate`)
    statplot    Plot trimming statistics
                    (`Rscript` in PATH required & `plotly` package installed in R)
    test        Test Atria program
    p | prog    Show this program list

Data simulation

atria simulate is designed to simulate paired-end reads. You can customize adapter sequences of R1 and R2, sequence length, repeat times, as well as different insert sizes and error rates.

Print the help page, using atria simulate -h

usage: <PROGRAM> [-o PREF] [-x REPEAT] [-a SEQ] [-A SEQ]
                 [-s SEQ-LENGTH]
                 [-i INSERT-SIZE-RANGE [INSERT-SIZE-RANGE...]]
                 [-S SUBSITUTION-RATE [SUBSITUTION-RATE...]]
                 [-I INSERTION-RATE [INSERTION-RATE...]]
                 [-D DELETION-RATE [DELETION-RATE...]] [-h]

optional arguments:
  -h, --help            show this help message and exit

output:
  -o, --prefix PREF     prefix of output fastq files (default:
                        "read_simulation")

simulation:
  -x, --repeat REPEAT   repeat times for each case (type: Int64,
                        default: 30000)
  -a, --adapter1 SEQ    read 1 adapter (default:
                        "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA")
  -A, --adapter2 SEQ    read 2 adapter (default:
                        "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT")
  -s, --seq-length SEQ-LENGTH
                        a given sequence length; simulated sequence
                        length might be 1 base more than the value
                        because of simulated phasing error (type:
                        Int64, default: 100)
  -i, --insert-size-range INSERT-SIZE-RANGE [INSERT-SIZE-RANGE...]
                        range of insert size (type: Int64, default:
                        [80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100,
                        102, 104, 106, 108, 110, 112, 114, 116, 118,
                        120])
  -S, --subsitution-rate SUBSITUTION-RATE [SUBSITUTION-RATE...]
                        subsitution rate per base. it is random for
                        each base. error type includs mismatch (type:
                        Float64, default: [0.001, 0.002, 0.003, 0.004,
                        0.005])
  -I, --insertion-rate INSERTION-RATE [INSERTION-RATE...]
                        insertion rate; number of arg should be the
                        same as --subsitution-rate (type: Float64,
                        default: [1.0e-5, 2.0e-5, 3.0e-5, 4.0e-5,
                        5.0e-5])
  -D, --deletion-rate DELETION-RATE [DELETION-RATE...]
                        deletion rate; number of arg should be the
                        same as --subsitution-rate (type: Float64,
                        default: [1.0e-5, 2.0e-5, 3.0e-5, 4.0e-5,
                        5.0e-5])

For example, create paired-end fastq files with the default setting:

atria simulate -o read_simulation
# ┌ Info: read simulation: output files
# │   r1 = "read_simulation.R1.fastq"
# └   r2 = "read_simulation.R2.fastq"
# ┌ Info: read simulation: all done
# └   elapsed = 2.0790791511535645

Format of the first line of each read

The first line of every read in the FastQ file records the read simulation attributes. Here is an example:

@PeReadSimulator2:69:69 TRUE=220 INSERT_SIZE=220 ERROR_RATE=0.00254 SEQ_LENGTH=251 ERROR_INSERT=1 ERROR_ADAPTER=0 SUB=0.0025 INS=2.0e-5 DEL=2.0e-5

It has seven fields:

Field Description
@PeReadSimulator:6268:868 The unique read ID in the FastQ file.
TRUE=220 The length of the real inserted DNA.
INSERT_SIZE=220 The length of the original inserted DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.)
ERROR_RATE=0.00254 The total error rate for base simulation.
SEQ_LENGTH=251 The raw sequence length.
ERROR_INSERT=1 The number of errors in the inserted DNA.
ERROR_ADAPTER=0 The number of errors in the adapter.
SUB=0.0025 The error rate of substitution
INS=2.0e-5 The error rate of insert
DEL=2.0e-5 The error rate of deletion

(Optional) Random-trim test

In some cases, R1 and R2 are not in the same length because of the poly N tails or other problems. If the sequence lengths of R1 and R2 are different, trimmers might not work as well as a uniform sequence length. Random-trim test is designed to examine the performance of trimmers by randomly trimming one of the paired reads at a random position.

It should be used after randtrim and before adapter trimming.

For example, input read_simulation.R1.fastq and read_simulation.R2.fastq.

atria randtrim read_simulation.R1.fastq read_simulation.R2.fastq

The output files are read_simulation.R1.randtrim.fastq and read_simulation.R2.randtrim.fastq.

Trimming with different software

Running Atria or other trimmers. Please make sure each folder contains results from a single trimmer.

Please only allow adapter trimming and disable other trimming methods. For Atria: use --no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration

atria -r read_simulation.R1.fastq -R read_simulation.R2.fastq --no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration -o Atria-result

Please make sure trimmers do not shuffle reads, which is quite common in multi-threading implementations in some trimmers.

Collecting reads metrics

Atria readstat collects adapter-trimming metrics. Please make sure trimmers do not shuffle reads, which is quite common in multi-threading implementations in some trimmers.

atria readstat trimed.R1.fastq trimmed.R2.fastq

For each FastQ file, a detail stats for every read and a summary stats are saved.

Format of detail stats file

The detail stats is a tab-delimited matrix:

Column name Description
seq_id The read ID. (Letters after a blank space is removed.)
seq_length The raw sequence length.
insert_size The length of the original inserted DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.)
error_rate The error rate for base simulation.
error_insert The number of errors in the insert.
error_adapter The number of errors in the adapter.
true_length The real insert size after error simulation.
trimmed_length The read length after adapter trimming.
delta_length true_length - trimmed_length.
is_trim_successful If true_length = trimmed_length, true; otherwise, false.

Format of summary stats file

The summary stats is a tab-delimited matrix:

Column name Description
seq_length The read ID. (Letters after a blank space is removed.)
insert_size The length of the original insert DNA before error simulation. (Indel might be introduced in error simulation, so the real insert size might vary.)
error_rate The error rate for base simulation.
repeat The number of read pairs found with the same seq_length, insert_size and error_rate.
precision The rate of reads trimmed at the true position.
rate_overtrim The rate of reads trimmed all adapter letters and some insert DNA letters.
rate_undertrim The rate of reads with adapter letters untrimmed.
deviation The median trimming deviation of imprecisely-trimmed reads.
deviation_overtrim The median trimming deviation of over-trimmed reads.
deviation_undertrim The median trimming deviation of under-trimmed reads.
rate_precision_in1 The rate of reads trimmed at the true position and adjacent 1 bp.
rate_overtrim_gt1 The rate of reads trimmed all adapter letters and more than 1 bp insert DNA letters.
rate_undertrim_gt1 The rate of reads with more than 1 bp adapter letters untrimmed.

Benchmark plots

atria statplot is an interactive plotter for summary stats.

Print the help message, use atria statplot:

usage: atria statplot [-h] [-i STAT.TSV [STAT.TSV ...]] [-o PREF]
                      [-l DIR|DIR2|BASE|.] [-a INT] [-F]

Trimming performance plots with stat.tsv files

optional arguments:
  -h, --help            show this help message and exit
  -i STAT.TSV [STAT.TSV ...], --input STAT.TSV [STAT.TSV ...]
                        input stat.tsv files; if auto, search files ended with
                        r12.stat.tsv
  -o PREF, --outpref PREF
                        Prefix of output files (default: trimmer_accuracy)
  -l DIR|DIR2|BASE|., --legend DIR|DIR2|BASE|.
                        legend name: DIR = dir name, BASE = base name, DIR2 = second dir name (default: DIR)
  -a INT, --adapter-length INT
                        Main adapter length for stats
  -F, --no-format       Stop formatting legend names (keep ^[A-Za-z]*, first
                        uppercase)

Examples:

  • Input is TRIMMER/xxx.stat.tsv
ls -ahl */*stat.tsv
# AdapterRemoval-3/adapterremoval.pair1.fq.r12.stat.tsv
# Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# Atropos/reads_diff_indel.R1.fastq.atropos.fq.r12.stat.tsv
# fastp/out.fastp.r1.fq.r12.stat.tsv
# SeqPurge/reads_diff_indel.R1.fastq.seqpurge.fq.r12.stat.tsv
# Skewer/Skewer-trimmed-pair1.fastq.r12.stat.tsv
# TrimGalore/reads_diff_indel.R1_val_1.fq.r12.stat.tsv
# Trimmomatic/out-pair1.paired.fq.r12.stat.tsv
# Trimmomatic/out-pair2.paired.fq.r12.stat.tsv

atria statplot -i */*stat.tsv -l DIR
  • Include adapter length: adapter_length_INT has to be in folder names.
ls adapter_length_*/Atria/*stat.tsv
# adapter_length_16/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_20/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_24/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_28/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv
# adapter_length_33/Atria/reads_diff_indel.R1.atria.fastq.r12.stat.tsv

atria statplot -i adapter_length_*/Atria/*stat.tsv -l DIR2

The output HTML file contains six interactive plots.The layout is:

A1   A2
B1   B2
C1   C2

A1, B1 and C1 are statistics for reads with adapter contamination, while A2, B2, C2 for reads without adapters.

A1 and A2 show the accumulated rates of accurate trim, 1 bp over trim, 1 bp under trim, multiple bp over trim and multiple bp under trim.

C1 and C2 show the trimming accuracy on different adapter length.