-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* update libraries * merge from master * paper * updated rand module * update benchmark * individual benchmark scripts * benchmarking and some paper revision * fixed figure syntax * updated benchmarking figure * updated benchmarking figure * m * version bump * paper pdf * m
- Loading branch information
Showing
27 changed files
with
602 additions
and
393 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,4 +10,3 @@ Cargo.lock | |
**/*.rs.bk | ||
|
||
tests/hyperfine | ||
paper/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
[package] | ||
name = "fasten" | ||
version = "0.6.0" | ||
version = "0.7.1" | ||
authors = ["Lee Katz <[email protected]>"] | ||
#license-file = "LICENSE" | ||
license = "MIT" | ||
|
@@ -96,12 +96,11 @@ name = "fasten_normalize" | |
path = "src/bin/fasten_normalize.rs" | ||
|
||
[dependencies] | ||
regex = "0.2.10" | ||
regex = "1.10" | ||
getopts = "0.2.21" | ||
statistical = "0.1.1" | ||
statistical = "1.0" | ||
multiqueue = "0.3.2" | ||
rand = "0.4" | ||
rand = "0.8" | ||
fastq = "0.6" | ||
threadpool = "1.8.1" | ||
bam = "0.1.4" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
@software{Peter_hyperfine_2023, | ||
author = {Peter, David}, | ||
license = {MIT}, | ||
month = mar, | ||
title = {{hyperfine}}, | ||
url = {https://github.com/sharkdp/hyperfine}, | ||
version = {1.16.1}, | ||
year = {2023} | ||
} | ||
|
||
@article{seqkit, | ||
doi = {10.1371/journal.pone.0163962}, | ||
author = {Shen, Wei AND Le, Shuai AND Li, Yan AND Hu, Fuquan}, | ||
journal = {PLOS ONE}, | ||
publisher = {Public Library of Science}, | ||
title = {SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation}, | ||
year = {2016}, | ||
month = {10}, | ||
volume = {11}, | ||
url = {https://doi.org/10.1371/journal.pone.0163962}, | ||
pages = {1-10}, | ||
abstract = {FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.}, | ||
number = {10}, | ||
|
||
} | ||
|
||
@software{seqtk, | ||
author = {Li, Heng}, | ||
license = {MIT}, | ||
title = {{seqtk}}, | ||
url = {https://github.com/lh3/seqtk}, | ||
year = {2023} | ||
} | ||
|
||
@software{fastx, | ||
author = {Gordon}, | ||
license = {AGPL}, | ||
title = {{fastx toolkit}}, | ||
url = {http://hannonlab.cshl.edu/fastx_toolkit/index.html}, | ||
year = {2014} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
title: 'Fasten with Pipes' | ||
tags: | ||
- command line | ||
- fastq manipulation | ||
- interleaved fastq | ||
authors: | ||
- name: Lee S. Katz | ||
affiliation: "1, 2" | ||
orcid: 0000-0002-2533-9161 | ||
- name: Henk C. den Bakker | ||
orcid: 0000-0002-4086-1580 | ||
affiliation: 1 | ||
affiliations: | ||
- name: Enteric Diseases Laboratory Branch (EDLB), Centers for Disease Control and Prevention, Atlanta, GA, USA | ||
index: 1 | ||
- name: Center for Food Safety, University of Georgia, Griffin, GA, USA | ||
index: 2 | ||
bibliography: paper.bib | ||
--- | ||
|
||
## Background | ||
|
||
There are still many gaps in basic command line bioinformatics for standard file formats. | ||
Bioinformaticians have been able to use many tools to manipulate sequence data files in the fastq format, such as `seqkit` [@seqkit], `seqtk` [@seqtk] or FASTX-Toolkit [@fastx]. | ||
These tools only accept paired end (PE) sequence data when split into multiple files per sample. | ||
Additionally, these tools do not always allow for Unix-style pipe file control. Sometimes they require explicity input/output options instead of using `stdin` and `stdout`. | ||
However, some bioinformaticians prefer to combine PE data from a single sample into one file using the interleaved fastq file format, but this format is not always well supported in mainstream tools. | ||
Here, we provide Fasten to the community to address these needs. | ||
|
||
## Materials | ||
|
||
We leveraged the Cargo packaging system in Rust to create a basic framework for interleaved fastq file manipulation. | ||
Each executable reads from `stdin` and prints reads to `stdout` and only performs one function at a time. | ||
The core executables perform these fundamental functions: 1) converting to and from interleaved format, 2) converting to and from other sequence file formats, 3) ‘straightening’ fastq files to a more standard 4-line-per-entry format. | ||
|
||
There are 20 executables including but not limited to read metric generation, read cleaning, kmer counting, read validation, and regular expressions for interleaved fastq files. | ||
|
||
We have also taken advantage of Rust to make comprehensive and standardized documentation. | ||
Continuous integration was implemented in GitHub Actions for unit testing, containerizing, and benchmarking. | ||
Benchmarking was performed against other mainstream packages using `hyperfine` using 20 replicates and 2 burn-ins [@Peter_hyperfine_2023]. | ||
|
||
## Results | ||
|
||
Documentation, the container, and code are available at GitHub. Benchmarking results were graphed into Figure \autoref{fig:benchmarks}. | ||
|
||
![Benchmarks comparing fasten with other analagous tools. From left to right, then to bottom: Trimming with a minimum quality score; converting fastq to fasta; interleaving R1 and R2 reads; kmer counting; normalizing read depth using kmer coverage; Searching for a sequence in a fastq file; downsampling reads; sorting fastq entries by either sequence or ID; and converting nonstandard fastq files to a format whose entries are four lines each, and selecting the first 100.\label{fig:benchmarks}](benchmarks.png) | ||
|
||
## Conclusions | ||
|
||
Fasten is a powerful manipulation suite for interleaved fastq files, written in Rust. | ||
We benchmarked Fasten on several categories. | ||
It has strengths as shown in Figure 1 but it does not occupy the fastest position in all cases. | ||
Its major strengths include its competetive speeds, | ||
Unix-style pipes, | ||
paired-end handling, | ||
and the advantages afforded by the Rust language including documentation and stability. | ||
|
||
Fasten touts a comprehensive manual, continuous integration, and integration into the command line with unix pipes. | ||
It is well poised to be a crucial module for daily work on the command line. | ||
|
||
## Acknowledgements | ||
|
||
Thank you, John Phan, for creating the Docker container. | ||
|
||
## References |
Oops, something went wrong.