Skip to content

Commit

Permalink
Merge branch 'master' of github.com:lskatz/fasten
Browse files Browse the repository at this point in the history
  • Loading branch information
lskatz committed Oct 28, 2023
2 parents 0921cac + 235f6eb commit d8af1ec
Show file tree
Hide file tree
Showing 27 changed files with 600 additions and 391 deletions.
46 changes: 25 additions & 21 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ name: paper formatting
jobs:
immem:
runs-on: ubuntu-latest
name: Paper Draft
name: IMMEM Draft
steps:
- name: Checkout
uses: actions/checkout@v2
Expand All @@ -26,23 +26,27 @@ jobs:
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
# paper:
# runs-on: ubuntu-latest
# name: Paper Draft
# steps:
# - name: Checkout
# uses: actions/checkout@v2
# - name: Build draft PDF
# uses: openjournals/openjournals-draft-action@master
# with:
# journal: joss
# # This should be the path to the paper within your repo.
# paper-path: paper/paper.md
# - name: Upload
# uses: actions/upload-artifact@v1
# with:
# name: paper
# # This is the output path where Pandoc will write the compiled
# # PDF. Note, this should be the same directory as the input
# # paper.md
# path: paper/paper.pdf
joss:
runs-on: ubuntu-latest
name: JOSS Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: inspect directory
run: |
tree
find . -type f -name '*paper*'
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper-abstract
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/paper.pdf
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,3 @@ Cargo.lock
**/*.rs.bk

tests/hyperfine
paper/
5 changes: 2 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "fasten"
version = "0.6.0"
version = "0.7.1"
authors = ["Lee Katz <[email protected]>"]
#license-file = "LICENSE"
license = "MIT"
Expand Down Expand Up @@ -100,8 +100,7 @@ regex = "1.10"
getopts = "0.2.21"
statistical = "1.0"
multiqueue = "0.3.2"
rand = "0.4"
rand = "0.8"
fastq = "0.6"
threadpool = "1.8.1"
bam = "0.1.4"

3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,7 @@ This documentation was built with `cargo docs --no-deps`

## Other documentation

* Some workflows are shown in the [one-liners](./docs/one-liners.md) page.
* Some wrapper scripts are noted in the [scripts](./docs/scripts.md) page.
* Some wrapper scripts are noted in the [scripts](./scripts.md) page.

## Fasten script descriptions

Expand Down
Binary file modified paper/benchmarks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
@software{Peter_hyperfine_2023,
author = {Peter, David},
license = {MIT},
month = mar,
title = {{hyperfine}},
url = {https://github.com/sharkdp/hyperfine},
version = {1.16.1},
year = {2023}
}

@article{seqkit,
doi = {10.1371/journal.pone.0163962},
author = {Shen, Wei AND Le, Shuai AND Li, Yan AND Hu, Fuquan},
journal = {PLOS ONE},
publisher = {Public Library of Science},
title = {SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation},
year = {2016},
month = {10},
volume = {11},
url = {https://doi.org/10.1371/journal.pone.0163962},
pages = {1-10},
abstract = {FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.},
number = {10},

}

@software{seqtk,
author = {Li, Heng},
license = {MIT},
title = {{seqtk}},
url = {https://github.com/lh3/seqtk},
year = {2023}
}

@software{fastx,
author = {Gordon},
license = {AGPL},
title = {{fastx toolkit}},
url = {http://hannonlab.cshl.edu/fastx_toolkit/index.html},
year = {2014}
}
66 changes: 66 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: 'Fasten with Pipes'
tags:
- command line
- fastq manipulation
- interleaved fastq
authors:
- name: Lee S. Katz
affiliation: "1, 2"
orcid: 0000-0002-2533-9161
- name: Henk C. den Bakker
orcid: 0000-0002-4086-1580
affiliation: 1
affiliations:
- name: Enteric Diseases Laboratory Branch (EDLB), Centers for Disease Control and Prevention, Atlanta, GA, USA
index: 1
- name: Center for Food Safety, University of Georgia, Griffin, GA, USA
index: 2
bibliography: paper.bib
---

## Background

There are still many gaps in basic command line bioinformatics for standard file formats.
Bioinformaticians have been able to use many tools to manipulate sequence data files in the fastq format, such as `seqkit` [@seqkit], `seqtk` [@seqtk] or FASTX-Toolkit [@fastx].
These tools only accept paired end (PE) sequence data when split into multiple files per sample.
Additionally, these tools do not always allow for Unix-style pipe file control. Sometimes they require explicity input/output options instead of using `stdin` and `stdout`.
However, some bioinformaticians prefer to combine PE data from a single sample into one file using the interleaved fastq file format, but this format is not always well supported in mainstream tools.
Here, we provide Fasten to the community to address these needs.

## Materials

We leveraged the Cargo packaging system in Rust to create a basic framework for interleaved fastq file manipulation.
Each executable reads from `stdin` and prints reads to `stdout` and only performs one function at a time.
The core executables perform these fundamental functions: 1) converting to and from interleaved format, 2) converting to and from other sequence file formats, 3) ‘straightening’ fastq files to a more standard 4-line-per-entry format.

There are 20 executables including but not limited to read metric generation, read cleaning, kmer counting, read validation, and regular expressions for interleaved fastq files.

We have also taken advantage of Rust to make comprehensive and standardized documentation.
Continuous integration was implemented in GitHub Actions for unit testing, containerizing, and benchmarking.
Benchmarking was performed against other mainstream packages using `hyperfine` using 20 replicates and 2 burn-ins [@Peter_hyperfine_2023].

## Results

Documentation, the container, and code are available at GitHub. Benchmarking results were graphed into Figure \autoref{fig:benchmarks}.

![Benchmarks comparing fasten with other analagous tools. From left to right, then to bottom: Trimming with a minimum quality score; converting fastq to fasta; interleaving R1 and R2 reads; kmer counting; normalizing read depth using kmer coverage; Searching for a sequence in a fastq file; downsampling reads; sorting fastq entries by either sequence or ID; and converting nonstandard fastq files to a format whose entries are four lines each, and selecting the first 100.\label{fig:benchmarks}](benchmarks.png)

## Conclusions

Fasten is a powerful manipulation suite for interleaved fastq files, written in Rust.
We benchmarked Fasten on several categories.
It has strengths as shown in Figure 1 but it does not occupy the fastest position in all cases.
Its major strengths include its competetive speeds,
Unix-style pipes,
paired-end handling,
and the advantages afforded by the Rust language including documentation and stability.

Fasten touts a comprehensive manual, continuous integration, and integration into the command line with unix pipes.
It is well poised to be a crucial module for daily work on the command line.

## Acknowledgements

Thank you, John Phan, for creating the Docker container.

## References
Loading

0 comments on commit d8af1ec

Please sign in to comment.