Merge branch 'master' of github.com:lskatz/fasten

lskatz · Oct 28, 2023 · d8af1ec · d8af1ec
2 parents 0921cac + 235f6eb
commit d8af1ec
Show file tree

Hide file tree

Showing 27 changed files with 600 additions and 391 deletions.
diff --git a/.github/workflows/draft-pdf.yml b/.github/workflows/draft-pdf.yml
@@ -4,7 +4,7 @@ name: paper formatting
 jobs:
   immem:
     runs-on: ubuntu-latest
-    name: Paper Draft
+    name: IMMEM Draft
     steps:
       - name: Checkout
         uses: actions/checkout@v2
@@ -26,23 +26,27 @@ jobs:
           # PDF. Note, this should be the same directory as the input
           # paper.md
           path: paper/paper.pdf
-#  paper:
-#    runs-on: ubuntu-latest
-#    name: Paper Draft
-#    steps:
-#      - name: Checkout
-#        uses: actions/checkout@v2
-#      - name: Build draft PDF
-#        uses: openjournals/openjournals-draft-action@master
-#        with:
-#          journal: joss
-#          # This should be the path to the paper within your repo.
-#          paper-path: paper/paper.md
-#      - name: Upload
-#        uses: actions/upload-artifact@v1
-#        with:
-#          name: paper
-#          # This is the output path where Pandoc will write the compiled
-#          # PDF. Note, this should be the same directory as the input
-#          # paper.md
-#          path: paper/paper.pdf
+  joss:
+    runs-on: ubuntu-latest
+    name: JOSS Draft
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v2
+      - name: Build draft PDF
+        uses: openjournals/openjournals-draft-action@master
+        with:
+          journal: joss
+          # This should be the path to the paper within your repo.
+          paper-path: paper/paper.md
+      - name: inspect directory
+        run: |
+          tree
+          find . -type f -name '*paper*'
+      - name: Upload
+        uses: actions/upload-artifact@v1
+        with:
+          name: paper-abstract
+          # This is the output path where Pandoc will write the compiled
+          # PDF. Note, this should be the same directory as the input
+          # paper.md
+          path: paper/paper.pdf
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,3 @@ Cargo.lock
 **/*.rs.bk
 
 tests/hyperfine
-paper/
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name    = "fasten"
-version = "0.6.0"
+version = "0.7.1"
 authors = ["Lee Katz <[email protected]>"]
 #license-file  = "LICENSE"
 license       = "MIT"
@@ -100,8 +100,7 @@ regex        = "1.10"
 getopts      = "0.2.21"
 statistical  = "1.0"
 multiqueue   = "0.3.2"
-rand         = "0.4"
+rand         = "0.8"
 fastq        = "0.6"
 threadpool   = "1.8.1"
 bam          = "0.1.4"
-
diff --git a/README.md b/README.md
@@ -56,8 +56,7 @@ This documentation was built with `cargo docs --no-deps`
 
 ## Other documentation
 
-* Some workflows are shown in the [one-liners](./docs/one-liners.md) page.
-* Some wrapper scripts are noted in the [scripts](./docs/scripts.md) page.
+* Some wrapper scripts are noted in the [scripts](./scripts.md) page.
 
 ## Fasten script descriptions
 

diff --git a/paper/benchmarks.png b/paper/benchmarks.png
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -0,0 +1,41 @@
+@software{Peter_hyperfine_2023,
+    author = {Peter, David},
+    license = {MIT},
+    month = mar,
+    title = {{hyperfine}},
+    url = {https://github.com/sharkdp/hyperfine},
+    version = {1.16.1},
+    year = {2023}
+}
+
+@article{seqkit,
+    doi = {10.1371/journal.pone.0163962},
+    author = {Shen, Wei AND Le, Shuai AND Li, Yan AND Hu, Fuquan},
+    journal = {PLOS ONE},
+    publisher = {Public Library of Science},
+    title = {SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation},
+    year = {2016},
+    month = {10},
+    volume = {11},
+    url = {https://doi.org/10.1371/journal.pone.0163962},
+    pages = {1-10},
+    abstract = {FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.},
+    number = {10},
+
+}
+
+@software{seqtk,
+    author = {Li, Heng},
+    license = {MIT},
+    title = {{seqtk}},
+    url = {https://github.com/lh3/seqtk},
+    year = {2023}
+}
+
+@software{fastx,
+    author = {Gordon},
+    license = {AGPL},
+    title = {{fastx toolkit}},
+    url = {http://hannonlab.cshl.edu/fastx_toolkit/index.html},
+    year = {2014}
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -0,0 +1,66 @@
+---
+title: 'Fasten with Pipes'
+tags:
+  - command line
+  - fastq manipulation
+  - interleaved fastq
+authors:
+  - name: Lee S. Katz
+    affiliation: "1, 2"
+    orcid: 0000-0002-2533-9161
+  - name: Henk C. den Bakker
+    orcid: 0000-0002-4086-1580
+    affiliation: 1
+affiliations:
+ - name: Enteric Diseases Laboratory Branch (EDLB), Centers for Disease Control and Prevention, Atlanta, GA, USA
+   index: 1
+ - name: Center for Food Safety, University of Georgia, Griffin, GA, USA
+   index: 2
+bibliography: paper.bib
+---
+
+## Background
+
+There are still many gaps in basic command line bioinformatics for standard file formats.
+Bioinformaticians have been able to use many tools to manipulate sequence data files in the fastq format, such as `seqkit` [@seqkit], `seqtk` [@seqtk] or FASTX-Toolkit [@fastx].
+These tools only accept paired end (PE) sequence data when split into multiple files per sample.
+Additionally, these tools do not always allow for Unix-style pipe file control. Sometimes they require explicity input/output options instead of using `stdin` and `stdout`.
+However, some bioinformaticians prefer to combine PE data from a single sample into one file using the interleaved fastq file format, but this format is not always well supported in mainstream tools.
+Here, we provide Fasten to the community to address these needs.
+
+## Materials
+
+We leveraged the Cargo packaging system in Rust to create a basic framework for interleaved fastq file manipulation.
+Each executable reads from `stdin` and prints reads to `stdout` and only performs one function at a time.
+The core executables perform these fundamental functions: 1) converting to and from interleaved format, 2) converting to and from other sequence file formats, 3) ‘straightening’ fastq files to a more standard 4-line-per-entry format.
+
+There are 20 executables including but not limited to read metric generation, read cleaning, kmer counting, read validation, and regular expressions for interleaved fastq files.
+
+We have also taken advantage of Rust to make comprehensive and standardized documentation.
+Continuous integration was implemented in GitHub Actions for unit testing, containerizing, and benchmarking.
+Benchmarking was performed against other mainstream packages using `hyperfine` using 20 replicates and 2 burn-ins [@Peter_hyperfine_2023].
+
+## Results
+
+Documentation, the container, and code are available at GitHub. Benchmarking results were graphed into Figure \autoref{fig:benchmarks}.
+
+![Benchmarks comparing fasten with other analagous tools. From left to right, then to bottom: Trimming with a minimum quality score; converting fastq to fasta; interleaving R1 and R2 reads; kmer counting; normalizing read depth using kmer coverage; Searching for a sequence in a fastq file; downsampling reads; sorting fastq entries by either sequence or ID; and converting nonstandard fastq files to a format whose entries are four lines each, and selecting the first 100.\label{fig:benchmarks}](benchmarks.png)
+
+## Conclusions
+
+Fasten is a powerful manipulation suite for interleaved fastq files, written in Rust.
+We benchmarked Fasten on several categories.
+It has strengths as shown in Figure 1 but it does not occupy the fastest position in all cases.
+Its major strengths include its competetive speeds,
+Unix-style pipes,
+paired-end handling,
+and the advantages afforded by the Rust language including documentation and stability.
+
+Fasten touts a comprehensive manual, continuous integration, and integration into the command line with unix pipes.
+It is well poised to be a crucial module for daily work on the command line.
+
+## Acknowledgements
+
+Thank you, John Phan, for creating the Docker container.
+
+## References
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,4 +10,3 @@ Cargo.lock
		*/.rs.bk

		tests/hyperfine
		paper/