-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
69 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,71 @@ | ||
# mxdx | ||
Generalized multiplexing / demultiplexing for FASTQ/FASTA/SAM | ||
|
||
`mxdx` allows for multiplexing paired or unpaired data at the sequence level. | ||
The specific problem `mxdx` solves running the need to run sets of samples | ||
against tools with high start up costs. By multiplexing, the start up cost | ||
(e.g., `bowtie2` with a large database) can be amortized. Multiplexing is | ||
performed at the per-record basis which balances the cost of processing | ||
an individual work unit, and avoids intermediate (write) IO. | ||
|
||
`mxdx` supports multiplexing and demultiplexing FASTQ/FASTA/SAM data. The user | ||
can control how paired data are emitted: | ||
|
||
- interleaved on a per record basis | ||
- sequentially, first R1 then R2 reads are multiplexed | ||
- R1 only where only R1 reads are multiplexed | ||
- R2 only where only R2 reads are multiplexed | ||
|
||
Because `mxdx` operates on a per-record basis, it is necessary to know the | ||
number of total records up front in order to determine which specific records | ||
to pull from a particular file. | ||
|
||
# Installation | ||
|
||
To install from pypi: | ||
|
||
``` | ||
$ pip install mxdx | ||
``` | ||
|
||
To install from github: | ||
|
||
``` | ||
$ git clone https://github.com/wasade/mxdx.git | ||
$ cd mxdx | ||
$ pip install -e . | ||
``` | ||
|
||
# Design considerations | ||
|
||
IO is handled in separate processes for all actions. Specifically, we read | ||
and decompress in one process, and write in a separate process. We do not use | ||
threads to avoid the GIL -- while generally, the GIL is fine for IO bound | ||
tasks, some compression schemes like `lzma` are costly. The use of multiple | ||
processes allows, if necessary in the future, an easier trajectory to read | ||
or write many files at once which may be viable on high performance | ||
file systems. | ||
|
||
The specific type of file being processed, and its compression, is inferred. | ||
As a result, the user does not need to provide these details. However, `mxdx` | ||
cannot mix and match compression schemes or data types. | ||
|
||
On demultiplexing, samples which are completely represented within a batch | ||
are written entirely. Samples which are partially represented by a batch, | ||
such that some records are processed in batch N and some in batch N+1, | ||
are written as partial files. Partials are an unfortunate necessity without | ||
a mechanism to orchestrate IO across independent jobs in a high performance | ||
compute environment. To resolve partials, we provide a command with `mxdx` | ||
called `consolidate-partials` to concatenate them in batch order. | ||
**NOTE**: `mxdx` does not `rm` the partial files, that is the responsibility | ||
of the user. | ||
|
||
# Usage example bash | ||
|
||
A round trip usage example can be found in `usage-test.sh`. This test is | ||
executed by CI. | ||
|
||
# Usage example SLURM | ||
|
||
A usage example using SLURM, which is not executed by CI, can be found in | ||
`examples/`. |