Skip to content

Commit

Permalink
DOC: readme
Browse files Browse the repository at this point in the history
  • Loading branch information
wasade committed May 29, 2024
1 parent 27f6894 commit 37057d8
Showing 1 changed file with 69 additions and 0 deletions.
69 changes: 69 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,71 @@
# mxdx
Generalized multiplexing / demultiplexing for FASTQ/FASTA/SAM

`mxdx` allows for multiplexing paired or unpaired data at the sequence level.
The specific problem `mxdx` solves running the need to run sets of samples
against tools with high start up costs. By multiplexing, the start up cost
(e.g., `bowtie2` with a large database) can be amortized. Multiplexing is
performed at the per-record basis which balances the cost of processing
an individual work unit, and avoids intermediate (write) IO.

`mxdx` supports multiplexing and demultiplexing FASTQ/FASTA/SAM data. The user
can control how paired data are emitted:

- interleaved on a per record basis
- sequentially, first R1 then R2 reads are multiplexed
- R1 only where only R1 reads are multiplexed
- R2 only where only R2 reads are multiplexed

Because `mxdx` operates on a per-record basis, it is necessary to know the
number of total records up front in order to determine which specific records
to pull from a particular file.

# Installation

To install from pypi:

```
$ pip install mxdx
```

To install from github:

```
$ git clone https://github.com/wasade/mxdx.git
$ cd mxdx
$ pip install -e .
```

# Design considerations

IO is handled in separate processes for all actions. Specifically, we read
and decompress in one process, and write in a separate process. We do not use
threads to avoid the GIL -- while generally, the GIL is fine for IO bound
tasks, some compression schemes like `lzma` are costly. The use of multiple
processes allows, if necessary in the future, an easier trajectory to read
or write many files at once which may be viable on high performance
file systems.

The specific type of file being processed, and its compression, is inferred.
As a result, the user does not need to provide these details. However, `mxdx`
cannot mix and match compression schemes or data types.

On demultiplexing, samples which are completely represented within a batch
are written entirely. Samples which are partially represented by a batch,
such that some records are processed in batch N and some in batch N+1,
are written as partial files. Partials are an unfortunate necessity without
a mechanism to orchestrate IO across independent jobs in a high performance
compute environment. To resolve partials, we provide a command with `mxdx`
called `consolidate-partials` to concatenate them in batch order.
**NOTE**: `mxdx` does not `rm` the partial files, that is the responsibility
of the user.

# Usage example bash

A round trip usage example can be found in `usage-test.sh`. This test is
executed by CI.

# Usage example SLURM

A usage example using SLURM, which is not executed by CI, can be found in
`examples/`.

0 comments on commit 37057d8

Please sign in to comment.