DOC: readme

wasade · May 29, 2024 · 37057d8 · 37057d8
1 parent 27f6894
commit 37057d8
Showing 1 changed file with 69 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,71 @@
 # mxdx
 Generalized multiplexing / demultiplexing for FASTQ/FASTA/SAM
+
+`mxdx` allows for multiplexing paired or unpaired data at the sequence level. 
+The specific problem `mxdx` solves running the need to run sets of samples
+against tools with high start up costs. By multiplexing, the start up cost
+(e.g., `bowtie2` with a large database) can be amortized. Multiplexing is 
+performed at the per-record basis which balances the cost of processing
+an individual work unit, and avoids intermediate (write) IO. 
+
+`mxdx` supports multiplexing and demultiplexing FASTQ/FASTA/SAM data. The user
+can control how paired data are emitted:
+
+- interleaved on a per record basis
+- sequentially, first R1 then R2 reads are multiplexed
+- R1 only where only R1 reads are multiplexed
+- R2 only where only R2 reads are multiplexed
+
+Because `mxdx` operates on a per-record basis, it is necessary to know the 
+number of total records up front in order to determine which specific records
+to pull from a particular file. 
+
+# Installation
+
+To install from pypi:
+
+```
+$ pip install mxdx
+```
+
+To install from github:
+
+```
+$ git clone https://github.com/wasade/mxdx.git
+$ cd mxdx
+$ pip install -e .
+```
+
+# Design considerations
+
+IO is handled in separate processes for all actions. Specifically, we read 
+and decompress in one process, and write in a separate process. We do not use
+threads to avoid the GIL -- while generally, the GIL is fine for IO bound
+tasks, some compression schemes like `lzma` are costly. The use of multiple 
+processes allows, if necessary in the future, an easier trajectory to read
+or write many files at once which may be viable on high performance 
+file systems.
+
+The specific type of file being processed, and its compression, is inferred. 
+As a result, the user does not need to provide these details. However, `mxdx`
+cannot mix and match compression schemes or data types.
+
+On demultiplexing, samples which are completely represented within a batch
+are written entirely. Samples which are partially represented by a batch,
+such that some records are processed in batch N and some in batch N+1,
+are written as partial files. Partials are an unfortunate necessity without
+a mechanism to orchestrate IO across independent jobs in a high performance
+compute environment. To resolve partials, we provide a command with `mxdx`
+called `consolidate-partials` to concatenate them in batch order. 
+**NOTE**: `mxdx` does not `rm` the partial files, that is the responsibility
+of the user.
+
+# Usage example bash
+
+A round trip usage example can be found in `usage-test.sh`. This test is 
+executed by CI.
+
+# Usage example SLURM
+
+A usage example using SLURM, which is not executed by CI, can be found in
+`examples/`.