Skip to content

Commit

Permalink
rearrange README
Browse files Browse the repository at this point in the history
  • Loading branch information
jeff-k committed Jul 12, 2024
1 parent de8b0ef commit bfbb2aa
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 40 deletions.
55 changes: 17 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,20 @@
### Bit-packed and well-typed biological sequences
</div>

This crate provides types for representing and operating on sequences of genomic data. Efficient encodings are provided for nucleotdies and amino acids and can be extended with the `Codec` trait.

Short sequences of fixed length (kmers) are given special attention.

Add [bio-seq](https://crates.io/crates/bio-seq) to `Cargo.toml`:

### Quick start

```toml
[dependencies]
bio-seq = "0.13"
```

Example: Iterating over the [kmer](https://docs.rs/bio-seq/latest/bio_seq/kmer)s for a [sequence](https://docs.rs/bio-seq/latest/bio_seq/seq):
Iterating over the [kmer](https://docs.rs/bio-seq/latest/bio_seq/kmer)s for a [sequence](https://docs.rs/bio-seq/latest/bio_seq/seq):

```rust
use bio_seq::prelude::*;
Expand All @@ -36,16 +42,19 @@ for kmer in seq.revcomp().kmers::<8>() {
// ...
```

Sequences are analogous to rust's string types and follow similar dereferencing conventions.
Sequences are analogous to rust's string types and follow similar dereferencing conventions:

```rust
use bio_seq::prelude::*;
// The `dna!` macro packs a static sequence with 2-bits per symbol at compile time.
// This is analogous to rust's string literals:
let s: &'static str = "hello!";
let seq: &'static SeqSlice<Dna> = dna!("CGCTAGCTACGATCGCAT");

// Static sequences can also be copied as kmers
let kmer: Kmer<Dna, 18> = dna!("CGCTAGCTACGATCGCAT").into();
// or with the kmer! macro:
let kmer = kmer!("CGCTAGCTACGATCGCAT")

// `Seq`s can be allocated on the heap like `String`s are:
let s: String = "hello!".into();
Expand All @@ -59,37 +68,6 @@ let slice: &str = &s[1..3];
let seqslice: &SeqSlice<Dna> = &seq[2..4];
```

Example: The [4-bit encoding of IUPAC](https://docs.rs/bio-seq/latest/bio_seq/codec/iupac) nucleotide ambiguity codes naturally represent a set of bases for each position (`0001`: `A`, `1111`: `N`, `0000`: `*`, ...):

```rust
use bio_seq::prelude::*;

let seq = iupac!("AGCTNNCAGTCGACGTATGTA");
let pattern = iupac!("AYG");

for slice in seq.windows(pattern.len()) {
if pattern.contains(slice) {
println!("{} matches pattern", slice);
}
}

// ACG matches pattern
// ATG matches pattern
```

The goal of this crate is to make handling biological sequence data safe and convenient. The [`TranslationTable`](https://docs.rs/bio-seq/latest/bio_seq/translation/trait.TranslationTable.html) trait implements genetic coding:

```rust
// This is a debruijn sequence of all possible 3-mers:
let seq: Seq<Dna> =
dna!("AATTTGTGGGTTCGTCTGCGGCTCCGCCCTTAGTACTATGAGGACGATCAGCACCATAAGAACAAA");
let aminos: Seq<Amino> = Seq::from_iter(seq.windows(3).map(|codon| translation::STANDARD.to_amino(codon)));
assert_eq!(
aminos,
Seq<Amino>::try_from("NIFLCVWGGVFSRVSLCARGALSPRAPPLL*SVYTLYM*ERGDTRDISQSAHTPHI*KRENTQK").unwrap()
);
```

## Philosophy

Many bioinformatics crates implement their own kmer packing logic. This project began as a way to reuse kmer construction code and make it compatible between projects. It quickly became apparent that a kmer type doesn't make sense without being tightly coupled to a general data type for sequences. The scope of the crate will be restricted to representing sequences.
Expand All @@ -109,7 +87,12 @@ Contributions are very welcome. There's lots of low hanging fruit for optimisati
## [Sequences](https://docs.rs/bio-seq/latest/bio_seq/seq)

Strings of encoded symbols are packed into [`Seq`](https://docs.rs/bio-seq/latest/bio_seq/seq/struct.Seq.html). Slicing, chunking, and windowing return [`SeqSlice`](https://docs.rs/bio-seq/latest/bio_seq/seq/struct.SeqSlice.html). `Seq<A: Codec>` and `&SeqSlice<A: Codec>` are analogous to `String` and `&str`. As with the standard string types, these are stored on the heap. [`Kmer`](https://docs.rs/bio-seq/latest/bio_seq/kmer)s are generally stored on the stack, implementing `Copy`.
All data is stored little-endian. This effects the order that sequences map to the integers ("colexicographic" order):

## [Kmers](https://docs.rs/bio-seq/latest/bio_seq/kmer)

kmers are short sequences of length `k` that can fit into a register (e.g. `usize`, or SIMD vector) and generally implement `Copy`. These are implemented with const generics and `k` is fixed at compile time.

All data is stored little-endian. This effects the order that sequences map to the integers ("colexicographic" order).

```rust
for i in 0..=15 {
Expand All @@ -136,10 +119,6 @@ for i in 0..=15 {
15: TTAAA
```

## [Kmers](https://docs.rs/bio-seq/latest/bio_seq/kmer)

kmers are short sequences of length `k` that can fit into a register (e.g. `usize`, or SIMD vector) and generally implement `Copy`. these are implemented with const generics and `k` is fixed at compile time.

### Succinct encodings

A lookup table can be indexed in constant time by treating kmers directly as `usize`:
Expand Down
2 changes: 1 addition & 1 deletion bio-seq/src/kmer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
// This file may not be copied, modified, or distributed
// except according to those terms.

//! Short sequences of fixed length.
//! Short sequences of fixed length
//!
//! Encoded sequences of length `k`, fixed at compile time. Generally, the underlying storage type of `Kmer` should lend itself to optimisation. For example, the default `Kmer` instance is packed into a `usize`, which can be efficiently `Copy`ed on the stack.
//!
Expand Down
2 changes: 2 additions & 0 deletions bio-seq/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@
//!
//! // Static sequences can also be copied as kmers
//! let kmer: Kmer<Dna, 18> = dna!("CGCTAGCTACGATCGCAT").into();
//! // or with the kmer! macro:
//! let kmer = kmer!("CGCTAGCTACGATCGCAT")
//!
//! // `Seq`s can be allocated on the heap like `String`s are:
//! let s: String = "hello!".into();
Expand Down
2 changes: 1 addition & 1 deletion bio-seq/src/seq/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
// This file may not be copied, modified, or distributed
// except according to those terms.

//! Arbitrary length sequences of bit-packed genomic data, stored on the heap.
//! Arbitrary length sequences of bit-packed genomic data, stored on the heap
//!
//! `Seq` and `SeqSlice` are analogous to `String` and `str`. A `Seq` owns its data and a `SeqSlice` is a read-only window into a `Seq`.
//!
Expand Down

0 comments on commit bfbb2aa

Please sign in to comment.