refactor: store sequences as byte arrays #301

ivan-aksamentov · 2024-12-11T20:35:25Z

Vec<char> although is more readable and easier to debug has significant memory and runtime overhead - char is 4 bytes integer (Rust strings are unicode).

Replacing Vec<char> with Vec<u8> should significantly reduce memory consumption and speed things up. While at it, why not wrap all sequence-related functionality into Seq class.

Before:

  Time (mean ± σ):     954.9 ms ±  40.9 ms    [User: 2652.8 ms, System: 433.4 ms]
  Range (min … max):   915.9 ms … 1044.1 ms    10 runs

After:

  Time (mean ± σ):     655.4 ms ±  25.7 ms    [User: 2258.9 ms, System: 210.1 ms]
  Range (min … max):   635.3 ms … 717.6 ms    10 runs

`Vec<char>` although is more readable and easier to debug has significant memory and runtime overhead - `char` is 4 bytes integer (Rust strings are unicode). Replacing `Vec<char>` with `Vec<u8>` should significantly reduce memory consumption and speed things up. While at it, why not wrap all sequence-related functionality into `Seq` class. Before: ``` Time (mean ± σ): 954.9 ms ± 40.9 ms [User: 2652.8 ms, System: 433.4 ms] Range (min … max): 915.9 ms … 1044.1 ms 10 runs ``` After: ``` Time (mean ± σ): 655.4 ms ± 25.7 ms [User: 2258.9 ms, System: 210.1 ms] Range (min … max): 635.3 ms … 717.6 ms 10 runs ```

Followup of #301 The `AsciiChar` wrapper type allows: * better debugging, as we can control how `AsciiChar` is printed, as opposed to the integer-like `u8` type - this is mostly a dev convenience thing - for debug-printing sequences, various maps etc., but it boosts development significantly * provide valid outputs wherever we need to display or write to file any characters - this is actually needed for correctness I measured no significant slowdown: Command: ``` $ cargo -q build --release --target-dir=/workdir/.build/docker --bin=treetime $ hyperfine --warmup 1 --show-output '/workdir/.build/docker/release/treetime ancestral --method-anc=parsimony --dense=false --tree=data/mpox/clade-ii/500/tree.nwk --outdir=tmp/smoke-tests/ancestral/marginal/mpox/clade-ii/500 data/mpox/clade-ii/500/aln.fasta.xz' ``` Before (branch: rust, commit 67ba54a): ``` Time (mean ± σ): 655.4 ms ± 25.7 ms [User: 2258.9 ms, System: 210.1 ms] Range (min … max): 635.3 ms … 717.6 ms 10 runs ``` After: ``` Time (mean ± σ): 650.3 ms ± 26.9 ms [User: 2231.2 ms, System: 218.7 ms] Range (min … max): 630.9 ms … 716.2 ms 10 runs ``

ivan-aksamentov merged commit 67ba54a into rust Dec 11, 2024
14 checks passed

ivan-aksamentov deleted the refactor/seq-as-bytes branch December 11, 2024 20:42

ivan-aksamentov mentioned this pull request Dec 16, 2024

refactor: introduce AsciiChar wrapper over u8 #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: store sequences as byte arrays #301

refactor: store sequences as byte arrays #301

ivan-aksamentov commented Dec 11, 2024

refactor: store sequences as byte arrays #301

refactor: store sequences as byte arrays #301

Conversation

ivan-aksamentov commented Dec 11, 2024