Skip to content

Releases: ksahlin/strobealign

v0.15.0

13 Dec 09:13
Compare
Choose a tag to compare

Changelog

  • #388 and #426: Increase accuracy and mapping rate for reads shorter than
    about 200 bp by introducing multi-context seeds.
    Previously, seeds always consisted of two k-mers and would only be found if
    both occur in query and reference.
    With this change, strobealign falls back to looking up just one of the k-mers
    when appropriate.
    This feature is currently experimental and only enabled when using the
    --mcs command-line option.
    Contributed by Ivan Tolstoganov (@Itolstoganov).
  • #421: Allow references with up to 2^32 contigs (instead of 2^23
    previously) by changing the way randstrobes are stored in the index.
  • #467: Reading a reference with many small references was sped up.
    Contributed by @luispedro.

v0.14.0

03 Oct 09:12
b3b8f48
Compare
Choose a tag to compare

Changes

  • #401: The default number of threads is now 1 instead of 3.
  • #409: Ensure reference names are unique and conform to the SAM specification.
    Contributed by @drtconway in PR #411.
  • #269, #418: Strobealign scales now much better to systems with many cores. Previously, decompressing gzipped-compressed input files was a bottleneck starting at about 30 threads.
    We now use ISA-L for decompression, which is about three times as fast as zlib, and decompression is also done in a separate thread. We tested up to 128 cores, and strobealign was still able to use all cores.
    Contributed by @telmin.
  • #447: Switched to a new way for hashing randstrobes in preparation for the introduction of multi-context seeds. Pre-generated index files (.sti files) therefore need to be re-generated. (Strobealign will complain if you try to use an outdated index file.)

v0.13.0

04 Mar 12:46
11aaa5c
Compare
Choose a tag to compare

Changes

  • #394: Added option --aemb (abundance estimation for metagenomic binning),
    which makes strobealign output a table with estimated abundance values for
    each contig (instead of SAM or PAF). This was contributed by Shaojun Pan
    (@psj1997).
  • #386: Parallelize indexing even more by using @alugowski’s
    poolSTL pluggable_sort.
    Indexing a human reference (measured on CHM13) now takes only ~45 s on a
    recent machine (using 8 threads).
  • #376: Improve accuracy for read length 50 by optimizing the default
    indexing parameters. Paired-end accuracy increases by 0.3 percentage
    points on average. Single-end accuracy increases by 1 percentage point.
  • #395: Previously, read length 75 used the same indexing parameters as length
    50, but the improved settings for length 50 are not the best for length 75.
    To avoid a decrease in accuracy, we introduced a new set of pre-defined
    indexing parameters for read length 75 (a new canonical read length).
  • If --details is used, output X0:i SAM tag with the number of
    identically-scored best alignments
  • #378: Added -C option for appending the FASTA or FASTQ comment to SAM
    output. (Idea and name of the option taken from BWA-MEM.)
  • #371: Added --no-PG option for not outputting the PG SAM header
  • Include ZStr in our own repository
    instead of downloading it at build time. This should make it possible to
    build strobealign without internet access.

v0.12.0

23 Nov 22:02
Compare
Choose a tag to compare

Changes

  • #293: Fix: When mapping single-end reads, many multimappers were previously assigned a high mapping quality. They now get assigned mapping quality zero as intended.
  • #321: Fix: For paired-end reads that cannot be placed as proper pairs, we now prefer placing them onto the same chrosome instead of on different ones if there is a choice.
  • #328: Adjust MAPQ computation for single-end reads.
  • #318: Added a --details option mainly intended for debugging. When used, some strobealign-specific tags are added to the SAM output that inform about things like no. of seeds found, whether mate rescue was performed etc.
  • #333: Fix matches ending too early in PAF output.
  • #359, #367: Assign (single-end and paired-end) multimappers randomly to one of the candidate mapping locations to reduce biases.
  • #347: Reduce memory usage by avoiding an unnecessary copy of reference contigs.

v0.11.0

22 Jun 09:16
Compare
Choose a tag to compare

Changes

  • #278: Memory usage was reduced drastically due to a redesigned strobemer index memory layout. For the human genome, for example, RAM usage was reduced from 23 to 13 GiB. (Other changes increased RAM usage again slightly, see below.)

    Idea and implementation for this substantial improvement were contributed by Shaojun Pan (@psj1997) (supervised by Luis Pedro @luispedro) and originate in his work on a "strobealign-lm" (low memory) branch of strobealign. Thanks!

  • #277, #285, PR #306: Support for very large references (exceeding ~20 Gbp) was added by switching from 32 bit to 64 bit strobemer indices. This was also enabled and made simpler by the memory layout changes. This increases RAM usage by 1 GiB for human-sized genomes.

  • #313: Increased accuracy (especially on short single-end reads) due to "more random" syncmers. This increases memory usage again slightly so that we are at 14.7 GiB RAM usage for the human genome for this version of strobealign.

  • #307: Indexing was further parallelized, cutting the time for index generation in about half for many cases.

  • #289: Fixed missing CIGAR for secondary alignments.

  • #212: SEQ and QUAL are set to * for secondary alignments as recommended by the SAM specification.

  • #294: Updated the alignment library (SSW), which fixes some incorrect
    alignments.

v0.10.0

07 Jun 12:28
Compare
Choose a tag to compare

Changes

  • #258: Fixed compilation on MinGW. Thanks @teepean.
  • #260: Include full command line in the SAM PG header. Thanks @telmin.
  • #20: By default, emit M CIGAR operations instead of = and X.
    Added option --eqx to use = and X as before.
  • #265: Fixed overflowing read count statistics when processing $2^{31}$ reads
    or more. Thanks @telmin.
  • #273: Fix handling of interleaved files using /1 or /2 suffixes

v0.9.0

16 Mar 09:27
Compare
Choose a tag to compare

Changes

  • #225: Add progress report (only shown if output is not a terminal; can be disabled with --no-progress)
  • #250: Avoid overeager soft clipping by adding an “end bonus” to the alignment score if the alignment reaches the 5' or 3' end of the read. This is equivalent to penalizing soft-clipping and improves mapping accuracy, in particular for short reads, as candidate mapping sites with and without soft clipping are compared more fairly. Use -L to change the end bonus. (This emulates a feature found in BWA-MEM.)
  • #238: Fix occasionally incorrect soft clipping.
  • #239: Fix an uninitialized variable that could lead to nondeterministic results.
  • #137: Compute TLEN (in SAM output) correctly
  • #255: Add support for reading gzip-compressed reference FASTA files.
  • #222: Make it possible again to build strobealign from the release tarball (not only from the Git repository).

v0.8.0

01 Feb 10:31
Compare
Choose a tag to compare

Changes

This is a large release with over 600+ commits since the previous one (0.7.1). Much of the work that went into it was enabled by a
Bioinformatics Long-Term Support grant through National Bioinformatics Structure Sweden (NBIS), which is the SciLifeLab Bioinformatics platform.

A majority of the commits was focused on reorganizing the code to make it easier to maintain, to read, to test and to change. This has already paid off in the form of external contributions that would have been more difficult without those changes.

Another focus was on usability and standards compliance: SAM output follows the SAM specification more closely, error messages are better, there is now a --help command-line option, some irrelevant logging output was hidden, and the documentation was updated.

Mapping speed and accuracy remain mostly unaffected in this release, except for one bugfix that increases mapping rate and accuracy at short read lengths (<100 bp). Also, some unintended coverage spikes no longer occur.

Memory usage was decreased due to switching to a modified in-memory representation of the index. For the human genome, for example, RAM usage went from 28 GiB to 21 GiB. The reduction is smaller for more repetitive genomes.

Strobealign also gained the ability to pre-generate an index and save it to disk. As indexing is quite fast, this is not as relevant as it is for other read mappers, but important when processing many small libraries.

Please see the full Changelog.

v0.7.1

17 Apr 08:10
fcce85c
Compare
Choose a tag to compare

Improvements mainly for large repetitive genomes.

  • Introduces maximum limit on repetitive seeds before calling optimized merged match finder (optimized for repetitive reads). This reduces the computational time if the genome is large and repetitive, e.g., maize (2.4Gb), rye (7.8Gb), significantly.
  • Fixes sam header issue #22
  • Removes dependency on ksw2.

v0.7

01 Apr 13:27
3647690
Compare
Choose a tag to compare

Major update in the implemented parallelization. The new parallel implementation allows a much more efficient interplay with reading input -> aligning -> writing output. This results in much better CPU usage as the number of threads increases. For example, I observed an almost a 2x speedup (50-30% reduced runtime) across four larger datasets when using 16 cores (SIM and GIAB 150bp and 250bp reads, see README benchmarks).

For reference, previous naive parallelization ran in sequential order: 1. Read batch of reads with one thread 2. Align batch input in parallel with OpenMP 3. Write output with one thread. New parallelization performs 1-3 across threads with mutex on input and output. Such types of parallelization are commonly applied in other tools.

This release also includes:

  • Implemented automatic inference of read length, which removes the need of specifying -r (as reported in #19)
  • Some minor bugfixes. For example, this bug is fixed.

This release has identical or near-identical alignments to the previous version v0.6.1 (same accuracy and SV calling stats across tested datasets)