Skip to content

Version History

Jouni Siren edited this page Sep 18, 2024 · 95 revisions

Current version

  • GBWTBuilder (and related tools) will automatically increase buffer size if a sequence is too large for the buffer.
  • Metadata improvements:
    • FullPathName: A standalone version of PathName that stores sample/contig names/ids as strings without requiring Metadata.
    • Metadata::findFragment(): Returns the path identifier of the haplotype fragment possibly covering the (sample, contig, haplotype, offset) represented by a path name.
  • New functionality:
    • FastLocate::decompressSA() and FastLocate::decompressDA() for decompressing the part of the suffix array / document array corresponding to a node.

Releases

v1.3.1 (2022-02-17)

  • Empty paths are fully supported (but still discouraged).
  • Text input format for build_gbwt (mostly for testing).
  • The broken CMake support has been removed.

v1.3 (2021-11-15)

  • Supports 64-bit ARM.
  • File format version 5:
    • Optional serialization using simple-sds structures.
    • Tags structure storing arbitrary key-value pairs.
    • Compatible with versions 1-4.
    • Uses Metadata version 2 (compatible with versions 0-1).
  • inverseLF(): Follow the sequence backward in a bidirectional index.
  • Serialization and loading use exceptions to handle failures.
  • Requires the vgteam fork of SDSL.

v1.2 (2021-01-22)

  • Uses C++14 and the vgteam fork of SDSL.
  • Direct GBWT to DynamicGBWT conversion.
  • Temporary files are now thread-safe.
  • An option to use persistent phasing files for haplotype generation. These files persist when the associated object is deleted, but they are still deleted when the program exits.
  • The fast GBWT merging algorithm now works with overlapping node id ranges as long as the non-empty records do not overlap.
  • metadata_tool now prints metadata or removes it completely.

v1.1 (2020-09-14)

  • FastLocate: Optional fast locate() structure based on the r-index.
  • Ignore metadata from empty GBWTs during merging.
  • Construction from paths with a many different starting nodes is faster.

v1.0 (2019-09-06)

  • Option to force the phasing of homozygous variants (default on).
  • CachedGBWT: A caching layer for workloads that repeatedly access the same subset of nodes.
  • Direct DynamicGBWT to GBWT conversion.
  • Install script.

v0.9 (2019-04-12)

  • Extended metadata with path, sample, and contig names.
  • Sample names and contig name in VCF parse.
  • Create full metadata when building GBWT from a VCF parse using build_gbwt.
  • Renamed metadata to metadata_tool.
  • Remove sequences by sample / contig name in remove_seq.
  • New functionality: GBWT::firstNode(), GBWT::empty(node).

v0.8 (2019-01-11)

  • An algorithm for removing sequences from DynamicGBWT.
  • Multiple parallel merge jobs in BWT-merge.
  • build_gbwt improvements: Accept file lists, write metadata when building from VCF parse.

v0.7 (2018-11-21)

  • Parallel merging algorithm for quickly merging multiple GBWTs over the same chromosome. It can reduce the index construction time for large datasets by a factor of 2 to 3.
  • Optional metadata in the GBWT index.
  • New functionality: GBWT::extract(position), GBWT::extract(position, max_length), DynamicGBWT::fullLF().

v0.6 (2018-09-24)

  • Option to change the path identifier sampling interval.
  • Save the temporary structures from haplotype generation and use them as input for build_gbwt.
  • Decompress the endmarker of compressed GBWT for faster extract() queries in indexes with millions of paths.
  • Bug fix: Initialize incoming edges correctly when loading DynamicGBWT if alphabet offset is non-zero.
  • Support for Clang.

v0.5 (2018-07-20)

  • Support for bidirectional search.
  • Bug fixes for empty indexes.
  • Use vector_type (32-bit integers) instead of std::vector<node_type> (64-bit integers).
  • Support structures for generating haplotypes from a phased VCF file.

v0.4 (2018-05-10)

  • New functionality: GBWT::hasEdge(), GBWT::edges(), GBWT::find(node).
  • Read and write data in smaller blocks to avoid the issue with >2 GB reads in GCC on macOS.
  • Faster GBWT::LF(from, i), GBWT::prefix(), GBWT::locate(), and GBWT::extract() queries.

v0.3 (2017-11-26)

  • New construction option: GBWTBuilder collects inserted sequences and builds GBWT in a background thread.
  • Support for node and path orientations.
  • Fast merging when the node ids do not overlap.

v0.2 (2017-10-20)

  • The second pre-release.
  • High-level interface (find(), extend(), locate(), extract()) shared between GBWT and DynamicGBWT.
  • Construction from std::vector<node_type>, which is also the type of extracted sequences.
  • More versatile construction program supporting multiple inputs and inserting sequences into an existing index.
  • Tools display version information.

v0.1 (2017-09-18)

  • The first pre-release.
  • Incremental index construction and GBWT merging.
  • LF-mapping and locate() queries for determining path identifiers.

Future work

Other ideas

  • Use binary search in DynamicGBWT::tryLocate().
  • Inverse suffix array functionality.
    • Get offset for a path in a given node.
  • Encode the destination of the first outgoing edge relative to the current node.
  • Memory-mapped compressed GBWT.
Clone this wiki locally