Skip to content

File Formats

Jouni Siren edited this page May 8, 2021 · 18 revisions

General

This describes the current version of the GBWT file format based on the in-memory SDSL data structures. The file format using simple-sds structures is documented on a separate page. All tools output the simple-sds format by default, with an option for using the SDSL format.

The default extension is .gbwt.

File format version GBWT version Changes
5 v1.3 Support for serialization using Simple-SDS structures. Metadata version 2. Compatible with versions 1 to 4.
4 v0.9 Metadata version 1 with path/sample/contig names. Compatible with versions 1 to 3.
3 v0.7 Optional metadata in the index. Compatible with versions 1 and 2.
2 v0.5 A flag for bidirectional search in the header. Compatible with version 1.
1 v0.2 The first proper version, effectively the same as version 0.
0 v0.1

Records

We split the BWT into records. Each record contains the part of the BWT corresponding to the reverse prefixes starting with a node of the graph. We do not have records for the maximal range [1, offset] of node identifiers not occurring in the collection. The record contains the following information:

  • Incoming edges as pairs (source node, path count) in sorted order.
  • Outgoing edges as pairs (destination node, path rank) in sorted order.
    • Path rank is the number of occurrences of the destination node in the records with identifier smaller than this record.
  • Body contains the run-length encoded BWT as pairs (edge rank, count).
    • Edge rank is the rank of the destination node in the list of outgoing edges.
  • Sampled identifiers as pairs (offset, identifier).
    • Offset is a lexicographic rank relative to the start of this record.
    • We currently sample one out of 1024 positions on each path.

Dynamic FM-index

The dynamic FM-index is used for construction.

class DynamicGBWT
{
  GBWTHeader                 header;
  std::vector<DynamicRecord> bwt;
  Metadata                   metadata;
};

Header

struct GBWTHeader
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t sequences;
  std::uint64_t size;
  std::uint64_t offset;
  std::uint64_t alphabet_size;
  std::uint64_t flags;
};
  • Currently tag must be 0x6B376B37 and version must be 4.
  • The following flags are stored in the bitfield flags:
    • FLAG_BIDIRECTIONAL = 0x0001: The index supports bidirectional search.
    • FLAG_METADATA = 0x0002: The index contains metadata.
    • FLAG_SIMPLE_SDS = 0x0004: The index is serialized using simple-sds structures.
  • The index contains sequences sequences of total length size.
  • offset defines the maximal range of node identifiers with no records.
  • alphabet_size is the largest node identifier + 1.

Records

struct DynamicRecord
{
  size_type                body_size;
  std::vector<edge_type>   incoming, outgoing;
  std::vector<run_type>    body;
  std::vector<sample_type> ids;
};
  • body_size is the length of the BWT in this record.
  • edge_type, run_type, and sample_type are all pairs of 32-bit integers.

Compressed index

The compressed index is used for queries and serialization. It is up to 10x smaller and typically 2x to 4x slower than the dynamic index.

Note that the serialized index can be loaded as either GBWT or DynamicGBWT.

class GBWT
{
  GBWTHeader  header;
  RecordArray bwt;
  DASamples   da_samples;
  Metadata    metadata;
};

Records

Each record is encoded as follows:

  • Incoming edges are not stored. The information is recomputed when the dynamic index is loaded.
  • Outgoing edges are encoded using ByteCode. The encoding starts with the outdegree, followed by the pairs of integers for each edge.
    • The destination nodes are differentially encoded.
    • In ByteCode, each byte contains the lowest remaining 7 bits of the integer, with the highest bit marking whether the encoding continues in the next node.
  • Body is encoded using Run, which in turn uses ByteCode.
    • The first byte contains edge rank and an integer between 0 and x = floor(256 / outdegree).
    • If the run length is less than x, it is encoded in the first byte as run length - 1.
    • Otherwise the first byte contains x and the following bytes encode run length - x using ByteCode.
    • If the local alphabet size (outdegree) is too large, we encode pair (edge rank, run length - 1) using ByteCode.
  • Sampled identifiers are stored in a global structure.
    • Some nodes have many samples, making linear search through them expensive.
    • Many nodes have samples, making the overhead from individual structures high.

Record array

struct RecordArray
{
  size_type                        records;
  sdsl::sd_vector<>                index;
  sdsl::sd_vector<>::select_1_type select;
  std::vector<byte_type>           data;
};
  • The array contains records records.
  • index marks the start of each record with 1-bits, while select is used for finding the 1-bits.
  • data is a byte array containing the encoding.

Stored path identifiers

struct DASamples
{
  sdsl::bit_vector                 sampled_records;
  sdsl::bit_vector::rank_1_type    record_rank;

  sdsl::sd_vector<>                bwt_ranges;
  sdsl::sd_vector<>::select_1_type bwt_select;

  sdsl::sd_vector<>                sampled_offsets;
  sdsl::sd_vector<>::rank_1_type   sample_rank;

  sdsl::int_vector<0>              array;
};
  • sampled_records tells whether a record has samples, while record_rank gives a rank among the records with samples.
    • As most nodes do not have samples, this offers a quick way of skipping them when searching for the next sample.
  • bwt_ranges is a bitvector over the concatenated offset space of the records with samples. 1-bits mark the beginning of each record, and bwt_select is used for finding the 1-bits.
  • sampled_offsets is another bitvector over the concatenated offset spaces, marking the sampled offsets with 1-bits. sample_rank gives a rank among the samples.
  • array contains the sampled path identifiers.

Metadata

See also: Metadata.

Version history

Metadata format version GBWT version Changes
2 v1.3 Serialization using simple-sds data structures. Compatible with versions 0 to 1.
1 v0.9 Path/sample/contig names. Compatible with version 0.
0 v0.7
struct Metadata
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t sample_count;
  std::uint64_t haplotype_count;
  std::uint64_t contig_count;
  std::uint64_t flags;

  std::vector<PathName> path_names;
  Dictionary            sample_names;
  Dictionary            contig_names;
};
  • Currently tag must be 0x6B375E7A and version must be 1.
  • The following flags are stored in the bitfield flags:
    • FLAG_PATH_NAMES = 0x0001: The metadata contains path names.
    • FLAG_SAMPLE_NAMES = 0x0002: The metadata contains sample names.
    • FLAG_CONTIG_NAMES = 0x0004: The metadata contains contig names.
  • The index contains haplotype_count haplotypes of sample_count samples over contig_count contigs.
  • See Metadata for further details.
Clone this wiki locally