Skip to content

File Formats

Jouni Siren edited this page May 8, 2021 · 18 revisions

General

This describes the current version of the GBWT file format based on the in-memory SDSL data structures. The file format using simple-sds structures is documented on a separate page. All tools output the simple-sds format by default, with an option for using the SDSL format.

The default extension is .gbwt.

File format version GBWT version Changes
5 v1.3 Support for serialization using Simple-SDS structures. Metadata version 2. Compatible with versions 1 to 4.
4 v0.9 Metadata version 1 with path/sample/contig names. Compatible with versions 1 to 3.
3 v0.7 Optional metadata in the index. Compatible with versions 1 and 2.
2 v0.5 A flag for bidirectional search in the header. Compatible with version 1.
1 v0.2 The first proper version, effectively the same as version 0.
0 v0.1

Records

We split the BWT into records. Each record contains the part of the BWT corresponding to the reverse prefixes starting with a node of the graph. We do not have records for the maximal range [1, offset] of node identifiers not occurring in the collection. The record contains the following information:

  • Incoming edges as pairs (source node, path count) in sorted order.
  • Outgoing edges as pairs (destination node, path rank) in sorted order.
    • Path rank is the number of occurrences of the destination node in the records with identifier smaller than this record.
  • Body contains the run-length encoded BWT as pairs (edge rank, count).
    • Edge rank is the rank of the destination node in the list of outgoing edges.
  • Sampled identifiers as pairs (offset, identifier).
    • Offset is a lexicographic rank relative to the start of this record.
    • We currently sample one out of 1024 positions on each path.

Dynamic FM-index

The dynamic FM-index is used for construction.

class DynamicGBWT
{
  GBWTHeader                         header;
  std::map<std::string, std::string> tags;
  std::vector<DynamicRecord>         bwt;
  Metadata                           metadata;
};

Header

struct GBWTHeader
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t sequences;
  std::uint64_t size;
  std::uint64_t offset;
  std::uint64_t alphabet_size;
  std::uint64_t flags;
};
  • Currently tag must be 0x6B376B37 and version must be 5.
  • The following flags are stored in the bitfield flags:
    • FLAG_BIDIRECTIONAL = 0x0001: The index supports bidirectional search.
    • FLAG_METADATA = 0x0002: The index contains metadata.
    • FLAG_SIMPLE_SDS = 0x0004: The index is serialized using simple-sds structures.
  • The index contains sequences sequences of total length size.
  • offset defines the maximal range of node identifiers with no records.
  • alphabet_size is the largest node identifier + 1.

Records

struct DynamicRecord
{
  size_type                body_size;
  std::vector<edge_type>   incoming, outgoing;
  std::vector<run_type>    body;
  std::vector<sample_type> ids;
};
  • body_size is the length of the BWT in this record.
  • edge_type, run_type, and sample_type are all pairs of 32-bit integers.

Compressed index

The compressed index is used for queries and serialization. It is up to 10x smaller and typically 2x to 4x slower than the dynamic index.

Note that the serialized index can be loaded as either GBWT or DynamicGBWT.

class GBWT
{
  GBWTHeader                         header;
  std::map<std::string, std::string> tags;
  RecordArray                        bwt;
  DASamples                          da_samples;
  Metadata                           metadata;
};

Records

Each record is encoded as follows:

  • Incoming edges are not stored. The information is recomputed when the dynamic index is loaded.
  • Outgoing edges are encoded using ByteCode. The encoding starts with the outdegree, followed by the pairs of integers for each edge.
    • The destination nodes are differentially encoded.
    • In ByteCode, each byte contains the lowest remaining 7 bits of the integer, with the highest bit marking whether the encoding continues in the next node.
  • Body is encoded using Run, which in turn uses ByteCode.
    • The first byte contains edge rank and an integer between 0 and x = floor(256 / outdegree).
    • If the run length is less than x, it is encoded in the first byte as run length - 1.
    • Otherwise the first byte contains x and the following bytes encode run length - x using ByteCode.
    • If the local alphabet size (outdegree) is too large, we encode pair (edge rank, run length - 1) using ByteCode.
  • Sampled identifiers are stored in a global structure.
    • Some nodes have many samples, making linear search through them expensive.
    • Many nodes have samples, making the overhead from individual structures high.

Record array

struct RecordArray
{
  size_type                        records;
  sdsl::sd_vector<>                index;
  sdsl::sd_vector<>::select_1_type select;
  std::vector<byte_type>           data;
};
  • The array contains records records.
  • index marks the start of each record with 1-bits, while select is used for finding the 1-bits.
  • data is a byte array containing the encoding.

Stored path identifiers

struct DASamples
{
  sdsl::bit_vector                 sampled_records;
  sdsl::bit_vector::rank_1_type    record_rank;

  sdsl::sd_vector<>                bwt_ranges;
  sdsl::sd_vector<>::select_1_type bwt_select;

  sdsl::sd_vector<>                sampled_offsets;
  sdsl::sd_vector<>::rank_1_type   sample_rank;

  sdsl::int_vector<0>              array;
};
  • sampled_records tells whether a record has samples, while record_rank gives a rank among the records with samples.
    • As most nodes do not have samples, this offers a quick way of skipping them when searching for the next sample.
  • bwt_ranges is a bitvector over the concatenated offset space of the records with samples. 1-bits mark the beginning of each record, and bwt_select is used for finding the 1-bits.
  • sampled_offsets is another bitvector over the concatenated offset spaces, marking the sampled offsets with 1-bits. sample_rank gives a rank among the samples.
  • array contains the sampled path identifiers.

Metadata

class Metadata
{
  MetadataHeader        header;
  std::vector<PathName> path_names;
  Dictionary            sample_names;
  Dictionary            contig_names;
};
  • path_names is either absent or contains a unique name for each path.
  • sample_names is either absent or contains a unique name for each sample.
  • contig_names is either absent or contains a unique name for each contig.

The name structures are present if and only if the corresponding flag is set in the header.

See also: Metadata.

Version history

Metadata format version GBWT version Changes
2 v1.3 Serialization using simple-sds data structures. Compatible with versions 0 to 1.
1 v0.9 Path/sample/contig names. Compatible with version 0.
0 v0.7

Metadata header

struct MetadataHeader
{
  std::uint32_t tag;
  std::uint32_t version;
  std::uint64_t sample_count;
  std::uint64_t haplotype_count;
  std::uint64_t contig_count;
  std::uint64_t flags;
};
  • Currently tag must be 0x6B375E7A and version must be 2.
  • The following flags are stored in the bitfield flags:
    • FLAG_PATH_NAMES = 0x0001: The metadata contains path names.
    • FLAG_SAMPLE_NAMES = 0x0002: The metadata contains sample names.
    • FLAG_CONTIG_NAMES = 0x0004: The metadata contains contig names.
  • There are there are sample_count samples with haplotype_count haplotypes over contig_count contigs.

Path names

struct PathName
{
  path_name_type sample;
  path_name_type contig;
  path_name_type phase;
  path_name_type count;
};

A path name is a combination of sample identifier, contig identifier, phase number, and running count. Type path_name_type is either short_type or size_type, depending on whether GBWT_SAVE_MEMORY has been defined (see Data Types). Each sample/phase combination represents a haplotype.

Support structures

String array

class StringArray
{
  sdsl::int_vector<0> index;
  std::vector<char>   strings;
};

A string array stores multiple concatenated strings in a single array. This reduces the overhead from memory allocations and enables fast serialization and loading. The string with identifier i is stored between offsets index[i] and index[i + 1] of vector strings.

Tags

Tags are key-value pairs can be used for arbitrary annotations. They are serialized as StringArray, where strings with an even identifier are keys and the following strings are the corresponding values. The keys are case-insensitive.

Key source indicates the library used for serializing the GBWT. This library corresponds to value jltsiren/gbwt.

Dictionary

class Dictionary
{
  StringArray         strings;
  sdsl::int_vector<0> sorted_ids;
};

A dictionary stores the mapping between names and identifiers. The names are stored in strings. Array sorted_ids stores the identifiers in the lexicographic order of names.

Clone this wiki locally