-
Notifications
You must be signed in to change notification settings - Fork 13
File Formats
This describes the current version of the GBWT file format based on the in-memory SDSL data structures. The file format using simple-sds structures is documented on a separate page. All tools output the simple-sds format by default, with an option for using the SDSL format.
The default extension is .gbwt
.
File format version | GBWT version | Changes |
---|---|---|
5 | v1.3 | Support for serialization using Simple-SDS structures. Tags structure. Metadata version 2. Compatible with versions 1 to 4. |
4 | v0.9 | Metadata version 1 with path/sample/contig names. Compatible with versions 1 to 3. |
3 | v0.7 | Optional metadata in the index. Compatible with versions 1 and 2. |
2 | v0.5 | A flag for bidirectional search in the header. Compatible with version 1. |
1 | v0.2 | The first proper version, effectively the same as version 0. |
0 | v0.1 |
We split the BWT into records. Each record contains the part of the BWT corresponding to the reverse prefixes starting with a node of the graph. We do not have records for the maximal range [1, offset] of node identifiers not occurring in the collection. The record contains the following information:
- Incoming edges as pairs (source node, path count) in sorted order.
-
Outgoing edges as pairs (destination node, path rank) in sorted order.
- Path rank is the number of occurrences of the destination node in the records with identifier smaller than this record.
-
Body contains the run-length encoded BWT as pairs (edge rank, count).
- Edge rank is the rank of the destination node in the list of outgoing edges.
-
Sampled identifiers as pairs (offset, identifier).
- Offset is a lexicographic rank relative to the start of this record.
- We currently sample one out of 1024 positions on each path.
The dynamic FM-index is used for construction.
class DynamicGBWT
{
GBWTHeader header;
Tags tags;
std::vector<DynamicRecord> bwt;
Metadata metadata;
};
struct GBWTHeader
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t sequences;
std::uint64_t size;
std::uint64_t offset;
std::uint64_t alphabet_size;
std::uint64_t flags;
};
- Currently
tag
must be0x6B376B37
andversion
must be5
. - The following flags are stored in the bitfield
flags
:-
FLAG_BIDIRECTIONAL = 0x0001
: The index supports bidirectional search. -
FLAG_METADATA = 0x0002
: The index contains metadata. -
FLAG_SIMPLE_SDS = 0x0004
: The index is serialized using simple-sds structures.
-
- The index contains
sequences
sequences of total lengthsize
. -
offset
defines the maximal range of node identifiers with no records. -
alphabet_size
is the largest node identifier + 1.
struct DynamicRecord
{
size_type body_size;
std::vector<edge_type> incoming, outgoing;
std::vector<run_type> body;
std::vector<sample_type> ids;
};
-
body_size
is the length of the BWT in this record. -
edge_type
,run_type
, andsample_type
are all pairs of 32-bit integers.
The compressed index is used for queries and serialization. It is up to 10x smaller and typically 2x to 4x slower than the dynamic index.
Note that the serialized index can be loaded as either GBWT
or DynamicGBWT
.
class GBWT
{
GBWTHeader header;
Tags tags;
RecordArray bwt;
DASamples da_samples;
Metadata metadata;
};
Each record is encoded as follows:
- Incoming edges are not stored. The information is recomputed when the dynamic index is loaded.
-
Outgoing edges are encoded using
ByteCode
. The encoding starts with the outdegree, followed by the pairs of integers for each edge.- The destination nodes are differentially encoded.
- In
ByteCode
, each byte contains the lowest remaining 7 bits of the integer, with the highest bit marking whether the encoding continues in the next node.
-
Body is encoded using
Run
, which in turn usesByteCode
.- The first byte contains edge rank and an integer between 0 and x = floor(256 / outdegree).
- If the run length is less than x, it is encoded in the first byte as run length - 1.
- Otherwise the first byte contains x and the following bytes encode run length - x using
ByteCode
. - If the local alphabet size (outdegree) is too large, we encode pair (edge rank, run length - 1) using
ByteCode
.
-
Sampled identifiers are stored in a global structure.
- Some nodes have many samples, making linear search through them expensive.
- Many nodes have samples, making the overhead from individual structures high.
struct RecordArray
{
size_type records;
sdsl::sd_vector<> index;
sdsl::sd_vector<>::select_1_type select;
std::vector<byte_type> data;
};
- The array contains
records
records. -
index
marks the start of each record with 1-bits, whileselect
is used for finding the 1-bits. -
data
is a byte array containing the encoding.
struct DASamples
{
sdsl::bit_vector sampled_records;
sdsl::bit_vector::rank_1_type record_rank;
sdsl::sd_vector<> bwt_ranges;
sdsl::sd_vector<>::select_1_type bwt_select;
sdsl::sd_vector<> sampled_offsets;
sdsl::sd_vector<>::rank_1_type sample_rank;
sdsl::int_vector<0> array;
};
-
sampled_records
tells whether a record has samples, whilerecord_rank
gives a rank among the records with samples.- As most nodes do not have samples, this offers a quick way of skipping them when searching for the next sample.
-
bwt_ranges
is a bitvector over the concatenated offset space of the records with samples. 1-bits mark the beginning of each record, andbwt_select
is used for finding the 1-bits. -
sampled_offsets
is another bitvector over the concatenated offset spaces, marking the sampled offsets with 1-bits.sample_rank
gives a rank among the samples. -
array
contains the sampled path identifiers.
class Metadata
{
MetadataHeader header;
std::vector<PathName> path_names;
Dictionary sample_names;
Dictionary contig_names;
};
-
path_names
is either absent or contains a unique name for each path. -
sample_names
is either absent or contains a unique name for each sample. -
contig_names
is either absent or contains a unique name for each contig.
The name structures are present if and only if the corresponding flag is set in the header.
See also: Metadata.
Metadata format version | GBWT version | Changes |
---|---|---|
2 | v1.3 | Serialization using simple-sds data structures. Compatible with versions 0 to 1. |
1 | v0.9 | Path/sample/contig names. Compatible with version 0. |
0 | v0.7 |
struct MetadataHeader
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t sample_count;
std::uint64_t haplotype_count;
std::uint64_t contig_count;
std::uint64_t flags;
};
- Currently
tag
must be0x6B375E7A
andversion
must be2
. - The following flags are stored in the bitfield
flags
:-
FLAG_PATH_NAMES = 0x0001
: The metadata contains path names. -
FLAG_SAMPLE_NAMES = 0x0002
: The metadata contains sample names. -
FLAG_CONTIG_NAMES = 0x0004
: The metadata contains contig names.
-
- There are there are
sample_count
samples withhaplotype_count
haplotypes overcontig_count
contigs.
struct PathName
{
path_name_type sample;
path_name_type contig;
path_name_type phase;
path_name_type count;
};
A path name is a combination of sample identifier, contig identifier, phase number, and running count. Type path_name_type
is either short_type
or size_type
, depending on whether GBWT_SAVE_MEMORY
has been defined (see Data Types). Each sample/phase combination represents a haplotype.
class StringArray
{
sdsl::int_vector<0> index;
std::vector<char> strings;
};
A string array stores multiple concatenated strings in a single array. This reduces the overhead from memory allocations and enables fast serialization and loading. The string with identifier i
is stored between offsets index[i]
and index[i + 1]
of vector strings
.
Tags are key-value pairs can be used for arbitrary annotations. They are serialized as StringArray
, where strings with an even identifier are keys and the following strings are the corresponding values. The keys are case-insensitive.
Key source
indicates the library used for serializing the GBWT. This library corresponds to value jltsiren/gbwt
.
class Dictionary
{
StringArray strings;
sdsl::int_vector<0> sorted_ids;
};
A dictionary stores the mapping between names and identifiers. The names are stored in strings
. Array sorted_ids
stores the identifiers in the lexicographic order of names.