-
Notifications
You must be signed in to change notification settings - Fork 13
File Formats
Jouni Siren edited this page Aug 7, 2019
·
18 revisions
This describes the current version of the GBWT file format. The default extension is .gbwt
.
File format version | GBWT version | Changes |
---|---|---|
4 | v0.9 | Metadata version 1 with path/sample/contig names. Compatible with versions 1 to 3. |
3 | v0.7 | Optional metadata in the index. Compatible with versions 1 and 2. |
2 | v0.5 | A flag for bidirectional search in the header. Compatible with version 1. |
1 | v0.2 | The first proper version, effectively the same as version 0. |
0 | v0.1 |
We split the BWT into records. Each record contains the part of the BWT corresponding to the reverse prefixes starting with a node of the graph. We do not have records for the maximal range [1, offset] of node identifiers not occurring in the collection. The record contains the following information:
- Incoming edges as pairs (source node, path count) in sorted order.
-
Outgoing edges as pairs (destination node, path rank) in sorted order.
- Path rank is the number of occurrences of the destination node in the records with identifier smaller than this record.
-
Body contains the run-length encoded BWT as pairs (edge rank, count).
- Edge rank is the rank of the destination node in the list of outgoing edges.
-
Sampled identifiers as pairs (offset, identifier).
- Offset is a lexicographic rank relative to the start of this record.
- We currently sample one out of 1024 positions on each path.
The dynamic FM-index is used for construction.
class DynamicGBWT
{
GBWTHeader header;
std::vector<DynamicRecord> bwt;
Metadata metadata;
};
struct GBWTHeader
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t sequences;
std::uint64_t size;
std::uint64_t offset;
std::uint64_t alphabet_size;
std::uint64_t flags;
};
- Currently
tag
must be0x6B376B37
andversion
must be4
. - The following flags are stored in the bitfield
flags
:-
FLAG_BIDIRECTIONAL = 0x0001
: The index supports bidirectional search. -
FLAG_METADATA = 0x0002
: The index contains metadata.
-
- The index contains
sequences
sequences of total lengthsize
. -
offset
defines the maximal range of node identifiers with no records. -
alphabet_size
is the largest node identifier + 1.
struct DynamicRecord
{
size_type body_size;
std::vector<edge_type> incoming, outgoing;
std::vector<run_type> body;
std::vector<sample_type> ids;
};
-
body_size
is the length of the BWT in this record. -
edge_type
,run_type
, andsample_type
are all pairs of 32-bit integers.
struct Metadata
{
std::uint32_t tag;
std::uint32_t version;
std::uint64_t sample_count;
std::uint64_t haplotype_count;
std::uint64_t contig_count;
std::uint64_t flags;
std::vector<PathName> path_names;
Dictionary sample_names;
Dictionary contig_names;
};
- Currently
tag
must be0x6B375E7A
andversion
must be1
. - The following flags are stored in the bitfield
flags
:-
FLAG_PATH_NAMES = 0x0001
: The metadata contains path names. -
FLAG_SAMPLE_NAMES = 0x0002
: The metadata contains sample names. -
FLAG_CONTIG_NAMES = 0x0004
: The metadata contains contig names.
-
- The index contains
haplotype_count
haplotypes ofsample_count
samples overcontig_count
contigs. - See Metadata for further details.
The compressed index is used for queries and serialization. It is up to 10x smaller and typically 2x to 4x slower than the dynamic index.
Note that the serialized index can be loaded as either GBWT
or DynamicGBWT
.
class GBWT
{
GBWTHeader header;
RecordArray bwt;
DASamples da_samples;
Metadata metadata;
};
Each record is encoded as follows:
- Incoming edges are not stored. The information is recomputed when the dynamic index is loaded.
-
Outgoing edges are encoded using
ByteCode
. The encoding starts with the outdegree, followed by the pairs of integers for each edge.- The destination nodes are differentially encoded.
- In
ByteCode
, each byte contains the lowest remaining 7 bits of the integer, with the highest bit marking whether the encoding continues in the next node.
-
Body is encoded using
Run
, which in turn usesByteCode
.- The first byte contains edge rank and an integer between 0 and x = floor(256 / outdegree).
- If the run length is less than x, it is encoded in the first byte as run length - 1.
- Otherwise the first byte contains x and the following bytes encode run length - x using
ByteCode
. - If the local alphabet size (outdegree) is too large, we encode pair (edge rank, run length - 1) using
ByteCode
.
-
Sampled identifiers are stored in a global structure.
- Some nodes have many samples, making linear search through them expensive.
- Many nodes have samples, making the overhead from individual structures high.
struct RecordArray
{
size_type records;
sdsl::sd_vector<> index;
sdsl::sd_vector<>::select_1_type select;
std::vector<byte_type> data;
};
- The array contains
records
records. -
index
marks the start of each record with 1-bits, whileselect
is used for finding the 1-bits. -
data
is a byte array containing the encoding.
struct DASamples
{
sdsl::bit_vector sampled_records;
sdsl::bit_vector::rank_1_type record_rank;
sdsl::sd_vector<> bwt_ranges;
sdsl::sd_vector<>::select_1_type bwt_select;
sdsl::sd_vector<> sampled_offsets;
sdsl::sd_vector<>::rank_1_type sample_rank;
sdsl::int_vector<0> array;
};
-
sampled_records
tells whether a record has samples, whilerecord_rank
gives a rank among the records with samples.- As most nodes do not have samples, this offers a quick way of skipping them when searching for the next sample.
-
bwt_ranges
is a bitvector over the concatenated offset space of the records with samples. 1-bits mark the beginning of each record, andbwt_select
is used for finding the 1-bits. -
sampled_offsets
is another bitvector over the concatenated offset spaces, marking the sampled offsets with 1-bits.sample_rank
gives a rank among the samples. -
array
contains the sampled path identifiers.