Skip to content

Construction Interface

Jouni Siren edited this page Nov 19, 2023 · 36 revisions

General

GBWT is built by inserting sequences in batches to a dynamic FM-index (DynamicGBWT). The construction algorithm is based on BCR and RopeBWT2. A single step of the algorithm consists of extending each sequence in the current batch by one node.

Batch size provides a time/space trade-off. Large batches require more memory, as the sequences must be loaded into memory. On the other hand, if the sequences are aligned (to some extent), the number of records modified in each step remains small, and the algorithm is faster.

Each batch must consist of entire sequences. If batch size is specified in number of nodes and the last sequence is not fully contained in the batch, the actual batch size will be smaller than specified. For best performance, each batch should consist of at least DynamicGBWT::MIN_SEQUENCES_PER_BATCH similar sequences.

Some construction options support inserting sequences in both orientations. When using this option, the node identifiers are assumed to be already encoded with Node::encode(), so that the identifier of the reverse node can be determined with Node::reverse(). See Identifiers for further details.

Bidirectional search requires that the index contains both orientations of each sequence. This information is maintained in the header of the index.

Sample interval is a time/space trade-off for locate() queries. Given sample interval N, we store the path identifiers at one out of N positions in each sequence. If the record does not contain a stored identifier for a given offset, we iterate LF-mapping until we find a stored identifier. Sample interval 0 means that we store the identifiers at the end of each sequence.

The construction is single-threaded, though some construction methods do the actual construction in a background worker thread. The memory usage of construction is low enough that it is feasible to run multiple construction jobs (e.g. for multiple chromosomes) in parallel.

If the construction encounters an unrecoverable error, it terminates the program with std::exit(EXIT_FAILURE).

See also: Data model | Merging

Construction tool

The main construction tool is build_gbwt. It is effectively single-threaded, though it uses separate threads for reading the input and for building the index. Unless otherwise specified, each input file is assumed to be a serialized sdsl::int_vector<0>.

build_gbwt [options] input1 [input2 ...]

This builds an index for all input files output.gbwt. If no output is specified, there is only one input file, and no existing index is loaded, input1 is used as the base name for output.

  • -b N: Insert the sequences in batches of N million nodes. Use batch size 0 to insert all sequences in a single batch. Default: 100.
  • -c: Check for overlapping variants in generated haplotypes. Writes the output to stderr. Use with -p.
  • -f: Insert the sequences only in forward orientation. This is the default behavior.
  • -F X: Read a list of file names from file X, one file per line. Use these as input files. May repeat. Specifying input files is unnecessary with this option.
  • -i X: Insert the sequences to an existing index X.gbwt.
  • -L X: Read a list of file names from file X, one file per line. Use these as phasing files. May repeat. Use with -p.
  • -o X: Use X as the base name for output.
  • -O: Output SDSL format instead of simple-sds format.
  • -p: The inputs are parsed VCF files. See Haplotype Generation.
  • -P X: Only use phasing information from file X. May repeat. Use with -p.
  • -r: Insert the sequences in both orientations. This is required for bidirectional search.
  • -R: Resample the sequence ids in the loaded index (implies -l).
  • -s N: Use sample interval N. Default: 1024. Use 0 for no samples.
  • -S: Skip overlapping variants. Use with -p.
  • -t: The inputs are text files containing one path per line. Each path is a comma-separated list of node identifiers.
  • -v: Verify the correctness of the index with various queries based on the input.

Example: build_gbwt -b 200 input reads the sequences from input, builds the GBWT in batches of 200 million nodes, and writes the index to input.gbwt.

Example: build_gbwt -r -i index -o output input reads the sequences from input, inserts them in both orientations into index.gbwt, and writes the result to output.gbwt.

GBWT construction

DynamicGBWT is defined in dynamic_gbwt.h.

GBWT construction is based on creating an empty index with the default constructor DynamicGBWT() and then inserting sequences into it with one or more DynamicGBWT::insert() calls. There are also ways to merge existing GBWTs.

After the GBWT index has been built, the DynamicGBWT object can be serialized or converted to GBWT. The conversion can be done using with a constructor or assignment:

GBWT(const DynamicGBWT& source)
GBWT& operator=(const DynamicGBWT& source)

Single batch

If you can generate the sequences incrementally, this options allows building the GBWT without storing the sequences explicitly. These functions are single-threaded. They are also space-efficient, as they use the text directly as an input without additional buffering.

void insert(const text_type& text, bool has_both_orientations = false, size_type sample_interval = SAMPLE_INTERVAL)
void insert(const text_type& text, size_type text_length, bool has_both_orientations = false, size_type sample_interval = SAMPLE_INTERVAL)
void insert(const vector_type& text, bool has_both_orientations = false, size_type sample_interval = SAMPLE_INTERVAL)
  • text: Batch of sequences.
  • text_length: Total length of the sequences including endmarkers.
  • has_both_orientations: Set to indicate that the batch contains both orientations of each sequence.
  • sample_interval: Sample interval for the inserted sequences.

In a common use case, we have a single text_type as a buffer. When the next sequence no longer fits into the buffer, the batch is inserted into the index and the buffer is cleared. Because resizing text_type always causes reallocation, it is more efficient to specify the total length of the sequences instead of resizing the buffer.

Example:

DynamicGBWT gbwt;
for(size_type i = 0; i < source.size(); i += 1000)
{
  text_type batch = source.get(i, std::min(i + 1000, source.size()) - 1);
  gbwt.insert(batch);
}

This builds GBWT for source, generating batches of 1000 sequences and inserting them into an initially empty index.

Construction from disk

If the sequences are stored on disk, this option inserts them in multiple batches. The function is effectively single-threaded, though a it uses separate threads for reading the input and for building the index.

void insert(text_buffer_type& text, size_type batch_size = INSERT_BATCH_SIZE, bool both_orientations = false, size_type sample_interval = SAMPLE_INTERVAL)
  • text: Sequences on disk.
  • batch_size: Batch size in number of nodes. Use 0 to insert all sequences as a single batch. Default: 100 million.
  • both_orientations: Set to true to index the sequences in both orientations.
  • sample_interval: Sample interval for the inserted sequences.

Example:

DynamicGBWT gbwt;
text_buffer_type text(input_name);
gbwt.insert(text, 200 * MILLION);

This inserts the sequences from file input_name into an empty GBWT in batches of 200 million nodes.

Buffered construction

class GBWTBuilder (defined in dynamic_gbwt.h) provides a simple interface for incremental construction. Sequences are inserted into the input buffer one-by-one. When the buffer gets full, it is swapped with the construction buffer. A worker thread is then launched to insert the sequences in the construction buffer into the index.

2023-11-19: GBWTBuilder will now automatically increase buffer size if a sequence is too large for the buffer. For best performance, buffer size should still be specified in advance.

The build_gbwt tool and the insert(text_buffer_type& text, ...) function are based on GBWTBuilder.

The public interface consists of the following member functions. The index itself can be accessed through the index member variable.

GBWTBuilder(size_type node_width, size_type batch_size = DynamicGBWT::INSERT_BATCH_SIZE, size_type sample_interval = DynamicGBWT::SAMPLE_INTERVAL)

The constructor creates a builder containing an empty index.

  • size_type node_width: Number of bits used to represent a node identifier.
  • size_type batch_size: Size of the input buffer and the construction buffer. Default: 100 million.
  • size_type sample_interval: Sample interval for the inserted sequences.
void swapIndex(DynamicGBWT& another_index)

This function can be used to swap the index stored in the GBWTBuilder with another index. With it, the builder can insert the sequences into an existing index. If sequences have been inserted into the buffer, finish() should be called before calling swapIndex().

  • DynamicGBWT& another_index: The index to swap the contents of index with.
void insert(const vector_type& sequence, bool both_orientations = false)

Insert a single sequence into the buffer. The sequence must not contain endmarkers.

  • vector_type& sequence: The sequence to be inserted.
  • bool both_orientations: Set to true to insert the sequence in both orientations.
void finish()

Finish the construction. Flushes the buffers and recodes the index. Recoding sorts the outgoing edges in each record, making it possible to serialize the index (see File Formats).

Example:

GBWTBuilder builder(bit_length(Node::encode(max_id, true)));
builder.swapIndex(existing_index);
for(auto& sequence : sequences)
{
  builder.insert(sequence, true);
}
builder.finish();
builder.swapIndex(existing_index);

This example first creates a GBWTBuilder, using the width of the largest node identifier max_id in reverse orientation as node_width. It then swaps the index with an existing index and inserts all sequences from sequences in both orientations. Finally it finishes the construction and swaps the contents of the index back to existing_index.

Resampling

Both compressed and dynamic GBWT indexes can be resampled with a new sample interval. Resampling traverses all indexed paths using multiple threads and replaces the existing document array samples with new ones.

void resample(size_type sample_interval)
  • sample_interval: New sample interval. 0 means that only the final position on each path should be sampled.

Serialization

GBWT can use both SDSL and simple-sds serialization interfaces. When loading serialized indexes, both interfaces automatically detect and handle indexes serialized using the other interface.

Serialization depends on iostream exceptions for handling I/O errors. sdsl::simple_sds::InvalidData is thrown if the loaded data fails sanity checks.

Simple-SDS format

The simple-sds format is an interchange format between GBWT implementations. The serialized file is usually slightly smaller than the compressed in-memory representation. If the loaded index does not contain document array samples, it is automatically resampled with the default sample interval.

void simple_sds_serialize(std::ostream& out) const
void simple_sds_load(std::istream& in)
  • out: Any output stream.
  • in: Any input stream.

Example: sdsl::simple_sds::serialize_to(index, filename); writes index to filename.

Example: GBWT index; sdsl::simple_sds::load_from(index, filename); loads index from filename.

The serialize_to() / load_from() functions enable iostream exceptions. They throw sdsl::simple_sds::CannotOpenFile if the file cannot be opened.

SDSL format

The SDSL format matches the in-memory representation closely.

size_type serialize(std::ostream& out, sdsl::structure_tree_node* v = nullptr, std::string name = "") const
void load(std::istream& in)
  • out: Any output stream.
  • v, name: These can be ignored by the user.
  • in: Any input stream.

Example: sdsl::store_to_file(index, filename); writes index to filename.

Example: GBWT index; sdsl::load_from_file(index, filename); loads index from filename.

The store_to_file() / load_from_file() functions do not enable iostream exceptions.

References

Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science 483:134–148, 2013. DOI: 10.1016/j.tcs.2012.02.002

Heng Li: Fast construction of FM-index for long sequence reads. Bioinformatics 30(22):3274–3275, 2014. DOI: 10.1093/bioinformatics/btu541

Clone this wiki locally