Skip to content

Construction Interface

Jouni Siren edited this page Nov 9, 2017 · 36 revisions

General

GBWT is built by inserting sequences in batches to a dynamic FM-index (DynamicGBWT). The construction algorithm is based on BCR and RopeBWT2. A single step of the algorithm consists of extending each sequence in the current batch by one node.

Batch size provides a time/space trade-off. Large batches require more memory, as the sequences must be loaded into memory. On the other hand, if the sequences are aligned (to some extent), the number of records modified in each step remains small, and the algorithm is faster.

Each batch must consist of entire sequences. If batch size is specified in number of nodes and the last sequence is not fully contained in the batch, the actual batch size will be smaller than specified.

Some construction options support inserting sequences in both orientations. See Identifiers for further details.

The construction is single-threaded, though some construction methods do the actual construction in a background worker thread. The memory usage of construction is low enough that it is feasible to run multiple construction jobs (e.g. for multiple chromosomes) in parallel.

See the data model for input specifications.

Construction tools

Construction

The main construction tool is build_gbwt. It is effectively single-threaded, though it uses separate threads for reading the input and for building the index.

build_gbwt [options] input1 [input2 ...]

This builds an index for all input files output.gbwt. If no output is specified, there is only one input file, and no existing index is loaded, input1 is used as the base name for output.

  • -b N: Insert the sequences in batches of N million nodes. Use batch size 0 to insert all sequences in a single batch. Default: 100.
  • -f: Insert the sequences only in forward orientation. This is the default behavior.
  • -i X: Insert the sequences to an existing index X.gbwt.
  • -o X: Use X as the base name for output.
  • -r: Insert the sequences in both orientations.
  • -v: Verify the correctness of the index with various queries based on the input.

Example: build_gbwt -b 200 input reads the sequences from input, builds the GBWT in batches of 200 million nodes, and writes the index to input.gbwt.

Example: build_gbwt -r -i index -o output input reads the sequences from input, inserts them in both orientations into index.gbwt, and writes the result to output.gbwt.

Merging

Existing indexes can be merged with merge_gbwt. This tool is single-threaded.

merge_gbwt [options] -o output input1 input2 [input3 ...]

The program reads input1.gbwt, inserts the sequences from input2.gbwt (and further inputs) to it, and writes the merged index to output.gbwt. For best performance, input1 should be the largest input.

  • -b N: Use batch size of N sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000.
  • -o X: Write the output to X.gbwt. Required.

Example: merge_gbwt -o merged large small1 small2 inserts the sequences from small1.gbwt and small2.gbwt to large.gbwt and writes the merged index to merged.gbwt.

Global settings

The construction writes status information to stderr. There are four possible verbosity levels, which can be set using Verbosity::set(size_type new_level):

Level Numerical Description
Verbosity::SILENT 0 No status information
Verbosity::BASIC 1 Basic progress information and statistics on the input and the final index
Verbosity::EXTENDED 2 Adds intermediate statistics for each batch
Verbosity::FULL 3 Adds detailed information for each batch (default)

GBWT construction

DynamicGBWT is defined in dynamic_gbwt.h.

GBWT construction is based on creating an empty index with the default constructor DynamicGBWT() and then inserting sequences into it with one or more DynamicGBWT::insert() calls.

Single batch

If you can generate the sequences incrementally, this options allows building the GBWT without storing the sequences explicitly. These functions are single-threaded. They are also space-efficient, as they use the text directly as an input without additional buffering.

void insert(const text_type& text)
void insert(const text_type& text, size_type text_length)
void insert(const std::vector<node_type>& text)
  • text: Batch of sequences.
  • text_length: Total length of the sequences including endmarkers.

In a common use case, we have a single text_type as a buffer. When the next sequence no longer fits into the buffer, the batch is inserted into the index and the buffer is cleared. Because resizing text_type always causes reallocation, it is more efficient to specify the total length of the sequences instead of resizing the buffer.

Example:

DynamicGBWT gbwt;
for(size_type i = 0; i < source.size(); i += 1000)
{
  text_type batch = source.get(i, std::min(i + 1000, source.size()) - 1);
  gbwt.insert(batch);
}

This builds GBWT for source, generating batches of 1000 sequences and inserting them into an initially empty index.

Construction from disk

If the sequences are stored on disk, this option inserts them in multiple batches. The function is effectively single-threaded, though a it uses separate threads for reading the input and for building the index.

void insert(text_buffer_type& text, size_type batch_size = INSERT_BATCH_SIZE, bool both_orientations = false)
  • text: Sequences on disk.
  • batch_size: Batch size in number of nodes. Use 0 to insert all sequences as a single batch. Default: 100 million.
  • both_orientations: Set to true to index the sequences in both orientations.

Example:

DynamicGBWT gbwt;
text_buffer_type text(input_name);
gbwt.insert(text, 200 * MILLION);

This inserts the sequences from file input_name into an empty GBWT in batches of 200 million nodes.

Merging

This option extracts the sequences from a compressed GBWT and inserts them into the index. It is around 2 times slower than the other construction options. This function is single-threaded.

void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE)
  • source: Input GBWT.
  • batch_size: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000.

Example:

DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.insert(input2);

This reads the index from file input1_name as a dynamic GBWT input1 and the index from file input2_name as a compressed GBWT input2. Then it inserts the sequences from input2 into input1.

Buffered construction

class GBWTBuilder (defined in dynamic_gbwt.h) provides a simple interface for incremental construction. Sequences are inserted into the input buffer one-by-one. When the buffer gets full, it is swapped with the construction buffer. A worker thread is then launched to insert the sequences in the construction buffer into the index.

The build_gbwt tool and the insert(text_buffer_type& text, ...) function are based on GBWTBuilder.

The public interface consists of the following member functions. The index itself can be accessed through the index member variable.

GBWTBuilder(size_type node_width, size_type batch_size = DynamicGBWT::INSERT_BATCH_SIZE)

The constructor creates a builder containing an empty index.

  • size_type node_width: Number of bits used to represent a node identifier.
  • size_type batch_size: Size of the input buffer and the construction buffer. Default: 100 million.

Serialization

GBWT uses the SDSL interface for serializing and loading structures.

size_type serialize(std::ostream& out, sdsl::structure_tree_node* v = nullptr, std::string name = "") const
void load(std::istream& in)
  • out: Any output stream.
  • v, name: These can be ignored by the user.
  • in: Any input stream.

Example: sdsl::store_to_file(index, filename); writes index to filename.

Example: GBWT index; sdsl::load_from_file(index, filename); loads index from filename.

References

Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science 483:134–148, 2013. DOI: 10.1016/j.tcs.2012.02.002

Heng Li: Fast construction of FM-index for long sequence reads. Bioinformatics 30(22):3274–3275, 2014. DOI: 10.1093/bioinformatics/btu541

Clone this wiki locally