-
Notifications
You must be signed in to change notification settings - Fork 13
Construction Interface
GBWT is built by inserting sequences in batches to a dynamic FM-index (DynamicGBWT
). The construction algorithm is based on BCR and RopeBWT2. A single step of the algorithm consists of extending each sequence in the current batch by one node.
Batch size provides a time/space trade-off. Large batches require more memory, as the sequences must be loaded into memory. On the other hand, if the sequences are aligned (to some extent), the number of records modified in each step remains small, and the algorithm is faster.
Each batch must consist of entire sequences. If batch size is specified in number of nodes and the last sequence is not fully contained in the batch, the actual batch size will be smaller than specified.
The construction is partially multi-threaded. There is little benefit in using more than 2 or 4 threads, however. Index verification can take advantage of tens of threads.
See the data model for input specifications.
The main construction program is build_gbwt
.
build_gbwt [options] base_name
- The program reads input from
base_name
and writes the index tobase_name.gbwt
. -
-b N
: Insert the sequences in batches ofN
million nodes. Use batch size 0 to insert all sequences in a single batch. Default: 100. -
-v
: Verify the correctness of the index with various queries based on the input.
Example: build_gbwt -b 200 input
reads the sequences from input
, builds the GBWT in batches of 200 million nodes, and writes the index to input.gbwt
.
Existing indexes can be merged with merge_gbwt
.
merge_gbwt [options] input1 [input2 ...] output
- The program reads
input1.gbwt
, inserts the sequences frominput2.gbwt
(and further inputs) to it, and writes the merged index tooutput.gbwt
. - For best performance,
input1
should be the largest input. -
-b N
: Use batch size ofN
sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000.
Example: merge_gbwt large small1 small2 merged
inserts the sequences from small1.gbwt
and small2.gbwt
to large.gbwt
and writes the merged index to merged.gbwt
.
The construction writes status information to stderr
. There are four possible verbosity levels, which can be set using Verbosity::set(size_type new_level)
:
Level | Numerical | Description |
---|---|---|
Verbosity::SILENT |
0 | No status information |
Verbosity::BASIC |
1 | Basic progress information and statistics on the input and the final index |
Verbosity::EXTENDED |
2 | Adds intermediate statistics for each batch |
Verbosity::FULL |
3 | Adds detailed information for each batch (default) |
DynamicGBWT
is defined in dynamic_gbwt.h.
GBWT construction is based on creating an empty index with the default constructor DynamicGBWT()
and then inserting sequences into it with one or more DynamicGBWT::insert()
calls.
If you can generate the sequences incrementally, this options allows building the GBWT without storing the sequences explicitly.
void insert(const text_type& text)
-
text
: Batch of sequences.
Example:
DynamicGBWT gbwt;
for(size_type i = 0; i < source.size(); i += 1000)
{
text_type batch = source.get(i, std::min(i + 1000, source.size()) - 1);
gbwt.insert(batch);
}
This builds GBWT for source
, generating batches of 1000 sequences and inserting them into an initially empty index.
If the sequences are stored on disk, this option inserts them in multiple batches.
void insert(text_buffer_type& text, size_type batch_size = INSERT_BATCH_SIZE)
-
text
: Sequences on disk. -
batch_size
: Batch size in number of nodes. Use 0 to insert all sequences as a single batch. Default: 100 million.
Example:
DynamicGBWT gbwt;
text_buffer_type text(input_name);
gbwt.insert(text, 200 * MILLION);
This inserts the sequences from file input_name
into an empty GBWT in batches of 200 million nodes.
This option extracts the sequences from a compressed GBWT and inserts them into the index. It is around 2 times slower than the other construction options.
void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE)
-
source
: Input GBWT. -
batch_size
: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000.
Example:
DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.insert(input2);
This reads the index from file input1_name
as a dynamic GBWT input1
and the index from file input2_name
as a compressed GBWT input2
. Then it inserts the sequences from input2
into input1
.
GBWT uses the SDSL interface for serializing and loading structures.
size_type serialize(std::ostream& out, sdsl::structure_tree_node* v = nullptr, std::string name = "") const
void load(std::istream& in)
-
out
: Any output stream. -
v
,name
: These can be ignored by the user. -
in
: Any input stream.
Example: sdsl::store_to_file(index, filename);
writes index
to filename
.
Example: GBWT index; sdsl::load_from_file(index, filename);
loads index
from filename
.
Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone: Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science 483:134–148, 2013. DOI: 10.1016/j.tcs.2012.02.002
Heng Li: Fast construction of FM-index for long sequence reads. Bioinformatics 30(22):3274–3275, 2014. DOI: 10.1093/bioinformatics/btu541