-
Notifications
You must be signed in to change notification settings - Fork 13
Removing Sequences
Removing sequences from GBWT is similar to inserting them. The implemented algorithm is an in-memory variant of the parallel merging algorithm. Multiple search threads search for the sequences to be removed, building the rank array in memory. The positions specified by the rank array are then removed from the index. Because the uncompressed rank array is stored in memory, requiring temporarily up to tens of bytes times the total length of the sequences, the algorithm is mostly suited for removing a small number of sequences.
If the index is bidirectional, any request to remove sequence N
will actually remove sequences Path::encode(N, false)
and Path::encode(N, true)
. Otherwise sequence N
will be removed instead. The set of sequence identifiers may contain duplicates, as they are removed during preprocessing. If at least one of the specified sequence identifiers is invalid, no sequences are removed.
Sequences can be removed with remove_seq
.
remove_seq [options] base_name seq1 [seq2 ...]
The program reads base_name.gbwt
, removes the sequences with identifiers seq1
, seq2
, ... The output is written back to base_name.gbwt
, unless specified otherwise.
-
-c N
: Use chunks ofN
sequences per search thread. -
-o X
: Write the output toX.gbwt
. -
-r
: Remove the range of sequencesseq1
toseq2
(inclusive). Requries exactly two sequence arguments.
Example: remove_seq -r -o output input 11 20
Reads input.gbwt
, removes sequences 11 to 20, and writes the result to output.gbwt
.
The following member functions of DynamicGBWT
remove sequences from the index. The return value is the total length of the removed sequences, or 0 if no sequences were removed.
size_type remove(size_type seq_id, size_type chunk_size = REMOVE_CHUNK_SIZE);
size_type remove(const std::vector<size_type>& seq_ids, size_type chunk_size = REMOVE_CHUNK_SIZE);
-
seq_id
: Identifier of the sequence. -
seq_ids
: Set of sequence identifiers to be removed. -
chunk_size
: Use chunks of this many sequences per search thread.
Example:
DynamicGBWT index;
sdsl::load_from_file(index, input_name);
std::vector<size_type> to_remove;
for(size_type i = 11; i <= 20; i++) { to_remove.push_back(i); }
index.remove(to_remove);
sdsl::store_to_file(index, output_name);
This reads the index from file input_name
, removes sequences 11 to 20, and writes the resulting index to file output_name
.