Skip to content

Removing Sequences

Jouni Siren edited this page Nov 28, 2018 · 6 revisions

General

Removing sequences from GBWT is similar to inserting them. The implemented algorithm is an in-memory variant of the parallel merging algorithm. Multiple search threads search for the sequences to be removed, building the rank array in memory. The positions specified by the rank array are then removed from the index. Because the uncompressed rank array is stored in memory, requiring temporarily up to tens of bytes times the total length of the sequences, the algorithm is mostly suited for removing a small number of sequences.

If the index is bidirectional, any request to remove sequence N will actually remove sequences Path::encode(N, false) and Path::encode(N, true). Otherwise sequence N will be removed instead. If at least one of the specified sequence identifiers is invalid, no sequences are removed.

Remove tool

Sequences can be removed with remove_seq.

remove_seq [options] base_name seq1 [seq2 ...]

The program reads base_name.gbwt, removes the sequences with identifiers seq1, seq2, ... The output is written back to base_name.gbwt, unless specified otherwise.

  • -c N: Use chunks of N sequences per search thread.
  • -o X: Write the output to X.gbwt.
  • -r: Remove the range of sequences seq1 to seq2 (inclusive). Requries exactly two sequence arguments.

Example: remove_seq -r -o output input 11 20 Reads input.gbwt, removes sequences 11 to 20, and writes the result to output.gbwt.

Interface

Clone this wiki locally