Skip to content

Search Interface

Jouni Siren edited this page Nov 26, 2021 · 9 revisions

General

All queries use a single thread. They check that the query parameters are valid. The following functions are member functions of GBWT, CachedGBW, and DynamicGBWT. The structures are defined in gbwt.h, cached_gbwt.h, and dynamic_gbwt.h, respectively.

See Query Interface for the definitions of search states.

Searching

These queries return a search state, which is empty if the parameters are invalid. A pattern matches a search state, if the prefixes in the range of record offsets end with the pattern. The iterators are assumed to be input iterators.

Paths containing a pattern

SearchState find(node_type node) const
template<class Iterator> SearchState find(Iterator begin, Iterator end) const

This query returns the search state matching the pattern.

  • node: Node identifier.
  • begin: Start of the pattern, inclusive.
  • end: End of the pattern, exclusive.

Paths starting with a pattern

SearchState prefix(node_type node) const
template<class Iterator> SearchState prefix(Iterator begin, Iterator end) const

These queries return the search state for the prefixes of the indexed paths that are equal to the pattern.

  • node: Node identifier.
  • begin: Start of the pattern, inclusive.
  • end: End of the pattern, exclusive.

Extending the search

SearchState extend(SearchState state, node_type node) const
template<class Iterator> SearchState extend(SearchState state, Iterator begin, Iterator end) const

These queries extend the search from the given search state. extend(find(A), B) is equivalent to searching for the concatenation with find(A + B).

  • state: Search state where the search is started.
  • node: Node identifier.
  • begin: Start of the pattern, inclusive.
  • end: End of the pattern, exclusive.

Bidirectional search

Bidirectional search maintains the search states for both orientations of the pattern. It requires that both orientations of the paths have been indexed. This can be checked with GBWT::bidirectional(). The queries do not check whether the index supports bidirectional search.

These queries return a bidirectional search state, which is empty if the parameters are invalid.

Starting the search

BidirectionalState bdFind(node_type node) const

Returns the bidirectional search state for a pattern consisting of a single node node. This is similar to the unidirectional query find(node).

Extending the search forward

BidirectionalState bdExtendForward(BidirectionalState state, node_type node) const

Assume that search state state corresponds to pattern P. This returns the bidirectional search state for the concatenation of P and node. This is similar to the unidirectional query extend(state, node).

Extending the search backward

BidirectionalState bdExtendBackward(BidirectionalState state, node_type node) const

Assume that search state state corresponds to pattern P. This returns the bidirectional search state for the concatenation of node and P.

Determining the matching paths

These convert positions to path identifiers. On the average, it takes d/2 steps of LF() to find a sampled identifier, where d is the sample interval.

Single positions

size_type locate(node_type node, size_type i) const
size_type locate(edge_type position) const

This query returns the path identifier for the given position. It iterates LF() until it finds a sample. If the position is invalid, the return value is invalid_sequence().

  • node: Node identifier.
  • i: Record offset.
  • position: A combination of the above.

Ranges of positions

std::vector<size_type> locate(node_type node, range_type range) const
std::vector<size_type> locate(SearchState state) const

This query returns the set of path identifiers for the given range of positions. It is much faster than calling locate() separately for each record offset. If the range of positions is invalid, the returned set will be empty.

  • node: Node identifier.
  • range: Range of record offsets.
  • state: A combination of the above.

Extracting paths

vector_type extract(size_type sequence) const
vector_type extract(edge_type position) const
vector_type extract(edge_type position, size_type max_length) const

These queries return the path, a suffix, and a substring, respectively. If the query parameters are invalid, the returned path is empty.

  • sequence: Path identifier.
  • position: Starting position of the extaction as a combination of a node indentifier and a record offset.
  • max_length: Maximum length of the returned substring.