It provides the various analytics on very large sequence of unsigned integers in constant time.
After adding to Cargo.toml, add this line to lib.rs or main.rs.
extern crate wavelet_matrix;
See crate document top for further examples.
-
Overall Performance
- https://github.com/sekineh/wavelet-matrix-rs/blob/master/BENCH.md
- Bench is performed on Intel Xeon 2800 MHz
-
Error evaluation of O(1) SUM
Given an unsigned integer sequence T, it provides the following queries.
.len()
:- Returns the length of
T
. - It's almost no overhead, just returning the stored value.
- Returns the length of
.lookup(pos)
:- Returns the value at the position of T,
T[pos]
. - It's slower than array lookup but still is O(1). You might want to save the original Vector for faster lookup.
- Returns the value at the position of T,
Counting is performed in O(1).
.count(start..end, value)
:- Returns the number of the element
e
which satisfiese == value
included inT[start..end]
- Returns the number of the element
.count_prefix(start..end, value, ignore_bit)
:- Returns the number of the element
e
which satisfiese >> ignore_bit == value >> ignore_bit
included inT[start..end]
- This will be useful for counting the number of IPv4 address that satisfies IPv4 prefix such as
192.168.10.0/24
. In this case, the ignore_bit will be 8.
- Returns the number of the element
.count_lt(start..end, value)
:- Returns the number of the element
e
which satisfiese < value
included inT[start..end]
- Returns the number of the element
.count_gt(start..end, value)
:- Returns the number of the element
e
which satisfiese > value
included inT[start..end]
- Returns the number of the element
.count_range(start..end, val_start..val_end)
:- Returns the number of the element
e
which satisfiesval_start <= e < val_end
included inT[start..end]
- Returns the number of the element
Searching is performed in O(1) per a next index.
.search(start..end, value)
:- Returns the iterator that find indexes of the element
e
which satisfiese == value
included inT[start..end]
- Returns the iterator that find indexes of the element
.search_prefix(start..end, value, ignore_bit)
:- Returns the iterator that find indexes of the element
e
which satisfiese >> ignore_bit == value >> ignore_bit
included inT[start..end]
- Returns the iterator that find indexes of the element
- [TODO] implement various conditions other than equal.
Ranking is performed in roughly O(1) with regard to the number of elements n
.
.max_k(start..end, val_start..val_end, k)
:- list the
(value, count)
pairs in descending order. - values are constrained to the range
val_start..val_end
.
- list the
.min_k(start..end, val_start..val_end, k)
:- list the
(value, count)
pairs in ascending order. - values are constrained to the range
val_start..val_end
.
- list the
.top_k()
is also performed in O(1) in best case, but may take O(n) in the worst case where every value occurs only once!
.top_k(start..end, val_start..val_end, k)
:- list the
(value, count)
pairs in most-frequent-one-first order. - values are constrained to the range
val_start..val_end
. - [TODO] clarify the order of same count.
- list the
To achieve O(1) performance regardless of the number of unique values, use .top_k_ranges()
instead:
- [EXPERIMENTAL]
.top_k_ranges(start..end, val_start..val_end, k)
:- list the
(v_start..v_end, count)
pairs in most-frequent-one-first order. - unlike
.top_k()
,.top_k_ranges()
returns the exhaustive range set that covers all of the values. - values are constrained to the range
val_start..val_end
. - [TODO] clarify the order of same count.
- list the
.median(start..end)
:- Returns the median value from
T[start..end]
. - same as
.quantile(start..end, start + (end - start) / 2)
.
- Returns the median value from
.quantile(start..end, k)
:- Returns the (k+1)th smallest value from
T[start..end]
.
- Returns the (k+1)th smallest value from
These methods use .top_k_ranges()
to enumerate the most relevant value ranges.
They are not as accurate as the method used in Experiment 2.
- [EXPERIMENTAL]
.sum_experiment1(start..end, val_start..val_end, k)
:- Approximately calculate the sum of
T[start..end]
using up tok
wavelet tree nodes. - Only values included in the range
val_start..val_end
are processed. - To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
- Approximately calculate the sum of
- [EXPERIMENTAL]
.mean_experiment1(start..end, val_start..val_end, k)
:- Approximately calculate the average of
T[start..end]
using up tok
wavelet tree nodes. - Only values included in the range
val_start..val_end
are processed. - To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
- Approximately calculate the average of
- [EXPERIMENTAL]
.variance_experiment1(start..end, val_start..val_end, k)
:- Approximately calculate the variance of
T[start..end]
using up tok
wavelet tree nodes. - Only values included in the range
val_start..val_end
are processed. - To get the exact result, specify
k = m + 1
wherem
is the number of values which are unique.
- Approximately calculate the variance of
Improvement over Experiment 1. They use custom node enumerator to minimize the error.
- [EXPERIMENTAL]
.sum_experiment2(start..end, val_start..val_end, k)
:
Improvement over Experiment 2. They use Range<u64>
to tell how accurate the computed value is.
- [EXPERIMENTAL]
.sum_experiment3(start..end, val_start..val_end, k)
:
.rank(pos, value)
: counts value included inT[0..pos]
.- Note: pos is exclusive. When pos is 0,
.rank()
always returns 0.
- Note: pos is exclusive. When pos is 0,
.select(rank, value)
: return the position of the (rank+1)-th value- Note: When found nothing, it returns
.len()
instead of None.
- Note: When found nothing, it returns
-
Add Benchmark.
- Implement same queries using trivial algorithm
- Compare wm's queries against trivial one.
- Make a nice plot.
-
Profiling
-
Optimize underlying rsdic structure.
-
Add travis CI.
-
Add u128 support or arbitrary-length integer support
-
The fastest implementation on the planet
-
A Go package for myriad array operations using wavelet trees
- https://github.com/hillbig/waveletTree
- Basically, the algorithm is deeply derived from the above Go implementation.
-
excellent slides in Japanese
-
The original inventor's pdf: