-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of index module #292
Conversation
1d28372
to
336e051
Compare
336e051
to
65f041e
Compare
Feel free to start reviewing the first draft :) Documentation will follow. Bidirectional index will follow after the review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far!
You may want to rebase on master, because the range concept moved (that's why the tests aren't compiling).
Otherwise some format and docs(where already existing) stuff and mainly questions for understanding.
include/seqan3/index/concept.hpp
Outdated
|
||
#include <seqan3/alphabet/nucleotide/dna4.hpp> | ||
#include <seqan3/core/metafunction/range.hpp> | ||
#include <seqan3/range/concept.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is now <seqan3/std/concept/range.hpp>
include/seqan3/index/concept.hpp
Outdated
// SeqAn - The Library for Sequence Analysis | ||
// ============================================================================ | ||
// | ||
// Copyright (c) 2006-2017, Knut Reinert & Freie Universitaet Berlin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2018
Also for the other files
include/seqan3/index/concept.hpp
Outdated
// DAMAGE. | ||
// | ||
// ============================================================================ | ||
// Author: Christopher Pockrandt <christopher.pockrandt AT fu-berlin.de> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use doxygen for author information:
/*!\file
- \brief ....
- \author ...
*/
Also for the other files
include/seqan3/index/concept.hpp
Outdated
requires requires (t index, std::vector<dna4> const & text) { { t(t(text)) } }; | ||
requires requires (t index, std::vector<dna4> const & text) { { index = t(text) } }; | ||
|
||
requires requires (t index, const std::vector<dna4> & text) { { t(text) } }; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
include/seqan3/index/all.hpp
Outdated
@@ -0,0 +1,4 @@ | |||
#pragma once |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
typename index_type::comp_char_type c = node.last_char + 1; | ||
typename index_type::size_type _lb, _rb; | ||
|
||
while (c < index->m_index.sigma && !sdsl::backward_search(index->m_index, parent_lb, parent_rb, index->m_index.comp2char[c], _lb, _rb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line length
EXPECT_EQ(it, iterator_type(sa)); | ||
} | ||
|
||
// search a char that does not occur in the text (some rank that is neither the smallest nor the highest smallest occurring in text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line length
text_type text {"ACGACG"_dna4}; | ||
index_type sa{text}; | ||
|
||
// custom constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split all the tests into parts that test one functionality.
I.e. one test for each constructor, one test for each function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree 😄
I don't see any need to create a separate test case for each constructor. We could replace testing the default constructors using the metafunctions.
edit: i'll split the tests in basic_search
|
||
#include <seqan3/index/fm_index.hpp> | ||
|
||
#include <seqan3/alphabet/nucleotide/dna4.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should test for all the alphabets we support?
The to_rank() functionality is already tested in the alphabet tests, but the fm_index still has stuff that is otherwise untested, i.e. sdsl::int_vector<8> tmp_text(text.size());
, which should produce errors once we use UTF alphabets and have a character with rank > 255.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned before, we shouldn't support larger alphabets than for example amino acid alphabets. in that case the user should use indices for integer alphabets (where I currently don't see any application for in sequence analysis)
but we might enforce using byte-alphabets. any idea how to do that? we can't check for the underlying type (::char_type) since indices should work with types such as char, uint8_t as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use std::is_same_v<underlying_rank<t>::type, uint8_t>
, which should be implemented for all types t that satisfy the alphabet_concept (hence also uint8_t and char).
It's here and included with #include <seqan3/alphabet/concept.hpp>
.
test/unit/index/fm_index_test.cpp
Outdated
// TODO: EXPECT_EQ is not supported by sdsl | ||
// TODO: test bitcompressed vector for construction | ||
|
||
TEST(fm_index, ctr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above (alphabets, minimal test cases)
65f041e
to
9287193
Compare
@eseiler Thanks! Tests won't compile anyway because it's blocked by #45 (sdsl-lite) |
df17567
to
db651eb
Compare
Pretty much fixed all TODOs now that do not depend on external libraries. Most important question for @h-2: should we make the constructors of the fm_index_iterator noexcept? Since they are often passed by value, I understand it might be faster avoiding exception handling at all. |
6e9748c
to
8e8099b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the docs so far 👍
include/seqan3/index/all.hpp
Outdated
// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH | ||
// DAMAGE. | ||
// | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
include/seqan3/index/all.hpp
Outdated
* mappers, assemblers or protein search tools. There are currently two major kind of indices: FM indices and | ||
* k-mer indices (also known as q-gram indices). | ||
* | ||
* Besides searching the index yourself using the iterator interfaces, SeqAn3 also provides a very powerful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] searching by yourself [...]
(I think both versions are fine, but with 'by' I find it easier to read.)
include/seqan3/index/all.hpp
Outdated
* | ||
* You can choose between unidirectional and bidirectional FM indices (which can be thought of suffix trees | ||
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to | ||
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly
include/seqan3/index/all.hpp
Outdated
* You can choose between unidirectional and bidirectional FM indices (which can be thought of suffix trees | ||
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to | ||
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional | ||
* FM indices are more powerful for approximate string matching for the cost of a higher space consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] string matching at the cost of [...]
include/seqan3/index/all.hpp
Outdated
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to | ||
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional | ||
* FM indices are more powerful for approximate string matching for the cost of a higher space consumption | ||
* (depending on the configuration between a factor of TODO and TODO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would switch the parts and say:
between a factor of TODO and TODO depending on the configuration
(I read at first that the configuration is between a factor of...; so it kinda breaks the read flow)
* | ||
* The iterator traverses the implicit suffix tree beginning at the root node. All methods | ||
* modifying the iterator (e.g. going down an edge) return a `bool` value whether the operation was successful or not. | ||
* In case of an unsuccessful operation the iterator remains unmodified, i.e. there an iterator can never be in an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] i.e. there an iterator [...]
* The iterator traverses the implicit suffix tree beginning at the root node. All methods | ||
* modifying the iterator (e.g. going down an edge) return a `bool` value whether the operation was successful or not. | ||
* In case of an unsuccessful operation the iterator remains unmodified, i.e. there an iterator can never be in an | ||
* invalid state except default constructed iterators that are always invalid. E.g. going down an edge while the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] E.g., going down [...]
* iterator is in a leaf node, will return `false` and will remain in that leaf node. | ||
* | ||
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact | ||
* running times, you have to look up the running times of the used traits (configuration) additionally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to additionally look up the running times of the used traits (configuration) additionally.
* sequence does not occur in the indexed text). | ||
* | ||
* If the going down the edges fails in the middle of the sequence, all previous computations are rewound to restore | ||
* the iterators state before calling this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] the iterator's state [...]
return depth() == 0; | ||
} | ||
|
||
// TODO: naming? maybe dereference-operator instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using a dereference operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should offer a "named" method as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, for me it's clear what I expect from dereferencing the iterator, but a named method might be more verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion? I'm not really happy with path_label()
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_path
, path_representation
(which I find quite long) ?
I actually find path_label
catchy
eb6ca59
to
f9a7ab3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please make the test cases minimal as suggested by enrico
- Concerning the module structure I would propose to have index in the search module as both things cannot be used without the other, lets discuss the strucutre together!
include/seqan3/index/all.hpp
Outdated
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to | ||
* right, right to left and character by character in any arbitrary order). Roughly speaking bidirectional | ||
* FM indices are more powerful for approximate string matching at the cost of a higher space consumption | ||
* (between a factor of TODO and TODO depending on the configuration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the documentation use \TODO
for TODOs as that makes them appear extra on:
http://docs.seqan.de/seqan/3.0.0-master-dev/todo.html
include/seqan3/index/all.hpp
Outdated
* FM indices are more powerful for approximate string matching at the cost of a higher space consumption | ||
* (between a factor of TODO and TODO depending on the configuration). | ||
* | ||
* The FM indices are based on the SDSL (succinct data structure library). You are able to specify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add link
include/seqan3/index/concept.hpp
Outdated
//!\cond | ||
template <typename t> | ||
concept bool fm_index_traits_concept = requires (t v, | ||
typename t::sdsl_index_type::size_type lb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indention
include/seqan3/index/concept.hpp
Outdated
requires requires (t index, std::vector<dna4> const & text) { { t(t(text)) } }; | ||
requires requires (t index, std::vector<dna4> const & text) { { index = t(text) } }; | ||
|
||
requires requires (t index, const std::vector<dna4> & text) { { t(text) } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
semiregular concept likely includes everything you want
include/seqan3/index/fm_index.hpp
Outdated
* | ||
* ### Complexity | ||
* | ||
* At least linear. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
public: | ||
|
||
// TODO(h-2): | ||
// * shouldn't all these constructors be noexcept? construction will happen a lot in recursions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
assert(index != nullptr); | ||
|
||
// parent_lb/parent_rb might be uninitialized in a root node | ||
return node == rhs.node && ((parent_lb == rhs.parent_lb && parent_rb == rhs.parent_rb) || depth() == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check depth == 0
before the other condition to prevent read from unitialised memory
* | ||
* No-throw guarantee. | ||
*/ | ||
template <range_concept pattern_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check actual contraints; at least input_range, but also sized if you use size()
for (typename index_t::size_type i = 0; i < occ.size(); ++i) { | ||
occ[i] = offset() - index->m_index[node.lb + i]; | ||
} | ||
return occ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indention
assert(index != nullptr); | ||
|
||
std::vector<size_type> occ(count()); | ||
for (typename index_t::size_type i = 0; i < occ.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
braces
f9a7ab3
to
341bed4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor stuff
{ | ||
//! [cycle] | ||
std::vector<dna4> genome{"GAATTAATGAAC"_dna4}; | ||
bi_fm_index<std::vector<dna4>> index{genome}; // build the index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to omit the template parameter
std::cout << it.query() << '\n'; // outputs "AAC" | ||
std::cout << it.last_char() << '\n'; // outputs 'C' | ||
|
||
// it.cycle_front(); // undefined behaviour! only cycle_back() is allowed after extend_right() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to throw? Does one cycle often in algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already discussed this and decided to make it undefined behaviour. Yes, pretty much all the time when you do approximate string matching
it.cycle_front(); // search the sequence "TAAT" | ||
std::cout << it.query() << '\n'; // outputs "TAAT" | ||
std::cout << it.last_char() << '\n'; // outputs 'T' | ||
//! [cycle] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice examples!
|
||
int main() | ||
{ | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, don't indent top-level(s) in the snippets, because it will also indent in the documentation.
bi_fm_index<std::vector<dna4>> index{genome}; // build the index | ||
|
||
auto it = index.begin(); // create an iterator | ||
it.extend_left("CAA"_dna4); // search the sequence "AAC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you said one doesn't have to reverse the sequences mentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an open discussion see
// TODO(h-2, eseiler and cpockrandt): should we iterate from left to right or right to left? |
//! [cycle] | ||
} | ||
|
||
std::cout << "-------------------------\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you printing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case a user wants to compile the snippet it's hard to see where the output of the different blocks start and end. But I'll make a more meaningful output instead of ------ 😃
// Undefined behaviour! Cannot be called on the forward iterator if the last extension on the bidirectional | ||
// iterator was to the left: | ||
// it.cycle_back(); | ||
// std::cout << it.last_char() << '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_char() also causes UB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If you do an extension to the left, than you can only do cycle_front() and last_char() will return the leftmost character. When you now construct a unidirectional fwd iterator (that goes to the right on the original text), last_char() will give you the leftmost char even though last_char() on a unidirectional iterator is supposed to give you always the rightmost character. That's why it's undefined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put a check in debug mode into the unidirectional iterator (when it is constructed from a bidirectional one) as we did it in the bidirectional iterator. I'll put this on my TODO list to work on after GCB.
uni_it.extend_right(dna4::G); // search the sequence "AACG" | ||
std::cout << uni_it.query() << '\n'; // outputs "AACG" | ||
std::cout << uni_it.last_char() << '\n'; // outputs 'G' | ||
uni_it.cycle_back(); // returns false since there is not sequence "AACT" in the text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not
e212324
to
6943a45
Compare
|
||
#pragma once | ||
|
||
#include <filesystem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace this with
#include <seqan3/io/filesystem.hpp>
It resolves to std::experimental::filesystem
or std::filesystem
- depending on what is available and puts it into the seqan3::filesystem
namespace
And then replace
std::filesystem::
with
filesystem::
|
||
#pragma once | ||
|
||
#include <filesystem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, replace with
#include <seqan3/io/filesystem.hpp>
ae530c0
to
be2d41a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor stuff
include/seqan3/index/bi_fm_index.hpp
Outdated
//!\cond | ||
requires alphabet_concept<innermost_value_type_t<text_t>> | ||
&& std::is_same_v<typename underlying_rank<innermost_value_type_t<text_t>>::type, uint8_t> | ||
//!\endcond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅 underlying_rank_t<innermost_value_type_t<text_t>
* | ||
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact | ||
* running times, you have to additionally look up the running times of the used traits (configuration). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please implement my previous suggestion regarding the "iterator" disclaimer.
edit: I am starting to be more and more convinced that we actually should not call them iterators, but maybe cursors or something like that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about it during the weekend 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given, that the iterator concept requires dereferencing and incrementing operations, and that in fact an index is not a range (is it?), I would also argue for something different. Cursor
sounds like a perfect candidate. In fact the ranges library offers a cursor interface and also a function called cursor_begin()
, cursor_end()
etc.. I would really prefer this interface.
by the way, can you index collections now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some first look. I will continue tomorrow.
include/seqan3/search/all.hpp
Outdated
* Searching is a key component in many sequence analysis tools. The search module is a powerful and easy way to search | ||
* sequences in a large text or an arbitrary nested collection of texts. When it comes to searching, indices are a core | ||
* component for searching large amounts of data and are used for tools such as read mappers, assemblers or protein | ||
* search tools. There are currently two major kind of indices: FM indices and k-mer indices (also known as q-gram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if FM-index is more generic than Suffix Array, if you talk about two major kinds?
|
||
public: | ||
|
||
/*!\name Member types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this group is superfluous
* | ||
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact | ||
* running times, you have to additionally look up the running times of the used traits (configuration). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given, that the iterator concept requires dereferencing and incrementing operations, and that in fact an index is not a range (is it?), I would also argue for something different. Cursor
sounds like a perfect candidate. In fact the ranges library offers a cursor interface and also a function called cursor_begin()
, cursor_end()
etc.. I would really prefer this interface.
d33774a
to
cb72be4
Compare
cb72be4
to
0514ee3
Compare
This is work in progress, don't merge
or reviewyet :)