Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of index module #292

Merged
merged 3 commits into from
Sep 18, 2018
Merged

First draft of index module #292

merged 3 commits into from
Sep 18, 2018

Conversation

cpockrandt
Copy link
Contributor

@cpockrandt cpockrandt commented Jul 6, 2018

This is work in progress, don't merge or review yet :)

@cpockrandt cpockrandt force-pushed the feature/index branch 6 times, most recently from 1d28372 to 336e051 Compare August 6, 2018 18:09
@cpockrandt cpockrandt requested review from h-2 and eseiler August 6, 2018 18:10
@cpockrandt
Copy link
Contributor Author

Feel free to start reviewing the first draft :) Documentation will follow. Bidirectional index will follow after the review :)

Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far!

You may want to rebase on master, because the range concept moved (that's why the tests aren't compiling).

Otherwise some format and docs(where already existing) stuff and mainly questions for understanding.


#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/metafunction/range.hpp>
#include <seqan3/range/concept.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is now <seqan3/std/concept/range.hpp>

// SeqAn - The Library for Sequence Analysis
// ============================================================================
//
// Copyright (c) 2006-2017, Knut Reinert & Freie Universitaet Berlin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018

Also for the other files

// DAMAGE.
//
// ============================================================================
// Author: Christopher Pockrandt <christopher.pockrandt AT fu-berlin.de>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use doxygen for author information:

/*!\file

  • \brief ....
  • \author ...
    */

Also for the other files

requires requires (t index, std::vector<dna4> const & text) { { t(t(text)) } };
requires requires (t index, std::vector<dna4> const & text) { { index = t(text) } };

requires requires (t index, const std::vector<dna4> & text) { { t(text) } };

This comment was marked as resolved.

This comment was marked as resolved.

@@ -0,0 +1,4 @@
#pragma once

This comment was marked as resolved.

typename index_type::comp_char_type c = node.last_char + 1;
typename index_type::size_type _lb, _rb;

while (c < index->m_index.sigma && !sdsl::backward_search(index->m_index, parent_lb, parent_rb, index->m_index.comp2char[c], _lb, _rb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length

EXPECT_EQ(it, iterator_type(sa));
}

// search a char that does not occur in the text (some rank that is neither the smallest nor the highest smallest occurring in text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length

text_type text {"ACGACG"_dna4};
index_type sa{text};

// custom constructor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split all the tests into parts that test one functionality.
I.e. one test for each constructor, one test for each function.

Copy link
Contributor Author

@cpockrandt cpockrandt Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree 😄

I don't see any need to create a separate test case for each constructor. We could replace testing the default constructors using the metafunctions.

edit: i'll split the tests in basic_search


#include <seqan3/index/fm_index.hpp>

#include <seqan3/alphabet/nucleotide/dna4.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should test for all the alphabets we support?
The to_rank() functionality is already tested in the alphabet tests, but the fm_index still has stuff that is otherwise untested, i.e. sdsl::int_vector<8> tmp_text(text.size());, which should produce errors once we use UTF alphabets and have a character with rank > 255.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned before, we shouldn't support larger alphabets than for example amino acid alphabets. in that case the user should use indices for integer alphabets (where I currently don't see any application for in sequence analysis)

but we might enforce using byte-alphabets. any idea how to do that? we can't check for the underlying type (::char_type) since indices should work with types such as char, uint8_t as well

Copy link
Member

@eseiler eseiler Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use std::is_same_v<underlying_rank<t>::type, uint8_t>, which should be implemented for all types t that satisfy the alphabet_concept (hence also uint8_t and char).
It's here and included with #include <seqan3/alphabet/concept.hpp>.

// TODO: EXPECT_EQ is not supported by sdsl
// TODO: test bitcompressed vector for construction

TEST(fm_index, ctr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above (alphabets, minimal test cases)

@cpockrandt
Copy link
Contributor Author

@eseiler Thanks!

Tests won't compile anyway because it's blocked by #45 (sdsl-lite)

@cpockrandt cpockrandt force-pushed the feature/index branch 3 times, most recently from df17567 to db651eb Compare August 7, 2018 15:24
@cpockrandt
Copy link
Contributor Author

cpockrandt commented Aug 7, 2018

Pretty much fixed all TODOs now that do not depend on external libraries.

Most important question for @h-2: should we make the constructors of the fm_index_iterator noexcept? Since they are often passed by value, I understand it might be faster avoiding exception handling at all.

@cpockrandt cpockrandt force-pushed the feature/index branch 5 times, most recently from 6e9748c to 8e8099b Compare August 8, 2018 00:14
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the docs so far 👍

// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
// DAMAGE.
//

This comment was marked as resolved.

* mappers, assemblers or protein search tools. There are currently two major kind of indices: FM indices and
* k-mer indices (also known as q-gram indices).
*
* Besides searching the index yourself using the iterator interfaces, SeqAn3 also provides a very powerful
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] searching by yourself [...]

(I think both versions are fine, but with 'by' I find it easier to read.)

*
* You can choose between unidirectional and bidirectional FM indices (which can be thought of suffix trees
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly

* You can choose between unidirectional and bidirectional FM indices (which can be thought of suffix trees
* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional
* FM indices are more powerful for approximate string matching for the cost of a higher space consumption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] string matching at the cost of [...]

* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to
* right, right to left and character by character in any arbitrary order). Rougly speaking bidirectional
* FM indices are more powerful for approximate string matching for the cost of a higher space consumption
* (depending on the configuration between a factor of TODO and TODO).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would switch the parts and say:

between a factor of TODO and TODO depending on the configuration

(I read at first that the configuration is between a factor of...; so it kinda breaks the read flow)

*
* The iterator traverses the implicit suffix tree beginning at the root node. All methods
* modifying the iterator (e.g. going down an edge) return a `bool` value whether the operation was successful or not.
* In case of an unsuccessful operation the iterator remains unmodified, i.e. there an iterator can never be in an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] i.e. there an iterator [...]

* The iterator traverses the implicit suffix tree beginning at the root node. All methods
* modifying the iterator (e.g. going down an edge) return a `bool` value whether the operation was successful or not.
* In case of an unsuccessful operation the iterator remains unmodified, i.e. there an iterator can never be in an
* invalid state except default constructed iterators that are always invalid. E.g. going down an edge while the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] E.g., going down [...]

* iterator is in a leaf node, will return `false` and will remain in that leaf node.
*
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact
* running times, you have to look up the running times of the used traits (configuration) additionally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to additionally look up the running times of the used traits (configuration) additionally.

* sequence does not occur in the indexed text).
*
* If the going down the edges fails in the middle of the sequence, all previous computations are rewound to restore
* the iterators state before calling this method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] the iterator's state [...]

return depth() == 0;
}

// TODO: naming? maybe dereference-operator instead?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using a dereference operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should offer a "named" method as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, for me it's clear what I expect from dereferencing the iterator, but a named method might be more verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion? I'm not really happy with path_label() 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_path, path_representation (which I find quite long) ?
I actually find path_label catchy

@cpockrandt cpockrandt force-pushed the feature/index branch 4 times, most recently from eb6ca59 to f9a7ab3 Compare August 8, 2018 14:42
Copy link
Member

@h-2 h-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please make the test cases minimal as suggested by enrico
  2. Concerning the module structure I would propose to have index in the search module as both things cannot be used without the other, lets discuss the strucutre together!

* and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to
* right, right to left and character by character in any arbitrary order). Roughly speaking bidirectional
* FM indices are more powerful for approximate string matching at the cost of a higher space consumption
* (between a factor of TODO and TODO depending on the configuration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the documentation use \TODO for TODOs as that makes them appear extra on:
http://docs.seqan.de/seqan/3.0.0-master-dev/todo.html

* FM indices are more powerful for approximate string matching at the cost of a higher space consumption
* (between a factor of TODO and TODO depending on the configuration).
*
* The FM indices are based on the SDSL (succinct data structure library). You are able to specify the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link

//!\cond
template <typename t>
concept bool fm_index_traits_concept = requires (t v,
typename t::sdsl_index_type::size_type lb,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indention

requires requires (t index, std::vector<dna4> const & text) { { t(t(text)) } };
requires requires (t index, std::vector<dna4> const & text) { { index = t(text) } };

requires requires (t index, const std::vector<dna4> & text) { { t(text) } };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semiregular concept likely includes everything you want

*
* ### Complexity
*
* At least linear.

This comment was marked as resolved.

public:

// TODO(h-2):
// * shouldn't all these constructors be noexcept? construction will happen a lot in recursions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

assert(index != nullptr);

// parent_lb/parent_rb might be uninitialized in a root node
return node == rhs.node && ((parent_lb == rhs.parent_lb && parent_rb == rhs.parent_rb) || depth() == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check depth == 0 before the other condition to prevent read from unitialised memory

*
* No-throw guarantee.
*/
template <range_concept pattern_t>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check actual contraints; at least input_range, but also sized if you use size()

for (typename index_t::size_type i = 0; i < occ.size(); ++i) {
occ[i] = offset() - index->m_index[node.lb + i];
}
return occ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indention

assert(index != nullptr);

std::vector<size_type> occ(count());
for (typename index_t::size_type i = 0; i < occ.size(); ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

braces

Copy link
Member

@h-2 h-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor stuff

{
//! [cycle]
std::vector<dna4> genome{"GAATTAATGAAC"_dna4};
bi_fm_index<std::vector<dna4>> index{genome}; // build the index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to omit the template parameter

std::cout << it.query() << '\n'; // outputs "AAC"
std::cout << it.last_char() << '\n'; // outputs 'C'

// it.cycle_front(); // undefined behaviour! only cycle_back() is allowed after extend_right()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to throw? Does one cycle often in algorithms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already discussed this and decided to make it undefined behaviour. Yes, pretty much all the time when you do approximate string matching

it.cycle_front(); // search the sequence "TAAT"
std::cout << it.query() << '\n'; // outputs "TAAT"
std::cout << it.last_char() << '\n'; // outputs 'T'
//! [cycle]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice examples!


int main()
{
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, don't indent top-level(s) in the snippets, because it will also indent in the documentation.

bi_fm_index<std::vector<dna4>> index{genome}; // build the index

auto it = index.begin(); // create an iterator
it.extend_left("CAA"_dna4); // search the sequence "AAC"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you said one doesn't have to reverse the sequences mentally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an open discussion see

// TODO(h-2, eseiler and cpockrandt): should we iterate from left to right or right to left?

//! [cycle]
}

std::cout << "-------------------------\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you printing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a user wants to compile the snippet it's hard to see where the output of the different blocks start and end. But I'll make a more meaningful output instead of ------ 😃

// Undefined behaviour! Cannot be called on the forward iterator if the last extension on the bidirectional
// iterator was to the left:
// it.cycle_back();
// std::cout << it.last_char() << '\n';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_char() also causes UB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If you do an extension to the left, than you can only do cycle_front() and last_char() will return the leftmost character. When you now construct a unidirectional fwd iterator (that goes to the right on the original text), last_char() will give you the leftmost char even though last_char() on a unidirectional iterator is supposed to give you always the rightmost character. That's why it's undefined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put a check in debug mode into the unidirectional iterator (when it is constructed from a bidirectional one) as we did it in the bidirectional iterator. I'll put this on my TODO list to work on after GCB.

uni_it.extend_right(dna4::G); // search the sequence "AACG"
std::cout << uni_it.query() << '\n'; // outputs "AACG"
std::cout << uni_it.last_char() << '\n'; // outputs 'G'
uni_it.cycle_back(); // returns false since there is not sequence "AACT" in the text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not

test/unit/index/fm_index_iterator_test.cpp Outdated Show resolved Hide resolved
@cpockrandt cpockrandt force-pushed the feature/index branch 4 times, most recently from e212324 to 6943a45 Compare September 13, 2018 00:33

#pragma once

#include <filesystem>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace this with

#include <seqan3/io/filesystem.hpp>

It resolves to std::experimental::filesystem or std::filesystem - depending on what is available and puts it into the seqan3::filesystem namespace

And then replace

std::filesystem::

with

filesystem::


#pragma once

#include <filesystem>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, replace with

#include <seqan3/io/filesystem.hpp>

@cpockrandt cpockrandt force-pushed the feature/index branch 3 times, most recently from ae530c0 to be2d41a Compare September 13, 2018 12:53
Copy link
Member

@h-2 h-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor stuff

//!\cond
requires alphabet_concept<innermost_value_type_t<text_t>>
&& std::is_same_v<typename underlying_rank<innermost_value_type_t<text_t>>::type, uint8_t>
//!\endcond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅 underlying_rank_t<innermost_value_type_t<text_t>

include/seqan3/search/fm_index/bi_fm_index.hpp Outdated Show resolved Hide resolved
include/seqan3/search/fm_index/bi_fm_index.hpp Outdated Show resolved Hide resolved
include/seqan3/search/fm_index/bi_fm_index.hpp Outdated Show resolved Hide resolved
*
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact
* running times, you have to additionally look up the running times of the used traits (configuration).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please implement my previous suggestion regarding the "iterator" disclaimer.

edit: I am starting to be more and more convinced that we actually should not call them iterators, but maybe cursors or something like that...

Copy link
Contributor Author

@cpockrandt cpockrandt Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it during the weekend 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given, that the iterator concept requires dereferencing and incrementing operations, and that in fact an index is not a range (is it?), I would also argue for something different. Cursor sounds like a perfect candidate. In fact the ranges library offers a cursor interface and also a function called cursor_begin(), cursor_end() etc.. I would really prefer this interface.

include/seqan3/search/fm_index/fm_index_iterator.hpp Outdated Show resolved Hide resolved
include/seqan3/search/fm_index/fm_index_iterator.hpp Outdated Show resolved Hide resolved
test/include/seqan3/test/comparison.hpp Outdated Show resolved Hide resolved
test/snippet/index/bi_fm_index_iterator.cpp Outdated Show resolved Hide resolved
test/unit/index/bi_fm_index_iterator_test.cpp Outdated Show resolved Hide resolved
@h-2
Copy link
Member

h-2 commented Sep 13, 2018

by the way, can you index collections now?

Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first look. I will continue tomorrow.

* Searching is a key component in many sequence analysis tools. The search module is a powerful and easy way to search
* sequences in a large text or an arbitrary nested collection of texts. When it comes to searching, indices are a core
* component for searching large amounts of data and are used for tools such as read mappers, assemblers or protein
* search tools. There are currently two major kind of indices: FM indices and k-mer indices (also known as q-gram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if FM-index is more generic than Suffix Array, if you talk about two major kinds?

include/seqan3/search/all.hpp Outdated Show resolved Hide resolved

public:

/*!\name Member types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this group is superfluous

include/seqan3/search/fm_index/bi_fm_index.hpp Outdated Show resolved Hide resolved
include/seqan3/search/fm_index/bi_fm_index_iterator.hpp Outdated Show resolved Hide resolved
include/seqan3/search/fm_index/bi_fm_index_iterator.hpp Outdated Show resolved Hide resolved
*
* The asymptotic running times for using the iterator depend on the SDSL index configuration. To determine the exact
* running times, you have to additionally look up the running times of the used traits (configuration).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given, that the iterator concept requires dereferencing and incrementing operations, and that in fact an index is not a range (is it?), I would also argue for something different. Cursor sounds like a perfect candidate. In fact the ranges library offers a cursor interface and also a function called cursor_begin(), cursor_end() etc.. I would really prefer this interface.

@cpockrandt cpockrandt force-pushed the feature/index branch 12 times, most recently from d33774a to cb72be4 Compare September 17, 2018 12:34
@h-2 h-2 merged commit 19f4ed0 into seqan:master Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants