Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove raptor data simulaiton submodule #114

Merged
merged 2 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 0 additions & 9 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,6 @@
path = lib/seqan3
url = https://github.com/seqan/seqan3.git
branch = master
[submodule "lib/robin-hood-hashing"]
path = lib/robin-hood-hashing
url = https://github.com/martinus/robin-hood-hashing.git
[submodule "lib/raptor"]
path = lib/raptor
url = https://github.com/seqan/raptor.git
[submodule "lib/raptor_data_simulation"]
path = lib/raptor_data_simulation
url = [email protected]:eaasna/raptor_data_simulation.git
[submodule "lib/seqan"]
path = lib/seqan
url = [email protected]:seqan/seqan.git
Expand Down
30 changes: 30 additions & 0 deletions include/raptor/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
BSD 3-Clause License

Copyright (c) 2006-2023, Knut Reinert & Freie Universität Berlin
Copyright (c) 2016-2023, Knut Reinert & MPI für molekulare Genetik
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
42 changes: 42 additions & 0 deletions include/raptor/adjust_seed.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2023, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2023, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/raptor/blob/main/LICENSE.md
// --------------------------------------------------------------------------------------------------

/*!\file
* \brief Provides raptor::adjust_seed.
* \author Enrico Seiler <enrico.seiler AT fu-berlin.de>
*/

#pragma once

#include <cstdint>

namespace raptor
{

/*\brief Adjust the default seed such that it does not interfere with the IBF's hashing.
*\param kmer_size The used k-mer size. For gapped shapes, this corresponds to the number of set bits (count()).
*\details
*
* The hashing used with the IBF assumes that the input values are uniformly distributed.
* However, we use a 64 bit seed, and unless the `kmer_size` is 32, not all 64 bits of the k-mers change.
* Hence, we need to shift the seed to the right.
*
* For example, using 2-mers and a seed of length 8 bit, the values for the k-mers will only change for the last 4 bits:
*
* ```
* seed = 1111'1011
* kmer = 0000'XXXX
* ```
*
* `seed XOR kmer` will then always have 4 leading ones.
*/
static inline constexpr uint64_t adjust_seed(uint8_t const kmer_size) noexcept
{
return 0x8F3F73B5CF1C9ADEULL >> (64u - 2u * kmer_size);
}

} // namespace raptor
25 changes: 25 additions & 0 deletions include/raptor/dna4_traits.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2023, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2023, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/raptor/blob/main/LICENSE.md
// --------------------------------------------------------------------------------------------------

/*!\file
* \brief Provides raptor::dna4_traits.
* \author Enrico Seiler <enrico.seiler AT fu-berlin.de>
*/

#pragma once

#include <seqan3/io/sequence_file/input.hpp>

namespace raptor
{

struct dna4_traits : seqan3::sequence_file_input_default_traits_dna
{
using sequence_alphabet = seqan3::dna4;
};

} // namespace raptor
181 changes: 181 additions & 0 deletions include/raptor/file_reader.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2023, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2023, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/raptor/blob/main/LICENSE.md
// --------------------------------------------------------------------------------------------------

/*!\file
* \brief Provides raptor::file_reader.
* \author Enrico Seiler <enrico.seiler AT fu-berlin.de>
*/

#pragma once

#include <seqan3/io/sequence_file/input.hpp>
#include <seqan3/search/views/minimiser_hash.hpp>

#include <raptor/adjust_seed.hpp>
#include <raptor/dna4_traits.hpp>

namespace raptor
{

enum class file_types
{
sequence,
minimiser
};

template <file_types file_type>
class file_reader
{};

template <>
class file_reader<file_types::sequence>
{
public:
file_reader() = default;
file_reader(file_reader const &) = default;
file_reader(file_reader &&) = default; // GCOVR_EXCL_LINE
file_reader & operator=(file_reader const &) = default;
file_reader & operator=(file_reader &&) = default;
~file_reader() = default;

explicit file_reader(seqan3::shape const shape, uint32_t const window_size) :
minimiser_view{seqan3::views::minimiser_hash(shape,
seqan3::window_size{window_size},
seqan3::seed{adjust_seed(shape.count())})}
{}

template <std::output_iterator<uint64_t> it_t>
void hash_into(std::vector<std::string> const & filenames, it_t target) const
{
for (auto && filename : filenames)
hash_into(filename, target);
}

template <std::output_iterator<uint64_t> it_t>
void hash_into(std::string const & filename, it_t target) const
{
sequence_file_t fin{filename};
for (auto && record : fin)
std::ranges::copy(record.sequence() | minimiser_view, target);
}

template <std::output_iterator<uint64_t> it_t>
void hash_into_if(std::vector<std::string> const & filenames, it_t target, auto && pred) const
{
for (auto && filename : filenames)
hash_into_if(filename, target, pred);
}

template <std::output_iterator<uint64_t> it_t>
void hash_into_if(std::string const & filename, it_t target, auto && pred) const
{
sequence_file_t fin{filename};
for (auto && record : fin)
std::ranges::copy_if(record.sequence() | minimiser_view, target, pred);
}

void on_hash(std::vector<std::string> const & filenames, auto && callback) const
{
for (auto && filename : filenames)
on_hash(filename, callback);
}

void on_hash(std::string const & filename, auto && callback) const
{
sequence_file_t fin{filename};
for (auto && record : fin)
callback(record.sequence() | minimiser_view);
}

void for_each_hash(std::vector<std::string> const & filenames, auto && callback) const
{
for (auto && filename : filenames)
for_each_hash(filename, callback);
}

void for_each_hash(std::string const & filename, auto && callback) const
{
sequence_file_t fin{filename};
for (auto && record : fin)
std::ranges::for_each(record.sequence() | minimiser_view, callback);
}

private:
using sequence_file_t = seqan3::sequence_file_input<dna4_traits, seqan3::fields<seqan3::field::seq>>;
using view_t = decltype(seqan3::views::minimiser_hash(seqan3::shape{}, seqan3::window_size{}, seqan3::seed{}));
view_t minimiser_view = seqan3::views::minimiser_hash(seqan3::shape{}, seqan3::window_size{}, seqan3::seed{});
};

template <>
class file_reader<file_types::minimiser>
{
public:
file_reader() = default;
file_reader(file_reader const &) = default;
file_reader(file_reader &&) = default;
file_reader & operator=(file_reader const &) = default;
file_reader & operator=(file_reader &&) = default;
~file_reader() = default;

explicit file_reader(seqan3::shape const, uint32_t const)
{}

template <std::output_iterator<uint64_t> it_t>
void hash_into(std::vector<std::string> const & filenames, it_t target) const
{
for (auto && filename : filenames)
hash_into(filename, target);
}

template <std::output_iterator<uint64_t> it_t>
void hash_into(std::string const & filename, it_t target) const
{
std::ifstream fin{filename, std::ios::binary};
uint64_t value;
while (fin.read(reinterpret_cast<char *>(&value), sizeof(value)))
{
*target = value;
++target;
}
}

template <std::output_iterator<uint64_t> it_t>
void hash_into_if(std::vector<std::string> const & filenames, it_t target, auto && pred) const
{
for (auto && filename : filenames)
hash_into_if(filename, target, pred);
}

template <std::output_iterator<uint64_t> it_t>
void hash_into_if(std::string const & filename, it_t target, auto && pred) const
{
std::ifstream fin{filename, std::ios::binary};
uint64_t value;
while (fin.read(reinterpret_cast<char *>(&value), sizeof(value)))
if (pred(value))
{
*target = value;
++target;
}
}

void for_each_hash(std::vector<std::string> const & filenames, auto && callback) const
{
for (auto && filename : filenames)
for_each_hash(filename, callback);
}

void for_each_hash(std::string const & filename, auto && callback) const
{
std::ifstream fin{filename, std::ios::binary};
uint64_t value;
while (fin.read(reinterpret_cast<char *>(&value), sizeof(value)))
callback(value);
}
};

} // namespace raptor
26 changes: 26 additions & 0 deletions include/raptor/strong_types.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
// --------------------------------------------------------------------------------------------------
// Copyright (c) 2006-2023, Knut Reinert & Freie Universität Berlin
// Copyright (c) 2016-2023, Knut Reinert & MPI für molekulare Genetik
// This file may be used, modified and/or redistributed under the terms of the 3-clause BSD-License
// shipped with this file and also available at: https://github.com/seqan/raptor/blob/main/LICENSE.md
// --------------------------------------------------------------------------------------------------

/*!\file
* \brief Provides raptor::window.
* \author Enrico Seiler <enrico.seiler AT fu-berlin.de>
*/

#pragma once

#include <cstdint>

namespace raptor
{

//!\brief Strong type for passing the window size.
struct window
{
uint32_t v{};
};

} // namespace raptor
Loading
Loading