Problem: Mapping metagenomes to large databases, such as the GMGCv1, takes too much memory. Partitioning the dataset is a common solution (and supported by NGLess, but it has drawbacks (it's slow).
This repository explores the possibility of prefiltering the database by removing sequences that are extremely unlikely to be matches.
- Parse all the reads and collect all randstrobes (or rather their hashes)
- Parse the database and select only unigenes that are expected to be present in the reads
- Map as usual to the pre-filtered database
For 2
, different strategies are possible. The simplest is to keep any unigene
that shares any hash with the set of hashes from the reads. Currently being
considered
min1
: keep all references that match at least one hashmin2
: keep all references that match at least two hashes
We also tested counting the exact value or using a hacky Bloom filter structure that uses a single fixed size array, but the hacky version gave bad estimates.
- Python, including NumPy and Pandas
- Jug
- NGLess
- Strobealign (Sahlin, 2022), including the Python bindings
- tabulate is used to print the final table
To install most dependencies (assuming you have conda-forge & bioconda set up):
conda install python=3.11 numpy pandas requests tabulate jug ngless
To install strobealign
's Python bindings (which will not be installed by default with conda
):
# To ensure you have a recent C++ compiler (not always needed)
conda install gxx_linux-64 gcc_linux-64
export CC CXX
git clone https://github.com/ksahlin/strobealign
cd strobealign
pip install .
- Database GMGCv1 (from (Coelho et al., 2022). This can be is downloaded by
jugfile.py
- Metagenomes: Dog dataset (from Coelho et al., 2018) and human gut dataset (from Zeller et al., 2014. These can be downloaded with ena-mirror. More guidance will be provided on how to do it soon, but get in touch if you have questions.
Note that running this benchmark will use a lot of disk storage!
- Luis Pedro Coelho (Queensland University of Technology). [email protected]