PHONI

(Practical Heuristic ON Incremental matching statistics computation)

This framework supports the currently memory-friendliest way to compute the matching statistics of a pattern on highly-repetitive texts, given that the input text is precomputed with the MONI index (more precisely, we need all ingredients of MONI except the thresholds).

We require the pattern and the text to be available in form of sequences stored in the .fa (FASTA) format. To use our solution, you need to have recent cmake, g++, zsh, and python 3 installed.

Preparations

We need the following python 3 packages for extracting and concatenating .fa files:

	pip3 install biopython
	pip3 install fastaparser
	pip3 install psutil

git clone --branch phoni https://github.com/koeppl/phoni

Compile

mkdir build
cd build; cmake ..
make

Building the index

To build the index we use the command phoni build from the build directory.

phoni build \
-r <filename of the reference> \
-t <number of threads> \
-g <grammar format> \
-f <input file is a fasta file> \

For example, to build the phoni index for the file yeast.fasta using 4 threads and the plain grammar we run from the build folder:

python3 phoni build -r ../data/yeast.fasta -f -t 4 -g plain

This command will produce yeast.fasta.phoni and yeast.fasta.plain.slp in the data folder, which represent the phoni index.

Querying the index

To query the index we use the command phoni query from the build directory.

phoni ms \
-r <filename of the reference> \
-p <number of threads> \
-g <grammar format> \

For example, to query the phoni index for the file yeast.fasta using the plain grammar with the pattern samples.fastq we run from the build folder:

python3 phoni build -r ../data/yeast.fasta -p ../data/samples.fa -g plain

This command will produce samples.fa.positions and samples.fa.lengths in the data folder, which represent the matching staistics positions and lengths of samples.fa against yeast.fasta, respectively.

Compatibility queries

To perform the queries using the old query command we first have to split the samples.fa file using the python tool splitpattern.py:

mkdir data/samples.fa.dir
python3 splitpattern.py data/samples.fa data/samples.fa.dir

Then we can query samples.fa against the yeast.fasta index with the plain grammar using the following command:

./build/test/src/phoni_compatibility data/yeast.fasta -p data/samples.fa -g plain

python Tools

prefixpattern.py: takes the x% prefix of each pattern stored in a .fa file and outputs a new .fa file to stdout
splitpattern.py: splits a .fa file into individual sequences stored in a directory given as program parameter (we need this for msfast as it does not support reading .fa files)

Benchmarks

We provide a script and benchmark files to evaluate PHONI in the setting as described in the paper

Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, In proceedings of 2021 Data Compression Conference, pp. 193-202, 2021.

Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi: PHONI: Streamed Matching Statistics with Multi-Genome References, arXiv:2011.05610, 11 Nov 2020.

In our experiments we used the file

chr19.1000.fa.xz as our text dataset, and
chr19.10.fa.xz as our pattern dataset.

We have a shell script benchmark.sh for an automatic benchmark. For this to work, some variables in it has to be set, as this project does not ship with the other matching statistic algorithms, namely

meaning it is necessary to download and compile those projects individually, and the set the corresponding variables in benchmark.sh manually (more precisely: in the switch-case statement for the hostname in the beginning). Finally, the output of benchmark.sh can be processed by sqlplots to generate the plots shown in the paper.

To compute the naive PHONI variant evaluated in the paper, simple exchange lceToRBounded with lceToR_NaiveBounded in the file include/ms/phoni.hpp.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
CMakeModules		CMakeModules
include		include
pipeline		pipeline
test/src		test/src
thirdparty		thirdparty
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmark.sh		benchmark.sh
catfasta.py		catfasta.py
prefixpattern.py		prefixpattern.py
readlog.sh		readlog.sh
splitpattern.py		splitpattern.py
testsuite.sh		testsuite.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHONI

Preparations

Compile

Building the index

Querying the index

Compatibility queries

python Tools

Benchmarks

About

Contributors 2

Languages

License

koeppl/phoni

Folders and files

Latest commit

History

Repository files navigation

PHONI

Preparations

Compile

Building the index

Querying the index

Compatibility queries

python Tools

Benchmarks

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages