`LCP` (Locally Consistent Parsing) Algorithm Implementation

This repository contains an implementation of the Locally Consistent Parsing (LCP) algorithm, applied to strings using a specific binary alphabet encoding. The implementation is in C++ and is designed for efficient computation of LCP on large datasets.

Features

Efficient LCP Computation: Implemented in C++ for efficient and scalable computation on large datasets.
High Accuracy: Achieves highly accurate comparisons by leveraging the unique LCP approach.
Designed for Genomics: Specifically caters to the needs of genomic researchers and bioinformaticians.

Installation

You can install lcptools either system-wide (requires sudo privileges) or in a user-specific directory (no sudo required).

System-wide Installation

To install lcptools system-wide, you need sudo privileges. This will install the library in /usr/local.

Install the repository:

git clone https://github.com/BilkentCompGen/lcptools.git
cd lcptools

# Install the library
sudo make install

Uninstall the library (if needed):
```
sudo make uninstall
```

User-specific Installation

To install lcptools in your home directory (or another custom directory), you don't need sudo privileges.

Install the repository:

git clone https://github.com/BilkentCompGen/lcptools.git
cd lcptools

# Install the library to a custom directory (e.g., `~/.local`):**
make install PREFIX=$(HOME)/.local

Uninstall the library from the custom directory (if needed):
```
make uninstall PREFIX=$(HOME)/.local
```

Usage

To compile your program with your program, you need to specify the include and library paths based on your installation method.

Compile with System-wide installed library

If you want to link static library, please use as follows:

g++ your_program.cpp -static -llcptools -o your_program

If you want to link dynamic library, please use as follows:

g++ your_program.cpp -llcptools -o your_program

Compile with User-specific installed library

If you want to link static library, please use as follows:

g++ your_program.cpp -static -I$(HOME)/.local/include/lcptools -L$(HOME)/.local/lib -llcptools -o your_program

If you want to link dynamic library, please use as follows:

g++ your_program.cpp -I$(HOME)/.local/include/lcptools -L$(HOME)/.local/lib -llcptools -Wl,-rpath,$(HOME)/.local/lib -o your_program

Note: Make sure that paths are correct.

Character Encoding

The binary encoding of the alphabet is defined as follows. This default encoding is used unless a custom encoding is provided:

Character	Binary Encoding
A, a	00
T, t	11
G, g	10
C, c	01

Initialization

To initialize the encodings, use the following function call at the beginning of your program. A boolean parameter verbose can be provided, which, when set to true, prints a summary of the encoding:

lcp::encoding::init(verbose);

To display the encoding summary separately, use:

lcp::encoding::summary();

For custom character encoding, initialize the encoding by passing a std::map<char, int> object to the initialization function:

std::map<char, int> customEncoding = { {'A', 3}, {'T', 0}, {'C', 2}, {'G', 1} };
lcp::encoding::init(customEncoding, verbose);

In the above code, verbose is an optional boolean parameter, defaulting to false.

Usage Example

Below is an example demonstrating the usage of the LCP algorithm implementation:

#include <iostream>
#include <string>
#include "lps.h"

int main() {

    // Initialize alphabet coefficients
    lcp::encoding::init();

    // Example string
    std::string str = "GGGACCTGGTGACCCCAGCCCACGACAGCCAAGCGCCAGCTGAGCTCAGGTGTGAGGAGATCACAGTCCT";

    // Create LCP string object
    lcp::lps *lcp_str = new lcp::lps(str);

    // Deepen the LCP analysis
    lcp_str->deepen();

    // Output LCP string
    std::cout << *lcp_str << std::endl;

    // Clean up to prevent memory leaks
    delete lcp_str;

    return 0;
}

LCP Algorithm Description

The LCP algorithm operates as follows:

Constructor:

Processes the input string and identifies cores that adhere to specific rules:

(LMIN) The subsequent characters should not be the same, and the middle character is local minima.

Ex: $w = xyz$ where $x \neq y$ and $y \neq z$, and $x \gt y$ and $y \lt z$
(LMAX) The subsequent characters should not be the same, and the middle character local maxima, and its neighbors are not local minima.

Ex: $w = sxyzt$ where $s \neq x$ and $x \neq y$ and $y \neq z$ and $z \neq t$, and $s \leq x$ and $x \lt y$ and $y \gt z$ and $z \geq t$.
(RINT) The characters, except the front and back, are the same.

Ex: $w=xy^iz$ where $i > 1$ and $|w| \gt 3$, , and $x\neq y$, $y\neq z$.
(SSEQ) The subsequent characters are either strictly increasing or decreasing with respect to the lexicographic order, and only the first and last characters are part of either a LMIN, LMAX, or a RINT.

Ex: $w = xyza_0 . . . a_nklm$, where $n \geq 1$ and $xyz$ and $klm$ are identified as cores, and $z \lt a_0 \lt \dots \lt a_n \lt k$ or $z \gt a_0 \gt \dots \gt a_n \gt k$.

Compress Function:

The compress function in the LCP algorithm is crucial for processing binary sequences. It starts by pinpointing the initial point of difference between two binary strings, beginning from the right-end. The function then assesses the difference based on the position and value of the divergent bit. This detail is transformed into a new binary sequence, which establishes the foundation of a newly generated 'core'. This core is a clear representation of the differences between the original sequences, integral to the algorithm's deepening process. Essentially, the compress function effectively consolidates and encapsulates the information, ensuring efficient further analysis within the LCP framework.

PROGRAM compress(bits1, bits2):
	
	position <- FIND FIRST DIFFERENCE BETWEEN bits1 AND bits2 STARTING FROM RIGHT
	difference <- GET BIT FROM THE bits2 AT POSITION (bits2.length - position)
	new_bits <- position * 2 + difference

	new_core = CREATE NEW CORE FROM new_bits

	return new_core

Ex: 11101000 vs 00010100 -> 1000 as the position is 2 (100) and the bit is 0. Position index start from 0.

Deepen Function:

The deepen function in the LCP algorithm primarily focuses on the compression of 'cores' alongside their left neighbors. The purpose of this repeated compression is to manage the length of the cores, preventing them from becoming large. After a few rounds of compression, the LCP algorithm is re-applied. This re-application aims to identify new cores within the compressed data. In this context, each compressed core is treated as a discrete value, represented in binary form. This representation facilitates efficient processing and analysis within the algorithm.

This function iteratively compresses and processes cores to find new cores in compliance with the rules stated above.

PROGRAM deepen(core_length=5, verbose=false):
	
	iteration_index <- 0

	FOR iteration_index < DCT_ITERATION_COUNT:
		iteration_index++
	 	temp_prev <- cores[0]
	 
		FOR index in range(1, cores.length):
			temp_curr <- cores[index]
			cores[index] <- compress(temp_prev, temp_curr)
			temp_prev <- temp_curr
		ENDFOR

		cores.pop_front() // it is not compressed since it has no left neighbour
	ENDFOR
	
	RUN LCP ALGORITHM FOR COMPRESSED CORES TO FIND NEW CORES
	CONCATENATE NEW CORES THAT COMPLY WITH EITHER ONE OF THE RULE DISCUSSED IN THE PREVIOUS SECTION

	level += 1

The deepen function in the LCP algorithm primarily focuses on the compression of 'cores' alongside their left neighbors. The purpose of this repeated compression is to manage the length of the cores, preventing them from becoming unwieldy with each iteration of deepening. After a few rounds of compression, the LCP algorithm is reapplied. This reapplication aims to identify new cores within the compressed data. In this context, each compressed core is treated as a discrete value, represented in binary form. This representation facilitates efficient processing and analysis within the algorithm.

Default Variables

The default iteration count for compression in each deepening is set to 1.

The default value for verbose is set to false.

The default value for use_map is set to false.

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
data		data
experiment		experiment
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
constant.h		constant.h
core.cpp		core.cpp
core.h		core.h
encoding.cpp		encoding.cpp
encoding.h		encoding.h
hash.cpp		hash.cpp
hash.h		hash.h
lcptools.cpp		lcptools.cpp
lps.cpp		lps.cpp
lps.h		lps.h
rules.h		rules.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`LCP` (Locally Consistent Parsing) Algorithm Implementation

Features

Installation

System-wide Installation

User-specific Installation

Usage

Compile with System-wide installed library

Compile with User-specific installed library

Character Encoding

Initialization

Usage Example

LCP Algorithm Description

Constructor:

Compress Function:

Deepen Function:

Default Variables

About

Releases 1

Packages

Contributors 2

Languages

License

BilkentCompGen/lcptools

Folders and files

Latest commit

History

Repository files navigation

LCP (Locally Consistent Parsing) Algorithm Implementation

Features

Installation

System-wide Installation

User-specific Installation

Usage

Compile with System-wide installed library

Compile with User-specific installed library

Character Encoding

Initialization

Usage Example

LCP Algorithm Description

Constructor:

Compress Function:

Deepen Function:

Default Variables

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

`LCP` (Locally Consistent Parsing) Algorithm Implementation

Packages