-
Notifications
You must be signed in to change notification settings - Fork 3
K mer counter
To get started run the command kmertools ctr --help
and you will see the following help output.
Count k-mers
Usage: kmertools ctr [OPTIONS] --input <INPUT> --output <OUTPUT> --k-size <K_SIZE>
Options:
-i, --input <INPUT>
Input file path
-o, --output <OUTPUT>
Output directory path
-k, --k-size <K_SIZE>
k size for counting
-m, --memory <MEMORY>
Max memory in GB
[default: 6]
-a, --acgt
Output ACGT instead of numeric values
This requires a larger space for the final result
compared to the compact numeric representation
-t, --threads <THREADS>
Thread count for computations 0=auto
[default: 0]
-h, --help
Print help (see a summary with '-h')
- Uses an algorithm similar to DSK
Size of the k-mer for counting. Only values 10 to 32 are allowed as of now. This may change in future.
This is the maximum memory used for k-mer counting. This value is in GB and it it used to decide on the number of chunks for k-mer counting. The algorithm is similar to that of DSK, but not exactly. But this gets the required input for coverage histogram creation in one-shot and in correct formatting.
In this mode, counts are reported in text format as follows.
AAAAAAAAAAAAAAC 4
AAAAAAAAAAAAAAG 6
AAAAAAAAAAAAAGT 22
AAAAAAAAAAAACCG 20
AAAAAAAAAAAACCT 24
Without this flag, the reported results are as follows.
1 4
2 6
11 22
22 20
23 24
Note that, former is readable at a much higher storage and memory cost. The default numeric mode is more suitable for loading into other programs and analytics in general. Conversion of numeric value to k-mer can be done using the following algorithm (in python and rust).
def numeric_to_kmer(kmer: int, k: int) -> str:
"""Converts a numeric k-mer representation back into its DNA sequence.
Args:
kmer (int): The numeric representation of the k-mer.
k (int): The length of the k-mer.
Returns:
str: The DNA sequence corresponding to the numeric k-mer.
"""
sequence = []
for _ in range(k):
bits = kmer & 0b11 # Get the last two bits
if bits == 0b00:
nucleotide = 'A'
elif bits == 0b01:
nucleotide = 'C'
elif bits == 0b10:
nucleotide = 'G'
elif bits == 0b11:
nucleotide = 'T'
else:
raise ValueError("Unexpected bit pattern.")
sequence.append(nucleotide)
kmer >>= 2 # Shift right by two bits to process the next nucleotide
return ''.join(reversed(sequence)) # Reverse the list to match the original order
# Example usage
kmer = 26 # Example numeric k-mer
k = 4 # Length of the k-mer
print(numeric_to_kmer(kmer, k)) # Outputs the DNA sequence
pub fn numeric_to_kmer(kmer: u64, k: usize) -> String {
let mut s = String::new();
let mut kmer = kmer;
for _ in 0..k {
let c = match kmer & 0b11 {
0b00 => 'A',
0b01 => 'C',
0b10 => 'G',
0b11 => 'T',
_ => panic!("Impossible!"),
};
s.push(c);
kmer >>= 2;
}
s.chars().rev().collect()
}
Note that, the current implementation is limited to 64 bits (i.e., 32-mers). Future plans are there to extend this to accommodate arbitrarily large k-mer sizes.
kmertools - k-mer driven genomics analytics toolkit