Skip to content

K mer counter

Anuradha Wickramarachchi edited this page May 19, 2024 · 1 revision

K-mer counter

Overview

To get started run the command kmertools ctr --help and you will see the following help output.

Count k-mers

Usage: kmertools ctr [OPTIONS] --input <INPUT> --output <OUTPUT> --k-size <K_SIZE>

Options:
  -i, --input <INPUT>
          Input file path

  -o, --output <OUTPUT>
          Output directory path

  -k, --k-size <K_SIZE>
          k size for counting

  -m, --memory <MEMORY>
          Max memory in GB
          
          [default: 6]

  -a, --acgt
          Output ACGT instead of numeric values
          
          This requires a larger space for the final result
          compared to the compact numeric representation

  -t, --threads <THREADS>
          Thread count for computations 0=auto
          
          [default: 0]

  -h, --help
          Print help (see a summary with '-h')

Notes

  • Uses an algorithm similar to DSK

Options

K size

Size of the k-mer for counting. Only values 10 to 32 are allowed as of now. This may change in future.

Memory

This is the maximum memory used for k-mer counting. This value is in GB and it it used to decide on the number of chunks for k-mer counting. The algorithm is similar to that of DSK, but not exactly. But this gets the required input for coverage histogram creation in one-shot and in correct formatting.

ACGT mode

In this mode, counts are reported in text format as follows.

AAAAAAAAAAAAAAC	4
AAAAAAAAAAAAAAG	6
AAAAAAAAAAAAAGT	22
AAAAAAAAAAAACCG	20
AAAAAAAAAAAACCT	24

Without this flag, the reported results are as follows.

1	4
2	6
11	22
22	20
23	24

Note that, former is readable at a much higher storage and memory cost. The default numeric mode is more suitable for loading into other programs and analytics in general. Conversion of numeric value to k-mer can be done using the following algorithm (in python and rust).

def numeric_to_kmer(kmer: int, k: int) -> str:
    """Converts a numeric k-mer representation back into its DNA sequence.
    
    Args:
        kmer (int): The numeric representation of the k-mer.
        k (int): The length of the k-mer.
    
    Returns:
        str: The DNA sequence corresponding to the numeric k-mer.
    """
    sequence = []
    for _ in range(k):
        bits = kmer & 0b11  # Get the last two bits
        if bits == 0b00:
            nucleotide = 'A'
        elif bits == 0b01:
            nucleotide = 'C'
        elif bits == 0b10:
            nucleotide = 'G'
        elif bits == 0b11:
            nucleotide = 'T'
        else:
            raise ValueError("Unexpected bit pattern.")
        sequence.append(nucleotide)
        kmer >>= 2  # Shift right by two bits to process the next nucleotide
    return ''.join(reversed(sequence))  # Reverse the list to match the original order

# Example usage
kmer = 26  # Example numeric k-mer
k = 4  # Length of the k-mer
print(numeric_to_kmer(kmer, k))  # Outputs the DNA sequence
pub fn numeric_to_kmer(kmer: u64, k: usize) -> String {
    let mut s = String::new();
    let mut kmer = kmer;
    for _ in 0..k {
        let c = match kmer & 0b11 {
            0b00 => 'A',
            0b01 => 'C',
            0b10 => 'G',
            0b11 => 'T',
            _ => panic!("Impossible!"),
        };
        s.push(c);
        kmer >>= 2;
    }
    s.chars().rev().collect()
}

Note that, the current implementation is limited to 64 bits (i.e., 32-mers). Future plans are there to extend this to accommodate arbitrarily large k-mer sizes.