Skip to content

Minimiser computations

Anuradha Wickramarachchi edited this page May 19, 2024 · 1 revision

Minimiser computations

To get started run the command kmertools min --help and you will see the following help output.

Bin reads using minimisers

Usage: kmertools min [OPTIONS] --input <INPUT> --output <OUTPUT>

Options:
  -i, --input <INPUT>
          Input file path

  -o, --output <OUTPUT>
          Output vectors path

  -m, --m-size <M_SIZE>
          Minimiser size
          
          [default: 10]

  -w, --w-size <W_SIZE>
          Window size
          
          0 - emits one minimiser per sequence (useful for sequencing reads)
          w_size must be longer than m_size
          
          [default: 0]

  -p, --preset <PRESET>
          Output type to write
          
          [default: s2m]

          Possible values:
          - s2m: Conver sequences into minimiser representation
          - m2s: Group sequences by minimiser

  -t, --threads <THREADS>
          Thread count for computations 0=auto
          
          [default: 0]

  -h, --help
          Print help (see a summary with '-h')

Notes

This command create minimisers for sequences and output them in the specified format s2m or m2s.

Options

Preset

This determines the output format.

s2m - report k-mers per sequence in the following format. Separations are all tabs.

  • First item - SEQ_ID
  • Next items - MINIMISER:START-END

START - start position of the minimiser window. END - end index of the minimiser window.z

ACGCCAT:0-32 - ACGCCAT is the minimiser and starts at index 0 and ends at index 32 (excluding). Window size is 32.

Read_1	ACGCCAT:0-32	AAATCCC:2-57	AACAACT:27-62	AAACCCT:32-63	AAAACCC:33-72
Read_2	AAAATAC:0-50	AAGAATC:20-57	AAGCAGA:27-64	AACGACG:34-65	AAACGAC:35-66	AAAACGA:36-72

m2s - report sequences grouped by their minimisers. Requires a larger memory! The output format is as follows. First and second items are tab separated, the second items is an array of tuples.

  • First item - MNIMISER
  • Second item - [tuple_1, tuple_2, ...]
  • Tuple format - (SEQ_ID, START, END)
AAAACCCTTA	[("Read_1", 0, 72)]
AAAACGACGC	[("Read_2", 0, 72)]

M and W size

M size determins the size of the minimiser. W size determines the window size. -w 0 sets window size to be the full length of the sequence. Otherwise it must be larger thatn -m value.

Example application

TDB