-
Notifications
You must be signed in to change notification settings - Fork 3
Minimiser computations
Anuradha Wickramarachchi edited this page May 19, 2024
·
1 revision
To get started run the command kmertools min --help
and you will see the following help output.
Bin reads using minimisers
Usage: kmertools min [OPTIONS] --input <INPUT> --output <OUTPUT>
Options:
-i, --input <INPUT>
Input file path
-o, --output <OUTPUT>
Output vectors path
-m, --m-size <M_SIZE>
Minimiser size
[default: 10]
-w, --w-size <W_SIZE>
Window size
0 - emits one minimiser per sequence (useful for sequencing reads)
w_size must be longer than m_size
[default: 0]
-p, --preset <PRESET>
Output type to write
[default: s2m]
Possible values:
- s2m: Conver sequences into minimiser representation
- m2s: Group sequences by minimiser
-t, --threads <THREADS>
Thread count for computations 0=auto
[default: 0]
-h, --help
Print help (see a summary with '-h')
This command create minimisers for sequences and output them in the specified format s2m
or m2s
.
This determines the output format.
- First item -
SEQ_ID
- Next items -
MINIMISER:START-END
START
- start position of the minimiser window. END
- end index of the minimiser window.z
ACGCCAT:0-32 - ACGCCAT
is the minimiser and starts at index 0
and ends at index 32
(excluding). Window size is 32
.
Read_1 ACGCCAT:0-32 AAATCCC:2-57 AACAACT:27-62 AAACCCT:32-63 AAAACCC:33-72
Read_2 AAAATAC:0-50 AAGAATC:20-57 AAGCAGA:27-64 AACGACG:34-65 AAACGAC:35-66 AAAACGA:36-72
m2s
- report sequences grouped by their minimisers. Requires a larger memory! The output format is as follows. First and second items are tab separated, the second items is an array of tuples.
- First item -
MNIMISER
- Second item -
[tuple_1, tuple_2, ...]
- Tuple format -
(SEQ_ID, START, END)
AAAACCCTTA [("Read_1", 0, 72)]
AAAACGACGC [("Read_2", 0, 72)]
M
size determins the size of the minimiser. W
size determines the window size. -w 0
sets window size to be the full length of the sequence. Otherwise it must be larger thatn -m
value.
TDB
kmertools - k-mer driven genomics analytics toolkit