Skip to content

Commit

Permalink
Implement --combine
Browse files Browse the repository at this point in the history
  • Loading branch information
cjw85 committed Feb 7, 2023
1 parent c802ebf commit ecc943d
Show file tree
Hide file tree
Showing 8 changed files with 115 additions and 30 deletions.
40 changes: 37 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,47 @@ modbam2bed -- summarise one or more BAM with modified base tags to bedMethyl.

The htslib pileup API is used to create a matrix of per-strand base counts
including modified bases and deletions. Inserted bases are not counted. Bases
of an abiguous nature, as defined by the two threshold probabilities are
of an abiguous nature (refered to as "filtered" below), as defined by the two threshold probabilities are
masked and used (along with substitutions and deletions) in the definition
of the "score" (column 5) and "coverage" (column 10) entries of the bedMethyl file.
In the case of `?`-style `MM` subtags, where a lack of a recorded call should
not be taken as implying a canonical-base call, the "no call" count is incremented.
The "no call" count is used in the calculation of "coverage" and also the denominator
of "score".

**An aside on alternative modified bases (`--combine` option)**

Oxford Nanopore Technogies' sequencing chemistries and basecallers can detect
any number of modified bases. Compared to traditional methods which force a
false dichoctomy between say cytosine and 5-methylcytosine, this rich biology
needs to be remembered when interpreting modified base calls.

In its default mode `modbam2bed` considers only a single modified-base tag in
the source BAM file and makes a choice between the categories "modified" and
"canonical" based on quality score stored within the BAM file. **Only when a
single modified base tag is present these categorical names are directly
intepretable.** However, in the case where multiple tags are present the quality
score does not directly measure the choice "modified" or "canonical". The score
that should be associated with a "canonical" call is a function of all the
modified-base qualities from all listed tags.

To intepret more correctly the case of multiple modifications being listed in
the BAM, `modbam2bed` includes an option `--combine`. When used, all
modification tags corresponding to the same canonical base are examined. The
choice "modified" vs. "canonical" for a particular base in a read is then driven
by any one of the family of modification tags containing a score which indicates
a modification being present. For example if both 5hmC and 5mC tags are
present, if either the score for 5hmC or 5mC is greater than the threshold
score, the site will be deemed to be modifed. If all tags when considered
independently would indicate the "canonical" category then the output of
`modbam2bed` will include a count for this category. In all other situations,
i.e. a mix of "canonical" and "filtered", the output will contain a "filtered"
count.

***A particular case where `--combine` is useful is when comparing to the result of bisulfite sequencing.***

**Output format**

> The description of the [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/)
> format on the ENCODE project website is rather loose. The definitions below are chosen pragmatically.
Expand All @@ -100,7 +133,7 @@ file specification, the values written are fixed and no meaning should be derive
from them. Columns 5, 10, and 11 are defined in terms of counts of observed
bases to agree with reasonable interpretations of the bedMethyl specifications:

* N<sub>canon</sub> - canonical (unmodified) base count.
* N<sub>canon</sub> - canonical (unmodified) base count, (contigent on the use of `--combine`, see above.)
* N<sub>mod</sub> - modified base count.
* N<sub>filt</sub> - count of bases where read does not contain a substitution or deletion
with respect to the reference, but the modification status is ambiguous: these bases
Expand Down Expand Up @@ -198,7 +231,7 @@ The result is two [numpy](https://numpy.org/) arrays. The first indicates the re
positions associated with the counts in the second array. Each row of the second array
(`counts` above) enumerates the observed counts of bases in the order:

a c g t A C G T d D m M f F
a c g t A C G T d D m M f F n N

where uppercase letters refer to bases on the forward strand, lowercase letters
relate to the reverse strand:
Expand All @@ -208,6 +241,7 @@ relate to the reverse strand:
* M modified base counts,
* F filtered counts - bases in reads with a modified-base record but which were filtered
according to the thresholds provided.
* N no call base counts.

**Extras**

Expand Down
22 changes: 19 additions & 3 deletions modbampy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import libmodbampy

# remember to bump version in src/version.h too
__version__ = "0.7.0"
__version__ = "0.8.0"
ffi = libmodbampy.ffi
libbam = libmodbampy.lib

Expand Down Expand Up @@ -39,6 +39,11 @@ def _tidy_args(read_group, tag_name, tag_value):
class ModBase:
"""Helper to create a mod_base instance.
:param code: modified base ChEBI code (e.g. "h" or 104)
:param base: one of {A, C, G, T}
:param name: long name of modified base (e.g. "5-methylcytosine")
:param abbrev: short name of modified base (e.g. "5mC")
Actually just a compatible list is created. Reuses the predefined
instances from header where possible.
"""
Expand Down Expand Up @@ -139,7 +144,7 @@ def pileup(
self, chrom, start, end,
read_group=None, tag_name=None, tag_value=None,
low_threshold=0.33, high_threshold=0.66, mod_base="m",
max_depth=None, canon_base=None):
max_depth=None, canon_base=None, combine=False):
"""Create a base count matrix.
:param chrom: reference sequence from BAM.
Expand All @@ -148,6 +153,17 @@ def pileup(
:param read group: read group of read to return.
:param tag_name: read tag to check during read filtering.
:param tag_value: tag value for reads to keep.
:param low_threshold: threshold below which a base is determined as
not the modified base.
:param high_threshold: threshold above which a base is determined
to be the modified base.
:param mod_base: ChEBI code of modified base to examine.
:param max_depth: maximum read depth to examine.
:param canon_base: canonical base corresponding to `mod_base`.
Required only if `mod_base is not a modification known to
the code.
:param combine: combine (include) all alternative modifications
with the same parent canonical base.
"""
for thresh in (low_threshold, high_threshold):
if thresh < 0.0 or thresh > 1.0:
Expand All @@ -167,7 +183,7 @@ def pileup(
fsets, chrom.encode(), start, end,
read_group, tag_name, tag_value,
low_threshold, high_threshold, mod_base.struct,
max_depth)
combine, max_depth)
# TODO: check for NULL

# copy data to numpy, we could be more clever here an wrap
Expand Down
19 changes: 17 additions & 2 deletions src/args.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ static struct argp_option options[] = {
"Output extended bedMethyl including counts of canonical, modified, and filtered bases (in that order)."},
{"mod_base", 'm', "BASE", 0,
"Modified base of interest, one of: 5mC, 5hmC, 5fC, 5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao. (Or modA, modC, modG, modT, modU, modN for generic modified base)."},
{"combine", 0x800, 0, 0,
"Create output with combined modified counts: i.e. alternative modified bases within the same family (same canonical base) are included."},
{"aggregate", 0x600, 0, 0,
"Output additional aggregated (across strand) counts, requires --cpg or --chg."},
{"threads", 't', "THREADS", 0,
"Number of threads for BAM processing."},
{"prefix", 'p', "PREFIX", 0,
Expand All @@ -55,8 +59,6 @@ static struct argp_option options[] = {
"Output records filtered to CHH sites.", 2},
{"chg", 0x500, 0, 0,
"Output records filtered to CHG sites.", 2},
{"aggregate", 0x600, 0, 0,
"Output additional aggregated (across strand) counts, requires --cpg or --chg.", 2},
{"mask", 'k', 0, 0,
"Respect soft-masking in reference file.", 2},
{0, 0, 0, 0,
Expand Down Expand Up @@ -115,6 +117,9 @@ static error_t parse_opt (int key, char *arg, struct argp_state *state) {
"Unrecognised modified base type: %s. ChEBI codes are not supported", arg);
}
break;
case 0x800:
arguments->combine = true;
break;
case 'r':
arguments->region = arg;
break;
Expand Down Expand Up @@ -206,6 +211,7 @@ static struct argp argp = {options, parse_opt, args_doc, doc};
arguments_t parse_arguments(int argc, char** argv) {
arguments_t args;
args.mod_base = default_mod_base;
args.combine = false;
args.lowthreshold = (int)(0.33 * 255);
args.highthreshold = (int)(0.66 * 255);
args.bam = NULL;
Expand Down Expand Up @@ -247,5 +253,14 @@ arguments_t parse_arguments(int argc, char** argv) {
fprintf(stderr, "ERROR: --highthreshold must be larger than --lowthreshold\n");
exit(1);
}
if (strncmp("5mC", args.mod_base.abbrev, 3) == 0 || strncmp("5hmC", args.mod_base.abbrev, 4)) {
fprintf(stderr,
"WARNING: You have specified either 5mC or 5hmC as a modified base.\n\
Oxford Nanopore Basecallers jointly call C, 5mC, and 5hmC. If you\n\
wish to combine calls of these bases into a single 'modified'\n\
count, please use the `--combine` option. The default behaviour\n\
is that calls of alternative modified bases are added to the\n\
non-modified (the then misnamed 'canonical' count).");
}
return args;
}
1 change: 1 addition & 0 deletions src/args.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ typedef struct arguments {
char tag_name[2];
int tag_value;
mod_base mod_base;
bool combine;
bool mask;
bool cpg;
bool chh;
Expand Down
54 changes: 36 additions & 18 deletions src/counts.c
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,8 @@ bool query_mod_subtag(hts_base_mod_state *state, int qtype, int qcanonical, char
* @param tag_value associated with tag_name
* @param lowthreshold highest probability to call base as canonical.
* @param highthreshold lowest probablity to call base as modified.
* @param mod_base BAM code for modified base to report. (e.g. h for 5hmC), or a ChEBI code.
* @param mb BAM code for modified base to report. (e.g. h for 5hmC), or a ChEBI code.
* @param combine combine all modified bases corresponding to same canonical base as mb
* @returns a pileup data pointer.
*
* The return value can be freed with destroy_plp_data.
Expand All @@ -464,7 +465,7 @@ bool query_mod_subtag(hts_base_mod_state *state, int qtype, int qcanonical, char
plp_data calculate_pileup(
const set_fsets *fsets, const char *chr, int start, int end,
const char *read_group, const char tag_name[2], const int tag_value,
int lowthreshold, int highthreshold, mod_base mb, int max_depth) {
int lowthreshold, int highthreshold, mod_base mb, bool combine, int max_depth) {

static bool shown_second_strand_warning = false;

Expand Down Expand Up @@ -568,12 +569,13 @@ plp_data calculate_pileup(
size_t n_mods = 256;
hts_base_mod_state *mod_state = p->cd.p;
hts_base_mod allmod[n_mods];
int valid[n_mods]; size_t n_valid = 1;
bool found = false;
int nm = bam_mods_at_qpos(p->b, p->qpos, mod_state, allmod, n_mods);
if (nm < 0 ) continue; // ignore reads which give error
bool found = false;
hts_base_mod mod;
if (nm > 0) {
hts_base_mod mod;
// TODO: deal with multiple overlapping tags e.g. C+m, C+h, ...
// first find set of valid tags - those with correct canonical base
for (int k = 0; k < nm && k < n_mods; ++k) {
mod = allmod[k];
if (mod.strand == 1) { // second strand tag
Expand All @@ -583,27 +585,43 @@ plp_data calculate_pileup(
}
continue;
}
if (mb.code == mod.modified_base
|| mb.code == -mod.modified_base) {
if (mb.code == mod.modified_base || mb.code == -mod.modified_base) {
valid[0] = k;
found = true;
if (mod.qual > highthreshold) {
// modified
base_i = bam_is_rev(p->b) ? rev_mod : fwd_mod;
} else if (mod.qual < lowthreshold) {
// canonical
base_i = num2countbase[bam_is_rev(p->b) ? base_j + 16: base_j];
} else {
// filter out
base_i = bam_is_rev(p->b) ? rev_filt : fwd_filt;
}
break;
}
else if (mod.canonical_base == mb.base) {
valid[n_valid] = k;
n_valid++;
found = found || combine;
}
}
size_t kmin = found ? 0 : 1;
size_t kmax = !combine ? 1 : n_valid;

size_t modified = 0;
size_t canonical = 0;
size_t filtered = 0;
for (int k = kmin; k < kmax; ++k) {
mod = allmod[valid[k]];
if (mod.qual > highthreshold) { modified++; }
else if (mod.qual < lowthreshold) { canonical++; }
else (filtered++);
}
if (modified > 0) { // modified trumps all
base_i = bam_is_rev(p->b) ? rev_mod : fwd_mod;
}
else if (canonical == (kmax - kmin)) { // all were canonical
base_i = num2countbase[bam_is_rev(p->b) ? base_j + 16: base_j];
}
else { // ambiguous
base_i = bam_is_rev(p->b) ? rev_filt : fwd_filt;
}
}
if (found == false) {
// No mod base subtag entry was found. In the case of explicit `?`
// tag we should not assume canonical, otherwise we can.
// NOTE: we don't look for second strand `-` tags.
// or a mess of `?` and `.` for alternative mods
if (query_mod_subtag(mod_state, mb.code, mb.base_i, '+', 0)) {
// we had an explicit tag, but no call for this position
base_i = bam_is_rev(p->b) ? rev_nocall : fwd_nocall;
Expand Down
3 changes: 2 additions & 1 deletion src/counts.h
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ void print_bedmethyl(
* @param lowthreshold highest probability to call base as canonical.
* @param highthreshold lowest probablity to call base as modified.
* @param mod_base a mod_base instance
* @param combine combine all modified bases corresponding to same canonical base as mb
* @param max_depth maximum depth of pileup.
* @returns a pileup data pointer.
*
Expand All @@ -166,6 +167,6 @@ void print_bedmethyl(
plp_data calculate_pileup(
const set_fsets *fsets, const char *chr, int start, int end,
const char *read_group, const char tag_name[2], const int tag_value,
int lowthreshold, int highthreshold, mod_base mb, int max_depth);
int lowthreshold, int highthreshold, mod_base mb, bool combine, int max_depth);

#endif
4 changes: 2 additions & 2 deletions src/modbam2bed.c
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ void *pileup_worker(void *arg) {
plp_data pileup = calculate_pileup(
files, j.chr, j.start, j.end,
j.args.read_group, j.args.tag_name, j.args.tag_value,
j.args.lowthreshold, j.args.highthreshold, j.args.mod_base,
j.args.lowthreshold, j.args.highthreshold, j.args.mod_base, j.args.combine,
j.args.hts_maxcnt);
destroy_filesets(files);
free(arg);
Expand All @@ -57,7 +57,7 @@ void process_region(arguments_t args, const char *chr, int start, int end, char
plp_data pileup = calculate_pileup(
args.bam, chr, start, end,
args.read_group, args.tag_name, args.tag_value,
args.lowthreshold, args.highthreshold, args.mod_base,
args.lowthreshold, args.highthreshold, args.mod_base, args.combine,
args.hts_maxcnt);
if (pileup == NULL) return;

Expand Down
2 changes: 1 addition & 1 deletion src/version.h
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
// remember to bump version in modbampy/__init__.py too
const char *argp_program_version = "0.7.0";
const char *argp_program_version = "0.8.0";

0 comments on commit ecc943d

Please sign in to comment.