-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix variant region for ins and dup on intron-exon boundary #709
Fix variant region for ins and dup on intron-exon boundary #709
Conversation
@reece I am wondering who can help review this pull request? I believe the core pdot calculation logic is currently giving incorrect results, and we are under some pressure to fix it. I suppose the two tasks are:
The basic problem is-- when material is inserted exactly at the intron/exon boundary, should the inserted material be treated as part of the coding region, or as part of the intron? We believe it is more correct to treat it as part of the coding sequence, because the entire intron and splicing region is intact. So in principle, splicing can still occur using the original splice site. The current behavior is inconsistent. Insertions at the exon-intron boundary are treated as coding (because the cdot nomenclature doesn't include any intronic positions). Insertions at the intron-exon boundary are treated as non-coding (because the cdot nomenclature does include intronic positions). But I can't think of any reason these situations would be different biologically. This PR both (1) makes the behavior consistent (always treats the inserted material as coding), and (2) makes the behavior correct (if you're convinced by the argument above). This PR also makes hgvs consistent with Mutalyzer. Examples:
|
Hi @gostachowiak: thanks for pinging me. I'll take a look this week and comment here. |
@reece reminder about this PR. If it would be easier to have a meeting to review together, we would be happy to do that. We will need to consider forking soon (maintaining a separate private version) if fixing this isn't an option. Our users consider this to be a serious flaw. |
624f045
to
ed861bd
Compare
@cassiemk @gostachowiak @b0d0nne11 : I'm very sorry that I let this issue fall off my radar. I appreciate the thoughtfulness of the original issue (#655) and the discussion here, and the completeness of the implementation. I have one question, which I'll pick up in a review in the next few minutes. And I agree with you that the implementation is inconsistent, or at least not well-defined. Thank you for addressing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is very nearly ready. Please see the comments in altseqbuilder.py.
src/hgvs/utils/altseqbuilder.py
Outdated
elif self._var_c.posedit.edit.type == "ins" and self._var_c.posedit.pos.start.offset == 0 and self._var_c.posedit.pos.end.offset == 1: | ||
# ins at exon-intron boundary | ||
result = self.EXON | ||
elif self._var_c.posedit.edit.type == "dup" and self._var_c.posedit.pos.start.offset <= -1 and self._var_c.posedit.pos.end.offset >= -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What cases of boundary-spanning dups are intended in this conditional?
I believe that the intended case is a dup that spans an intron-exon boundary (i.e., on the 5' end of an exon). However, I believe that misses a set of these cases and may match unintended cases. I've summarized these here:
Please comment on the following:
self._var_c.posedit.pos.end.offset >= -1
: Why not== 0
, which is more specific? I believe that end is intended to be in the CDS.- Which of the cases do you intend to catch in the figure?
I'm happy to talk more about this interactively if useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reece Thanks for the comment. The intention in this PR is to handle insertions that are directly at the boundary (not spanning the boundary). So self._var_c.posedit.pos.end.offset >= -1
should be updated to self._var_c.posedit.pos.end.offset = -1
(because a duplication in the left intron will insert material directly at the boundary if the end position is -1).
So in your image:
- Not intended to be handled (fixed with the change above)
- Not intended to be handled. And, no special logic is needed for the right end of the exon, because a duplication that inserts material at the boundary has no offsets, so is already considered exonic.
2, 3, 4. Good questions, we are thinking about these.
src/hgvs/utils/altseqbuilder.py
Outdated
@@ -326,7 +340,7 @@ def _setup_incorporate(self): | |||
result = cds_start - 1 | |||
else: # cds/intron | |||
if pos.offset <= 0: | |||
result = (cds_start - 1) + pos.base - 1 | |||
result = (cds_start - 1) + pos.base + pos.offset - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if this line is covered by tests prior to this PR? It's suspicious to me that a bug fix of this magnitude doesn't have broader impacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little surprised too but if I remove it the following tests fail:
def test_map_of_dup_intron_exon_boundary(self):
hgvs_c = "NM_024529.4:c.132-2_132-1dup"
hgvs_p = "NP_078805.3:p.(Thr45GlyfsTer65)"
var_c = self.hp.parse_hgvs_variant(hgvs_c)
var_p = self.am.c_to_p(var_c)
> self.assertEqual(str(var_p), hgvs_p)
E AssertionError: 'NP_078805.3:p.(Thr45ArgfsTer65)' != 'NP_078805.3:p.(Thr45GlyfsTer65)'
E - NP_078805.3:p.(Thr45ArgfsTer65)
E ? ^^^
E + NP_078805.3:p.(Thr45GlyfsTer65)
E ? ^^^
tests/test_hgvs_assemblymapper.py:210: AssertionError
def test_map_of_dup_intron_exon_boundary_rc(self):
hgvs_c = "NM_004380.2:c.3251-1dup"
hgvs_p = "NP_004371.2:p.(Ile1084SerfsTer3)"
var_c = self.hp.parse_hgvs_variant(hgvs_c)
var_p = self.am.c_to_p(var_c)
> self.assertEqual(str(var_p), hgvs_p)
E AssertionError: 'NP_004371.2:p.(Ile1084MetfsTer3)' != 'NP_004371.2:p.(Ile1084SerfsTer3)'
E - NP_004371.2:p.(Ile1084MetfsTer3)
E ? ^ ^
E + NP_004371.2:p.(Ile1084SerfsTer3)
E ? ^ ^
tests/test_hgvs_assemblymapper.py:227: AssertionError
def test_map_of_ins_intron_exon_boundary(self):
hgvs_c = "NM_004380.2:c.3251-1_3251insA"
var_c = self.hp.parse_hgvs_variant(hgvs_c)
var_p = self.vm.c_to_p(var_c)
> self.assertEqual(str(var_p), "NP_004371.2:p.(Ile1084AsnfsTer3)")
E AssertionError: 'NP_004371.2:p.(Phe1085LeufsTer2)' != 'NP_004371.2:p.(Ile1084AsnfsTer3)'
E - NP_004371.2:p.(Phe1085LeufsTer2)
E ? ^^ ^^^^ ^
E + NP_004371.2:p.(Ile1084AsnfsTer3)
E ? ^^ ^^^^ ^
tests/test_hgvs_variantmapper.py:125: AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reece While working on my latest commit I found a way to pull the offset out of this calculation. My understanding in that we had to backup the start position for negative offsets by 1 so that the inserted section precedes the exon rather than ending up right after the first base. You're correct that using offsets here is weird since the offset positions aren't even present in the cdot sequence. Using offset worked but only because for this change only the -1 offset is critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With + pos.offset
in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)
tests/data/sanity_cp.tsv
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
tests/test_hgvs_assemblymapper.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great tests!
c113fa2
to
43b08ec
Compare
Thanks for the thorough feedback @reece. I've updated the boundary conditions and tests after reviewing it with @gostachowiak. I believe he'll have a longer update soon. I've also rebased on main to pull in the build fix. |
We discovered issues with 2 of the tests I added yesterday. I just pushed a fix for those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvements. Please check my comments below. This is some thorny code and I appreciate your diving into it.
src/hgvs/assemblymapper.py
Outdated
var_g = super(AssemblyMapper, self).c_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method) | ||
if var.type == 'n': | ||
var_g = super(AssemblyMapper, self).n_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method) | ||
if var.type == 't': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t is a shorthand in functions for var.type in "cnr", but var.type is not ever set to "t"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed that case.
src/hgvs/assemblymapper.py
Outdated
@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac): | |||
assert len(alt_acs) == 1, "Should have exactly one alignment at this point" | |||
return alt_acs[0] | |||
|
|||
def _replace_reference(self, var): | |||
if (getattr(var.posedit.pos.start, 'offset', 0) != 0 or getattr(var.posedit.pos.end, 'offset', 0) != 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value of this check here?
I think that you could obviate the getattr with structure like this:
def _replace_reference(self, var):
if var.type in "cnr":
# var.posedit.pos.start and .end will be BaseOffsetPositions, so no getattr needed
# be on your merry way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the new references to getattr
. This is cleaner. Thanks for the suggestion!
@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac): | |||
assert len(alt_acs) == 1, "Should have exactly one alignment at this point" | |||
return alt_acs[0] | |||
|
|||
def _replace_reference(self, var): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we should have only one _replace_reference
. It's currently in variantmapper. Is there a reason that one doesn't work / can't be extended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In order to get the reference value for these variants we need to first map them back to the var_g
form which requires an assembly mapper afaik. We don't actually call the *_to_g
function from the AM because that includes additional normalization. Instead we use the AM to get the alt_ac
and then call *_to_g
from the VM with the alt_ac
and alt_aln_method
from the AM. In all other cases this method simply calls _replace_reference
from the variant mapper instead. This is all so that we can get a portion of the reference sequence from the intronic section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AssemblyMapper
has VariantMapper
as a parent class. In that light, do we need to have a method of the same name? I think this is confusing, since this one does something different. Also, the AM has a normalize
property. Could that get used to prevent what seem to be an undesired side-effects? I assume the intent is not to call _maybe_normalize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andreasprlic,
I don't think I understand your first question. This _replace_reference
method overrides the method of the same name from the parent VariantMapper
class. It has the same purpose and type signature. The only difference is that I've added logic to handle c
and n
type variants with non-zero offsets.
I thought about disabling self.normalize
temporarily in AssemblyMapper
to avoid calling the _maybe_normalize
method but I would also have to make sure the original value of self.normalize
is replaced afterward, probably with a try...finally
block. I thought it was simpler to call the underlying functions directly but I can rework this if you'd prefer.
Thanks!
src/hgvs/variantmapper.py
Outdated
@@ -483,8 +483,8 @@ def _replace_reference(self, var): | |||
else: | |||
pos = var.posedit.pos | |||
|
|||
seq_start = pos.start.base - 1 | |||
seq_end = pos.end.base | |||
seq_start = pos.start.base + getattr(pos.start, 'offset', 0) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like:
if var.type in "cnr":
# assume pos have offset
else:
# don't have offset
The subtext here is that I find getattr hard to read because it obfuscates the cases that we're trying to handle. I try mightily to avoid getattr with defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
src/hgvs/utils/altseqbuilder.py
Outdated
@@ -326,7 +340,7 @@ def _setup_incorporate(self): | |||
result = cds_start - 1 | |||
else: # cds/intron | |||
if pos.offset <= 0: | |||
result = (cds_start - 1) + pos.base - 1 | |||
result = (cds_start - 1) + pos.base + pos.offset - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With + pos.offset
in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)
@reece I made some changes based on your feedback but I'm still doing some more testing on my end. I'll comment here once it passes. Thanks! |
Hey @reece, my testing all checked out fine with the latest changes. Thanks again for your feedback. |
444bd03
to
22705e2
Compare
22705e2
to
dc8d9e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gostachowiak @b0d0nne11 : I need your help to think this change through. What I think you're doing is filling out (missing) intronic sequence using a genomic reference, and then using that to make a prediction about p.
If that's right, then, in principle, the result depends on the intronic sequence. Since the intronic sequence will depend on which assembly is used, the exact intronic sequence that's fetched might vary. This is why hgvs is so stubborn about boundaries and missing sequence and warnings when using intronic positions.
Do I understand your proposed change correctly? If so, aren't we building in an implicit dependency on an arbitrary choice of assembly?
Yes, I think you understand correctly-- the result depends on the intronic sequence.
This is inserting material after the splice site, in the coding region. (even though both positions in the nomenclature have offsets). Currently an empty pdot is returned, which I believe is wrong. The inserted material is part of the coding region and will produce a change to the amino acid sequence.
To calculate the pdot, we need to know what the duplicated bases are. The only way we could come up with is to look back into the reference sequence. Our typical use case is take something from a VCF file (chr/start/ref/alt) --> calculate cdot --> calculate pdot. We need this to work. In this situation it seems safe to go back to GRCh37 or GRCh38 to figure out what the duplicated bases actually are, since the VCF file said it explicitly. I think I may be naive about the "arbitrary choice of assembly" problem, possibly because we're just trying to do straightforward GRCh37/38 --> cdot --> pdot. In what situation would this procedure fail? Tangentially, this appears to be a clear drawback of not listing out the duplicated bases in HGVS nomenclature. It makes |
Also, this might be semantics but we're not using an arbitrary assembly in the sense that we're picking GRCh37 or GRCh38 randomly. Rather, we're using the assembly tied to the Thinking about this now I might be able to avoid requiring AM if I change the function signature for |
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
@andreasprlic agreed to investigate here. |
dc8d9e6
to
33c28c9
Compare
Rebased on main |
33c28c9
to
58a780e
Compare
58a780e
to
9c1539f
Compare
Rebased on main |
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
We found a few variants that have issues with cigar mapping where the intermediate mapping logic introduced in this PR cased new exceptions to be thrown. I've fixes those issues and added tests for them in #748. If it's needed I can reopen this PR and also fix those issues here. |
Fixes #655
Certain ins or dup variants that occur at the intron/exon boundaries have non-zero offsets but are still coding. This PR ensures that projections return accurate var_p representations for these variants.
Fixing ins variants are pretty straightforward because we know what to insert. Fixing dup variants is more involved because we need to calculate the reference sequence. This reference sequence must be derived from the var_g representation so that we get the correct bases from the intron rather than the previous or following exon. As a result there are some mappings that require
AssemblyMapper
. In these cases trying to perform the mapping withVariantMapper
will throw anHGVSError
. In addition, when computing the reference sequence we need to handle strand orientation and avoid normalization.