-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No variants are simulated in zero-recombination region of chr22 #1422
Comments
Looks like that's because this region is masked ("blank" trees) in the output tree sequence for both engines. The hapmap files for both genetic maps start at ~15 Mb, so I assume that to fix this we'd need to add a line at the beginning specifying a 0 rate (so this region is not treated as "missing data")? |
related to #1387 |
To get to the bottom of this I decided to look at branch stats instead. With the SLiM engine, we get non-zero segsites for the first few windows when
|
So it looks like |
wouldn't it be where the rec rate is nan, not 0? |
oops, yes, that is what I meant! |
Missing data in |
Trimming out the trees with missing data on the flanks is definitely by design, as we've had a number of cases where these large zero-recombination rate bits flanking a chromosome have really messed up statistics. Do we really know that the start of chr22 has a recombination rate of 0, or is it just that there's not much data there because of sequence weirdness? |
Weirdness is probably the answer @jeromekelleher -- looks like the short arm of chr22 (0 to ~15Mb) contains many repetitive regions that are homologous to (and may recombine with) parts of other chromosomes. One thing that confuses me with these maps is that they invariably contain missing data at the start, but 0 recombination rate regions at the end. For example, import stdpopsim
homsap = stdpopsim.get_species("HomSap")
chr22 = homsap.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(contig=chr22, demographic_model=model, samples={"pop_0":2})
print(chr22.length)
# 50818468.0
print(chr22.recombination_map.asdict())
# {'position': array([ 0., 15287921., 15295878., ..., 50785209., 50791377., 50818468.]),
# 'rate': array([ nan, 6.74364e-09, 4.31139e-09, ..., 6.48707e-09, 6.55239e-09, 0.00000e+00])}
print([sim.first().num_edges, sim.last().num_edges])
# [0, 6]
print(list(sim.breakpoints())[-5:])
# [50714355.0, 50756801.0, 50760002.0, 50779378.0, 50818468.0] so, the last bin (from 50791377 to 50818468) is not masked but instead simulated with a zero recombination rate, which results in a single tree that spans from 50779378 to the chromosome end in this simulation. In contrast the first bin is missing so contains "empty" trees in the output. All the chromosomes are like this-- just seems odd to me that the chromosome start/end aren't treated the same? |
That does sound like we're not treating the ends consistently, and should mask the ends as well. A topic for the next stdpopsim meeting next week? |
Agreed, sounds like the ends aren't being handled properly. |
Here is where the problem lies: stdpopsim/stdpopsim/genetic_maps.py Lines 120 to 123 in a229273
|
That would be it - should just be np.nan I guess. |
A small complication-- if the SLiM engine is used, then mutations w/ fitness effects may be placed on masked intervals (which seem to be simulated as having 0-recombination rate). Though these mutations will ultimately be removed by import stdpopsim
homsap = stdpopsim.get_species("HomSap")
chr1_left_edge = homsap.get_contig("chr1", genetic_map="HapMapII_GRCh38", right=100000)
# deleterious DFE over entire contig
mt = stdpopsim.MutationType(
distribution_type="f",
dominance_coeff=1.0,
distribution_args=[-0.01],
)
dfe = stdpopsim.DFE(
id="new_mutation",
mutation_types=[mt],
proportions=[1.0],
description="",
long_description="",
)
chr1_left_edge.add_dfe(intervals=[[0, chr1_left_edge.length]], DFE=dfe)
engine = stdpopsim.get_engine("slim")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(
contig=chr1_left_edge,
demographic_model=model,
samples={"pop_0":2},
slim_scaling_factor=4,
_recap_and_rescale=False,
seed=1024,
)
print(chr1_left_edge.recombination_map.asdict()) # First 55Kb masked in recmap
# {'position': array([ 0., 55550., 82571., 88169., 100000.]),
# 'rate': array([ nan, 2.981822e-08, 2.082414e-08, 2.484956e-08])}
print(sim.first().num_edges) # Without recap/rescale, there's a tree in masked region
# 913
print([i for i in sim.breakpoints()][:5]) # Looks like masked region is simulated as nonrecombining
# [0.0, 55826.0, 55850.0, 56041.0, 58092.0]
print(sim.first().num_mutations) # ... and there are deleterious mutations on that tree
# 5
print([v.position for v in sim.variants()])
# [19521.0, 19742.0, 22695.0, 36215.0, 50202.0, 77121.0, 78131.0, 83261.0] The thing to do might be to mask DFE intervals in |
yes this is a problem @nspope, and one that is pretty context specific. That said, in empirical data masked regions are generally highly repetitive and gene poor (e.g. heterochromatic regions of genomes). If we are masking low recombination regions, those are indeed likely to be gene poor, so perhaps this really is a small issue? |
We should really be consistent here - is "recombination rate nan" how we are communicating "this region is no good, don't simulate anything here"? If so, we should propagate that choice through to DFEs as well (so, perhaps |
Yeah I agree. Calling it "masked regions" or "missing data" was putting it the wrong way, as these terms could be conflated with post-simulation masking. Instead, we don't know the parameter values over an interval, and so can't simulate it. My inclination is to follow
Fixing #1404 should resolve this issue, I think -- when we extract a chunk of a chromosome, we'll use |
For example:
gives
The text was updated successfully, but these errors were encountered: