Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No variants are simulated in zero-recombination region of chr22 #1422

Closed
nspope opened this issue Nov 15, 2022 · 18 comments · Fixed by #1617
Closed

No variants are simulated in zero-recombination region of chr22 #1422

nspope opened this issue Nov 15, 2022 · 18 comments · Fixed by #1617
Assignees
Milestone

Comments

@nspope
Copy link
Collaborator

nspope commented Nov 15, 2022

For example:

import stdpopsim
import numpy as np

# simulate first 20Mb of chr22

for mapname in ["HapMapII_GRCh37", "HapMapII_GRCh38"]:
    sp = stdpopsim.get_species("HomSap")
    chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
    constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
    breaks = np.linspace(0, chr22.length, 7)
    breaks_mb = np.round(breaks/1e6, 1)
    
    slim = stdpopsim.get_engine("slim")
    ts_slim = slim.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
    div_slim = ts_slim.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
        print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
    
    msp = stdpopsim.get_engine("msprime")
    ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
    div_msp = ts_msp.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
        print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")

gives

segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.059999999999999e-05
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00018810000000000007
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.36e-05
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00020310000000000008
segsites slim/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.189999999999998e-05
segsites slim/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00017460000000000007
segsites msp/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.789999999999998e-05
segsites msp/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00019890000000000006
@nspope
Copy link
Collaborator Author

nspope commented Nov 15, 2022

Looks like that's because this region is masked ("blank" trees) in the output tree sequence for both engines.

The hapmap files for both genetic maps start at ~15 Mb, so I assume that to fix this we'd need to add a line at the beginning specifying a 0 rate (so this region is not treated as "missing data")?

@nspope
Copy link
Collaborator Author

nspope commented Nov 15, 2022

related to #1387

@mufernando
Copy link
Member

To get to the bottom of this I decided to look at branch stats instead.

With the SLiM engine, we get non-zero segsites for the first few windows when _recap_and_recale=False, but these we go to 0 when _recap_and_rescale=True.
With the msprime engine, they are always zero.

>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4, _recap_and_rescale=False)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(

>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 7944.9128727
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 7531.1644341
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3081.1006886320965
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 12736.400180715054
>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/genetic_maps.py:127: UserWarning: Recombination map has length 51229805.0, which is longer than chromosome length 50818468. The latter will be used.
  warnings.warn(
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=ticks but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning)
  warnings.warn(message, TimeUnitsMismatchWarning)
>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 2195.502231990335
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 13879.990809453386
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 2172.3004433208343
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 14758.390178399126

@mufernando
Copy link
Member

So it looks like msprime is at fault here --- it is removing the trees where rec rate is 0.

@nspope
Copy link
Collaborator Author

nspope commented Nov 15, 2022

wouldn't it be where the rec rate is nan, not 0?

@mufernando
Copy link
Member

oops, yes, that is what I meant!

@nspope
Copy link
Collaborator Author

nspope commented Nov 16, 2022

Missing data in RateMap is documented here ... I think the issue is with our chr22 maps (which should explicitly mark the recombination rate as 0).

@jeromekelleher
Copy link
Member

Trimming out the trees with missing data on the flanks is definitely by design, as we've had a number of cases where these large zero-recombination rate bits flanking a chromosome have really messed up statistics.

Do we really know that the start of chr22 has a recombination rate of 0, or is it just that there's not much data there because of sequence weirdness?

@nspope
Copy link
Collaborator Author

nspope commented Nov 21, 2022

Weirdness is probably the answer @jeromekelleher -- looks like the short arm of chr22 (0 to ~15Mb) contains many repetitive regions that are homologous to (and may recombine with) parts of other chromosomes.

One thing that confuses me with these maps is that they invariably contain missing data at the start, but 0 recombination rate regions at the end. For example,

import stdpopsim

homsap = stdpopsim.get_species("HomSap")
chr22 = homsap.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(contig=chr22, demographic_model=model, samples={"pop_0":2})

print(chr22.length)
# 50818468.0

print(chr22.recombination_map.asdict())
# {'position': array([       0., 15287921., 15295878., ..., 50785209., 50791377., 50818468.]), 
#  'rate': array([        nan, 6.74364e-09, 4.31139e-09, ..., 6.48707e-09, 6.55239e-09, 0.00000e+00])}

print([sim.first().num_edges, sim.last().num_edges])
# [0, 6]

print(list(sim.breakpoints())[-5:])
# [50714355.0, 50756801.0, 50760002.0, 50779378.0, 50818468.0]

so, the last bin (from 50791377 to 50818468) is not masked but instead simulated with a zero recombination rate, which results in a single tree that spans from 50779378 to the chromosome end in this simulation. In contrast the first bin is missing so contains "empty" trees in the output.

All the chromosomes are like this-- just seems odd to me that the chromosome start/end aren't treated the same?

@petrelharp
Copy link
Contributor

That does sound like we're not treating the ends consistently, and should mask the ends as well. A topic for the next stdpopsim meeting next week?

@jeromekelleher
Copy link
Member

Agreed, sounds like the ends aren't being handled properly.

@nspope
Copy link
Collaborator Author

nspope commented Nov 21, 2022

Here is where the problem lies:

# Extend map to the end of the chromosome.
positions = np.append(recomb_map.position, chrom.length)
rates = np.append(recomb_map.rate, 0)
recomb_map = msprime.RateMap(position=positions, rate=rates)

@jeromekelleher
Copy link
Member

That would be it - should just be np.nan I guess.

@nspope
Copy link
Collaborator Author

nspope commented Nov 29, 2022

A small complication-- if the SLiM engine is used, then mutations w/ fitness effects may be placed on masked intervals (which seem to be simulated as having 0-recombination rate). Though these mutations will ultimately be removed by _recap_and_rescale, they could influence the genealogies in unmasked regions.

import stdpopsim                                                                                                                                                                              

homsap = stdpopsim.get_species("HomSap")
chr1_left_edge = homsap.get_contig("chr1", genetic_map="HapMapII_GRCh38", right=100000)

# deleterious DFE over entire contig
mt = stdpopsim.MutationType(
    distribution_type="f",
    dominance_coeff=1.0,
    distribution_args=[-0.01],
)
dfe = stdpopsim.DFE(
    id="new_mutation",
    mutation_types=[mt],
    proportions=[1.0],
    description="",
    long_description="",
)
chr1_left_edge.add_dfe(intervals=[[0, chr1_left_edge.length]], DFE=dfe)

engine = stdpopsim.get_engine("slim")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(
    contig=chr1_left_edge, 
    demographic_model=model, 
    samples={"pop_0":2}, 
    slim_scaling_factor=4, 
    _recap_and_rescale=False, 
    seed=1024,
)

print(chr1_left_edge.recombination_map.asdict()) # First 55Kb masked in recmap
# {'position': array([     0.,  55550.,  82571.,  88169., 100000.]), 
#  'rate': array([         nan, 2.981822e-08, 2.082414e-08, 2.484956e-08])}

print(sim.first().num_edges) # Without recap/rescale, there's a tree in masked region
# 913
                                                                                                                              
print([i for i in sim.breakpoints()][:5]) # Looks like masked region is simulated as nonrecombining                                                                                                                                
# [0.0, 55826.0, 55850.0, 56041.0, 58092.0]                     

print(sim.first().num_mutations) # ... and there are deleterious mutations on that tree
# 5
print([v.position for v in sim.variants()])
# [19521.0, 19742.0, 22695.0, 36215.0, 50202.0, 77121.0, 78131.0, 83261.0] 

The thing to do might be to mask DFE intervals in add_dfe, according to where there's missing data in the recombination map?

@andrewkern
Copy link
Member

yes this is a problem @nspope, and one that is pretty context specific. That said, in empirical data masked regions are generally highly repetitive and gene poor (e.g. heterochromatic regions of genomes). If we are masking low recombination regions, those are indeed likely to be gene poor, so perhaps this really is a small issue?

@petrelharp
Copy link
Contributor

We should really be consistent here - is "recombination rate nan" how we are communicating "this region is no good, don't simulate anything here"? If so, we should propagate that choice through to DFEs as well (so, perhaps simulate would have a step where it masks any DFEs by the bad regions).

@nspope
Copy link
Collaborator Author

nspope commented Nov 29, 2022

Yeah I agree. Calling it "masked regions" or "missing data" was putting it the wrong way, as these terms could be conflated with post-simulation masking.

Instead, we don't know the parameter values over an interval, and so can't simulate it. My inclination is to follow msprime:

  • If the "undefined" regions are flanks, then just ignore them and simulate over the valid part of the contig.
  • If the undefined regions are internal to the contig (sandwiched between valid intervals), then throw an error.
  • Clip DFEs to to the valid portion of the contig, so that the undefined flanks are always simulated in SLiM as 0-recombination and neutral. These will get masked during recapitation, and won't have any impact on genealogies in the valid portion of the contig.

Fixing #1404 should resolve this issue, I think -- when we extract a chunk of a chromosome, we'll use RateMap.slice to set the flanking part of the rate map to NaN, so that simulation isn't performed over the flanks and the coordinate system matches the original chromosome. The DFE intervals will be trimmed in Contig.add_dfe to exclude the flanks. I don't think there'd be any performance cost to simulating these in SLiM, because no mutations or recombination events would ever occur on the flanks. So we'd just need to add a test or two to ensure this is the case.

@nspope
Copy link
Collaborator Author

nspope commented Dec 16, 2024

The underlying issue was fixed by #1570, a test is added in #1617 that checks that SLiM doesn't add DFE-associated mutations in the "missing" flanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants