No variants are simulated in zero-recombination region of chr22 #1422

nspope · 2022-11-15T20:01:36Z

For example:

import stdpopsim
import numpy as np

# simulate first 20Mb of chr22

for mapname in ["HapMapII_GRCh37", "HapMapII_GRCh38"]:
    sp = stdpopsim.get_species("HomSap")
    chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
    constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
    breaks = np.linspace(0, chr22.length, 7)
    breaks_mb = np.round(breaks/1e6, 1)
    
    slim = stdpopsim.get_engine("slim")
    ts_slim = slim.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
    div_slim = ts_slim.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
        print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
    
    msp = stdpopsim.get_engine("msprime")
    ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
    div_msp = ts_msp.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
        print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")

gives

segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.059999999999999e-05
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00018810000000000007
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.36e-05
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00020310000000000008
segsites slim/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.189999999999998e-05
segsites slim/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00017460000000000007
segsites msp/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.789999999999998e-05
segsites msp/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00019890000000000006

The text was updated successfully, but these errors were encountered:

nspope · 2022-11-15T20:07:14Z

Looks like that's because this region is masked ("blank" trees) in the output tree sequence for both engines.

The hapmap files for both genetic maps start at ~15 Mb, so I assume that to fix this we'd need to add a line at the beginning specifying a 0 rate (so this region is not treated as "missing data")?

nspope · 2022-11-15T20:17:18Z

related to #1387

mufernando · 2022-11-15T21:34:05Z

To get to the bottom of this I decided to look at branch stats instead.

With the SLiM engine, we get non-zero segsites for the first few windows when _recap_and_recale=False, but these we go to 0 when _recap_and_rescale=True.
With the msprime engine, they are always zero.

>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4, _recap_and_rescale=False)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(

>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 7944.9128727
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 7531.1644341
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3081.1006886320965
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 12736.400180715054

>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/genetic_maps.py:127: UserWarning: Recombination map has length 51229805.0, which is longer than chromosome length 50818468. The latter will be used.
  warnings.warn(
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=ticks but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning)
  warnings.warn(message, TimeUnitsMismatchWarning)
>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 2195.502231990335
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 13879.990809453386
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 2172.3004433208343
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 14758.390178399126

mufernando · 2022-11-15T21:35:20Z

So it looks like msprime is at fault here --- it is removing the trees where rec rate is 0.

nspope · 2022-11-15T21:40:12Z

wouldn't it be where the rec rate is nan, not 0?

mufernando · 2022-11-15T21:42:36Z

oops, yes, that is what I meant!

nspope · 2022-11-16T02:54:47Z

Missing data in RateMap is documented here ... I think the issue is with our chr22 maps (which should explicitly mark the recombination rate as 0).

jeromekelleher · 2022-11-21T13:20:44Z

Trimming out the trees with missing data on the flanks is definitely by design, as we've had a number of cases where these large zero-recombination rate bits flanking a chromosome have really messed up statistics.

Do we really know that the start of chr22 has a recombination rate of 0, or is it just that there's not much data there because of sequence weirdness?

nspope · 2022-11-21T20:23:54Z

Weirdness is probably the answer @jeromekelleher -- looks like the short arm of chr22 (0 to ~15Mb) contains many repetitive regions that are homologous to (and may recombine with) parts of other chromosomes.

One thing that confuses me with these maps is that they invariably contain missing data at the start, but 0 recombination rate regions at the end. For example,

import stdpopsim

homsap = stdpopsim.get_species("HomSap")
chr22 = homsap.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(contig=chr22, demographic_model=model, samples={"pop_0":2})

print(chr22.length)
# 50818468.0

print(chr22.recombination_map.asdict())
# {'position': array([       0., 15287921., 15295878., ..., 50785209., 50791377., 50818468.]), 
#  'rate': array([        nan, 6.74364e-09, 4.31139e-09, ..., 6.48707e-09, 6.55239e-09, 0.00000e+00])}

print([sim.first().num_edges, sim.last().num_edges])
# [0, 6]

print(list(sim.breakpoints())[-5:])
# [50714355.0, 50756801.0, 50760002.0, 50779378.0, 50818468.0]

so, the last bin (from 50791377 to 50818468) is not masked but instead simulated with a zero recombination rate, which results in a single tree that spans from 50779378 to the chromosome end in this simulation. In contrast the first bin is missing so contains "empty" trees in the output.

All the chromosomes are like this-- just seems odd to me that the chromosome start/end aren't treated the same?

petrelharp · 2022-11-21T20:35:52Z

That does sound like we're not treating the ends consistently, and should mask the ends as well. A topic for the next stdpopsim meeting next week?

jeromekelleher · 2022-11-21T20:44:27Z

Agreed, sounds like the ends aren't being handled properly.

nspope · 2022-11-21T21:55:26Z

Here is where the problem lies:

stdpopsim/stdpopsim/genetic_maps.py

Lines 120 to 123 in a229273

    
           # Extend map to the end of the chromosome. 
        
           positions = np.append(recomb_map.position, chrom.length) 
        
           rates = np.append(recomb_map.rate, 0) 
        
           recomb_map = msprime.RateMap(position=positions, rate=rates)

jeromekelleher · 2022-11-22T09:25:05Z

That would be it - should just be np.nan I guess.

nspope · 2022-11-29T17:45:22Z

A small complication-- if the SLiM engine is used, then mutations w/ fitness effects may be placed on masked intervals (which seem to be simulated as having 0-recombination rate). Though these mutations will ultimately be removed by _recap_and_rescale, they could influence the genealogies in unmasked regions.

import stdpopsim                                                                                                                                                                              

homsap = stdpopsim.get_species("HomSap")
chr1_left_edge = homsap.get_contig("chr1", genetic_map="HapMapII_GRCh38", right=100000)

# deleterious DFE over entire contig
mt = stdpopsim.MutationType(
    distribution_type="f",
    dominance_coeff=1.0,
    distribution_args=[-0.01],
)
dfe = stdpopsim.DFE(
    id="new_mutation",
    mutation_types=[mt],
    proportions=[1.0],
    description="",
    long_description="",
)
chr1_left_edge.add_dfe(intervals=[[0, chr1_left_edge.length]], DFE=dfe)

engine = stdpopsim.get_engine("slim")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(
    contig=chr1_left_edge, 
    demographic_model=model, 
    samples={"pop_0":2}, 
    slim_scaling_factor=4, 
    _recap_and_rescale=False, 
    seed=1024,
)

print(chr1_left_edge.recombination_map.asdict()) # First 55Kb masked in recmap
# {'position': array([     0.,  55550.,  82571.,  88169., 100000.]), 
#  'rate': array([         nan, 2.981822e-08, 2.082414e-08, 2.484956e-08])}

print(sim.first().num_edges) # Without recap/rescale, there's a tree in masked region
# 913
                                                                                                                              
print([i for i in sim.breakpoints()][:5]) # Looks like masked region is simulated as nonrecombining                                                                                                                                
# [0.0, 55826.0, 55850.0, 56041.0, 58092.0]                     

print(sim.first().num_mutations) # ... and there are deleterious mutations on that tree
# 5
print([v.position for v in sim.variants()])
# [19521.0, 19742.0, 22695.0, 36215.0, 50202.0, 77121.0, 78131.0, 83261.0]

The thing to do might be to mask DFE intervals in add_dfe, according to where there's missing data in the recombination map?

andrewkern · 2022-11-29T17:54:58Z

yes this is a problem @nspope, and one that is pretty context specific. That said, in empirical data masked regions are generally highly repetitive and gene poor (e.g. heterochromatic regions of genomes). If we are masking low recombination regions, those are indeed likely to be gene poor, so perhaps this really is a small issue?

petrelharp · 2022-11-29T20:32:22Z

We should really be consistent here - is "recombination rate nan" how we are communicating "this region is no good, don't simulate anything here"? If so, we should propagate that choice through to DFEs as well (so, perhaps simulate would have a step where it masks any DFEs by the bad regions).

nspope · 2022-11-29T22:36:37Z

Yeah I agree. Calling it "masked regions" or "missing data" was putting it the wrong way, as these terms could be conflated with post-simulation masking.

Instead, we don't know the parameter values over an interval, and so can't simulate it. My inclination is to follow msprime:

If the "undefined" regions are flanks, then just ignore them and simulate over the valid part of the contig.
If the undefined regions are internal to the contig (sandwiched between valid intervals), then throw an error.
Clip DFEs to to the valid portion of the contig, so that the undefined flanks are always simulated in SLiM as 0-recombination and neutral. These will get masked during recapitation, and won't have any impact on genealogies in the valid portion of the contig.

Fixing #1404 should resolve this issue, I think -- when we extract a chunk of a chromosome, we'll use RateMap.slice to set the flanking part of the rate map to NaN, so that simulation isn't performed over the flanks and the coordinate system matches the original chromosome. The DFE intervals will be trimmed in Contig.add_dfe to exclude the flanks. I don't think there'd be any performance cost to simulating these in SLiM, because no mutations or recombination events would ever occur on the flanks. So we'd just need to add a test or two to ensure this is the case.

nspope · 2024-12-16T23:13:02Z

The underlying issue was fixed by #1570, a test is added in #1617 that checks that SLiM doesn't add DFE-associated mutations in the "missing" flanks

nspope self-assigned this Apr 23, 2023

petrelharp mentioned this issue Jul 2, 2024

Release 0.2.1 checklist #1565

Open

11 tasks

petrelharp added this to the Version 0.2.1 milestone Jul 5, 2024

nspope mentioned this issue Dec 16, 2024

Add test checking how a clipped DFE is handled by SLiM #1617

Merged

petrelharp closed this as completed in #1617 Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No variants are simulated in zero-recombination region of chr22 #1422

No variants are simulated in zero-recombination region of chr22 #1422

nspope commented Nov 15, 2022

nspope commented Nov 15, 2022 •

edited

Loading

nspope commented Nov 15, 2022

mufernando commented Nov 15, 2022

mufernando commented Nov 15, 2022

nspope commented Nov 15, 2022

mufernando commented Nov 15, 2022

nspope commented Nov 16, 2022

jeromekelleher commented Nov 21, 2022

nspope commented Nov 21, 2022 •

edited

Loading

petrelharp commented Nov 21, 2022

jeromekelleher commented Nov 21, 2022

nspope commented Nov 21, 2022

jeromekelleher commented Nov 22, 2022

nspope commented Nov 29, 2022 •

edited

Loading

andrewkern commented Nov 29, 2022

petrelharp commented Nov 29, 2022

nspope commented Nov 29, 2022 •

edited

Loading

nspope commented Dec 16, 2024

No variants are simulated in zero-recombination region of chr22 #1422

No variants are simulated in zero-recombination region of chr22 #1422

Comments

nspope commented Nov 15, 2022

nspope commented Nov 15, 2022 • edited Loading

nspope commented Nov 15, 2022

mufernando commented Nov 15, 2022

mufernando commented Nov 15, 2022

nspope commented Nov 15, 2022

mufernando commented Nov 15, 2022

nspope commented Nov 16, 2022

jeromekelleher commented Nov 21, 2022

nspope commented Nov 21, 2022 • edited Loading

petrelharp commented Nov 21, 2022

jeromekelleher commented Nov 21, 2022

nspope commented Nov 21, 2022

jeromekelleher commented Nov 22, 2022

nspope commented Nov 29, 2022 • edited Loading

andrewkern commented Nov 29, 2022

petrelharp commented Nov 29, 2022

nspope commented Nov 29, 2022 • edited Loading

nspope commented Dec 16, 2024

nspope commented Nov 15, 2022 •

edited

Loading

nspope commented Nov 21, 2022 •

edited

Loading

nspope commented Nov 29, 2022 •

edited

Loading

nspope commented Nov 29, 2022 •

edited

Loading