ecoPeakDescription.html

<h3>Please see <a href="https://doi.org/10.1126/sciadv.abg5285">Roberts Kingman et al., Sci Adv, 2021</a> for complete Materials and Methods information.</h3>

<h3>ECOTYPE GROUPING OF GEOGRAPHIC POPULATIONS</h3>

<li>Northeast Pacific selection: the primary analysis group discussed in <a href="https://doi.org/10.1126/sciadv.abg5285">Roberts Kingman et al. 2021</a>, involving 57 freshwater and 12 marine populations</li>

<li>Global selection: 56 freshwater and 28 marine populations</li>

<p>See <a href="https://doi.org/10.1126/sciadv.abg5285">Table</a> S2 (a separate excel file) in <a href="https://doi.org/10.1126/sciadv.abg5285">Roberts Kingman et al. 2021</a> for detailed population information. </p>

<h3>ECOPEAK IDENTIFICATION</h3>

<p>Two complementary methods were used to identify EcoPeaks in the older, established populations: a genetic distance-based approach over small windows, and a test of random allelic distribution at each individual variant base. </p>

<p>First, we utilized the same genetic distance-based approach as <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a>. This involves building N×N pairwise nucleotide divergence (π) matrices for small overlapping windows tiled across the genome (2500 bp windows with 500bp step size), where N is the total number of samples in each analysis group. The distance matrix was then used to calculate a marine-freshwater cluster separation score (CSS), quantifying the average marine-freshwater genetic distance after subtracting the genetic distance found within each ecotype. All details are the same as in <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a> except for the following increases in stringency: at least 4 variable sites were required in the window; at least 2/3 of samples had to have a call at any particular base for the base to be used; and the reference distribution to estimate p-value was generated through Monte Carlo sampling. Due to the larger number of samples in this study, an exhaustive examination of all possible combinations was infeasible, and so we continued Monte Carlo sampling until either 10 samplings as extreme as the actual data were observed, or 1,000,000 combinations were generated (compared to 352,716 combinations in <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a>). If after one million samplings none were as extreme as the observed data, the window was conservatively treated as 0.5 successes/1,000,000 draws and given a p-value of 5e-7 (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S2). As many windows were this extreme, we also computed a separate z-score for each window by fitting the observed CSS value to a normal distribution (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S3).</p>

<p>Second, we analyzed the distribution of allele counts between marine and freshwater populations at every base in the genome with two alleles present at &gt;10% frequency in the combined analysis metapopulation. At each such base, after conditioning on the observed number of homozygous reference, heterozygous, and homozygous non-reference calls, we used a multivariate hypergeometric generalization of Fisher’s Exact Test to compute a two-sided p-value for the probability of an imbalance in allele counts at least as extreme as observed occurring by chance (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S4) </p>

<p>The two analysis methods, while yielding overall similar results (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S2-S4), have different relative advantages. The window-based method, which operates on a genetic distance metric, has lower sensitivity to particularly young alleles, which may create large phenotypic changes through minimal genetic changes. However, the single base method operates on a single nucleotide at a time and so scores a single but consistently differentiated nucleotide just as highly as a dense cluster of consistently differentiated nucleotides. The single base method also offers single base resolution, while resolution of the window-based method is limited by window size (2500 bp). However, the window-based approach is more robust to missing data, sequencing errors, and read mismapping by integrating information from positions throughout the window. This integration across multiple bases also results in greater maximum stratification and fewer sites requiring multiple hypothesis correction. </p>

<p>For both the SNP-based and 2,500 bp window-based analysis, nearby significant values were grouped into peaks that appear to behave as a single unit. On each chromosome, starting from the most significant unassigned p-value, points were greedily added to the peak window if above significance s and within distance d of another point within the peak. For the sensitive call set, s was set to two LOD below the significance threshold, while for the specific call set s was set to halfway between two LOD below threshold and the maximum peak value (and a minimum of 2 LOD below maximum peak value). Changes in d result in only minor differences in peak boundaries; for all call sets d was set to 50,000 bp. Peak extension into a more significant peak results in disqualification of the weaker peak, as it is not clearly independent of the primary peak. </p>

<p>Peaks were filtered at either a 1% false discovery rate for the specific calls or at 5% for the sensitive calls. The single base and window peaks were then intersected for the final specific calls, or unioned for the final sensitive calls. These final sets of peaks were subsequently termed EcoPeaks. </p> 

<p>In addition to the northeastern Pacific basin and global analyses described in the main text, this analysis was also performed on additional geographic subsets of the sequencing data, as described above in Section 11, to generate additional EcoPeaks call sets for populations from Northern Europe, a subset of global regions covered by glaciers in the last ice age, and freshwater California in addition to the northeastern Pacific basin and global analyses described in the main text (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S5). </p>

<p>We also tested for overlap between the northeast Pacific and global EcoPeaks and previous findings in <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a> (<a href="https://doi.org/10.1126/sciadv.abg5285">Table</a> S3, S4). 174/174 CSS (5% FDR) regions, 212/215 SOM (A-D) regions, 81/81 CSS 2% FDR intersected with SOM A-D regions, and 240/242 5% FDR intersected with SOM A-D regions lifted successfully from the original gasAcu1 to gasAcu1-4 for comparison to EcoPeaks. EcoPeaks and Jones regions were considered to overlap if there was at least 1bp of overlap. The most stringent call set from <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a>, the regions called by both CSS (2% FDR) and SOM (A-D), all overlapped Pacific EcoPeaks (81/81 for both specific and sensitive) and nearly all overlapped global EcoPeaks (73/81 and 79/81 for specific and sensitive, respectively). In contrast, less than a quarter of specific Pacific EcoPeaks (48/209) was recovered by even the most lenient <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a> call set (CSS union SOM). To test statistical significance, we performed a permutation test for each comparison, with the <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a> regions held fixed, with the EcoPeaks randomly rearranged through bedtools shuffle, with 10,000 permutations per comparison. As 31/32 permutation tests returned no randomized draws with as much overlap as observed, we also approximated the p-values through a z-score. All overlaps are highly statistically significant by both permutation test and z-score (<a href="https://doi.org/10.1126/sciadv.abg5285">Table</a> S3, S4). </p> 

<p>We also assessed the relative contribution of coding and non-coding mutations. Following <a href="https://doi.org/10.1038/nature10944">Jones et al. 2012</a>, EcoPeaks and TempoPeaks were classified as "regulatory" if they did not overlap any known genes, "coding" if they contain at least one mutation consistently differentiated between ecotypes (p &lt; 1e-6) that alters an amino acid, and otherwise as "probably regulatory". This approach is conservative for calling regulatory regions and overcounts coding regions, especially for larger EcoPeaks and TempoPeaks, as even single amino acid changes that may have inconsequential functional effects will result in classifying the entire region as "coding". Consistent with the previous study, we classified a large majority of regions as either "regulatory" or "probably regulatory" in all call sets (<a href="https://doi.org/10.1126/sciadv.abg5285">Figure</a> S6), and attribute the slight differences in proportion of "coding" regions between call sets primarily to differences in region size. </p>

<h3>Contact</h3>
  
<p>Please email <a href="mailto:kingsley@stanford.edu">David Kingsley</a>, <a href="mailto:kingsley_lab@mailman.stanford.edu">the Kingsley Lab</a>, or <a href="mailto:krishna.Veeramah@stonybrook.edu">Krishna Veeramah</a> with questions about the hub or the associated manuscript (<a href="https://doi.org/10.1126/sciadv.abg5285">Roberts Kingman et al. 2021</a>).</p>