Final class project for UCSD's CSE284 Personal Genomics, Spring 2021
Authored by: Lauryn Bruce, Hannah Mummey & Sarah Wright
ChroMonet is an Hidden Markov Model implementation for local ancestry inference. This implementation is designed to work with two ancestral populations, and SNP positions with base pair substitutions only (no indels). The model makes the simplifying assumption that recombination is equally likely at all positions in the genome.
This respository contains all necessary functions to implement, run and evaluate ChroMonet, as well as methods for creation of simulated datasets, genomes and admixtures. Note that the repository contains toy simulated datasets, but additional data needs to be created or downloaded by users to use ChroMonet with longer datasets.
simulated_data.py
contains methods to generate simulated allele frequencies for a specified number of positions and produce synthetic genomes from these simulated datasets, and perform admixture between genomes. In all functions, the individual positions are considered independent, as are the two chromosomes within a genome.
parse_genotype_mafs.py
Extraction and formatting of allele freqencies for AFR and EUR super-populations from 1000 Genomes VCF files.
Data/
contains example simulated datasets for N=100 positions.
Larger simulated datasets and 1000 Genomes data used in this project can be found on the class JupyterHub in Team3 directory under simulated_files
and chromosome_21_files
/chromosome_14_files
respectively. Alternatively, simulated datasets can be generated as described below, and 1000 Genomes data can be download from The 1000 Genomes Project.
LAI_hmm_script.py
All methods for implementing local ancestry HMM and solving for optimal path.run_LAI_hmm.py
Example script based implementation of HMM.
accuracy_metrics.py
contains methods for evaluating the predicted results from the HMM by comparing the results to corresponding truth sets. Wrapper function accuracy_metrics.compare_genome_files()
takes file paths to the predicted and truth genotypes and outputs:
- Positional accuracy: Proportion of all alleles assigned the correct ancestry.
- Skew: The ratio of number predicted recombinations events to the true number of recombination events.
plot_karyogram_LB.py
Implementation of chromosome painting adopted from a script originally written by Alicia Martin https://github.com/armartin/ancestry_pipeline. Modified to work with single chromosomes and simulated genomes.plot_chromosome_painting.py
Wrapper script forplot_karyogram_LB.py
that formats HMM outputs for chromosome paintings and calls chromosome painting functions.
These jupyter notebooks provide example implementation of the HMM and associated tasks. These are for illustration purposes, and will not run correctly outside of the CSE284 JupyterHub.
Simulated Data Examples.ipynb
Example usage of functions to create synthetic allele frequency datasets, synthetic genomes and perform admixture of synthetic and real genomes.File Reformatting.ipynb
Formalization of input data formatsStreamlined HMM.ipynb
Implementation and testing of HMM on simulated dataLog10 HMM.ipynb
Implementation and testing of HMM with model that utilizes addition of logs of probabilities rather than product of probabilitiesHMM Performance Testing.ipynb
HMM implementation that outputs runtime estimates for runtime testing
Accuracy calculations.ipynb
Calculation and visualization of accuracy metrics for all HMM testsCreate Chromosome Paintings.ipynb
Visulization of HMM outputs using chromosome paintings.
Generate 1000G AFR-EUR Inputs.ipynb
Calculation of necessary HMM inputs from 1000 Genomes dataCHR14 AF Calculations.ipynb
Allele frequency calculations for Chromosome 21 in 1000 Genomes dataCHR21 AF Calculations.ipynb
Allele frequency calculations for Chromosome 14 in 1000 Genomes data
All simulated data files start with the prefix sim
.
- Simulated allele frequencies:
simData_N[number of populations]_P[number of positions]_seed[random seed].tsv
. E.g. the filesimData_N2_P100_seed518.tsv
is a two population data set with 100 positions, generated with random seed=518. - Simulated genomes:
simGenome_[number of positions]_[pop1]_[pop2].tsv
. E.g.simGenome_100_1_0.tsv
is a 100 position genome with one chromosome from population 1, and one chromsome from population 0.
Simulated admixed genomes: simAdmixedGenome_[number of positions]_[pop1]_[pop2]_Rx[number of recombinations]_[repeat].tsv
e.g. simAdmixedGenome_100_0_1_Rx4_1.tsv
specifies the first repeat of admixture between population 0 and population 1 with 4 recombination events in each chromosome.
[genome file prefix]_recomb[recombination rate]_HMMoutput.tsv
e.g. simAdmixedGenome_100_0_1_Rx1_0_recomb0.01_HMMoutput.tsv
specifies the HMM prediction results for genome simAdmixedGenome_100_0_1_Rx1_0.tsv
with recombination rate r=0.01