PopInf is a method to infer the major population (or populations) ancestry of a sample or set of samples.
Below are steps for running PopInf. PopInf is incorporated into the workflow system snakemake. All necessary files and scripts are in this directory. There are instructions on preparing the reference panel in a folder called "Reference_Panel
". There are also instructions on preparing the unknown samples in a folder called "Unknown_Samples
".
We have provided sample data sets to run PopInf. They are subsetted data from 1000 genomes phase 3. The reference panel VCFs can be found in the folder called Reference_Panel/
and the unknown (in this example, they are a set of admixed individuals from 1000 genomes) samples can be found in the folder called Unknown_Samples/
.
- Variants for a reference panel in VCF file format separated by chromosome. See
Reference_Panel/
- Variants for sample(s) of individuals with unknown or self-reported ancestry in VCF file format separated by chromosome. See
Unknown_Samples/
- Sample information file for the reference panel. This file must contain 3 tab-delimited columns: 1) the individual's sample name, and 2) sex information (i.e. male, female, unknown) and 3) population information for the corresponding individual. Our example for this file is provided here:
Sample_Information/ThousandGenomesSamples_AdmxRm.txt
. - Sample information file for the unknown samples. This file must contain 3 tab-delimited columns: 1) the individual's sample name, and 2) sex information (i.e. male, female, unknown) and 3) population information for the corresponding individual (this column can be labeled "unknown" for this file). Our example for this file is provided here:
Sample_Information/ThousandGenomesSamples_Admx_samples.txt
. - Reference Genome file (.fa) used for mapping variants. Make sure there are accompanying index (.fai) and dictionary (.dict) files. See folder
Reference_Genome/
for more information.
PopInf uses a variety of programs. We will set up a conda environment to manage all necessary packages and programs.
First, you will have to install Anaconda or Miniconda. Please refer to Conda's documentation for steps on how to install conda. See: https://conda.io/docs/index.html
You can name your environment whatever you would like. We named this environment 'PopInf' and we will use this environment for all analyses.
Create conda environment called PopInf
:
conda env create --name PopInf --file PopInfConda.txt
The PopInfConda.txt
is located in this folder and contains the programs needed to run PopInf.
If the above does not work (i.e. differences in platforms), try the following:
conda env create --name PopInf --file PopInf.yml
PopInf.yml
is also located in this folder
You will need to activate the environment when running scripts or commands and deactivate the environment when you are done.
To activate the PopInf
environment:
source activate PopInf
To deactivate the PopInf
environment:
source deactivate PopInf
To use GATK in the conda environment, you must download it from the Broad Institute and register it. After downloading the GATK v3.7 jar file, activate the PopInf environment, and type the following into the command line:
gatk-register <path and name of gatk jar file>
Please note that "<path and name of gatk jar file>
" is the path and file name for the GenomeAnalysisTK.jar file. The jar file must be downloaded independently. See: https://bioconda.github.io/recipes/gatk/README.html
To download the GATK 3.7 jar file go to: https://software.broadinstitute.org/gatk/download/archive
There, click the "GATK 2-3" tab. The different versions of GATK will appear to download. Click GenomeAnalysisTK-3.7-0-gcfedb67.tar.bz2 and unpack this file.
tar xvfj GenomeAnalysisTK-3.7-0-gcfedb67.tar.bz2
The jar file is called GenomeAnalysisTK.jar
See the readme file in the folder called Reference_Panel/
for more information.
Please make sure the reference panel VCF is separated by chromosome and gzipped. The sample information text file we use as an example is located in Sample_Information/
and the file name is ThousandGenomesSamples_AdmxRm.txt
If you are running PopInf with the test data in this repository, the reference panel VCFs and sample information file are already prepared and specified in the configuration file.
The 1000 genomes data was mapped to GRCh37. If you do not have this reference genome already, please follow the steps outlined in the folder called Reference_Genome/
. If you are using a different reference genome, specify the full path and file name in the configuration file (Step 5).
See the readme file in the folder called Unknown_Samples/
for more information.
Please make sure the unknown samples VCF is separated by chromosome and gzipped. The sample information text file we use as example is located in Sample_Information/
and the file name is ThousandGenomesSamples_Admx_samples.txt
If you are running PopInf with the test data in this repository, the unknown samples VCFs and sample information file are already prepared and specified in the configuration file (Step 5).
Associated with the Snakefile is a configuration file in json format. This file has 16 pieces of information needed to run the Snakefile. To run PopInf, go through all lines in the configuration file and make sure to change the content as specified.
The config file is named popInf.config.json
and is located in this folder. See below for details. We also provide an example our configuration file below:
popInf.config.json:
{
"_comment_sample_info": "This section of the .json file asks for sample information",
"ref_panel_pop_info_path": "Sample_Information/ThousandGenomesSamples_AdmxRm.txt",
"unkn_panel_pop_info_path": "Sample_Information/ThousandGenomesSamples_Admx_samples.txt",
"_comment_general_options": "This section of the .json file asks for information needed to run popInf regardelss of what chromosomes you choose to analyze",
"Autosomes_Yes_or_No": "Y",
"ref_path": "Reference_Genome/hs37d5.fa",
"genotype_call_rate_threshold": "0.98",
"_comment_autosomes": "This section of the .json file asks for information needed for the autosomes if they are to be analyzed",
"vcf_ref_panel_path": "Reference_Panel/",
"vcf_ref_panel_prefix": "chr",
"vcf_ref_panel_suffix": "_1000genomes_selected_individuals_SNPs_nomissing.dupsRemoved.thinned.vcf.gz",
"vcf_unknown_set_path": "Unknown_Samples/",
"vcf_unknown_set_prefix": "chr",
"vcf_unknown_set_suffix": "_1000genomes_admixed_samples.dupsRemoved.thinned.vcf.gz",
"chromosome": ["1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14",
"15", "16", "17", "18", "19", "20", "21", "22"],
"_comment_chrX": "This section of the .json file asks for information needed for the analysis of the X chromosome",
"vcf_ref_panel_path_X": "Reference_Panel/",
"vcf_ref_panel_file": "chrX_1000genomes_selected_individuals.dupsRemoved.thinned.vcf.gz",
"vcf_unknown_set_path_X": "/scratch/amtarave/test_set_POPINF/1000genomes/",
"vcf_unknown_set_file": "chrX_1000genomes_admixed_samples.dupsRemoved.thinned.vcf.gz",
"X_chr_coordinates": "X_chromosome_regions_XTR_hg19.bed"
}
After editing popInf.config.json
make sure that this file has maintained proper json format. You can use The JSON Validator for example (https://jsonlint.com/).
If you are running PopInf with the test data in this repository, you should not have to change anything in PopInf.config.json
. However, we suggests double checking prior to running PopInf.
Below, are the details on what to add or change in the PopInf.config.json
.
"ref_panel_pop_info_path":
Add the full path and file name of the sample information text file for the reference panel.
"unkn_panel_pop_info_path":
Add the full path and file name of the sample information text file for the unknown samples.
"Autosomes_Yes_or_No":
Specify whether analyzing the autosomes or X chromosome. If analyzing the autosomes, type "Y"
. If analyzing the X chromosome, type "N"
.
"ref_path":
Add the full path to and name of the reference genome file.
"genotype_call_rate_threshold":
Removes sites with a user specified call rate. For example, if you want to remove sites with any missing data (call rate of 100%) set "genotype_call_rate_threshold":
to "1.0"
. We suggests leaving the call rate to .98 or higher, so that sites found in both the reference panel and unknown set overlap.
"vcf_ref_panel_path":
Add the full path to the reference panel VCF files that are separated by chromosome. Make sure the path has "/" at the end.
"vcf_ref_panel_prefix":
Add the part of the name of the reference VCF files that comes before the chr number. For example, if the reference VCF file for chromosome 1 is named chr1_reference_panel.vcf.gz
then you would add "chr"
to this part of the config file.
"vcf_ref_panel_suffix":
Add the part of the name of the reference VCF files that comes after the chromosome number. For example, if the reference VCF file for chromosome 1 is named chr1_reference_panel.vcf.gz
then you would add "_reference_panel.vcf.gz"
to this the config file.
"vcf_unknown_set_path":
Add the full path to the unzipped unknown sample(s) VCF files that are separated by chromosome. Make sure the path has "/" at the end.
"vcf_unknown_set_prefix":
Add the part of the name of the unknown VCF files that comes before the chromosome number. For example, if the unknown VCF file for chromosome 1 is named chr1_unknown_panel.vcf
then you would add "chr"
to this part of the config file.
"vcf_unknown_set_suffix":
Add the part of the name of the unknown VCF files that comes after the chr number. For example, if the unknown VCF file for chromosome 1 is named chr1_unknown_panel.vcf
then you would add "_unknown_panel.vcf.gz"
to the config file.
"chromosome":
You may leave it as is, unless you do not want to analyze chromosomes 1-22. PopInf has an option to analyze the X chromosome (separately from the autosomes) so the X chromosome is not added here. If you are interested in analyzing the X chromosome, see below.
"vcf_ref_panel_path_X":
Add the full path to the reference panel VCF file for the X chromosome. Make sure the path has "/" at the end.
"vcf_ref_panel_file":
Add the full name of the reference panel VCF file for the X chromosome.
"vcf_unknown_set_path_X":
Add the full path to the unzipped unknown sample(s) VCF file for the X chromosome. Make sure the path has "/" at the.
"vcf_unknown_set_file":
Add the full name of the unzipped unknown sample(s) VCF file for the X chromosome.
"X_chr_coordinates":
Add the full path to and name of the file containing the X chromosome PAR and XTR coordinates. The coordinates are provided in the file named X_chromosome_regions_XTR_hg19.bed
and this file is located in this folder.
This step will provide instructions on how to run PopInf. With our server, we chose to use an sbatch script to run PopInf. This script is provided in this folder if your wish to use this. However, depending on your server, you might need to run PopInf differently. All the necessary scripts are provided in this folder.
Before running the sbatch script, some necessary edits are needed. These edits are specified both at the top of the script and here:
SPATH=/full/path/to/PopInf/directory/
ENV=PopInf
3. Email you want the notifications to be sent to. If running on a cluster. This is the email address you wish to send slurm logs to (Line 30)
POPFILEREF=/full/path/to/reference_panel/Sample_Information/file.txt
POPFILEUNK=/full/path/to/unknown_sets/Sample_Information/file.txt
CHRLST=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
If you are only analyzing chromosomes 1,2,5,7 your chromosome list variable would look like this:
CHRLST=1,2,5,7
IMPORTANT NOTE: If you run PopInf on all the autosomes (chromosomes 1-22 for humans) and resultantly there are no variants on one or more of the chromosomes, this needs to be reflected in this chromosome list in snakemake_PopInf_slurm.sh
. So for example, if you initially specified running PopInf on chromosomes 1-22 and then there are no variants on chromosome 8, the merge step of PopInf will fail. You will have to re-run PopInf but specify the chromosomes to merge in snakemake_PopInf_slurm.sh
and in PopInf.config.json
.
Alternatively, if you run PopInf independently on all autosomes and then wish to only merge a sub-set of the chromosomes for PCA visualization and to generate the inferred ancestry report, please specify those chromosome numbers in line 37 of snakemake_PopInf_slurm.sh
only.
Additional Note: If you are not running this shell script on a cluster, remove lines 2-7 and replace the snakemake command on line 69 with just snakemake
The following section discusses how the run the sbatch script to run PopInf. The script can be run differently depending on whether the autosomes or X chromosome is to be analyzed.
NOTE: Make sure you edit snakemake_PopInf_slurm.sh
before running PopInf
sbatch snakemake_PopInf_slurm.sh A
sbatch snakemake_PopInf_slurm.sh X
After submitting snakemake_PopInf_slurm.sh
PopInf will run. PopInf will output PCA plots as well as an inferred population report for each specified chromosome separately and all autosomes merged and the X chromosome. The PCA plots will provide a visual representation of how the unknown sample(s) compare(s) to the reference panel. For each unknown sample, the inferred population reports will provide distances to each reference population's centroid, and inferred ancestry based on how close the sample is to each population.
The results for each specified autosome can be found: autosomes/per_chr_results/
The results for the autosomes merged together can be found in this directory with the file names autosomes_inferred_pop_report.pdf
and autosomes_inferred_pop_report.txt
.