This repository contains three workflows for performing genotype data QC for the eQTL Catalogue project.
Parts of this workflow have been merged into the eQTL-Catalogue/geimpute workflow. This workflow is no longer maintained independently.
Most of the software dependencies for the pipelines are listed in the conda environment file. Docker container with all of these dependencies can be obtained from DockerHub.
The pipelines also require GenotypeHarmonizer and LDAK5 that need to be downladed separately. Script for downloading those can be found here.
Preparing genotype data for imputation to the 1000 Genomes Phase 3 reference panel with Michigan Imputation Server. We have installed the imputation server locally.
QC steps:
- Align raw genotypes to the reference panel with Genotype Harmonizer.
- Convert the genotypes to the VCF format with PLINK.
- Exclude variants with Hardy-Weinberg p-value < 1e-6, missingness > 0.05 and minor allele frequency < 0.01 with bcftools
- Calculate individual-level missingness using vcftools.
- Create separate VCF files for each chromosome.
Execution:
nextflow run pre-imputation_qc.nf -profile eqtl_catalogue -resume\
--bfile /gpfs/hpc/projects/genomic_references/CEDAR/genotypes/PLINK_100718_1018/CEDAR\
--output_name CEDAR_GRCh37_genotyped\
--outdir CEDAR
Genotype data imputed to 1000 Genomes Phase 3 reference panel.
- Perform LD pruning on the reference dataset with PLINK.
- Perform PCA and project new samples to the reference principal components with LDAK.
nextflow run pop_assign.nf -profile pop_assign --vcf <path_to_vcf.vcf.gz> --data_name <study_name>
Initial version of the population assignment pipeline was implemented by Katerina Peikova and Marija Samoviča, later modified by Nurlan Kerimov and Kaur Alasoo.