Skip to content

1. Convert VCF to PLINK

Sander W. van der Laan edited this page Apr 20, 2023 · 4 revisions

Background

This script converts VCF to PLINK-format files from 1000G Phase 3, version 5b. Note that 5b contains errors that have been subsequently fixed in 5c. However, 5c does not include rsID for the variants (nor did I find the proper .VCF-files per chromosome). Use of 5c may be problematic in downstream pipelines that require rsID - it is not trivial to match rsIDs from dbSNP. The latter contains tens of millions of rsIDs mapping to the same location. So far I have not found an easy fix for this - but I am open to any suggestions ;-).

Also, checkout this package kgp for R. It can provide easy access to the meta-data from 1000G phase 3.

A couple of things are relevant to know.

  • these files only include bi-allelic variants
  • the allele coding is 'kept', that is to say, they are not converted to A1=minor, and A2=major
  • variants with rsIDs have ID 'rs#'
  • variants with no ID have ID 'chr:bp:ref:alt'
  • this does includ '' (esv) and '' type variants
  • no multi-allelic variants are retained

References used

Conversion

The script used is called vcf_to_plink.convert.sh.

The basic operation is below.

Six PLINK-formatted file-sets were generated:

  • EUR, with all individuals from the EURopean super-population
  • AFR, with all individuals from the AFRican super-population
  • EAS, with all individuals from the East-Asian super-population
  • SAS, with all individuals from the South-East Asian super-population
  • AMR, with all individuals from the Admixed American super-population
  • ALL, with all individuals

To this end a 'remove-samples'-file was created:

cat integrated_call_samples_v3.20130502.ALL.panel | awk '$3!="AFR"' | awk '{ print 0, $1 }' | tail -n +2 > remove.nonAFR.individuals.txt
cat integrated_call_samples_v3.20130502.ALL.panel | awk '$3!="EUR"' | awk '{ print 0, $1 }' | tail -n +2 > remove.nonEUR.individuals.txt
cat integrated_call_samples_v3.20130502.ALL.panel | awk '$3!="EAS"' | awk '{ print 0, $1 }' | tail -n +2 > remove.nonEAS.individuals.txt
cat integrated_call_samples_v3.20130502.ALL.panel | awk '$3!="SAS"' | awk '{ print 0, $1 }' | tail -n +2 > remove.nonSAS.individuals.txt
cat integrated_call_samples_v3.20130502.ALL.panel | awk '$3!="AMR"' | awk '{ print 0, $1 }' | tail -n +2 > remove.nonAMR.individuals.txt

Files used

  • ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
  • ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
  • ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
  • ALL.chr[no].phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

PLINK2

Note that we are using PLINK2 - not PLINK 1.9. We used plink2_alpha3_7_final, but download the latest version as bugs are regularly found.

Commands

First, VCF was converted to BCF.

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)
  • Note that chromosome 25, i.e. chr XY is retained in chromosome X.

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

norm -Ob --rm-dup both creates binary zipped (Ob) BCF (this is faster than zipped VCF!) and removes duplicate variants

An example with autosomal chromosomes, adapt this to accomodate MT, X, and Y.

for chr in {1..22} ; do
	echo "> fixing, annotating, and normalizing chromosome: ${chr} ..."
	bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
		ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz | \
			bcftools annotate -x ID -I +'chr%CHROM\:%POS\:%REF\_%ALT' | \
				$bcftools norm -Ob --rm-dup both \
					> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.chrbprefalt.nodups.indel_leftalign.multi_split.bcf ;
	bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.chrbprefalt.nodups.indel_leftalign.multi_split.bcf ;	
done

An example with autosomal chromosomes, adapt this to accomodate MT, X, and Y.

for chr in {1..22}; do
	echo "> converting chromosome: ${chr} ..."
	PLINK2 \
	  --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.chrbprefalt.nodups.indel_leftalign.multi_split.bcf \
	  --vcf-idspace-to _ \
	  --const-fid \
	  --split-par b37 \
	  --update-name ALL.phase3_shapeit2_mvncall_integrated_v5b.20130502.VARIANTLIST.PLINKupdate.only_biallelic.only_rsIDs.txt \
	  --make-bed \
	  --out ${GEN1000P3}/PLINK_format/1000Gp3v5.20130502.ALL.chr"${chr}" ;
done

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

An alternative, older, command - be aware, this should first be tested. This is available in create.1kGp3v5a.plink.sh.

PLINK2 --vcf VCFFILE.vcf.gz \
--allow-extra-chr 0 \
--const-fid \ 
--keep-allele-order \
--vcf-idspace-to _ \
--chr X \
--split-par b37 \
--set-missing-var-ids "chr"@:#:\$r:\$a \
--var-id-multi "chr"@:#:\$r:\$a \
--var-id-multi-nonsnp "chr"@:#:\$r:\$a \
--new-id-max-allele-len 23 missing \
--max-alleles 2 \
--make-bed --out 1000Gp3v5a.20130502.${POP}.chr23 \
--remove remove.non${POP}.individuals.txt 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Files generated

We converted the original VCF data, without any changes to the variantIDs.

`ALL.chr5.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.[bed/bim/fam]`	

The data generated from the above steps are of the same format, and include intermediate PLINK-generated files, as well as gzipped frequency-statistics.

`1000Gp3v5.20130502.[POP].chr[NR].[bed/bim/fam/log]`
	POP = EUR, AFR, EAS, SAS, AMR, ALL 
	NR = number
	These are PLINK-format binary files

`1000Gp3v5.20130502.[POP].chr[NR].FREQ.[afreq.gz/log]`
	POP = EUR, AFR, EAS, SAS, AMR, ALL 
	NR = number
	Frequency statistics as generated by PLINK2 - https://www.cog-genomics.org/plink/2.0/formats#afreq
	
`1000Gp3v5.20130502.[POP].chr[NR].[pgen/psam/pvar]`
	PLINK2-format binary files - https://www.cog-genomics.org/plink/2.0/input#pgen
	Don't remove, these are handy in case of further conversions to other formats
Clone this wiki locally