Skip to content

03 Preparing the Reference Sequence

Neranjan Perera edited this page Dec 3, 2018 · 2 revisions

The GATK needs two files when accessing the reference file:

  • A dictionary of the contig names and sizes
  • An index file to access the reference fasta file bases

In here we are preparing these files upfront so the GATK will be able to use the FASTA file as a reference.

Generate the BWA index

First run the following bwa command to create the index, given that you have reference hg19.fa file already downloaded in to the folder called hg19

hg19= < full_path_to >hg19.fa

module load bwa/0.7.17

cd ../hg19/
bwa index ${hg19}

This will create the following files:

hg19/
├── hg19.fa
├── hg19.fa.amb
├── hg19.fa.ann
├── hg19.fa.bwt
├── hg19.fa.pac
└── hg19.fa.sa

Generate Fasta File Index

Using samtools we will create a index of the reference fasta file.

module load samtools/1.7
samtools faidx ${hg19}

This will create:

hg19/
└── hg19.fa.fai

It will consist of one record per line for each of the contigs in the fasta file. Where each record is composed of

  • contig name
  • size
  • location
  • bases per lane
  • bytes per lane

Create Sequence Dictionary

Use the picard tools to create the dictionary by the following command:

module load picard/2.9.2
export _JAVA_OPTIONS=-Djava.io.tmpdir=/scratch

java -jar $PICARD CreateSequenceDictionary \
        REFERENCE=${hg19} \
        OUTPUT=hg19.dict \
        CREATE_INDEX=True

This will create:

hg19/
└── hg19.dict

This is formated like a SAM file header and when running GATK it automatically looks for these files.