-
Notifications
You must be signed in to change notification settings - Fork 3
03 Preparing the Reference Sequence
Neranjan Perera edited this page Dec 3, 2018
·
2 revisions
The GATK needs two files when accessing the reference file:
- A dictionary of the contig names and sizes
- An index file to access the reference fasta file bases
In here we are preparing these files upfront so the GATK will be able to use the FASTA file as a reference.
First run the following bwa command to create the index, given that you have reference hg19.fa file already downloaded in to the folder called hg19
hg19= < full_path_to >hg19.fa module load bwa/0.7.17 cd ../hg19/ bwa index ${hg19}
This will create the following files:
hg19/ ├── hg19.fa ├── hg19.fa.amb ├── hg19.fa.ann ├── hg19.fa.bwt ├── hg19.fa.pac └── hg19.fa.sa
Using samtools
we will create a index of the reference fasta file.
module load samtools/1.7 samtools faidx ${hg19}
This will create:
hg19/ └── hg19.fa.fai
It will consist of one record per line for each of the contigs in the fasta file. Where each record is composed of
- contig name
- size
- location
- bases per lane
- bytes per lane
Use the picard tools to create the dictionary by the following command:
module load picard/2.9.2 export _JAVA_OPTIONS=-Djava.io.tmpdir=/scratch java -jar $PICARD CreateSequenceDictionary \ REFERENCE=${hg19} \ OUTPUT=hg19.dict \ CREATE_INDEX=True
This will create:
hg19/ └── hg19.dict
This is formated like a SAM file header and when running GATK it automatically looks for these files.