-
Notifications
You must be signed in to change notification settings - Fork 0
/
how-to-edit-gwas-and-genomic-prediction-file-formats.html
295 lines (272 loc) · 11.5 KB
/
how-to-edit-gwas-and-genomic-prediction-file-formats.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title> How-to: Edit GWAS and Genomic Prediction File Formats | Computational Approaches to Biological Problems
</title>
<link rel="canonical" href="./how-to-edit-gwas-and-genomic-prediction-file-formats.html">
<link rel="stylesheet" href="./theme/css/bootstrap.min.css">
<link rel="stylesheet" href="./theme/css/fontawesome.min.css">
<link rel="stylesheet" href="./theme/css/pygments/default.min.css">
<link rel="stylesheet" href="./theme/css/theme.css">
<meta name="description" content="Among the many tools used for GWAS and Genomic Prediction, there seems to equally as many file formats to navigate. Running multiple different algorithms then requires switching between file formats, which can be a tedious if not time-absorbing task. For both my future sanity as well as yours, I've provided …">
</head>
<body>
<header class="header">
<div class="container">
<div class="row">
<div class="col-sm-4">
<a href="./">
<img class="img-fluid rounded" src=./images/LucasBoatwrightHeadShot-161x215.jpeg alt="Computational Approaches to Biological Problems">
</a>
</div>
<div class="col-sm-8">
<h1 class="title"><a href="./">Computational Approaches to Biological Problems</a></h1>
<p class="text-muted">and other biological inklings</p>
<ul class="list-inline">
<li class="list-inline-item"><a href="./index.html">Home</a></li>
<li class="list-inline-item"><a href="./archives.html">posts</a></li>
<li class="list-inline-item"><a href="https://github.com/jlboat">GitHub</a></li>
<li class="list-inline-item"><a href="https://scholar.google.com/citations?hl=en&user=q5G2bJEAAAAJ&view_op=list_works&gmla=AJsN-F4gfKvkjBIT6Nv8Xy97IIUHuhFi9jF8x5ntZveyPGo_xO2ws-_5NjCw-_9z9CD2mWJnlbL1DeMd-0l-IUNYPsQ1NR4wu8s_l6YjZYlsb9mTQ1d8TX8">Pubs</a></li>
<li class="list-inline-item text-muted">|</li>
<li class="list-inline-item"><a href="./pages/about-me.html">About me</a></li>
</ul>
</div>
</div> </div>
</header>
<div class="main">
<div class="container">
<h1> How-to: Edit GWAS and Genomic Prediction File Formats
</h1>
<hr>
<article class="article">
<header>
<ul class="list-inline">
<li class="list-inline-item text-muted" title="2019-10-10T00:00:00-05:00">
<i class="fas fa-clock"></i>
Thu 10 October 2019
</li>
<li class="list-inline-item">
<i class="fas fa-folder-open"></i>
<a href="./category/how-to.html">How-to</a>
</li>
<li class="list-inline-item">
<i class="fas fa-user"></i>
<a href="./author/j-lucas-boatwright.html">J. Lucas Boatwright</a> </li>
<li class="list-inline-item">
<i class="fas fa-tag"></i>
<a href="./tag/bioinformatics.html">#bioinformatics</a>, <a href="./tag/education.html">#education</a>, <a href="./tag/gwas.html">#gwas</a>, <a href="./tag/gp.html">#gp</a> </li>
</ul>
</header>
<div class="content">
<p>Among the many tools used for GWAS and Genomic Prediction, there seems to equally as many file formats to navigate. Running multiple different algorithms then requires switching between file formats, which can be a tedious if not time-absorbing task. For both my future sanity as well as yours, I've provided a brief overview of some common formats and recommended routes for file conversion.</p>
<p>To start, I highly recommend that all genotype files be originally generated in VCF format. If you can do so following the GATK best practices for variant calling, all the better. The reason I recommend this is because VCF files are generally the most descriptive of all the genotype formats and conversion from VCF format to others results in a loss of informaton.</p>
<p>Once you have a VCF file, genotype file conversion is relatively simple. You can use Tassel which provides options to convert among several genotype file formats:</p>
<p>VCF | HapMap | HapMap Diploid | Plink | HDF5 | Phylip (interleaved or sequential) | Table</p>
<p>This may be done using the Tassel GUI (graphical user interface) or on command line.</p>
<p>Alternatively, you can use vcftools to convert a VCF or BCF (binary VCF) to the following formats:</p>
<p>BEAGLE | IMPUTE | LDhat | Plink | Plink TPED (transposed) | Numeric Matrix</p>
<p>Conversion among genotype files is <em>typically</em> relatively simple but can be more complicated when converting between phased or unphased genotypes or when files with multi-allelic sites need to be converted to contain only bi-allelic sites.</p>
<p>As for phenotype formats, the main issue is typically determining what format you need for your program of choice and then figuring out how best to convert it.</p>
<p>Plink format for phenotypes has the following tab-delimited structure:</p>
<ul>
<li>FID = Family ID</li>
<li>IID = Individual ID</li>
</ul>
<p>In this example, I don't have FIDs so the IID is just repeated for both.</p>
<table border="1" style="width:100%" align="center">
<thead>
<tr>
<th>FID</th>
<th>IID</th>
<th>Phenotype1</th>
<th>...</th>
<th>PhenotypeN</th>
</tr>
</thead>
<tbody>
<tr>
<td>Indiv1</td>
<td>Indiv1</td>
<td>0</td>
<td>...</td>
<td>0</td>
</tr>
<tr>
<td>Indiv2</td>
<td>Indiv2</td>
<td>3</td>
<td>...</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>This is slightly different from the Tassel tab-delimited format for phenotypes:</p>
<table border="1" style="width:100%" align="center">
<thead>
<tr>
<th><Trait></th>
<th>Phenotype1</th>
<th>...</th>
<th>PhenotypeN</th>
</tr>
</thead>
<tbody>
<tr>
<td>Indiv1</td>
<td>0</td>
<td>...</td>
<td>0</td>
</tr>
<tr>
<td>Indiv2</td>
<td>3</td>
<td>...</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>Tassel will allow you to save a phenotype file into either Tassel or Plink format.</p>
<p>If you're running GAPIT, then the phenotype file is read into an R dataframe. That means that the delimiter used is less important since files can be read with "read.csv" or "read.table". However, you will generally want the file to looks like this:</p>
<table border="1" style="width:100%" align="center">
<thead>
<tr>
<th>Taxa</th>
<th>Phenotype1</th>
<th>...</th>
<th>PhenotypeN</th>
</tr>
</thead>
<tbody>
<tr>
<td>Indiv1</td>
<td>0</td>
<td>...</td>
<td>0</td>
</tr>
<tr>
<td>Indiv2</td>
<td>3</td>
<td>...</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>And, when you read the file, you want to designate that there are headers for the phenotype file ("head=TRUE"). Don't do this for the genotype file, though (in HapMap format). Even though there <em>is</em> a header, it should be read as "head=FALSE" so that the header occurs on the first row.</p>
<p>It's also worth noting that while the first 11 HapMap columns are required for GAPIT, only three of them are used ("rs" a.k.a. SNP name, "chrom" and "pos"). So, the other eight columns may be filled with "NA".</p>
<p>Now, let's assume instead that you want to run Plink or GEMMA. GEMMA can use Plink file formats, so let's use that common format. Conversion from VCF to Plink files is easily acheived using Tassel or vcftools. </p>
<p>To begin, Plink requires both a PED (pedigree) and MAP (genetic map) file. Plink PED file format requires all markers be biallelic and the file look like so (header included here for clarity -- not in actual PED file):</p>
<table border="1" style="width:100%" align="center">
<thead>
<tr>
<th>FamilyID</th>
<th>IndividualID</th>
<th>PaternalID</th>
<th>MaternalID</th>
<th>Sex</th>
<th>Phenotype</th>
<th>Snp1</th>
<th>...</th>
<th>SnpN</th>
</tr>
</thead>
<tbody>
<tr>
<td>Indiv1</td>
<td>Indiv1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>A</td>
<td>...</td>
<td>G</td>
</tr>
<tr>
<td>Indiv2</td>
<td>Indiv2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>A</td>
<td>...</td>
<td>G</td>
</tr>
</tbody>
</table>
<p>And the Plink MAP file looks like (again, header included for clarity only):</p>
<table border="1" style="width:100%" align="center">
<thead>
<tr>
<th>chromosome</th>
<th>SnpID</th>
<th>GeneticDistance</th>
<th>BasePairPosition</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>S01_3918</td>
<td>0</td>
<td>3918</td>
</tr>
<tr>
<td>1</td>
<td>S01_12215</td>
<td>0</td>
<td>12215</td>
</tr>
<tr>
<td>1</td>
<td>S01_23664</td>
<td>0</td>
<td>23664</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="c1"># This will generate the PED and MAP files -- note: Only biallelic loci will be output</span>
<span class="c1"># Tassel can also generate the PED and MAP files, but Tassel retains multiallelic loci</span>
<span class="c1"># which is a problem for GEMMA</span>
vcftools --vcf genotypes.vcf --plink --out genotypes
<span class="c1"># This will generate BED (binary PED), NoSex (if --allow-no-sex is included), </span>
<span class="c1"># FAM (PLINK sample information file), BIM (PLINK extended MAP file) and log files</span>
plink --file genotypes --pheno phenotypes.txt --allow-no-sex --make-bed --out genotypes
</code></pre></div>
<p>The BED, BIM and FAM files are referred to as the Plink binary fileset. These are what you want to run GEMMA and Plink.</p>
<p>The last style of genotype matrix typically seen is the numeric genotype matrix. Depending on the software, you may want this matrix in 0, 1, 2 format or in -1, 0, 1. This matrix format is useful in genomic prediction as a design matrix or during the calculation of a genomic relationship matrix.</p>
<p>Conversion to 012 is easily accomplished using vcftools.</p>
<div class="highlight"><pre><span></span><code>vcftools --vcf my.vcf --012 --out geno_matrix
</code></pre></div>
<p>Then, conversion to a -101 format simply requires that the matrix be read into either R or python and one be subtracted from the matrix. For example:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Commented code for a small-scale example</span>
<span class="c1"># myMatrix <- matrix(c(c(0,1,2),c(2,1,0),c(1,2,0)), nrow=3)</span>
<span class="n">myMatrix</span> <span class="o"><-</span> <span class="nf">read.csv</span><span class="p">(</span><span class="s">"geno_matrix.012"</span><span class="p">,</span> <span class="n">head</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span>
<span class="n">myMatrix</span> <span class="o"><-</span> <span class="n">myMatrix</span> <span class="o">-</span> <span class="m">1</span>
</code></pre></div>
<p>Do note, however, that different programs require missing data in particular formats. Here, the 012 matrix has missing as -1 and the -101 has -2. So, change for NAs as necessary.</p>
<p>That should cover the majority of file conversions needed for performing GWAS or Genomic Prediction. Good luck!</p>
</div>
</article>
</div>
</div>
<footer class="footer">
<div class="container">
<div class="row">
<ul class="col-sm-6 list-inline">
<li class="list-inline-item"><a href="./authors.html">Authors</a></li>
<li class="list-inline-item"><a href="./archives.html">Archives</a></li>
<li class="list-inline-item"><a href="./categories.html">Categories</a></li>
<li class="list-inline-item"><a href="./tags.html">Tags</a></li>
</ul>
<p class="col-sm-6 text-sm-right text-muted">
Generated by <a href="https://github.com/getpelican/pelican" target="_blank">Pelican</a>
/ <a href="https://github.com/nairobilug/pelican-alchemy" target="_blank">✨</a>
</p>
</div> </div>
</footer>
</body>
</html>