-
Notifications
You must be signed in to change notification settings - Fork 0
/
workshop.qmd
380 lines (234 loc) · 13.9 KB
/
workshop.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
---
title: "CCV Bootcamp 2024: Phylogenetics"
format: html
editor: visual
author: "August Guang"
---
## Overview
In this brief workshop we will cover:
- How to best make use of Open OnDemand and OSCAR for computationally complex work and reproducibility
- Some good practices for keeping track of this kind of work
- How to run and allocate resources for IQ-TREE2 to infer phylogenies
- How to use ggtree2+tidytree to visualize phylogenetic trees and do some basic tree manipulation
This will *not* cover:
- Phylogenetic theory, models, and statistical tests
- Assembly, alignment, time tree dating, other components in phylogenetic pipelines
- Other flavors of phylogenetic inference programs like raxml, mrbayes, etc
## Environment and repo setup
This workshop requires a fair amount of setup, as we will be toggling between the terminal window and RStudio on Open OnDemand as well as using an Apptainer container that has all the software we need. You **will** need an Oscar account already to be able to follow along, as we will be using Open OnDemand. If you don't have one, it is possible to either:
- Download the Docker image and run the scripts + access RStudio within the container on your local machine
- Download programs and packages on your own, run the scripts + access RStudio on your local machine
But we don't recommend it.
Navigate to:
```
http://ood.ccv.brown.edu
```
Once you are in, open up a terminal window. From there, you will want to clone the `ccv_bootcamp_phylogenetics` repository into your home directory (the default one when you open terminal):
```
git clone https://github.com/compbiocore/ccv_bootcamp_phylogenetics.git
```
Next steps will be to go into the repo, go into the scripts directory, and run `00_env.sh` to pull the Singularity image we will be using and set up the environment:
```
cd ccv_bootcamp_phylogenetics/scripts
./00_env.sh
```
Once that is done, we will be setting up the last piece. Go to the OOD apps, click on RStudio on Singularity, and put in the following options:
![](images/rstudio.png)
Go to the bottom and hit the Launch button, and wait for your instance to spin up!
## Setup
```{r setup, include=FALSE}
library(ggtree)
library(treeio)
library(ape)
library(tidyverse)
PATH<-"~/ccv_bootcamp_phylogenetics"
```
## Placing Turtle Relative to Crocodiles and Birds
Today we will use IQTREE and ggtree to help us understand the question of where turtles should be placed on the phylogeny relative to crocodiles and birds. Data is already in the repository and is a subset of the dataset used to answer this question in [Chiari et al., 2012](https://doi.org/10.1186/1741-7007-10-65).
![](images/turtle.png)
### Quick view of data
- Recommend using a program like Jalviewer or Aliview to view (not covered here)
- Will open file directly to take a look, but will not give you a sense of alignment
- There exists a `read.fasta` function from treeio if you just want a brief summary of the alignment. It relies on a package called `Biostrings`, so you will need to have that installed as well.
```{r}
# function from treeio
fa <- read.fasta(file.path(PATH,"data/turtle.fa"))
fa
```
## Inferring an initial phylogeny
All scripts we will be running today are in the `ccv_bootcamp_phylogenetics/scripts` folder, as you may have already seen. They are organized into
```
00_env.sh
01_init.sh
02_partition.sh
03_topo.sh
```
This represents the order in which the scripts should be run. We've already run `00_env.sh` to set up the environment. The other 3 scripts all are batch scripts containing different `iqtree2` commands, and we'll be going over them with each one.
We are going to start by running `01_init.sh`. On Oscar as you would run the following command:
``` bash
sbatch 01_init.sh
```
This will run
```
singularity exec ${IMG} iqtree2 -s ${WORKDIR}/data/turtle.fa -B 1000 -T AUTO -pre ${WORKDIR}/results/turtle
```
with data inputs, outputs, singularity image, and various other options as described below:
### Lines and options explained:
- `WORKDIR=/users/aguang/ccv_bootcamp_phylogenetics` sets up a global path as our working directory so we can quickly reference it.
- `-s turtle.fa` indicates input alignment is `turtle.fa`
- `-B 1000` specifies 1000 replicates with ultrafast bootstrap \[Minh et al., 2013\](https://doi.org/10.1093/molbev/mst024)
- `-T AUTO` determines best number of CPU cores to speed up analysis
### What will happen:
1. Select best-fit model using ModelFinder (Kalyaanamoorthy et al., 2017)\[https://doi.org/10.1038/nmeth.4285\]
2. Reconstruct ML tree using IQ-TREE search algorithm (Nguyen et al., 2015).
3. Assess branch supports with ultrafast bootstrap (UFBoot) algorithm (Minh et al., 2013)
Original Felsenstein nonparametric bootstrap can be run with `-b` option instead.
## Questions for initial tree
- Look at report file `turtle.fa.iqtree` using `cat turtle.fa.iqtree | less`
- What is the best-fit model name?
- What are the AIC/AICc/BIC scores of this model and tree?
## Looking at `turtle.iqtree`:
Take a look a `turtle.iqtree`. A few different ways to open on OOD:
- Go to Files tab at top of OOD, navigate to relevant directory, and click on file to get it as text file
- In Terminal tab, can look at using `cat $PATH/turtle.fa.iqtree | less` then space to page through
- In our RStudio session, open using File tab at top, clicking `ccv_bootcamp_phylogenetics` folder, `results` folder, and finally `turtle.iqtree`
## Understanding initial iqtree run
```{r, treeinit}
treeinit <- read.iqtree(file.path(PATH,"results/turtle.treefile"))
treeinit
```
Components of `treeinit` object:
- `@phylo`: phylogenetic tree as represented in ape. Has tip labels, node labels
- `@data`: metadata associated with tree
- There are also other components (you can look for yourself with `attribute(treeinit)` or `str(treeinit`) but generally not necessary to know or have
## ggtree, treeio, tidytree
- `ggtree` is an R package in Bioconductor that extends ggplot2 for visualizing and annotating phylogenetic trees with covariates and associated data
- `treeio` is an R package in Bioconductor that loads in common phylogenetic tree formats and associated data
- `read.iqtree` is command specifically for reading in trees generated from IQTREE. There is also `read.raxml`, `read.beast`, `read.phylip`, etc.
- `tidytree` provides tidy interface for exploring tree data
## Visualizing our first tree
Now that we have our initial tree loaded, we can plot it with `ggtree`. Syntax is very similar to in `ggplot2`, it follows the grammar of graphics structure with a `ggplot()` and multiple `geom` functions as well, the most basic of which is `geom_tree()`:
```{r initviz}
ggplot(treeinit) + geom_tree()
```
For convenience there is a function called `ggtree(x)` which will automatically run `ggplot(x) + geom_tree() + theme_tree()`: (`theme_tree()` sets a black&white theme for the tree as opposed to `ggplot2`'s default theme of grey grid)
```{r initviz_ggtree}
ggtree(treeinit)
```
Other `geoms` that are relevant are:
- `geom_treescale()`: Adds a scale bar
- `theme_tree2()`: Adds a scale on x-axis
- `geom_nodepoint()`: Adds node points
- `geom_tippoint():` Adds tip points
- `geom_tiplab():` Adds tip labels
```{r initviz_geoms1}
p <- ggtree(treeinit)
p + theme_tree2()
```
```{r initviz_geoms2}
p + geom_nodepoint()
```
```{r initviz_geom3}
p + geom_tippoint()
```
```{r initviz_geom4}
p + geom_tiplab()
```
Once we add on the tip labels, it seems our plot is a bit short to display them. `ggtree` is not as refined as `ggplot2`, you often have to add manual expansions of size to the plot for everything to fit, it won't automagically figure it out for you:
```{r initviz_large}
p + geom_tiplab() + xlim(0,0.4)
```
You can figure out what `xlim` should be by running `theme_tree2()` and seeing what the bottom scale already is:
```{r scale}
p + theme_tree2()
```
Since the question is, where are turtles placed relative to crocodiles and birds, we want to add some kind of grouping information to show. This can be accomplished by making a dataframe with the relevant groups for each species in the tree, then adding it to the `ggtree` object using the `%<+%` operator:
```{r addgrouping}
tip_df <- data.frame(label=treeinit@phylo$tip.label, group=c("outgroup","lizard","lizard","snake","bird","bird","crocodile","crocodile","turtle","turtle","turtle","turtle","outgroup","outgroup","outgroup","outgroup"))
p %<+% tip_df + geom_tiplab() + geom_tippoint(aes(color=group)) + xlim(0,0.4)
```
More is described in the [ggtree book](https://yulab-smu.top/treedata-book/chapter7.html), but the `%<+%` operator allows for attaching annotation data to a `ggtree` object, either with a dataframe of tip data or a dataframe of internal node data, or both. The main thing is to have labels for tips or nodes match up. You can also directly modify the `treedata` object if desired.
```{r modifytree}
left_join(treeinit, tip_df)
```
So, based on this analysis, where would you place turtles?
## Looking at bootstrap of clade
```{r bootstrap}
plotinit <- ggtree(treeinit) %<+% tip_df + geom_tiplab(aes(color=group), hjust=-.1) + xlim(0,0.4) + geom_tippoint(aes(color=group)) + geom_text(aes(label=UFboot),hjust=-.3, vjust=.9, size=3) + ggtitle("Initial tree")
plotinit
```
You can also use the `MRCA` function to find the most recent common ancestor of a set of tips and then view the subsetted clade starting at the MRCA, which can be useful if your tree is very large. The `MRCA` function also comes in handy for many other kinds of annotations and analyses as well.
```{r mrca}
viewClade(plotinit,MRCA(treeinit,.node1=c("Gallus","alligator","emys_orbicularis")))
```
There are similar other verbs as well:
- `parent()` gives parent of a single node.
- `child()` gives children of a single node.
- `offspring()` gives all offspring (i.e. children and children of children, etc) of a node
- `ancestor()` gives all ancestors of a node
- `sibling()` gives sibling of a node
## Running partitioned analysis
We go back to terminal temporarily to run our 2nd batch script, `02_partition.sh`. While it runs we will go through more useful components of `ggtree`.
## More ggtree things
### Labeling Clades
Often you will want to label clades as well. This is what `geom_cladelabel` is for.
```{r cladelab}
ggtree(treeinit) + geom_cladelabel(node=22, label="Some random clade", color="red")
```
You can fill in the clade color with `geom_hilight`:
```{r hilight}
ggtree(treeinit) + geom_tiplab() + geom_hilight(node=22, fill="gold")
```
If you want the box to cover all of the text, you will want to provide the `extend` argument to `geom_hilight`:
```{r extend}
ggtree(treeinit) + geom_tiplab() + geom_hilight(node=22, fill="gold", extend=5)
```
### Plotting multiple sequence alignment next to tree
`msaplot` is a nice function that will plot your multiple sequence alignment next to your phylogeny. It takes as base arguments the `ggtree` plot and the path to the multiple sequence alignment. You can set which slice of the alignment to display with the `window` argument as well. So if we run it here:
```{r msaplot}
# offset sets an offset of the alignment to the tree. Without it the msa covers the tip labels
# if we do that we need to set a new xlim
msaplot(plotinit, file.path(PATH, "data/turtle.fa"),offset=.3) + xlim(0,1)
```
Using some regions from the `turtle.nex` file:
```{r msaplot_window}
msaplot(plotinit, file.path(PATH, "data/turtle.fa"),offset=.2, window=c(2041,2772)) + xlim(0,0.8)
```
```{r msaplot_window2}
msaplot(plotinit, file.path(PATH, "data/turtle.fa"),offset=.2, window=c(15445,15963)) + xlim(0,0.8)
```
What can we say about our alignment and tree?
## Analyzing our partitioned tree
What would the commands here be to visualize our partitioned tree?
```{r partition}
treepart <- read.iqtree(file.path(PATH,"results/partition.treefile"))
plotpart <- ggtree(treepart) %<+% tip_df + geom_tiplab(aes(color=group), hjust=-.1) + xlim(0,0.4) + geom_tippoint(aes(color=group)) + geom_text(aes(label=UFboot),hjust=-.3, vjust=.9, size=3) + ggtitle("Partition tree")
plotpart
```
Plotting trees side by side is very simple:
```{r sideplot}
plotpart + plotinit
```
### Questions:
- How has the relationship changed?
- Based on the two analyses, which tree would you go with?
## Topology test (and finding best partition)
Run `03_topo.sh` in the same way as we did before:
```
sbatch scripts/03_topo.sh
```
### Questions:
- What has the lowest log-likelihood from the topology tests?
- Would you be able to choose a tree based off of the SH, KH or AU tests? Why or why not?
## Review and modifications
- We covered repository and data structures for reproducibility and ease of workflow and made use of OOD+containerization
- We went through running IQTREE2 in different ways to try to address aspects of our data and ggtree-verse to visualize and annotate.
- Scripts ordered in this numeric scheme can easily slot into a [nextflow](https://docs.ccv.brown.edu/bootcamp-2024/schuedule/day4) workflow for better logging, file provenance and reproducibility.
## For more information:
- **Data Integration, Manipulation, and Visualization of Phlyogenetic Trees**: <https://yulab-smu.top/treedata-book/>
- **IQTree2 reference**: <http://www.iqtree.org/>
- **Example iqtree and raxml scripts are in `scripts/examples`**
- **Workshop on Molecular Evolution**: <https://www.mbl.edu/education/advanced-research-training-courses/course-offerings/workshop-molecular-evolution>
## Acknowledgments
- This workshop is heavily based off of material from the [IQTREE2 workshop in Sydney 2022](http://www.iqtree.org/workshop/sydney2022) and the [JMU2017 ggtree workshop](https://4va.github.io/biodatasci/r-ggtree.html#advanced_tree_annotation).
- Thank you to Joselynn Wallace, Eric Salomaki and other members of CBC for supporting with this workshop through testing and running through ideas, and also to Paul Hall and George Dang for coordinating the CCV Bootcamp.