-
Notifications
You must be signed in to change notification settings - Fork 1
/
Logolas.Rmd
293 lines (197 loc) · 12.2 KB
/
Logolas.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: "Guided Logolas Tutorial"
author: "Kushal Dey, Dongyue Xie, Matthew Stephens"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Guided Logolas Tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
We present an elaborate guided tutorial of how to use the **Logolas**
R package.
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(dpi = 90,dev = "png")
```
## Features of Logolas
**Logolas** offers two new features that makes logo visualization a more generic tool with
potential applications in a much wider scope of problems.
- **Enrichment Depletion Logo (EDLogo)** : General logo plotting softwares (*seqLogo*)
primarily highlight only enrichment of symbols at each position of the logo, but
**Logolas** allows the user to highlight both enrichment and depletion of symbols
at any position, leading to more parsimonious and visually appealing representation.
- **String symbols** : General logo plot softwares have limited library of symbols, typically
restricted to English alphabets. **Logolas** allows the user to plot symbols for
any alphanumeric string, and also allows for combination of strings and charcaters.
## Installation
The most recent version of Logolas is available from Github using [devtools](http://www.r-pkg.org/pkg/devtools) R package.First, you would
require to install the following Bioconductor packages.
```{r install_logolas_bio, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite(c("Biostrings","BiocStyle","Biobase","seqLogo","ggseqlogo"))
```
Then install Logolas as follows
```{r install_logolas, eval=FALSE}
library(devtools)
install_github("kkdey/Logolas",build_vignettes = TRUE)
```
Once you have installed the package, load the package in R by entering
```{r load_logolas_bio, eval=TRUE}
library(Logolas)
```
To get an overview of the package, enter
```{r help_logolas, eval=FALSE}
help(package = "Logolas")
```
## Accepted Data Types
**Logolas** accepts two data formats as input
- a vector of aligned character sequences (may be DNA, RNA or amino acid sequences), each of same length (see Example 1 below)
- a positional frequency (weight) matrix, termed PFM (PWM), with the symbols to be plotted along the rows and the positions of aligned sequences, from which the matrix is generated, along the columns. (see Example 2)
## Illustration of Logolas
### String set input
Consider aligned strings of characters
```{r seq, cache=FALSE, eval=TRUE,warning=FALSE}
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTTTAT", "CTATAGT",
"CTATTTT", "CTTATAT", "CTATATT", "CTCATTT", "CTTATTT", "CAATAGT",
"CATTTGA", "CTCTTAT", "CTATTAT", "CTTTTAT", "CTATAAT", "CTTAGGT",
"CTATTGT", "CTCATGT", "CTATAGT", "CTCGTTA", "CTAGAAT", "CAATGGT")
```
The standard logo plot, akin to **seqLogo** R package can be plotted using the
\textbf{logomaker()} function with `type=Logo`.
```{r message = FALSE, warning = FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(sequence, type = "Logo")
```
To plot the EDLogo plot for the same example, use the `type=EDLogo` option instead.
```{r message = FALSE, warning = FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(sequence, type = "EDLogo")
```
Instead of DNA.RNA sequence as above, one can also use amino acid character sequences.
```{r message = FALSE, warning = FALSE, fig.height=4, fig.width= 8, fig.align='center'}
library(ggseqlogo)
data(ggseqlogo_sample)
sequence <- seqs_aa$AKT1
logomaker(sequence, type = "EDLogo")
```
### Positional Frequency (Weight) Matrix
We now see an example of positional weight matrix (PWM) as input to **logomaker()**.
```{r}
data("seqlogo_example")
seqlogo_example
```
We plot the logo plots for this PWM matrix.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, type = "Logo", return_heights = TRUE)
logomaker(seqlogo_example, type = "EDLogo", return_heights = TRUE)
```
The `r return_heights = TRUE ` outputs the information content at each position for the standard logo plot (type = "Logo") and the heights of the stacks along the positive and negative Y axis, along
with the breakdown of the height due to different characters for the EDLogo plot (type = "EDLogo").
### Coloring schemes
The **logomaker()** function provides three arguments to set the colors for the logos, a **color_type** specifying the scheme of coloring used,
**colors** denoting the cohort of colors used and a **color_seed** argument determining how sampling is done from this cohort.
The **color_type** argument can be of three types, `per_row`, `per_column` and `per_symbol`.
`colors` element is a cohort/library of colors (chosen suitably large) from which distinct colors are chosen based on distinct `color_type`. The number of colors chosen is of same length as number of rows in table for `per_row` (assigning a color to each string), of same length as number of columns in table for `per_column` (assuming a color for each column), or a distinct color for a distinct symbol in `per_symbol`. The length of `colors` must be as large as the number of colors to be chosen in each scenario.
The default `color_type` is `per-row` and default `colors` comprises of a large cohort of nearly 70 distinct colors from which colors are sampled using the `color_seed` argument. The user can ignore the `colors` option unless he is not happy with the set of colors used, and play around with the `color_seed` instead and choose the one that presents a set of colors of his liking.
An example of using default `colors` and just specifying `color_seed`.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, type = "EDLogo", color_seed = 2000)
```
An alternative example of using an user-specified `colors`.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, color_type = "per_row",
colors = c("#7FC97F", "#BEAED4", "#FDC086", "#386CB0"),
type = "EDLogo")
```
### Styles of symbols
Besides the default style with filled symbols for each character, one can also use
characters with border styles. For the standard logo plot, this is accomplished by the
`tofill` control argument.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, type = "Logo",
logo_control = list(control = list(tofill= FALSE)), color_seed = 4000)
```
For an EDLogo plot, the arguments `tofill_pos` and `tofill_neg` represent the coloring scheme for the positive and the negative axes in an EDLogo plot.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, type = "EDLogo",
logo_control = list(control = list(tofill_pos = TRUE,
tofill_neg = FALSE)))
```
### Background Info
**Logolas** allows the user to scale the data based on a specified background information - specified
through the input `bg`. The default value is NULL, in which case equal probability is assigned to each symbol.
Alternatively, the user can specify a vector (equal to in length to the number of symbols) which specifies the background probability for each symbol and assumes this same background probability
for each position (column of PWM matrix). See examplw below.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
bg <- c(0.05, 0.90, 0.03, 0.05)
names(bg) <- c("A", "C", "G", "T")
logomaker(seqlogo_example, bg=bg, type = "EDLogo")
```
As a second alternative, the user can specify `bg` as a matrix of same dimension as the positional
freuqncy matrix underlying the input data, which is flexible in having separate background
probabilities for each position.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(seqlogo_example, bg=(seqlogo_example+1e-02), type = "EDLogo")
```
### String symbols
**Logolas** allows the user to plot symbols not just for characters as in previous examples,
but for any alphanumeric string. We present two such examples below.
In the first example, demonstrating the relative frequencies of different mutation
signatures (mismatch type at the center and bases flanking it), we see how Logolas
handles data with a mix of strings and characters, with flanking positions being purely
characters and the center position being purely string based (data from [Shiraishi et al 2015](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005657)).
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
data("mutation_sig")
logomaker(mutation_sig, type = "Logo", color_seed = 2000)
```
The user may want to have distinct colors for distinct symbols.characters in this case
than a string. This is where we make use of the `per_symbol` option for **color_type**.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
logomaker(mutation_sig, type = "EDLogo", color_type = "per_symbol", color_seed = 2000)
```
In the second example, we use EDLogo plots to highlight enrichment and depletion of
histone marks in different regions of the genome, with respect to a specific
background `bg`(data from [Koch et al 2007](https://www.ncbi.nlm.nih.gov/pubmed/17567990)).
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
data("histone_marks")
logomaker(histone_marks$mat, bg=histone_marks$bgmat, type = "EDLogo")
```
## Extras
<!-- **Logolas** provides a new nomenclature to geneerate consensus sequence from a positional frequency (weight) matrix or from a vector of aligned sequences. This is performed by the **GetConsensusSeq()** function.
```{r echo=FALSE, eval=FALSE, message = FALSE, warning=FALSE, fig.height=5, fig.width= 6, fig.align='center'}
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT")
GetConsensusSeq(sequence)
```
In the sequence, a position represented by (Ag) would mean enrichment in A and depletion in G at that position.
One can input a PWM or PFM matrix with A, C, G and T as row names in the **GetConsensusSeq()** function as well. -->
### Multiple panels plots
**Logolas** plots can be plotted in multiple panels, as depicted below.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT",
"CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT")
Logolas::get_viewport_logo(1, 2, heights.val = 20)
library(grid)
seekViewport(paste0("plotlogo", 1))
logomaker(sequence, type = "Logo", logo_control = list(newpage = FALSE))
seekViewport(paste0("plotlogo", 2))
logomaker(sequence, type = "EDLogo", logo_control = list(newpage = FALSE))
```
In the same way, ggplot2 graphics can also be combined with **Logolas** plots.
### PSSM logos
While **logomaker()** takes a PFM, PWM or a set of aligned sequences as input, sometimes, some position specific scores are only available to the user. In this case, one can use the **logo_pssm()** in **Logolas** to plot the scoring matrix.
```{r message = FALSE, warning=FALSE, fig.height=4, fig.width= 8, fig.align='center'}
data(pssm)
logo_pssm(pssm, control = list(round_off = 0))
```
The `round_off` comtrol argument specifies the number of points after decimal allowed in the axes of the plot.
## Acknowledgements
The authors would like to acknowledge Oliver Bembom, the author of `seqLogo` for acting as an inspiration and providing the foundation on which this package is created. We also thank Peter Carbonetto, Edward Wallace and John Blischak for helpful feedback and discussions.
\section{Session Info}
```{r}
sessionInfo()
```
----------------------------------------------------------------------------
**Thank you for using Logolas !**
If you have any questions, you can either open an issue in our [Github page](https://github.com/kkdey/Logolas) or write to Kushal K Dey ([email protected]). Also please feel free to contribute to the package. You can contribute by submitting a pull request or by communicating with the said person.