-
Notifications
You must be signed in to change notification settings - Fork 1
/
discovery-swirl.rnw
330 lines (330 loc) · 11.6 KB
/
discovery-swirl.rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
\section{Swirl Review Questions}
\subsection{Lesson 1}
\begin{enumerate}
\item What is an example of an 'unsupervised learning' problem?
\begin{enumerate}
\item Finding topics in a set of newspaper articles
\item Discovering ideological differences among legislators
\item Identifying commonly used words in an author's corpus
\item All of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
All of these
@
\fi
\item How did unsupervised learning enable researchers to address the disputed authorship of the Federalist Papers?
\begin{enumerate}
\item Hamilton and Madison preferred to discuss different topics
\item Hamilton and Madison favored different words
\item Madison left coded messages in his prose
\item All of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
Hamilton and Madison favored different words
@
\fi
\item What kind of information does a document-term matrix like \rexpr{dtm} contain?
\begin{enumerate}
\item Term frequencies across a set of documents
\item The number of times a word is used in a document
\item The number of documents that contain a word at least once
\item All of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
All of these
@
\fi
\item Why does working with a document-term matrix require us to make the 'bag-of-words' assumption?
\begin{enumerate}
\item A document-term matrix says nothing about grammar or order of words
\item A document-term matrix contains a lot of words
\item A document-term matrix sorts similar words into bags
\item None of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
A document-term matrix says nothing about grammar or order of words
@
\fi
\item Chapter 5 also introduces several important \ R{} extensions, or packages. Use the \rexpr{\rfun{install.packages()}} function to install the \rexpr{wordcloud} package.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
install.packages("wordcloud")
@
\fi
\item Now call the \rexpr{\rfun{library()}} function to load the \rexpr{wordcloud} package.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
library(wordcloud)
@
\fi
\item Use the \rexpr{\rfun{inspect()}} function to visualize the first 5 rows and 8 columns of \rexpr{dtm}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
inspect(dtm[1:5, 1:8])
@
\fi
\item Currently, \rexpr{dtm} belongs to a special \ R{} class called \rexpr{DocumentTermMatrix}. This object class is not easily manipulated in \R. Using the \rfun{as.matrix()} function, coerce \rexpr{dtm} to a matrix object in \ R{} called \rexpr{dtm.mat}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
dtm.mat <- as.matrix(dtm)
@
\fi
\item Finally, use the \rfun{wordcloud()} function to visualize the information contained in the eighth document of \rexpr{dtm.mat} only.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
wordcloud(colnames(dtm.mat), dtm.mat[8, ])
@
\fi
\item Which of these do you think best describes the topic of the eighth Federalist Paper?
\begin{enumerate}
\item The costs and benefits of standing militia
\item Trade between the colonies
\item The universal rights of man
\item The abolition of slavery
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
The costs and benefits of standing militia
@
\fi
\end{enumerate}
\subsection{Lesson 2}
\begin{enumerate}
\item What are some examples of networks?
\begin{enumerate}
\item marriages between families
\item international trade flows
\item friendships on Facebook
\item all of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
all of these
@
\fi
\item A node represents \_\_\_\_\_\_\_\_\_\_
\begin{enumerate}
\item an individual unit
\item a group of units
\item all the units
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
an individual unit
@
\fi
\item An edge represents the \_\_\_\_\_\_\_\_\_\_
\begin{enumerate}
\item existence of a relationship between any pair of nodes
\item lack of a relationship between any pair of nodes
\item nodes that are the same
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
existence of a relationship between any pair of nodes
@
\fi
\item Verify that \rexpr{florence} is a square adjacency matrix using the \rfun{dim()} function.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
dim(florence)
@
\fi
\item Now, use indexing to have \ R{} output the adjacency (sub)matrix for the first 5 families only.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
florence[1:5, 1:5]
@
\fi
\item Is \rexpr{florence} an example of directed or undirected network data?
\begin{enumerate}
\item florence is undirected
\item florence is directed
\item we cannot tell yet
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
florence is undirected
@
\fi
\item There are two steps to plotting the network graph of \rexpr{florence}. First, use \rfun{graph.adjacency()} function to produce an \rexpr{igraph} object called \rexpr{florence.graph}. Be sure to specify that the adjancency matrix is undirected and that there are no marriages within families.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
florence.graph <- graph.adjacency(florence, mode = "undirected", diag = FALSE)
@
\fi
\item Now use the \rexpr{\rfun{plot()}} function to visualize the marriage network described by \rexpr{florence.graph}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
plot(florence.graph)
@
\fi
\item We can quantify each family's place in the network using a measure of centrality. One common measure of centrality is known as 'betweenness'. Which of these statements best describes betweenness?
\begin{enumerate}
\item betweenness is the proportion of shortest paths between two other nodes that contain it
\item A node's betweenness is the number of nodes that are immediately connected to it
\item A node's betweenness is a measure of how close it is to other nodes
\item None of these
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
betweenness is the proportion of shortest paths between two other nodes that contain it
@
\fi
\item Compute the betweenness of each node in \rexpr{florence} and store the result as an object called \rexpr{between}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
between <- betweenness(florence.graph)
@
\fi
\item Now, use the \rexpr{\rfun{sort()}} function to output a vector that starts with the family with highest betweenness and ends with the family with lowest betweenness.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
sort(between, decreasing = TRUE)
@
\fi
\item Verify this by using \rexpr{\rfun{order()}} and indexing to output the same vector from the previous question.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
between[order(between, decreasing = TRUE)]
@
\fi
\item Based on you find, which of the elite Florentine families was most central in the marriage network?
\begin{enumerate}
\item Medici
\item Ridolfi
\item Bischeri
\item Strozzi
\item Pucci
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
Medici
@
\fi
\end{enumerate}
\subsection{Lesson 3}
\begin{enumerate}
\item Maps can help us \_\_\_\_\_\_\_ spatial patterns.
\begin{enumerate}
\item visualize
\item disregard
\item invent
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
visualize
@
\fi
\item John Snow used \_\_\_\_\_\_\_\_\_ to identify the cause of the cause of the 1854 cholera epidemic.
\begin{enumerate}
\item a natural experiment
\item a randomized control trial
\item observational data
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
a natural experiment
@
\fi
\item The hexadecimal color code is a sequence of six characters beginning with a pound sign. Each set of two digits represents the colors \_\_\_\_\_\_\_, \_\_\_\_\_\_, and \_\_\_\_\_\_.
\begin{enumerate}
\item red, green and blue
\item red, yellow and blue
\item red, white and blue
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
red, green and blue
@
\fi
\item Look at the map of the United States 2008 presidential election results. Which feature did we use to help visualize the degree of support for Democrats and Republicans?
\begin{enumerate}
\item hue
\item shade
\item transparency
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
transparency
@
\fi
\item Using color, transparency and size helps us to visualize the Walmart expansion. Which of these best describes the location of Walmart Supercenters in the U.S.?
\begin{enumerate}
\item Supercenters equally distributed across the U.S.
\item Supercenters only on the coasts
\item Supercenters mainly in the Midwest and South
\end{enumerate}
\if1\solutions
\noindent{\bf Solution:}
<<eval=FALSE>>=
Supercenters mainly in the Midwest and South
@
\fi
\item Use the \rexpr{\rfun{map()}} function to draw a map of the United States.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
map(database = "usa")
@
\fi
\item Using the \rexpr{\rfun{subset()}} function, create an \rexpr{data.frame} object called \rexpr{lrgcities} which only contains cities with populations greater than 100,000. The variable for population in the \rexpr{us.cities} dataset is called \rexpr{pop}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
lrgcities <- subset(us.cities, pop > 100000, )
@
\fi
\item Next, we want to save the USA database as a list and not a plot. Save it to an object called \rexpr{usa}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
usa <- map(database = "usa", plot = FALSE)
@
\fi
\item Call the \rexpr{\rfun{map()}} function with the database set to \rexpr{state}, regions to \rexpr{New Jersey} and plot to \rexpr{FALSE}, and save the output to an object called \rexpr{nj}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
nj <- map(database = "state", regions = "New Jersey", plot = FALSE)
@
\fi
\item Use the \rexpr{\rfun{rgb()}} function to assign the hexadecimal code for the color blue to the object \rexpr{blue}.
\if1\solutions
\newline\newline \noindent{\bf Solution:}
<<eval=FALSE>>=
blue <- rgb(red = 0, green = 0, blue = 1)
@
\fi
\end{enumerate}