-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-starting-with-data.html
388 lines (388 loc) · 29.3 KB
/
01-starting-with-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Programming with R</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Programming with R</h1>
<h2 class="subtitle">Analyzing patient data</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Read tabular data from a file into a program.</li>
<li>Assign values to variables.</li>
<li>Select individual values and subsections from data.</li>
<li>Perform operations on a data frame of data.</li>
<li>Display simple graphs.</li>
</ul>
</div>
</div>
<p>We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. The data sets are stored in <a href="reference.html#comma-separated-values-(csv)">comma-separated values</a> (CSV) format. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in succesive days. The first few rows of our first file look like this:</p>
<pre class="output"><code>0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
</code></pre>
<p>We want to:</p>
<ul>
<li>Load data into memory,</li>
<li>Calculate the average value of inflammation per day across all patients, and</li>
<li>Plot the results.</li>
</ul>
<p>To do all that, we’ll have to learn a little bit about programming.</p>
<h3 id="loading-data">Loading Data</h3>
<p>To load our inflammation data, first we need to tell our computer where is the file that contains the values. We have been told its name is <code>inflammation-01.csv</code>. This is very important in R, if we forget this step we’ll get an error message when trying to read. We can change the current working directory using the function <code>setwd</code>. For this example, we change the path to the working directory where our scripts are stored that is named <code>swc</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setwd</span>(<span class="st">"~/swc"</span>)</code></pre>
<p>Just like in the Unix Shell, we type the command and then press <code>Enter</code> (or <code>return</code>). Alternatively you can change the working directory using the RStudio GUI using the menu option <code>Session</code> -> <code>Set Working Directory</code> -> <code>Choose Directory...</code></p>
<p>We also know that experimental files are located in the directory <code>data</code> inside the working directory. Now we can load the data into R using <code>read.csv</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">read.csv</span>(<span class="dt">file =</span> <span class="st">"data/inflammation-01.csv"</span>, <span class="dt">header =</span> <span class="ot">FALSE</span>)</code></pre>
<p>The expression <code>read.csv(...)</code> is a <a href="reference.html#function-call">function call</a> that asks R to run the function <code>read.csv</code>.</p>
<p><code>read.csv</code> has two <a href="reference.html#argument">arguments</a>: the name of the file we want to read, and whether the first line of the file contains names for the columns of data. The filename needs to be a character string (or <a href="reference.html#string">string</a> for short), so we put it in quotes. Assigning the second argument, <code>header</code>, to be <code>FALSE</code> indicates that the data file does not have column headers. We’ll talk more about the value <code>FALSE</code>, and its converse <code>TRUE</code>, in lesson 04.</p>
<div id="tip" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p><code>read.csv</code> actually has many more arguments that you may find useful when importing your own data in the future. You can learn more about these options in this supplementary <a href="01-supp-read-write-csv.html">lesson</a>.</p>
</div>
</div>
<p>The utility of a function is that it will perform its given action on whatever value is passed to the named argument(s). For example, in this case if we provided the name of a different file to the argument <code>file</code>, <code>read.csv</code> would read it instead. We’ll learn more of the details about functions and their arguments in the next lesson.</p>
<p>Since we didn’t tell it to do anything else with the function’s output, the console will display the full contents of the file <code>inflammation-01.csv</code>. Try it out.</p>
<p><code>read.csv</code> read the file, but we can’t use data unless we assign it to a variable. A variable is just a name for a value, such as <code>x</code>, <code>current_temperature</code>, or <code>subject_id</code>. We can create a new variable simply by assigning a value to it using <code><-</code></p>
<pre class="sourceCode r"><code class="sourceCode r">weight_kg <-<span class="st"> </span><span class="dv">55</span></code></pre>
<p>Once a variable has a value, we can print it by typing the name of the variable and hitting <code>Enter</code> (or <code>return</code>). In general, R will print to the console any object returned by a function or operation <em>unless</em> we assign it to a variable.</p>
<pre class="sourceCode r"><code class="sourceCode r">weight_kg</code></pre>
<pre class="output"><code>[1] 55
</code></pre>
<p>We can do arithmetic with the variable:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># weight in pounds:</span>
<span class="fl">2.2</span> *<span class="st"> </span>weight_kg</code></pre>
<pre class="output"><code>[1] 121
</code></pre>
<div id="tip-1" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>We can add comments to our code using the <code>#</code> character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.</p>
</div>
</div>
<p>We can also change an object’s value by assigning it a new value:</p>
<pre class="sourceCode r"><code class="sourceCode r">weight_kg <-<span class="st"> </span><span class="fl">57.5</span>
<span class="co"># weight in kilograms is now</span>
weight_kg</code></pre>
<pre class="output"><code>[1] 57.5
</code></pre>
<p>If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:</p>
<p><img src="fig/python-sticky-note-variables-01.svg" alt="Variables as Sticky Notes" /></p>
<p>This means that assigning a value to one object does not change the values of other variables. For example, let’s store the subject’s weight in pounds in a variable:</p>
<pre class="sourceCode r"><code class="sourceCode r">weight_lb <-<span class="st"> </span><span class="fl">2.2</span> *<span class="st"> </span>weight_kg
<span class="co"># weight in kg...</span>
weight_kg</code></pre>
<pre class="output"><code>[1] 57.5
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># ...and in pounds</span>
weight_lb</code></pre>
<pre class="output"><code>[1] 126.5
</code></pre>
<p><img src="fig/python-sticky-note-variables-02.svg" alt="Creating Another Variable" /></p>
<p>and then change <code>weight_kg</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r">weight_kg <-<span class="st"> </span><span class="fl">100.0</span>
<span class="co"># weight in kg now...</span>
weight_kg</code></pre>
<pre class="output"><code>[1] 100
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># ...and weight in pounds still</span>
weight_lb</code></pre>
<pre class="output"><code>[1] 126.5
</code></pre>
<p><img src="fig/python-sticky-note-variables-03.svg" alt="Updating a Variable" /></p>
<p>Since <code>weight_lb</code> doesn’t “remember” where its value came from, it isn’t automatically updated when <code>weight_kg</code> changes. This is different from the way spreadsheets work.</p>
<div id="tip-2" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>An alternative way to print the value of a variable is to use () around the assignment statement. As an example: <code>(total_weight <- weight_kg + weight_lb)</code>, adds the values of <code>weight_kg</code> and <code>weight_lb</code>, assigns the result to the <code>total_weight</code>, and finally prints the assigned value of the variable <code>total_weight</code>.</p>
</div>
</div>
<p>Now that we know how to assign things to variables, let’s re-run <code>read.csv</code> and save its result:</p>
<pre class="sourceCode r"><code class="sourceCode r">dat <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file =</span> <span class="st">"data/inflammation-01.csv"</span>, <span class="dt">header =</span> <span class="ot">FALSE</span>)</code></pre>
<p>This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value. However, for large data sets it is convenient to use the function <code>head</code> to display only the first few rows of data.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(dat)</code></pre>
<pre class="output"><code> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18
2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6
3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10
4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7
5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17
6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12
V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
1 6 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3
2 18 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1
3 19 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2
4 17 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3
5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0
6 12 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3
V39 V40
1 0 0
2 0 1
3 1 1
4 2 1
5 1 1
6 2 1
</code></pre>
<div id="challenge---assigning-values-to-variables" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Challenge - Assigning values to variables</h2>
</div>
<div class="panel-body">
<p>Draw diagrams showing what variables refer to what values after each statement in the following program:</p>
<pre class="sourceCode r"><code class="sourceCode r">mass <-<span class="st"> </span><span class="fl">47.5</span>
age <-<span class="st"> </span><span class="dv">122</span>
mass <-<span class="st"> </span>mass *<span class="st"> </span><span class="fl">2.0</span>
age <-<span class="st"> </span>age -<span class="st"> </span><span class="dv">20</span></code></pre>
</div>
</div>
<h3 id="manipulating-data">Manipulating Data</h3>
<p>Now that our data is loaded in memory, we can start doing things with it. First, let’s ask what type of thing <code>dat</code> is:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">class</span>(dat)</code></pre>
<pre class="output"><code>[1] "data.frame"
</code></pre>
<p>The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.</p>
<p>We can see the dimensions, or <a href="reference.html#shape-(of-an-array)">shape</a>, of the data frame with the function <code>dim</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dim</span>(dat)</code></pre>
<pre class="output"><code>[1] 60 40
</code></pre>
<p>This tells us that our data frame, <code>dat</code>, has 60 rows and 40 columns.</p>
<p>If we want to get a single value from the data frame, we can provide an <a href="reference.html#index">index</a> in square brackets, just as we do in math:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># first value in dat</span>
dat[<span class="dv">1</span>, <span class="dv">1</span>]</code></pre>
<pre class="output"><code>[1] 0
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># middle value in dat</span>
dat[<span class="dv">30</span>, <span class="dv">20</span>]</code></pre>
<pre class="output"><code>[1] 16
</code></pre>
<p>An index like <code>[30, 20]</code> selects a single element of a data frame, but we can select whole sections as well. For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:</p>
<pre class="sourceCode r"><code class="sourceCode r">dat[<span class="dv">1</span>:<span class="dv">4</span>, <span class="dv">1</span>:<span class="dv">10</span>]</code></pre>
<pre class="output"><code> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 0 1 3 1 2 4 7 8 3
2 0 1 2 1 2 1 3 2 2 6
3 0 1 1 3 3 2 6 2 5 9
4 0 0 2 0 4 2 2 1 6 7
</code></pre>
<p>The <a href="reference.html#slice">slice</a> <code>1:4</code> means, “Start at index 1 and go to index 4.”</p>
<p>The slice does not need to start at 1, e.g. the line below selects rows 5 through 10:</p>
<pre class="sourceCode r"><code class="sourceCode r">dat[<span class="dv">5</span>:<span class="dv">10</span>, <span class="dv">1</span>:<span class="dv">10</span>]</code></pre>
<pre class="output"><code> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
5 0 1 1 3 3 1 3 5 2 4
6 0 0 1 2 2 4 2 1 6 4
7 0 0 2 2 4 2 2 5 5 8
8 0 0 1 2 3 1 2 3 5 3
9 0 0 0 3 1 5 6 5 5 8
10 0 1 1 2 1 3 5 3 5 8
</code></pre>
<p>We can use the function <code>c</code>, which stands for <strong>c</strong>ombine, to select non-contiguous values:</p>
<pre class="sourceCode r"><code class="sourceCode r">dat[<span class="kw">c</span>(<span class="dv">3</span>, <span class="dv">8</span>, <span class="dv">37</span>, <span class="dv">56</span>), <span class="kw">c</span>(<span class="dv">10</span>, <span class="dv">14</span>, <span class="dv">29</span>)]</code></pre>
<pre class="output"><code> V10 V14 V29
3 9 5 4
8 3 5 6
37 6 9 10
56 7 11 9
</code></pre>
<p>We also don’t have to provide a slice for either the rows or the columns. If we don’t include a slice for the rows, R returns all the rows; if we don’t include a slice for the columns, R returns all the columns. If we don’t provide a slice for either rows or columns, e.g. <code>dat[, ]</code>, R returns the full data frame.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># All columns from row 5</span>
dat[<span class="dv">5</span>, ]</code></pre>
<pre class="output"><code> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17
V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
5 9 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0
V39 V40
5 1 1
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># All rows from column 16</span>
dat[, <span class="dv">16</span>]</code></pre>
<pre class="output"><code> [1] 4 4 15 8 10 15 13 9 11 6 3 8 12 3 5 10 11 4 11 13 15 5 14
[24] 13 4 9 13 6 7 6 14 3 15 4 15 11 7 10 15 6 5 6 15 11 15 6
[47] 11 15 14 4 10 15 11 6 13 8 4 13 12 9
</code></pre>
<p>Now let’s perform some common mathematical operations to learn about our inflammation data. When analyzing data we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary data frame, and then perform the calculation on this subset:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># first row, all of the columns</span>
patient_1 <-<span class="st"> </span>dat[<span class="dv">1</span>, ]
<span class="co"># max inflammation for patient 1</span>
<span class="kw">max</span>(patient_1)</code></pre>
<pre class="output"><code>[1] 18
</code></pre>
<p>We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># max inflammation for patient 2</span>
<span class="kw">max</span>(dat[<span class="dv">2</span>, ])</code></pre>
<pre class="output"><code>[1] 18
</code></pre>
<p>R also has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># minimum inflammation on day 7</span>
<span class="kw">min</span>(dat[, <span class="dv">7</span>])</code></pre>
<pre class="output"><code>[1] 1
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># mean inflammation on day 7</span>
<span class="kw">mean</span>(dat[, <span class="dv">7</span>])</code></pre>
<pre class="output"><code>[1] 3.8
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># median inflammation on day 7</span>
<span class="kw">median</span>(dat[, <span class="dv">7</span>])</code></pre>
<pre class="output"><code>[1] 4
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># standard deviation of inflammation on day 7</span>
<span class="kw">sd</span>(dat[, <span class="dv">7</span>])</code></pre>
<pre class="output"><code>[1] 1.725187
</code></pre>
<p>What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across a margin of the data frame:</p>
<p><img src="fig/r-operations-across-axes.svg" alt="Operations Across Axes" /></p>
<p>To support this, we can use the <code>apply</code> function.</p>
<div id="tip-3" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>To learn about a function in R, e.g. <code>apply</code>, we can read its help documention by running <code>help(apply)</code> or <code>?apply</code>.</p>
</div>
</div>
<p><code>apply</code> allows us to repeat a function on all of the rows (<code>MARGIN = 1</code>) or columns (<code>MARGIN = 2</code>) of a data frame.</p>
<p>Thus, to obtain the average inflammation of each patient we will need to calculate the mean of all of the rows (<code>MARGIN = 1</code>) of the data frame.</p>
<pre class="sourceCode r"><code class="sourceCode r">avg_patient_inflammation <-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">1</span>, mean)</code></pre>
<p>And to obtain the average inflammation of each day we will need to calculate the mean of all of the columns (<code>MARGIN = 2</code>) of the data frame.</p>
<pre class="sourceCode r"><code class="sourceCode r">avg_day_inflammation <-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, mean)</code></pre>
<p>Since the second argument to <code>apply</code> is <code>MARGIN</code>, the above command is equivalent to <code>apply(dat, MARGIN = 2, mean)</code>. We’ll learn why this is so in the next lesson.</p>
<div id="tip-4" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>Some common operations have more efficient alternatives. For example, you can calculate the row-wise or column-wise means with <code>rowMeans</code> and <code>colMeans</code>, respectively.</p>
</div>
</div>
<div id="challenge---slicing-subsetting-data" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Challenge - Slicing (subsetting) data</h2>
</div>
<div class="panel-body">
<p>A subsection of a data frame is called a <a href="reference.html#slice">slice</a>. We can take slices of character vectors as well:</p>
<pre class="sourceCode r"><code class="sourceCode r">animal <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"m"</span>, <span class="st">"o"</span>, <span class="st">"n"</span>, <span class="st">"k"</span>, <span class="st">"e"</span>, <span class="st">"y"</span>)
<span class="co"># first three characters</span>
animal[<span class="dv">1</span>:<span class="dv">3</span>]</code></pre>
<pre class="output"><code>[1] "m" "o" "n"
</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># last three characters</span>
animal[<span class="dv">4</span>:<span class="dv">6</span>]</code></pre>
<pre class="output"><code>[1] "k" "e" "y"
</code></pre>
<ol style="list-style-type: decimal">
<li><p>If the first four characters are selected using the slice <code>animal[1:4]</code>, how can we obtain the first four characters in reverse order?</p></li>
<li><p>What is <code>animal[-1]</code>? What is <code>animal[-4]</code>? Given those answers, explain what <code>animal[-1:-4]</code> does.</p></li>
<li><p>Use a slice of <code>animal</code> to create a new character vector that spells the word “eon”, i.e. <code>c("e", "o", "n")</code>.</p></li>
</ol>
</div>
</div>
<div id="challenge---subsetting-data-2" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Challenge - Subsetting data 2</h2>
</div>
<div class="panel-body">
<p>Suppose you want to determine the maximum inflamation for patient 5 across days three to seven. To do this you would extract the relevant slice from the data frame and calculate the maximum value. Which of the following lines of R code gives the correct answer?</p>
<ol style="list-style-type: lower-alpha">
<li><code>max(dat[5, ])</code></li>
<li><code>max(dat[3:7, 5])</code></li>
<li><code>max(dat[5, 3:7])</code></li>
<li><code>max(dat[5, 3, 7])</code></li>
</ol>
</div>
</div>
<h3 id="plotting">Plotting</h3>
<p>The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.</p>
<p>Let’s take a look at the average inflammation over time. Recall that we already calculated these values above using <code>apply(dat, 2, mean)</code> and saved them in the variable <code>avg_day_inflammation</code>. Plotting the values is done with the function <code>plot</code>.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plot</span>(avg_day_inflammation)</code></pre>
<p><img src="fig/01-starting-with-data-plot-avg-inflammation-1.png" title="plot of chunk plot-avg-inflammation" alt="plot of chunk plot-avg-inflammation" style="display: block; margin: auto;" /></p>
<p>Above, we gave the function <code>plot</code> a vector of numbers corresponding to the average inflammation per day across all patients. <code>plot</code> created a scatter plot where the y-axis is the average inflammation level and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 40 days of treatment. The result is roughly a linear rise and fall, which is suspicious: based on other studies, we expect a sharper rise and slower fall. Let’s have a look at two other statistics: the maximum and minimum inflammation per day.</p>
<pre class="sourceCode r"><code class="sourceCode r">max_day_inflammation <-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, max)
<span class="kw">plot</span>(max_day_inflammation)</code></pre>
<p><img src="fig/01-starting-with-data-plot-max-inflammation-1.png" title="plot of chunk plot-max-inflammation" alt="plot of chunk plot-max-inflammation" style="display: block; margin: auto;" /></p>
<pre class="sourceCode r"><code class="sourceCode r">min_day_inflammation <-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, min)
<span class="kw">plot</span>(min_day_inflammation)</code></pre>
<p><img src="fig/01-starting-with-data-plot-min-inflammation-1.png" title="plot of chunk plot-min-inflammation" alt="plot of chunk plot-min-inflammation" style="display: block; margin: auto;" /></p>
<p>The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data.</p>
<div id="challenge---plotting-data" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Challenge - Plotting data</h2>
</div>
<div class="panel-body">
<p>Create a plot showing the standard deviation of the inflammation data for each day across all patients.</p>
</div>
</div>
<div id="key-points" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Key Points</h2>
</div>
<div class="panel-body">
<ul>
<li>Use <code>variable <- value</code> to assign a value to a variable in order to record it in memory.</li>
<li>Objects are created on demand whenever a value is assigned to them.</li>
<li>The function <code>dim</code> gives the dimensions of a data frame.</li>
<li>Use <code>object[x, y]</code> to select a single element from a data frame.</li>
<li>Use <code>from:to</code> to specify a sequence that includes the indices from <code>from</code> to <code>to</code>.</li>
<li>All the indexing and slicing that works on data frames also works on vectors.</li>
<li>Use <code>#</code> to add comments to programs.</li>
<li>Use <code>mean</code>, <code>max</code>, <code>min</code> and <code>sd</code> to calculate simple statistics.</li>
<li>Use <code>apply</code> to calculate statistics across the rows or columns of a data frame.</li>
<li>Use <code>plot</code> to create simple visualizations.</li>
</ul>
</div>
</div>
<div id="next-steps" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Next Steps</h2>
</div>
<div class="panel-body">
<p>Our work so far has convinced us that something’s wrong with our first data file. We would like to check the other 11 the same way, but typing in the same commands repeatedly is tedious and error-prone. Since computers don’t get bored (that we know of), we should create a way to do a complete analysis with a single command, and then figure out how to repeat that step once for each file. These operations are the subjects of the next two lessons.</p>
</div>
</div>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/r-novice-inflammation">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>