-
Notifications
You must be signed in to change notification settings - Fork 72
/
cm002.Rmd
295 lines (192 loc) · 13.1 KB
/
cm002.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Introduction to R
Today, we'll get you up to speed with a minimum "need to know" about using R and RStudio. We're going to assume you know nothing, but aren't covering the breadth of the R/RStudio landscape.
The format of today's notes aim to teach R by exploration, so is essentially an activity guide with prompts for exploration. These are mostly all exercises we'll be doing together in class.
To participate in today's lecture, you should have:
- R and RStudio installed
- A [participation repo](https://stat545.stat.ubc.ca/evaluation/participation/#setup) on GitHub to put in-class work.
__Announcements__: (10 min)
- Assignment 1 is launched!
- Class meeting schedule is fixed
## Learning Objectives
By the end of today's class, students are expected to be able to:
- Write an R script to perform simple calculations
- Access the R documentation on an as-needed basis
- Use functions and operators in R
- Subset vectors in R
- Explore a data frame in R
- Load packages in R
## Participation
Start a new R script in RStudio, and add your exploratory code to the script as we work through the exercises. What you write on this script doesn't have to be exactly the same as what I write -- we're just looking for some exploration of coding in R.
## Resources
Here are some useful resources for getting oriented with R.
- Jenny's [stat545.com: hello r](http://stat545.com/block002_hello-r-workspace-wd-project.html) page for exploring R roughly follows today's outline.
- Want to practice programming in R? Check out [R Swirl](https://swirlstats.com/) for interactive lessons.
- For a list of R "vocabulary", see [Advanced R - Vocabulary](http://adv-r.had.co.nz/Vocabulary.html); for a list of R operators, see [Quick-R](https://www.statmethods.net/management/operators.html).
Today, we'll be learning just enough base R so that we can dive in to the tidyverse side of R. If you want to learn even more about base R, take a look at [Mike Marin's R playlist on YouTube](https://www.youtube.com/playlist?list=PLqzoL9-eJTNBlVXxWvJkq0dtVut2sICUW).
## Why R?
Why R? Some points taken from [adv-r: intro](http://adv-r.had.co.nz/Introduction.html):
- Free, platform-wide
- Open source
- Comprehensive set of "add on" packages for analysis
- Huge community
- ...
Alternatives exist for data analysis, python being another excellent tool, especially these days as it seems like more and more R-like functionality is added to it. The good thing about python is that it's faster and has better support for machine learning models. For the sake of streamlining, both STAT 545A and STAT 547M only focus on R.
## Orientation to R
### Using R and RStudio (5 min)
Let's try these exercises as our first steps.
1. Try some arithmetic from a script vs. the console.
- Notice that your commands appear in the "History" tab. Do not rely on this! What do you think is better than relying on the history?
2. Store a number in a variable called `number` using `<-` (read this arrow as "gets").
- Notice that the object appears in the "Environment" tab in the top-right of RStudio.
3. Try some arithmetic on the variable.
4. Try some arithmetic on an undefined variable.
5. Try some arithmetic on the variable on a line of code above the variable definition (do you think we'll get an error?)
### Vectors (3 min)
_Vectors_ store multiple entries of a data type, like numbers. You'll discover that they show up just about everywhere in R.
Let's collect some data and store this in a vector called `times`. How long was your commute this morning, in minutes? Here's starter code:
```
times <- c()
```
Operations happen component-wise. Let's calculate those times in hours. How can we "save" the results?
### Functions, Part I (3 min)
What's the average travel time? Instead of computing this manually, let's use a _function_ called `mean`. Notice the syntax of using a function: the _input_ goes inside brackets, which is followed by the function name to the left.
We _input_ `times`, and got some _output_. Did this function change the input? Aside from some bizarre functions, this is always the case.
Functions don't always return a single value. Try the `range()` function, for example. What's the output? What about the `sqrt()` function?
Much of R is about becoming familiar with R's "vocabulary". A nice list can be found in [Advanced R - Vocabulary](http://adv-r.had.co.nz/Vocabulary.html).
### Comparisons (7 min)
We'll now introduce _logicals_.
Which of our travel times are less than (say) 30 minutes? Use `<`.
Which of our travel times are equal to ... (pick something)? What about _not_ equal to it? Notice the use of `==` as opposed to `=` -- why do you think that is?
Which of our travel times are greater than ...(lower)... _and_ less than ...(upper)...? What about less than ...(lower)... _or_ greater than ...(upper)...?
Some functions expect logical inputs. Try using the `which()` function on one of the above. What about `any()`? `all()`?
Logicals can be explicitly specified in R with `TRUE` and `FALSE`.
### Subsetting (10 min)
Use `[]` to subset the vector of times:
1. Extract the third entry.
2. Extract everything except the third entry.
3. Extract the second and fourth entry. The fourth and second entry.
4. Extract the second through fifth entry -- make use of `:` to construct sequential vectors.
4. Extract all entries that are less than 30 minutes. Why does this work? Logical subsetting!
After all of that, did our `times` object change at all?
We can use `[]` in conjunction with `<-` to change the `times` object:
1. Replace two entries with new travel times.
2. "Cap" entries that are "too large" at some set value. If this is more than one value, why don't we need to match the number of values? Recycling!
3. Remove an entry, by overwriting `times`.
### NA (2 min)
Sometimes we have missing data. Those entries are replaced with `NA` in R. Be careful with these!
1. Add `NA` to the vector of times.
2. What's the mean of this new vector of times?
Let's expand our view of functions in order to solve this problem.
### Functions, Part II (10 min)
Functions often take more than one _arguments_ as input, separated by commas. You can find out what these arguments are by accessing the function's _documentation_:
Access the documentation of the `mean()` function by executing `?mean`.
- There are four arguments.
- All the arguments have names, except for the `...` argument (more on `...` later). This is always the case.
- Under "Usage", some of the arguments are of the form `name = value`.
- These are default values, in case you don't specify these arguments.
- This is a sure sign that these arguments are _optional_.
- `x` is "on its own". This typically means that it has no default, and often (but not always) means that the argument is _required_.
We can specify an argument in one of two ways:
- specifying `argument name = value` in the function parentheses; or
- matching the ordering of the input with the ordering of the arguments.
- For readability, this is not recommended beyond the first or sometimes second argument!
Input `TRUE` for the `na.rm` argument in both ways.
### Data frames (12 min)
Living in a vector-only world would be nice if all data analyses involved one variable. When we have more than one variable, _data frames_ come to the rescue. Basically, a data frame holds data in tabular format.
R has some data frames "built in". For example, motor car data is attached to the variable name `mtcars`.
Print `mtcars` to screen. Notice the tabular format.
__Your turn__ (5 min): Finish the exercises of this section:
1. Use some of these built-in R functions to explore `mtcars`, without printing the whole thing to screen:
- `head()`, `tail()`, `str()`, `nrow()`, `ncol()`, `summary()`, `row.names()` (yuck), `names()`.
Notice that `names` and `row.names()` outputs a _character vector_ (we've already seen numeric and logical vectors). These are useful for characterizing categorical data in R.
2. What's the first column name in the `mtcars` dataset?
3. Which column number is named `"wt"`?
Each column is its own vector that can be extracted using `$`. For example, we can extract the `cyl` column with `mtcars$cyl`.
4. Extract the vector of `mpg` values. What's the mean `mpg` of all cars in the dataset?
### R packages (13 min)
Usually, the suite of functions that "come with" R are just not enough to do an analysis.
Usually, the suite of functions that "come with" R are just not very convenient.
In come R _packages_ to the rescue. These are "add ons", each coming with their own suite of functions and objects, usually designed to do one type of task. [CRAN](https://cran.r-project.org/) stores packages that, for all intents and purposes, can be considered "official" R packages. It's easy to install packages from CRAN! Just use the `install.packages()` function.
Run the following lines of code to install the `tibble` and `gapminder` packages. (But don't include this in your scripts -- it's not very nice to others!)
```
install.packages("tibble")
install.packages("gapminder")
```
- `tibble`: a data frame with some useful "bells and whistles"
- `gapminder`: a package that makes the gapminder dataset available (as a `tibble`!)
Installing a package is not enough! To access its functions, you have to _load_ it. Use the `library()` function to load a package. (Note: ironically, it's not libraries we load with the `library()` function, but a package).
Run the following lines of code to load the packages. (Do put these in your scripts, and near the top)
```
library(tibble)
library(gapminder)
```
Take a look at the packages under the "Global Environment" tab to see the new objects that have just been made available to us. PS: you'll notice `mtcars` is not in our workspace/environment, yet we can still access it -- where does `mtcars` live?
Try the following two approaches to access information about the `tibble` package. Run the lines one-at-a-time. Vignettes are your friend, but do not always exist.
```
?tibble
browseVignettes(package = "tibble")
```
Print out the `gapminder` object to screen. It's a tibble -- how does it differ from a data frame in terms of how it's printed?
Because a tibble is a data frame, our exploration functions still work on it. Try some.
### Two slogans to understand computations in R (6 min)
(We probably won't have time to cover this, and that's OK -- I'm leaving it here for you to peruse if you are interested).
John Chambers eloquently sums up using R:
> To understand computations in R, two slogans are helpful:
>
> - Everything that exists is an object.
> - Everything that happens is a function call.
These are useful to remember to prevent us from getting confused.
1. Everything that exists is an object.
This is not obvious when we look at the output of, say, `str()`:
``` r
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```
The stuff you see is simply printed to screen, not an object! The actual object is `NULL`:
``` r
foo <- str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
# ...(snip)...
foo
#> NULL
```
The output of `summary()` is actually a "table" object (something not often used in R). Let's coerce it to character data:
``` r
foo <- summary(mtcars)
as.character(foo)
#> [1] "Min. :10.40 " "1st Qu.:15.43 " "Median :19.20 "
#> [4] "Mean :20.09 " "3rd Qu.:22.80 " "Max. :33.90 "
#> [7] "Min. :4.000 " "1st Qu.:4.000 " "Median :6.000 "
#> [10] "Mean :6.188 " "3rd Qu.:8.000 " "Max. :8.000 "
#> [13] "Min. : 71.1 " "1st Qu.:120.8 " "Median :196.3 "
#> [16] "Mean :230.7 " "3rd Qu.:326.0 " "Max. :472.0 "
# ...(snip)...
```
2. Everything that happens is a function call.
Did you know that operators like `+` are actually functions? The "plus" function is literally `` `+`() ``, and accepts two arguments.
Here is what's actually happening when we call `5 + 2`:
``` r
`+`(5, 2)
#> [1] 7
```
Want a challenge? What's the difference between the `` `(`() `` function and the `` `{`() `` function? Hint check the documentation with `` ?`{` ``.
## Finishing up (5 min)
1. Highly recommended: [Don't save your workspace](https://www.r-bloggers.com/using-r-dont-save-your-workspace/) when you quit RStudio. Make this a default:
- Go to "RStudio" -> "Preferences..." -> "General"
- Uncheck "restore .RData into workspace on startup"
- Select: "Save workspace to RData on exit:" Never
2. Push your final script to GitHub (you can do this in a simple way by dragging the file onto your respository homepage).
Don't forget! There's an office hour after every class, held upstairs in ESB 3174.