-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03.Rmd
1261 lines (840 loc) · 41.6 KB
/
03.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
# The R Programming Language
> The material in this chapter is rather dull reading because it basically amounts to a list (although a carefully scaffolded list) of basic commands in R along with illustrative examples. After reading the first few pages and nodding off, you may be tempted to skip ahead, and I wouldn't blame you. But much of the material in this chapter is crucial, and all of it will eventually be useful, so you should at least skim it all so you know where to return when the topics arise later. [@kruschkeDoingBayesianData2015, p. 35]
Most, but not all, of this part of my project will mirror what's in the text. However, I do add **tidyverse**-oriented content, such as a small walk through of plotting with the [**ggplot2** package](https://ggplot2.tidyverse.org) [@wickhamGgplot2ElegantGraphics2016; @R-ggplot2].
## Get the software
The first step to following along with this ebook is to install **R** on your computer. Go to [https://cran.r-project.org/](https://cran.r-project.org/) and follow the instructions, from there. If you get confused, there are any number of brief video tutorials available to lead you through the steps. Just use a search term like "install R."
If you're new to **R** or just curious about its origins, check out Chapter 2 of @pengProgrammingDataScience2020, [*History and overview of R*](https://bookdown.org/rdpeng/rprogdatascience/history-and-overview-of-r.html). One of the great features about **R**, which might seem odd or confusing if you're more used to working with propriety software like SPSS, is a lot of the functionality comes from the add-on packages users develop to make **R** easier to use. Peng briefly discusses these features in Section 2.7, [*Design of the R system*](https://bookdown.org/rdpeng/rprogdatascience/history-and-overview-of-r.html#design-of-the-r-system). I make use of a variety of add-on packages in this project. You can install them all by executing this code block.
```{r, eval = F}
packages <- c("bayesplot", "brms", "coda", "cowplot", "cubelyr", "devtools", "fishualize", "GGally", "ggdist", "ggExtra", "ggforce", "ggmcmc", "ggridges", "ggthemes", "janitor", "lisa", "loo", "palettetown", "patchwork", "psych", "remotes", "rstan", "santoku", "scico", "tidybayes", "tidyverse")
install.packages(packages, dependencies = T)
remotes::install_github("clauswilke/colorblindr")
devtools::install_github("dill/beyonce")
devtools::install_github("ropenscilabs/ochRe")
```
### A look at RStudio.
> The R programming language comes with its own basic user interface that is adequate for modest applications. But larger applications become unwieldy in the basic R user interface, and therefore it helps to install a more sophisticated R-friendly editor. There are a number of useful editors available, many of which are free, and they are constantly evolving. At the time of this writing, I recommend RStudio, which can be obtained from [[https://rstudio.com/]](https://rstudio.com/) (p. 35).
I completely agree. **R** programming is easier with **RStudio**. However, I should point out that there are other user interfaces available. You can find several alternatives listed [here](https://datascience.stackexchange.com/questions/5345/ide-alternatives-for-r-programming-rstudio-intellij-idea-eclipse-visual-stud) or [here](https://intro2r.com/alternatives-to-rstudio.html).
## A simple example of R in action
Basic arithmetic is straightforward in **R**.
```{r}
2 + 3
```
Much like Kruschke did in his text, I denote my programming prompts in typewriter font atop a gray background, like this: `2 + 3`.
Anyway, algebra is simple in **R**, too.
```{r}
x <- 2
x + x
```
I don't tend to save lists of commands in text files. Rather, I almost exclusively work within [R Notebook](https://bookdown.org/yihui/rmarkdown/notebook.html) files, which I discuss more fully in [Section 3.7][Programming in R]. As far as *sourcing*, I never use the `source()` approach Kruschke discussed in page 36. I'm not opposed to it. It's just not my style.
Anyway, behold Figure 3.1.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
library(tidyverse)
d <-
tibble(x = seq(from = -2, to = 2, by = .1)) %>%
mutate(y = x^2)
ggplot(data = d,
aes(x = x, y = y)) +
geom_line(color = "skyblue") +
theme(panel.grid = element_blank())
```
If you're new to the **tidyverse** and/or making figures with **ggplot2**, it's worthwhile to walk that code out. With the first line, `library(tidyverse)`, we opened up the [core packages within the tidyverse](https://www.tidyverse.org/packages/), which are:
* **ggplot2** [@wickhamGgplot2ElegantGraphics2016; @R-ggplot2],
* **dplyr** [@R-dplyr],
* **tidyr** [@R-tidyr],
* **readr** [@R-readr],
* **purrr** [@R-purrr],
* **tibble** [@R-tibble],
* **stringr** [@R-stringr], and
* **forcats** [@R-forcats].
With the few lines,
```{r}
d <-
tibble(x = seq(from = -2, to = 2, by = .1)) %>%
mutate(y = x^2)
```
we made our tibble. In **R**, data frames are one of the primary types of data objects (see [Section 3.4.4][List and data frame.], below). We'll make extensive use of data frames in this project. Tibbles are a particular type of data frame, which you might learn more about in the [tibbles section](https://r4ds.had.co.nz/tibbles.html) of Grolemund and Wickham's [-@grolemundDataScience2017] *R4DS*. With those first two lines, we determined what the name of our tibble would be, `d`, and made the first column, `x`.
Note the `%>%` operator at the end of the second line. In prose, we call that the *pipe*. As explained in [Section 5.6.1 of *R4DS*](https://r4ds.had.co.nz/transform.html#combining-multiple-operations-with-the-pipe), "a good way to pronounce `%>%` when reading code is 'then.'" So in words, the those first two lines indicate "Make an object, `d`, which is a tibble with a variable, `x`, defined by the `seq()` function, *then*..."
In the portion after *then* (i.e., the `%>%`), we changed `d`. The `dplyr::mutate()` function let us add another variable, `y`, which is a function of our first variable, `x`.
With the next 4 lines of code, we made our plot. When plotting with **ggplot2**, the first line is always with the `ggplot()` function. This is where you typically tell **ggplot2** what data object you're using--which must be a data frame or tibble--and what variables you want on your axes. The interesting thing about **ggplot2** is that the code is modular. So if we only coded the `ggplot()` portion, we'd get:
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
ggplot(data = d,
aes(x = x, y = y))
```
Although **ggplot2** knows which variables to put on which axes, it has no idea how we'd like to express the data. The result is an empty coordinate system. The next line of code is the main event. With `geom_line()` we told **ggplot2** to connect the data points with a line. With the `color` argument, we made that line `"skyblue"`. [[Here's a great list](http://sape.inf.usi.ch/quick-reference/ggplot2/colour) of the named colors available in **ggplot2**.] Also, notice the `+` operator at the end of the `ggplot()` function. With **ggplot2**, you add functions by placing the `+` operator on the right of the end of one function, which will then append the next function.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
ggplot(data = d,
aes(x = x, y = y)) +
geom_line(color = "skyblue")
```
Personally, I'm not a fan of gridlines. They occasionally have their place and I do use them from time to time. But on the whole, I prefer to omit them from my plots. The final `theme()` function allowed me to do so.
```{r, fig.width = 3.5, fig.height = 3, message = F, warning = F}
ggplot(data = d,
aes(x = x, y = y)) +
geom_line(color = "skyblue") +
theme(panel.grid = element_blank())
```
[Chapter 3 of *R4DS*](https://r4ds.had.co.nz/data-visualisation.html) is a great introduction to plotting with **ggplot2**. If you want to dive deeper, see the [references at the bottom of this page](https://ggplot2.tidyverse.org). And of course, you might read up in Wickham's [-@wickhamGgplot2ElegantGraphics2016] [*ggplot2: Elegant graphics for data analysis*](https://ggplot2-book.org/).
### Get the programs used with this book.
This subtitle has a double meaning, here. Yes, you should probably get Kruschke's scripts from the book's website, [https://sites.google.com/site/doingbayesiandataanalysis/](https://sites.google.com/site/doingbayesiandataanalysis/). You may have noticed this already, but unlike in Kruschke's text, I will usually show all my code. Indeed, the purpose of my project is to make coding these kinds of models and visualizations easier. But if you're ever curious, you can always find my script files in their naked form at [https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse). For example, the raw file for this very chapter is at [https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/blob/master/03.Rmd](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/blob/master/03.Rmd).
Later in this subsection, Kruschke mentioned working directories. If you don't know what your current working directory is, just execute `getwd()`. I'll have more to say on this topic later on when I make my pitch for [**RStudio** projects](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) in [Section 3.7.2][Running a program.].
## Basic commands and operators in R
In addition to the resource link Kruschke provided in the text, Grolemund and Wickham's [*R4DS*](http://r4ds.had.co.nz) is an excellent general introduction to the kinds of **R** functions you'll want to succeed with your data analysis. Other than that, I've learned the most when I had a specific data problem to solve and then sought out the specific code/techniques required to solve it. If already have your own data or can get your hands on some sexy data, learn these techniques by playing around with them. This isn't the time to worry about rigor, preregistration, or all of that. This is time to play.
### Getting help in R.
As with `plot()` you can learn more about the `ggplot()` function with `?`.
```{r, eval = F}
?ggplot
```
`help.start()` can be nice, too.
```{r, eval = F}
help.start()
```
`??geom_line()` can help us learn more about the `geom_line()` function.
```{r, eval = F}
??geom_line()
```
Quite frankly, a bit part of becoming a successful **R** user is learning how to get help, online. In addition to the methods, above, type in the name of your function of interest in your favorite web browser. For example, If I wanted to learn more about the `geom_line()` function, I'd literally do a web search for "geom_line()". In my case, the first search results when doing so was [https://ggplot2.tidyverse.org/reference/geom_path.html](https://ggplot2.tidyverse.org/reference/geom_path.html), which is a nicely-formatted official reference page put out by the **ggplot2** team.
### Arithmetic and logical operators.
With arithmetic, the order of operations is: power first, then multiplication, then addition.
```{r}
1 + 2 * 3^2
```
With parentheses, you can force addition before multiplication.
```{r}
(1 + 2) * 3^2
```
Operations inside parentheses get done before power operations.
```{r}
(1 + 2 * 3)^2
```
One can nest parentheses.
```{r}
((1 + 2) * 3)^2
```
```{r}
?Syntax
```
We can use **R** to perform a variety of logical tests, such as negation.
```{r}
!TRUE
```
We can do conjunction.
```{r}
TRUE & FALSE
```
And we can do disjunction.
```{r}
TRUE | FALSE
```
Conjunction has precedence over disjunction.
```{r}
TRUE | TRUE & FALSE
```
However, with parentheses we can force disjunction first.
```{r}
(TRUE | TRUE) & FALSE
```
### Assignment, relational operators, and tests of equality.
In contrast to Kruschke's preference, I will use the [arrow operator, `<-`, to assign](https://style.tidyverse.org/syntax.html#assignment) values to named variables[^1].
```{r}
x = 1
x <- 1
```
Yep, this ain't normal math.
```{r}
(x = 1)
(x = x + 1)
```
Here we use `==` to test for equality.
```{r}
(x = 2)
x == 2
```
Using `!=`, we can check whether the value of `x` is NOT equal to 3.
```{r}
x != 3
```
We can use `<` to check whether the value of `x` is less than 3.
```{r}
x < 3
```
Similarly, we can use `>` to check whether the value of `x` is greater than 3.
```{r}
x > 3
```
This normal use of the `<-` operator
```{r}
x <- 3
```
is not the same as
```{r}
x < - 3
```
The limited precision of a computer's memory can lead to odd results.
```{r}
x <- 0.5 - 0.3
y <- 0.3 - 0.1
```
Although mathematically `TRUE`, this is `FALSE` for limited precision.
```{r}
x == y
```
However, they are equal up to the precision of a computer.
```{r}
all.equal(x, y)
```
## Variable types
If you'd like to learn more about the differences among vectors, matrices, lists, data frames and so on, you might check out Roger Peng's [-@pengProgrammingDataScience2020] *R Programming for data science*, [Chapter 4](https://bookdown.org/rdpeng/rprogdatascience/r-nuts-and-bolts.html).
### Vector.
"A vector is simply an ordered list of elements of the same type" (p. 42).
#### The combine function.
The *combine* function is `c()`, which makes vectors. Here we'll first make an unnamed vector. Then we'll sve that vector as `x`.
```{r}
c(2.718, 3.14, 1.414)
x <- c(2.718, 3.14, 1.414)
```
You'll note the equivalence.
```{r}
x == c(2.718, 3.14, 1.414)
```
This leads to the next subsection.
#### Component-by-component vector operations.
We can multiply two vectors, component by component.
```{r}
c(1, 2, 3) * c(7, 6, 5)
```
If you have a sole number, a *scaler*, you can multiply an entire vector by it like:
```{r}
2 * c(1, 2, 3)
```
which is a more compact way to perform this.
```{r}
c(2, 2, 2) * c(1, 2, 3)
```
The same sensibilities hold for other operations, such as addition.
```{r}
2 + c(1, 2, 3)
```
#### The colon operator and sequence function.
The colon operator, `:`, is a handy way to make integer sequences. Here we use it to serially list the inters from 4 to 7.
```{r}
4:7
```
The colon operator has precedence over addition.
```{r}
2 + 3:6
```
Parentheses override default precedence.
```{r}
(2 + 3):6
```
The power operator has precedence over the colon operator.
```{r}
1:3^2
```
And parentheses override default precedence.
```{r}
(1:3)^2
```
The `seq()` function is quite handy. If you don't specify the length of the output, it will figure that out the logical consequence of the other arguments.
```{r}
seq(from = 0, to = 3, by = 0.5)
```
This sequence won't exceed `to = 3`.
```{r}
seq(from = 0, to = 3, by = 0.5001)
```
In each of the following examples, we'll omit one of the core `seq()` arguments: `from`, `to`, `by`, and `length.out`. Here we do not define the end point.
```{r}
seq(from = 0, by = 0.5, length.out = 7)
```
This time we fail to define the increment.
```{r}
seq(from = 0, to = 3, length.out = 7)
```
And this time we omit a starting point.
```{r}
seq(to = 3, by = 0.5, length.out = 7)
```
In this ebook, I will always explicitly name my arguments within `seq()`.
#### The replicate function.
We'll define our pre-replication vector with the `<-` operator.
```{r}
abc <- c("A", "B", "C")
```
With the `times` argument, we repeat the vector as a unit with the `rep()` function.
```{r}
rep(abc, times = 2)
```
But if we mix the `times` argument with `c()`, we can repeat individual components of `abc` differently.
```{r}
rep(abc, times = c(4, 2, 1))
```
With the `each` argument, we repeat the individual components of `abc` one at a time.
```{r}
rep(abc, each = 2)
```
And you can even combine `each` and `length`, repeating each element until the `length` requirement has been fulfilled.
```{r}
rep(abc, each = 2, length = 10)
```
You can also combine `each` and `times`.
```{r}
rep(abc, each = 2, times = 3)
```
I tend to do things like the above as two separate steps. One way to do so is by nesting one `rep()` function within another.
```{r}
rep(rep(abc, each = 2),
times = 3)
```
As Kruschke points out, this can look confusing.
```{r}
rep(abc, each = 2, times = c(1, 2, 3, 1, 2, 3))
```
But breaking the results up into two steps might be easier to understand,
```{r}
rep(rep(abc, each = 2),
times = c(1, 2, 3, 1, 2, 3))
```
And especially earlier in my **R** career, it helped quite a bit to break operation sequences like this up by saving and assessing the intermediary steps.
```{r}
step_1 <- rep(abc, each = 2)
step_1
rep(step_1, times = c(1, 2, 3, 1, 2, 3))
```
#### Getting at elements of a vector.
Behold our exemplar vector, `x`.
```{r}
x <- c(2.718, 3.14, 1.414, 47405)
```
The straightforward way to extract the second and fourth elements is with a combination of brackets, `[]`, and `c()`.
```{r}
x[c(2, 4)]
```
Or you might use reverse logic and omit the first and third elements.
```{r}
x[c(-1, -3 )]
```
It's handy to know that `T` is a stand in for `TRUE` and `F` is a stand in for `FALSE`. You'll probably notice I use the abbreviations most of the time.
```{r}
x[c(F, T, F, T)]
```
The `names()` function makes it easy to name the components of a vector.
```{r}
names(x) <- c("e", "pi", "sqrt2", "zipcode")
x
```
Now we can call the components with their names.
```{r}
x[c("pi", "zipcode")]
```
Once we start working with summaries from our Bayesian models, we'll use this trick a lot.
Here's Kruschke's review:
```{r}
# define a vector
x <- c(2.718, 3.14, 1.414, 47405)
# name the components
names(x) <- c("e", "pi", "sqrt2", "zipcode")
# you can indicate which elements you'd like to include
x[c(2, 4)]
# you can decide which to exclude
x[c(-1, -3)]
# or you can use logical tests
x[c(F, T, F, T)]
# and you can use the names themselves
x[c("pi", "zipcode")]
```
### Factor.
"Factors are a type of vector in R for which the elements are *categorical* values that could also be ordered. The values are stored internally as integers with labeled levels" (p. 46, *emphasis* in the original).
Here are our five-person socioeconomic status data.
```{r}
x <- c("high", "medium", "low", "high", "medium")
x
```
The `factor()` function turns them into a factor, which will return the levels when called.
```{r}
xf <- factor(x)
xf
```
Here are the factor levels as numerals.
```{r}
as.numeric(xf)
```
With the `levels` and `ordered` arguments, we can order the factor elements.
```{r}
xfo <- factor(x, levels = c("low", "medium", "high"), ordered = T)
xfo
```
Now "high" is a larger integer.
```{r}
as.numeric(xfo)
```
We've already specified `xf`.
```{r}
xf
```
And we know how it's been coded numerically.
```{r}
as.numeric(xf)
```
We can have `levels` and `labels`.
```{r}
xfol <- factor(x,
levels = c("low", "medium", "high"), ordered = T,
labels = c("Bottom SES", "Middle SES", "Top SES"))
xfol
```
Factors can come in very handy when modeling with certain kinds of categorical variables, as in [Chapter 22][Nominal Predicted Variable], or when arranging elements within a plot.
### Matrix and array.
Kruschke uses these more often than I do. I'm more of a vector and data frame kinda guy. Even so, here's an example of a matrix.
```{r}
matrix(1:6, ncol = 3)
```
We can get the same thing using `nrow`.
```{r}
matrix(1:6, nrow = 2)
```
Note how the numbers got ordered by rows within each column? We can specify them to be ordered across columns, first.
```{r}
matrix(1:6, nrow = 2, byrow = T)
```
We can name the dimensions. I'm not completely consistent, but I generally follow [*The tidyverse style guide*](https://style.tidyverse.org) [@wickhamTidyverseStyleGuide2020] for naming my **R** objects and their elements. From [Section 2.1](https://style.tidyverse.org/syntax.html#object-names), we read
> Variable and function names should use only lowercase letters, numbers, and `_`. Use underscores (`_`) (so called snake case) to separate words within a name.
By those sensibilities, we'll name our rows and columns like this.
```{r}
matrix(1:6,
nrow = 2,
dimnames = list(TheRowDimName = c("row_1_name", "row_2_name"),
TheColDimName = c("col_1_name", "col_2_name", "col_3_name")))
```
You've also probably noticed that I "[always put a space after a comma, never before, just like in regular English](https://style.tidyverse.org/syntax.html#spacing)," as well as "put a space before and after `=` when naming arguments in function calls." IMO, this makes code easier to read. You do you.
We'll name our matrix `x`.
```{r}
x <-
matrix(1:6,
nrow = 2,
dimnames = list(TheRowDimName = c("row_1_name", "row_2_name"),
TheColDimName = c("col_1_name", "col_2_name", "col_3_name")))
```
Since there are 2 dimensions, we'll subset with two dimensions. Numerical indices work.
```{r}
x[2, 3]
```
Row and column names work, too. Just make sure to use quotation marks, `""`, for those.
```{r}
x["row_2_name", "col_3_name"]
```
Here we specify the range of columns to include.
```{r}
x[2, 1:3]
```
Leaving that argument blank returns them all.
```{r}
x[2, ]
```
And leaving the row index blank returns all row values within the specified column(s).
```{r}
x[, 3]
```
Mind your commas! This produces the second row, returned as a vector.
```{r}
x[2, ]
```
This returns both rows of the 2^nd^ column.
```{r}
x[, 2]
```
Leaving out the comma will return the numbered element.
```{r}
x[2]
```
It'll be important in your **brms** career to have a sense of 3-dimensional arrays. Several **brms** convenience functions often return them (e.g., `ranef()` in multilevel models).
```{r}
a <- array(1:24, dim = c(3, 4, 2), # 3 rows, 4 columns, 2 layers
dimnames = list(RowDimName = c("r1", "r2", "r3"),
ColDimName = c("c1", "c2", "c3", "c4"),
LayDimName = c("l1", "l2")))
a
```
Since these have 3 dimensions, you have to use 3-dimensional indexing. As with 2-dimensional objects, leaving the indices for a dimension blank will return all elements within that dimension. For example, this code returns all columns of `r3` and `l2`, as a vector.
```{r}
a["r3", , "l2"]
```
And this code returns all layers of `r3` and `c4`, as a vector.
```{r}
a["r3", "c4", ]
```
This whole topic of subsetting--whether from matrices and arrays, or from data frames and tibbles--can be confusing. For more practice and clarification, you might check out Peng's Chapter 9, [*Subsetting R objects*](https://bookdown.org/rdpeng/rprogdatascience/subsetting-r-objects.html).
### List and data frame.
"The `list` structure is a generic vector in which components can be of different types, and named" (p. 51). Here's `my_list`.
```{r}
my_list <-
list("a" = 1:3,
"b" = matrix(1:6, nrow = 2),
"c" = "Hello, world.")
my_list
```
To return the contents of the `a` portion of `my_list`, just execute this.
```{r}
my_list$a
```
We can index further within `a`.
```{r}
my_list$a[2]
```
To return the contents of the first item in our list with the double bracket, `[[]]`, do:
```{r}
my_list[[1]]
```
You can index further to return only the second element of the first list item.
```{r}
my_list[[1]][2]
```
But double brackets, `[][]`, are no good, here.
```{r}
my_list[1][2]
```
To learn more, Jenny Bryan has a [great talk](https://www.youtube.com/watch?v=4MfUCX_KpdE&t=615s&frags=pl%2Cwn) discussing the role of lists within data wrangling. There's also [this classic pic](https://twitter.com/hadleywickham/status/643381054758363136?lang=en) from Hadley Wickham:
```{r, echo = F, fig.align = "center", out.width = "100%", fig.cap = "Indexing lists in #rstats. Inspired by the Residence Inn"}
knitr::include_graphics("pics/pepper_list.png")
```
But here's a data frame.
```{r}
d <-
data.frame(integers = 1:3,
number_names = c("one", "two", "three"))
d
```
With data frames, we can continue indexing with the `$` operator.
```{r}
d$number_names
```
We can also use the double bracket.
```{r}
d[[2]]
```
Notice how the single bracket with no comma indexes columns rather than rows.
```{r}
d[2]
```
But adding the comma returns the factor-level information when indexing columns.
```{r}
d[, 2]
```
It works a touch differently when indexing by row.
```{r}
d[2, ]
```
Let's try with a tibble, instead.
```{r}
t <-
tibble(integers = 1:3,
number_names = c("one", "two", "three"))
t
```
One difference is that tibbles default to assigning text columns as character strings rather than factors. Another difference occurs when printing large data frames versus large tibbles. Tibbles yield more compact glimpses. For more, check out [*R4DS* Chapter 10](https://r4ds.had.co.nz/tibbles.html).
It's also worthwhile pointing out that within the **tidyverse**, you can pull out a specific column with the `select()` function. Here we select `number_names`.
```{r}
t %>%
select(number_names)
```
Go [here](https://r4ds.had.co.nz/transform.html#select) learn more about `select()`.
## Loading and saving data
### The ~~read.csv~~ `read_csv()` and ~~read.table~~ `read_table()` functions.
Although `read.csv()` is the default CSV reader in **R**, the [`read_csv()` function](https://readr.tidyverse.org/reference/read_delim.html) from the [**readr** package](https://readr.tidyverse.org) (i.e., one of the core **tidyverse** packages) is a new alternative. In [comparison to base **R**](https://r4ds.had.co.nz/data-import.html#compared-to-base-r)'s `read.csv()`, `readr::read_csv()` is faster and returns tibbles (as opposed to data frames with `read.csv()`). The same general points hold for base **R**'s `read.table()` versus `readr::read_table()`.
Using Kruschke's `HGN.csv` example, we'd load the CSV with `read_csv()` like this.
```{r, message = F}
hgn <- read_csv("data.R/HGN.csv")
```
Note again that `read_csv()` defaults to returning columns with character information as characters, not factors.
```{r}
hgn$Hair
```
See? As a character variable, `Hair` no longer has factor level information. But if you knew you wanted to treat `Hair` as a factor, you could easily convert it with `dplyr::mutate()`.
```{r}
hgn <-
hgn %>%
mutate(Hair = factor(Hair))
hgn$Hair
```
And here's a **tidyverse** way to reorder the levels for the `Hair` factor.
```{r}
hgn <-
hgn %>%
mutate(Hair = factor(Hair, levels = c("red", "blond", "brown", "black")))
hgn$Hair
as.numeric(hgn$Hair)
```
Since we imported `hgn` with `read_csv()`, the `Name` column is already a character vector, which we can verify with the `str()` function.
```{r}
hgn$Name %>% str()
```
Note how using `as.vector()` did nothing in our case. `Name` was already a character vector.
```{r}
hgn$Name %>%
as.vector() %>%
str()
```
The `Group` column was imported as composed of integers.
```{r}
hgn$Group %>% str()
```
Switching `Group` to a factor is easy enough.
```{r}
hgn <-
hgn %>%
mutate(Group = factor(Group))
hgn$Group
```
### Saving data from R.
The **readr** package has a `write_csv()` function, too. The arguments are as follows: `write_csv(x, file, na = "NA", append = FALSE, col_names = !append, quote_escape = "double", eol = "\n")`. Learn more by executing `?write_csv`. Saving `hgn` in your working directory is as easy as:
```{r}
write_csv(hgn, "hgn.csv")
```
You could also use `save()`.
```{r}
save(hgn, file = "hgn.Rdata" )
```
Once we start fitting Bayesian models, this method will be an important way to save the results of those models.
The `load()` function is simple.
```{r}
load("hgn.Rdata" )
```
The `ls()` function works very much the same way as the more verbosely-named `objects()` function.
```{r}
ls()
```
## Some utility functions
"A function is a process that takes some input, called the *arguments*, and produces some output, called the *value*" (p. 56, *emphasis* in the original).
```{r}
# this is a more compact way to replicate 100 1's, 200 2's, and 300 3's
x <- rep(1:3, times = c(100, 200, 300))
summary(x)
```
We can use the pipe to convert and then summarize `x`.
```{r}
x %>%
factor() %>%
summary()
```
The `head()` and `tail()` functions are quite useful.
```{r}
head(x)
tail(x)
```
I used `head()` a lot.
Within the **tidyverse**, the `slice()` function serves a similar role. In order to use `slice()`, we'll want to convert `x`, which is just a vector of integers, into a data frame. Then we'll use `slice()` to return a subset of the rows.
```{r}
x <-
x %>%
data.frame()
x %>%
slice(1:6)
```
So that was analogous to what we accomplished with `head()`. Here's the analogue to `tail()`.
```{r}
x %>%
slice(595:600)
```
The downside of that code was we had to do the math to determine that $600 - 6 = 595$ in order to get the last six rows, as returned by `tail()`. A more general approach is to use `n()`, which will return the total number of rows in the tibble.
```{r}
x %>%
slice((n() - 6):n())
```
To unpack `(n() - 6):n()`, because `n()` = 600, `(n() - 6)` = 600 - 6 = 595. Therefore `(n() - 6):n()` was equivalent to having coded `595:600`. Instead of having to do the math ourselves, `n()` did it for us. It's often easier to just go with `head()` or `tail()`. But the advantage of this more general approach is that it allows one take more complicated slices of the data, such as returning the first three and last three rows.
```{r}
x %>%
slice(c(1:3, (n() - 3):n()))
```
We've already used the handy `str()` function a bit. It's also nice to know that `tidyverse::glimpse()` performs a similar function.
```{r}
x %>% str()
x %>% glimpse()
```
Within the **tidyverse**, we'd use `group_by()` and then `summarize()` as alternatives to the base **R** `aggregate()` function. With `group_by()` we group the observations first by `Hair` and then by `Gender` within `Hair`. After that, we summarize the groups by taking the `median()` values of their `Number`.
```{r, message = F}
hgn %>%
group_by(Hair, Gender) %>%
summarize(median = median(Number))
```
One of the nice things about this workflow is that the code reads somewhat like how we'd explain what we were doing. We, in effect, told **R** to *Take `hgn`, then group the data by `Hair` and `Gender` within `Hair`, and then `summarize()` those groups by their `median()` `Number` values.* There's also the nice quality that we don't have to continually tell **R** where the data are coming from the way the `aggregate()` function required Kruschke to prefix each of his variables with `HGNdf$`. We also didn't have to explicitly rename the output columns the way Kruschke had to.