-
Notifications
You must be signed in to change notification settings - Fork 1
/
R-Syntax-Tutorial-Part1.Rmd
2676 lines (1995 loc) · 125 KB
/
R-Syntax-Tutorial-Part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to R Syntax Part 1"
pagetitle: "Introduction to R Syntax, Part 1"
output:
learnr::tutorial:
theme: spacelab
ace_theme: cobalt
progressive: true
allow_skip: true
highlight: pygments
toc: yes
toc_depth: 2
code_folding: hide
runtime: shiny_prerendered
bibliography: bib.bib
description: "Basic R Syntax and data structures"
---
```{r setup, include=FALSE}
library(learnr)
knitr::opts_chunk$set(echo = FALSE)
```
<!-- ======================== -->
## About this tutorial
<!-- ======================== -->
This introductory tutorial about the R syntax is designed to be guided by an instructor. It contains explanations, code exercises, and quizzes that make it very interactive. If time allows during the guided session the student can execute some of the code and try variations of its own. The instructor may also ask short questions regarding the material. Should you fall behind or need more time to go over code or concepts, please do so after the live session. Actually this is very important in order to cement the concepts and test your understanding.
The emphasis will be on why R works the way it does and less on being a reference manual or a source of recipes. In doing so we hope to get you excited about this useful and expressive language. R was designed from the ground up to be vectorized and object oriented. Its aim is to be useful for both experts and beginners with the need to do computational statistics, data analysis and data visualization tasks.
Dr. Amelia McNamara describes three types of R Syntax [@RSyntax-Cheatsheet]: the dollar sign, the formula, and the Tidyverse. These syntax are interchangeable, the important thing is to be consistent when using them. The formula syntax tends to be more compact but less readable while the Tidyverse syntax is more readable but also more verbose. We will focus here on the dollar sign syntax in order to meet our time goal for these two sessions. However, the student will have enough fundamental concepts to tackle the other two syntax on his own by the end of this tutorials.
This tutorial was built using an R package called `learnr` and was deployed using an Rstudio solution on a ShinyR server. You can also do these things if you move forward in your R journey.
Finally, if you are following this material on a browser, every time you reopen this page the tutorial will be in the previous state you left it the last time you worked on it. If you want to reset your answers and the code you've run on the exercises, erasing all of your previous history, press the `Start Over` option at the bottom left on the main panel.
Let's get started.
<!-- ======================== -->
## Learning outcomes
<!-- ======================== -->
The main goal of this tutorial is to teach patterns of the R language basic syntax. We believe that this knowledge will set you on a positive trajectory at any point in your R language journey. R has specific properties that set it apart from other languages and can make it easier to learn for a beginner. However, those same differences can make it confusing at first for people with experience.
By the end of this tutorial, you will know what data structures to use to store your data in memory. You will select the data type according to the data you have for your specific problem. You will know the syntax to use R as a powerful calculation tool to generate statistical and data insights by subsetting, filtering, and computing in R native data structures.
You will know how to build complex expressions that compute using R data structures with simplicity, whether they are vectors, lists, or data frames.
You will know how to search for built-in R functions to help you process the data and achieve your goals. You will also create your own if no other from the built-ins or publicly available R packages meets your needs.
<!-- ======================== -->
## Operators in R
<!-- ======================== -->
R has a console to interact with it. Let's use it as a calculator to add two numbers.
Here is an example of a line of code to add two plus two and the resulting output:
```{r two-plus-two, echo=TRUE}
2 + 2
```
Note the output, there is a [1] and the result of the calculation. We will address the meaning of the number between square brackets when discussing `Vectors`. For now pretend it tells you the position of the single answer provided: a `[1]` indicates that the number `4` is the first value in the answer.
### Mathematical operators in R
Please try running the following lines of code, each representing a different mathematical operation.
```{r operators-1, exercise=TRUE, exercise.lines=5}
# white space between operands and operator does not matter below, try it!
450 - 100 # subtraction
3 * 10 # multiplication
35 / 7 # division
5^2 #exponentiation
```
Each result appears on its own line after running the code. The `#` symbol precedes a comment that is ignored by the interpreter of the code. This is the same behaviour you would see on a command line if you entered the four lines of code. Go back and prove to yourself that white space between the numbers and the operators has no effect on the result.
There are many more operators. We will now go over some of the most important and show how to search for all of the ones that R offers so you can find them when you need them.
### Relational operators
These operators have R compute numerical equality or inequality between objects. Don't get intimidated by the word objects: in R everything is an object, although some appear to be just numeric values.
The basic relational operators compute equality or the lack of it: "is equal" `==`, "is not equal" `!=`; and inequality in either direction with "is less than" `<` and "is greater than" `>`.
```{r relational-operators, exercise = TRUE}
18 == 3 * 6
6 < 10
12 > 5
9 != 3 * 3
```
What about text comparisons? Text is written between double quotes, it contain letters, numbers, white space, and other special characters like punctuation. Execute the following lines of code.
```{r relational-operators-with-text, exercise=TRUE}
"Calgary Flames" == "Edmonton Oilers"
"Susan" > "Anne"
"a" < "b"
"Airport" != "Bus station"
```
The previous comparisons work because R uses the ASCII values of the characters on the text to compute numbers. The `a` comes before the `b` so its value is lower and thus the result observed above.
Can you go back and transform the arguments, not the operators, in the code above to reverse the results?
For example `"d" < "b"` for line 3 to produce `FALSE`.
<!-- roundoff error using the equality operator deferred for later when we have covered functions -->
### Logical operators
These operators are used to combine the results of two or more relational operators into a single result. Here is the AND operator represented in R by the ampersand symbol `&`:
```{r logical-AND-exercise, exercise=TRUE, collapse=TRUE}
5 > 3 & 5 < 8
5 > 3 & 5 != 5
5 < 4 & 5 > 1
5 < 4 & 5 > 7
```
Can you make a table with the truth values of the relational operators before they were combined with the logical operator AND?
```{r logical-table-AND, exercise=TRUE, exercise.lines=1}
# Run Code to see the table"
library(knitr)
and = c("&", "&", "&", "&")
equals = c("=", "=", "=" , "=")
relation1 <- c("5 > 3", "5 > 3", "5 < 4", "5 < 4")
value1 <- c(paste0("(", 5 > 3, ")"),
paste0("(", 5 > 3, ")"),
paste0("(", 5 < 4, ")"),
paste0("(", 5 < 4, ")"))
relation2 <- c("5 < 8", "5 != 5", "5 > 1", "5 > 7")
value2 <- c(paste0("(", 5 < 8, ")"),
paste0("(", 5 != 5, ")"),
paste0("(", 5 > 1, ")"),
paste0("(", 5 > 7, ")"))
result = c(5 > 3 & 5 < 8,
5 > 3 & 5 != 5,
5 < 4 & 5 > 1,
5 < 4 & 5 > 7)
kable( data.frame(relation1, value1, and, relation2, value2, equals, result,
stringsAsFactors = FALSE),
col.names = c("rel1", "(value)", "", "rel2", "(value)", "", "result"),
align = 'rlcrlcl',
caption = 'Truth table for AND')
```
For the logical OR operator, represented in R by the pipe symbol `|`, the table would look as follows:
```{r logical-table-OR, exercise=TRUE, exercise.lines=1}
# Run Code to see the table"
library(knitr)
and = c("|", "|", "|", "|")
equals = c("=", "=", "=" , "=")
relation1 <- c("5 > 3", "5 > 3", "5 < 4", "5 < 4")
value1 <- c(paste0("(", 5 > 3, ")"),
paste0("(", 5 > 3, ")"),
paste0("(", 5 < 4, ")"),
paste0("(", 5 < 4, ")"))
relation2 <- c("5 < 8", "5 != 5", "5 > 1", "5 > 7")
value2 <- c(paste0("(", 5 < 8, ")"),
paste0("(", 5 != 5, ")"),
paste0("(", 5 > 1, ")"),
paste0("(", 5 > 7, ")"))
result = c(5 > 3 | 5 < 8,
5 > 3 | 5 != 5,
5 < 4 | 5 > 1,
5 < 4 | 5 > 7)
kable( data.frame(relation1, value1, and, relation2, value2, equals, result,
stringsAsFactors = FALSE),
col.names = c("rel1", "(value)", "", "rel2", "(value)", "", "result"),
align = 'rlcrlcl',
caption = 'Truth table for OR')
```
Finally the negation or NOT logical operator is represented in R by the exclamation mark `!`. It negates or turns the logical value into its opposite.
```{r negation-logical-operator, exercise=TRUE, exercise.lines=2}
!(4 > 1)
!TRUE
```
Notice the presence of the parenthesis around the last expression. What do you think would have been the meaning of `!4 > 6`?
Try it.
```{r negation-precedence, exercise=TRUE}
!4 < 1
# furthermore try just the first element
!4
```
As it turns out R coerces the values of `TRUE` to 1 and `FALSE` to 0. By the same token, any non-zero value is interpreted as `TRUE`, thus the last expression `!4` translates into `!TRUE` or `0`, which makes `!4 < 1` `TRUE` as opposed to `FALSE` for `!(4 < 1)`.
The change in final result depending on the order of evaluation brings us into the subject of operator precedence.
### Operator precedence
Operators that act on a single value are _unary_ while those that receive two arguments are _binary_. The execution order of an expression containing several operators follows the rules of operator precedence. These rules determine the priority of execution given to some operators over others. Let's look at an example:
```{r op-precendence-A, exercise=TRUE, exercise.lines=2}
# predict the result computed by R before pressing the Run Code button
1 + 5 * 5
```
Did you predict the result correctly? For R, multiplication takes precedence over addition.
You can force the execution of addition over multiplication by use of parenthesis as illustrated below:
```{r op-precendence-B, exercise=TRUE, exercise.lines=2}
# predict the new result using parenthesis around the addition operator
(1 + 5) * 5
```
To learn about the rules of precedence built into R you can use its help system.
Here is a list of all the _unary_ and _binary_ operators and their precedence order from highest to lowest from top to bottom and from left to right within groups:
```{r op-precedence-2, exercise=TRUE, exercise.lines=1}
# Run Code to see the table"
library(knitr)
kable(data.frame(operator_groups = c(":: :::",
"$ @",
"[ [[",
"^",
"- +",
":",
"%any%",
"* /",
"+ -",
"< > <= >= == !=",
"!",
"& &&",
"| ||",
"~",
"-> ->>",
"<- <<-",
"=",
"?"),
description = c("access variables in a namespace",
"component / slot extraction",
"indexing",
"exponentiation (right to left)",
"unary minus and plus",
"sequence operator",
"special operators (including %% and %/%)",
"multiply, divide",
"(binary) add, subtract",
"ordering and comparison",
"negation",
"and",
"or",
"as in formulae",
"rightwards assignment",
"assignment (right to left)",
"assignment (right to left)",
"help (unary and binary)")))
```
You can get documentation like this from the R help at the command line by typing `?Syntax` or `help("Syntax")`, a full set of help functions appears on the table below.
```{r help-commands, exercise=TRUE, exercise.lines=1}
# Run Code to see the table"
library(knitr)
long_form <- c("help.start()", "help(\"funABC\")", "help.search(\"funABC\")", "example(\"funABC\")", "RSiteSearch()","apropos(\"funABC\", mode = \"function\")", "data()", "vignette()", "vignette(\"ABC\")")
short_form <- c("", "?funABC", "??funABC", "", "", "", "", "", "")
description = c("General help system",
"Help on function \"funABC\" (quotations optional)",
"Searches help for string \"funABC\"",
"Finds examples for \"funABC\"",
"Opens a browser search for \"funABC\" on R online manuals and archived mailing lists",
"List of all avaliable functions with \"funABC\" in their name",
"List all data sets in loaded packages",
"List vignettes for currently installed packages",
"Displays content of vignette for package \"ABC\"")
kable( data.frame(long_form, short_form, description,
stringsAsFactors = FALSE),
caption = 'Help commands explained [@KabacoffRobert2015Ria, p.11]')
```
<!-- ======================== -->
### Test your understanding
<!-- ======================== -->
```{r quiz-operator-precedence, echo=FALSE, cache=FALSE}
quiz(
question_radio(
"The NOT operator is represented in R by the symbol ~",
answer("yes", correct = FALSE, message = 'This is a good question though!'),
answer("no", correct= TRUE),
correct = 'Correct, it is actually the exclamation mark so !TRUE will give FALSE',
random_answer_order = TRUE,
allow_retry = TRUE
),
question_checkbox(
" !(5 > 2) == (5 < 2)
Select all that apply.",
answer("Evaluates to TRUE", correct = TRUE),
answer("Evaluates to TRUE because == has higher precendence that all other operators in the expression", correct = FALSE),
answer("Evaluates to FALSE", correct = FALSE),
answer("Evaluates to FALSE because ! has higher precendence than ==", correct = FALSE),
answer("Evaluates to TRUE because negating the first inequality flips its order, making it identical to the inequality on the right", correct = TRUE),
random_answer_order = TRUE,
allow_retry = TRUE,
correct = "Well done, you have achieved a good understanding of relational and logical operators.",
incorrect = "Not all true answers selected or maybe you have chosen good and bad answers."
),
question("What is the result of executing
10 - 6 / 2 ?",
answer("2", message = "division has precedence over subtraction"),
answer("7", correct = TRUE, message = "division has higher precedence than subtraction"),
answer("8", message = "Good try, redo the calculation"),
answer("-4"),
allow_retry = TRUE,
random_answer_order = TRUE
),
# https://rdrr.io/github/timelyportfolio/sortableR/man/question_rank.html
sortable::question_rank(text = "Sort the following operators by precendence in descending order",
correct = "You must have memorized that table from the documentation, you are a star!",
incorrect = "Not quite right, try again!",
allow_retry = TRUE,
random_answer_order = TRUE,
options = sortable::sortable_options(),
learnr::answer(c("$", "[", "+", "&", "->"), correct = TRUE)
)
)
```
<!-- ======================== -->
## Variables in R
<!-- ======================== -->
Just like in mathematics and statistics, the concept of variables is that of a placeholder for values. In R or any other programming language it is a good idea to store results in variables so we can keep them around to use them in other expressions. Here is an exercise to declare and assign values to variables.
```{r storing-values, exercise.lines=7, exercise=TRUE}
# store a value
a <- 10
# store the result of an operation
b <- a + 5
# print the contents 'a' and 'b' one on separate lines
a
b
```
The `<-` binary operator indicates assignment. The value on the right is assigned to the variable on the left. After line 1 is executed by R the variable `a` stores the value 10. When the second line is executed, `b` is assigned the result of applying the operator `+` to the value of the variable `a`, `10`, and the value `5`. As a result of this, `b` will be assigned the value `15`.
The operator assignment has subtleties in R that we will cover when addressing the subject of functions and their arguments.
<!-- ======================== -->
### Test your understanding
<!-- ======================== -->
```{r quiz-simple-assignment, echo=FALSE, cache=FALSE}
quiz(
question("What is the value of b after executing the following R code?
a <- 5
b <- a * (34 - 22) / 1 + 1",
answer("12"),
answer("30"),
answer("61", correct = TRUE),
answer("149"),
allow_retry = TRUE,
random_answer_order = TRUE,
try_again = "Check the operator precedence table and try again."
)
)
```
### Variables and objects
In R the name used to label a variable is an object itself [@RLangDef.objects], in fact everything is an object in R and in this tutorial we will refer to objects and variables interchangeably.
Variables have a type and a mode. To find out the type and the mode of R variables one needs to use built-in functions `typeof()` and `mode()`. Let's then go into the subject in the next section.
## Built-in functions
R offers predefined functions. When you use an R package you also use functions to get many complex tasks done with ease. It is important to familiarize yourself with their syntax. This requires that we start simple and build up the concepts gradually. The goal is to learn how to pass information to them and how to obtain results in two forms: as direct returned objects or as side effects. A side effect would be something like a file being written to the local disk or to a file service in the cloud.
### Finding the type of a variable
Functions in general may take arguments and return computed values when executed. Let's revisit the type and mode of variables.
```{r type-of-variables, exercise=TRUE, exercise.lines=8}
i <- 54
typeof(i)
j <- 54L
typeof(j)
x <- 3.1416
typeof(x)
first_name <- "Rob"
typeof(first_name)
```
R stores variables using other, related, categories. These can be queried via the `mode` function:
```{r mode-of-variables, exercise=TRUE, exercise.lines=8}
i <- 54
mode(i)
j <- 54L
mode(j)
x <- 3.1416
mode(x)
first_name <- "Rob"
mode(first_name)
```
Let's investigate the type and mode of the R function `mode` itself.
```{r mode-of-mode, exercise=TRUE, exercise.lines=2}
typeof(mode)
mode(mode)
```
In computer language _lingo_ a closure is a function and an environment to evaluate it, let's leave it at that for now. R is a unique language in more ways than one, let's move on with more basic concepts.
### Assignment of values returned from built-ins
Some expressions may be made up of a function with its arguments and the assignment operator to store the returned value from the function in a new variable. Let's look at the result of running the following lines:
```{r executing-function, exercise.lines=4, exercise=TRUE}
# assign the result of the function call to the variable 'e'
e <- exp(1)
# print the contents of 'e' to the output
e
```
The first line executes the expression on the right of the assignment operator, itself a function call to `exp()` with argument `1` and assigns the returned value to the variable `e`. The second line prints the value stored in `e` as `2.718282`.
Try running the line 1 by itself below.
```{r executing-function-oneline, exercise.lines=1, exercise=TRUE}
e <- exp(1)
```
There should be no output. The value of the function `exp` with the argument `1` was computed and stored in the variable `e`. The assignment operation leaves no trace on the output. Another way of saying this is _the assignment operator produces no side effects in the console_. Its only effect is to create a name-value pair, (e, 2.718282) in the global environment, so the value can be recalled later by its assigned name.
*Note:* As it is usually the case in R, there is more than one way of getting things done. You can get a two-for-one effect by printing and doing assignment in a single expression on the console by surrounding the assignment with parenthesis. Try it!
```{r two-for-one, exercise = TRUE, exercise.lines=2}
# this accomplishes the assignment and prints the value of the variable in one line
(e <- exp(1))
```
### Other built-in functions
To compute logarithms, R offers the following predefined functions:
```
log(x, base = exp(1))
logb(x, base = exp(1))
log10(x)
log2(x)
log1p(x)
```
The most fundamental R built-in functions come bundled in the `base` package.
To read about an R package use the function call `library(help = "base")` at the command line. Packages are loaded by issuing the function `libray("base")` but the `base` package is loaded by default when an R session is started. Try it below.
```{r functions-base-package, exercise=TRUE, exercise.lines=1}
library(help = "base")
```
The language R provides a standard work-flow to build packages that contain functions, variables, and data targeting a specific problem. Contributors write R packages and share them mainly via the CRAN repository. To find out what comes built-in with the `base` package we can call the
`builtins()` function to produce the list of `r length(builtins())` objects that are loaded in the base environment when you start R the first time.
```{r number-functions-base-package, exercise=TRUE, exercise.lines=5}
# get the number of objects loaded from the base R package
length(builtins())
# test if a object (a function) is part of the built-ins in the base environment
"exp" %in% builtins()
```
<!-- ======================== -->
### Test your understanding
<!-- ======================== -->
```{r quiz-complex-expressions, echo=FALSE, cache=FALSE}
quiz(
question("Consider the expression:
a <- 1 + log10(10)
What is the result of executing the line above in the R console?",
answer("the value 2 gets assigned to _a_ and printed to the console", message = "the operaror _assignment_ has no side effects so nothing should be printed to the console"),
answer("error, _log10_ of 10 is undefined", message = "That's not quite right, check your the definition of a logarithm and try again"),
answer("nothing gets printed", correct = TRUE, message = "The operator _assignment_ has no side effects so nothing gest printed after the value 2 gets associated with the variable _a_"),
answer("the value of _a_ gets printed", message = "The operator _assignment_ has no other effect than to associate a value with a variable name"),
allow_retry = TRUE,
random_answer_order = TRUE
)
)
```
### The assignment operators and their uses
Experiment now to compute logarithms in base 10 and natural logarithms (base _e_). Try to answer the following:
- What is the logarithm of 10 in base 10?
- What is the natural logarithm of _e_?
- What is the logarithm of 512 in base 2?
Try using the two variables already predefined in the first two lines. Add as many lines of code as you need to experiment.
```{r other-operators, exercise=TRUE, exercise.lines=12}
a <- 10
e = exp(1)
print(paste0("a = ", a, "; e = ", round( x = e, digits = 4)))
# logarithm of 10 in base 10
# natural logarithm of e
# binary logarithm of 512
```
```{r other-operators-hint}
log10(a)
log(e, base = e)
```
From the previous exercise you might have noticed that the operators `<-` and `=` behave identically in stand alone expressions. Technically speaking, their effect is to create name-value pairs for each variable, (_a_, 10) and (_e_, 2.7183), in the global environment.
So, you might ask: why are there two operators to do the same in R?
Read on for the answer.
### Function arguments: positional and named
R functions may be built-in or user created, they may also have none, one, or more arguments. The arguments are given in between parenthesis, separated by commas, and they may be named or not. An example of a named parameter to a function is `base` in the `log` function: `log(x, base = exp(1))`. Examine the code below and guess the output before running it, did you expect the result?
```{r named-param-1, exercise = TRUE, exercise.lines=3}
(three <- log(1000, base = 10))
# un-comment to check if base exists after executing the line above
# base
```
<!-- ```{r named-param-1, exercise = TRUE, exercise.lines=9}
three <- log(1000, base = 10)
# Does 'base' exist in the global environment after log gets valuated?
# let's check it out and print a nice message accordingly
if (exists("base")) {
print(paste0("base exists outside of log, base = ", base))
} else {
print("base does not exist outside of log")
}
```--->
In the call to the function `log`, the first argument is positional, taking exactly the position number one. The second argument is named and may receive a value via the `=` operator if the function requires it during its execution.
If the function is called without a second argument, and `base` is used for computation inside the function, the assignment operator will use the value resulting from evaluating the expression `exp(1)` and associate it with `base`. This is how a default value for a named argument can be given.
If the user prefers to pass a different value from the default then the named argument can be given as in `log(10, base = 10)` or just `log(10, 10)`.
```{r named-param-2, exercise = TRUE}
# use the name for the second argument
log(1000, base = 10)
# use only a value for the second argument, still ok!
log(1000, 10)
```
Now compare the flexibility of using the explicit name assignment.
```{r named-param-3, exercise = TRUE}
# the first argument is now a named argument so the second gets position one instead
log(base = 10, 1000)
# in absence of hints x = 10 (position one) and base = 1000 (named)
log(10, 1000)
```
### More on variable assignment
We already saw how the `=` operator for a named argument does not affect the global environment where the function is created. Variable assignment is like creating a pair (variable-name, expression) that lives in a scope where it can be reached for further evaluation.
We could use the `<-` operator for the named argument, the expression `log(x, base <- exp(1))` would assign the expression `exp(1)` to the local variable `base`. However, before that gets done, the name-expression pair gets created in the global environment. That implies that there will be a global variable `base` with the value of evaluating `exp(1)` outside of the function `log` after exiting `log`. Let's test that.
```{r named-param-4, exercise = TRUE}
base = 100
log(1000, base = 10)
# Does 'base' exist in the global environment after log gest evaluated?
base
```
<!-- ```{r named-param-4, exercise = TRUE}
log(1000, base <- 10)
# Does 'base' exist in the global environment after log gest evaluated?
if (exists("base")) {
print(paste0("base exists outside of log, base = ", base))
} else {
print("base does not exist outside of log")
}
``` -->
If we use the global assignment operator `<-` in the first position it might have an unexpected result compared to using the local assignment operator `=`. Check for yourself with the code below.
```{r named-param-5, exercise = TRUE}
log(base <- 10, 1000)
# try now using the local assignment = for the named parameter
log(base = 10, 1000)
```
Did you get the same result of using `log(base = 10, 1000)`? That is almost true, the difference is that now there is a variable assignment represented by the pair `(base, 10)` that outlives the call to `log`.
In summary, to avoid ambiguities and unplanned side effects, when assigning values to variables use `<-` for stand-alone expressions and `=` for function named parameters. On the next section there are a few exercises to solidify these concepts.
## On number representation
Computers can only store and operate on numbers in the binary system, meaning only with two states that we will call "off" and "on" or zeros and ones. Due to this limitation there are round off errors intrinsic to the arithmetic of converting from binary to decimal, a counting system we humans are more familiar with. Let's investigate the problem and find a solution in R [@CRAN.FAQ.RDoesNotThinkNumbersAreEqual].
```{r round-off-error, exercise=TRUE, exercise.lines=1}
0.1 + 0.1 + 0.1 == 0.3
```
Wait a second, no one saw that one coming! Let's explore what is happening and why. First let's try to see the decimal representation of these numeric types.
```{r printing-double-at-max-resolution, exercise=TRUE, exercise.lines=2}
print(0.1, digits = 17)
print(0.3, digits = 17)
```
Let's investigate now what the machine's precision to represent a double floating point number from the R documentation on the meaning of `double.eps` from the constant `.Machine` (you can summon the documentation with `help(.Machine)`:
```{r floating-point-precision, exercise=TRUE, exercise.lines=2}
# using the R constant .Machine, find out more with help(".Machine")
(.Machine$double.base ^ .Machine$double.ulp.digits) / 2
```
This means that this is the smallest number that will make this expression `FALSE`: `1 + x == 1`:
```{r precision-double-test, exercise=TRUE, exercise.lines=3}
1 + 1.00e-16 == 1
1 + 1.11e-16 == 1
1 + 1.12e-16 == 1
```
To avoid the round-off error when making these kind of comparisons it is recommended to use a built-in function that considers the machine precision of doubles: `all.equal()`.
```{r how-to-avoid-round-off-error, exercise=TRUE}
sum_calc <- 0.1 + 0.1 + 0.1
sum_expected <- 0.3
all.equal(sum_calc, sum_expected)
```
## Data Representation in R
Every computing language uses a model to represent information in memory and R is no exception. Everything in R is an object with a default constructor. We are interested in the objects that adopt certain shapes in the computer memory to hold values. The values have types that can be queried with the built-in function `typeof()` that we have seen before. We will cover these definitions and work with them in this section.
Let's consider these cases:
* We need to store the grades of the mid-term exam of a class with 300 students to compute stats on them and make some visualizations.
* Then we need to store the student IDs, the course section, year of studies, and program of study the students belong to.
* Finally we want to find the two most relevant variables that influence grade and then visualize the clusters of students that got A or better as a function of those two variables.
A simple spreadsheet could have been enough up to storing the data. However, automation of a repetitive task, reporting or visualization, and further data processing using algorithms, make a language like R more attractive for these tasks.
Although solving this problem is beyond the scope of this tutorial, the idea of using a programming language forces you to think of the type of data and the structures that you need to store and manipulate it in order to accomplish your goal. That is exactly the reason we need to address now the syntax of those data types and structures in the R language.
### Data Types
The fundamental values that R can represent and manipulate in the computer memory are:
- integer
- double (also called numeric)
- character
- logical
There are two less commonly used: _complex_ and _raw_ that we will leave for another time.
### Data Structures
These are the shapes of the data in the computer memory, literally. There are two types of data structures according to the type of values they can store: homogeneous and heterogeneous.
They can also be categorized according to the dimensions they can store: 1d, 2d or nd. This produces the following double entry table:
```{r data structures, echo=FALSE, results='asis'}
library(knitr)
Dimensions <- c("One", "Two", "Three or more")
Homogeneous <- c("Vector", "Matrix", "Array")
Heterogeneous <- c("List", "Data frame", "")
kable( data.frame(Dimensions, Homogeneous, Heterogeneous,
stringsAsFactors = FALSE),
caption = 'Native data structures in R according to the data type they can store and the number of dimensions they use [@WickhamHadley2015AR, p.13].')
```
R was designed to manipulate data using these structures, they don't come from libraries or are add-ons to the language. This gives R a certain expressive power to work with data.
This is important because a computer language has to allow the manipulation of values in memory using a certain recipe called an _algorithm_. These algorithms rely on the properties of the data structures and the data types themselves. They are intimately related. A computer language will allow a human to write a solution to a problem in terms of the data structures and types that it provides. R has the fundamental data structures and types that we just discussed. Let's see how to start using them to represent information.
**Note:** To reveal the data structure of R objects the built-in function `str()` may be handy although the output for complex objects may be difficult to interpret.
## Vectors
From the table we just saw one could read: if you need a one-dimension data structure to store objects of the same type, then use a vector. An important characteristic of vectors is that their contents can be stored in contiguous memory because all the elements require the same space thanks to being of the same type.
### Constructing a vector
A vector of six integer values would be represented graphically as a long structure of six boxes of equal size:
```{r vector, echo=FALSE, results='asis'}
knitr::include_graphics("images/vector.png", dpi = 86)
```
And as code you would use the function `c()`:
```{r vector-construction, exercise = TRUE, exercise.lines=1}
c(5, -1, 3, 0, -4, 1)
```
The function `c()` takes a variable number of arguments with or without names. Once a vector has been constructed and assigned to a variable `x`, its elements can be extracted with the subsetting operator `[]`. `x[1]` subsets the vector represented by the name `x` returning another vector with the element from the first position.
Try to answer the questions with your code, use the hint if necessary.
```{r vector-example, exercise = TRUE, exercise.lines=10}
x <- c(5, -1, 3, 0, -4, 1)
# extract the third element of x
# subtract the first element from the last and print the result
# compute the length of x
# compute the difference between the last and first elements using the length
```
```{r vector-example-hint, echo=TRUE}
x[3]
x[6] - x[1]
length(x)
x[length(x)] - x[1]
```
### A vector in disguise
Did you notice the language used to explain subsetting? `x[1]` returns a vector with the first element of `x`. In R the subsetting function returns another vector. Let's verify these statements with R itself:
```{r verify-data-structure-subsetting-a-vector, exercise=TRUE, exercise.lines=6}
x <- c(5, -1, 3, 0, -4, 1)
first_element <- x[1]
typeof(x)
is.vector(x)
typeof(first_element)
is.vector(first_element)
```
### Naming the elements of a vector
A vector can also have named elements. These may help future you or another reader of the code to understand the intended meaning of the data. These names can be assigned when creating the vector or later with the _unary_ function `names`. Here are examples for you to try.
```{r examples-named-vectors, exercise=T, exercise.lines=9}
x <- c(5, -1, 10)
temperature_labels = c("current_temp", "low_forecast", "high_forecast")
# assign the names after creation of the vector
names(x) <- temperature_labels
# assign names while creating the vector
y <- c(current_temp = 5, low_forecast = -1, high_forecast = 10)
# now print out the x and y vectors, do you see any differences?
```
### Constructing vectors with patterns
Creating vectors can become a tedious task so there are a number of techniques to lighten this burden. The functions `rep()` and `seq()` can be pretty helpful to achieve this. Study the output of these functions.
```{r seq-rep, exercise=TRUE, exercise.lines=2}
rep(-1, times = 10)
seq(1, 100, by = 5)
```
`rep` and `seq` can be combined to composed colorful patterns.
```{r rep-seq-mix, exercise=TRUE, exercise.lines=6}
# repeat the sequence of digits twice
rep(seq(1,9), times = 2)
# repeat the sequence of digits two at a time
rep(seq(1:9), each = 2)
# generate 5 numbers between 0 and 5
seq(0,5, length=5)
```
As you might have guessed from the previous exercise, either `FALSE`, `F`, or `0` mean the same. Similarly `TRUE`, `T`, or `1` are synonymous. They are also called the boolean data type and can only have two values, each defined as the negation of the other. You can add logical values as if they were zeros and ones. This is useful to count the number of positive or TRUE values in a vector:
```{r logical-sum, exercise= TRUE, exercise.lines=3}
is_positive <- rep(rep(c(F, F, T, F, F, F, F, T, T, F, F, F, F, T), each = 2), times = 7)
# how many TRUE values are there in is_positive?
```
```{r logical-sum-hint}
sum(is_positive)
```
The binary operator `:` can be used with integer values as a short version of `seq(to, from)`.
```{r short-version-of-seq, exercise=TRUE, exercise.lines=1}
1:50
```
*Note:* the execution of the previous expression creates a vector of doubles from 1 to 50. At this point in our tutorial, with the output of this vector spreading over more than one line on the console, we can interpret the meaning of the numbers in square brackets on the left of the output. They indicate the position within the vector of the first value presented in that line of the output. Therefore the second line of output begins with the 23rd element of the vector, which happens to be the number 23 in the sequence.
### More operations on vectors
How about arithmetic operations? R applies the arithmetic operators element-wise by default. Have a look below.
```{r multiplication-vectors-different-length-1, exercise=T, exercise.lines=4}
# weekly consumption of ingredients at a bakery (in undisclosed units)
weekly_values <- c(flour = 12, eggs = 450, yeast = 5, salt = 35)
# estimate the annual consumption given 52.18 weeks per year
weekly_values * 52.18
```
```{r setup-element-wise-vectors}
weekly_values <- c(flour = 12, eggs = 450, yeast = 5, salt = 35)
```
Two vectors are operated element-wise as well. The bakery needs to plan raw materials for a special-event week where more cakes than bread will be needed. The kitchen produces a multiplier vector to adjust the weekly estimates.
```{r element-wise-vector-multiplication, exercise=T, exercise.lines=3, exercise.setup="setup-element-wise-vectors"}
special_cake_contract_week <- c(1.2, 1.3, 1.2, 1)
special_week <- weekly_values * special_cake_contract_week
weekly_values == special_week
```
### Vector recycling
Continuing on from the previous examples, what would happen if the multiplier vector provided by the kitchen had had only estimates for two materials and we had multiplied it by the weekly estimates?
```{r multiplication-vectors-different-length-2, exercise=T, exercise.lines=1, exercise.setup="setup-element-wise-vectors"}
weekly_values * c(1.2, 1.3)
```
This was a multiplication of a vector of length 4 with another vector of length 2. What did R do? The answer lies in a _useful_ but _risky_ operation R does in these cases, called <span style="color:red">recycling</span>. When doing the element-wise multiplication R reuses the shorter vector until all elements of the longer one are multiplied. It is useful if this is your intention and you are sure of the correctness of the results.
In other words the previous operation is the equivalent of having multiplied `weekly_values` by `c(1.2, 1.3, 1.2, 1.3)`. Note that if the shorter vector is not an exact multiple of the longer one the elements get clipped to fit the length of the longer vector.
```{r clipping-recycled-vector-to-fit, exercise=T, exercise.lines=3, warning=F, exercise.setup="setup-element-wise-vectors"}
new_weekly_estimates <- weekly_values * c(1.2, 1.3, 1.1)
# test if the recycled value was as expected
new_weekly_estimates[4] == weekly_values[4] * 1.2
```
<!-- ======================== -->
### Time to practice
<!-- ======================== -->
```{r quiz-constructing-vectors, echo=FALSE, cache=FALSE}
quiz(
question_radio(
"rep(1:5, times = 2) will print out
[1] 1 1 2 2 3 3 4 4 5 5",
answer("yes", correct = FALSE, message = 'Perhaps you are thinking of rep(1:5, each = 2)?'),
answer("no", correct= TRUE),
correct = 'Correct, the named argument \'times\' appends the vectors',
random_answer_order = TRUE,
allow_retry = TRUE
),
question("What is the result of executing the following expression:
seq(5, 20, 4)",
answer("[1] 4 9 14 19", correct = FALSE, message = "Maybe review the order of the arguments to 'seq'?"),
answer("[1] 5 9 13 17", correct = TRUE),
answer("[1] 4 5 20", correct = FALSE, message = "Check the meaning of the arguments to 'seq'."),
answer("[1] 5 20 5 20 5 20 5 20", correct = FALSE, message = "Good try, you might be confused with 'rep'"),
answer("[1] 5 9 13 14 19 20", correct = FALSE, message = "That last 20 doesn't quite match the pattern!"),
allow_retry = TRUE,
random_answer_order = TRUE,
correct = "It would have been easier if the arguments had been named but you still figured it out, nicely done!"),
question("The function 'sample' takes a vector and returns a random sample of its elements. What would this expression return?
sample(c(letters, LETTERS), size = 4, replace = F)
The named argument 'size' gives the number of elements in the sample and the 'replace' is a logical value to indicate whether the items from the vector can appear more than once in the sample.",
answer("[1] \"w\" \"i\" \"D\" \"F\"", correct = T),
answer("[1] \"a\" \"L\" \"a\" \"D\"", correct = F, message = "The sample was requested without replacement so this answer is not the one!"),
answer("[1] \"E\" \"A\" \"q\" \"K\" \"e\"", correct = F, message = "This sample is too big, the requested sample size was 4!"),
answer("[1] \"!\" \"B\" \"Q\" \"V\"", correct = F, message = "The vector we are sampling from does not contain punctuation symbols."),
answer("[1] \"s\" \"1\" \"M\" \"i\" ", correct = F, message = "The original vector to sample from does not contain digits."),
allow_retry = T,
random_answer_order = T,
correct = "Wow, that was impressive, you are really beginning to get a kick out of R!"),
question("What would the result of executing the following expression be?
times_each <- c(9,2,7)
x <- rep(x = 0:2, times = times_each)
length(x)
Hint: remember that R vectorized operations act element-wise.",
answer("[1] 18", correct = T),
answer("[1] 3", correct = F, message = "Maybe you are not counting on the effect of the function rep?"),
answer("[1] 27", correct = F, message = "Something is amiss!"),
answer("[1] 2", correct = F, message = "Too short... perhaps go over the material again?"),
answer("[1] Error", correct = F, message = "How possible?"),
answer("[1] 0", correct = F, message = "I did not see this one coming!"),
allow_retry = T,
random_answer_order = T,
correct = "This is x:
[1] 0 0 0 0 0 0 0 0 0 1 1 2 2 2 2 2 2 2
That was awesome, did you use pen and paper or the R command line?",
incorrect = "Oops")
)
```
### Subsetting vectors
Subsetting can be done by position or by value, @BaseR-Cheatsheet. Subsetting is akin to filtering some elements a vector of that we want to select while leaving the rest out.
The following are ways of subsetting a vector by selecting elements according to their **position**.
Combining the binary operator `:` and the subsetting operator one can extract ranges of elements of a vector.
```{r subsetting-a-vector, exercise=T, exercise.lines=3}
x <- c(5, -1, 3, 0, -4, 1)
# subset the range of elements from the third to the fifth one
x[3:5]
```
There are many interesting ways of sub-setting a vector using ranges. For example a negative value indicates the element at that index is to be removed from the subset of the vector.
```{r, "setup-vector-to-subset-examples"}
x <- c(5, -1, 3, 0, -4, 1)
```
```{r subset-from-a-point-to-the-end, exercise=TRUE, exercise.lines=4, exercise.setup="setup-vector-to-subset-examples"}
# subset all but the first element of the vector 'x'
# extract the last element using the function 'length()' to give the desired position
```
```{r subset-from-a-point-to-the-end-hint}
x[-1]
x[length(x)]
```
You can subset the elements at specific locations using a vector of positions.
```{r, subsetting-vectors-with-vectos-of-positions, exercise=T, exercise.lines=4, exercise.setup="setup-vector-to-subset-examples"}
# create a vector of positions
pos <- c(1, 5)
# subset (filter)
x[pos]
```
Subsetting selected elements can be done depending not only of position but on the values of the elements at any position in the vector. This is appropriately called subsetting by **value**.
The first example is using logical masks to extract a subset of the vector. The logical mask can be created with the target vector and the logical operators applied to satisfy a condition like elements equal to or less than a given value. R is vectorized so the syntax is straight forward:
```{r, subsetting-vectors-with-logical-masks, exercise=T, exercise.lines=10, exercise.setup="setup-vector-to-subset-examples"}
# create a vector of logical types
mask = x > 2
# subset (filter)
x[mask]
# another mask
mask2 <- x == 0
# compute how many zeros were found in 'x'
length(x[mask2])
# another way is to use the mask by itself
sum(mask2)
```