diff --git a/source/acknowledgments.Rmd b/source/acknowledgments.Rmd
index 19ef5248c..2c70815ed 100644
--- a/source/acknowledgments.Rmd
+++ b/source/acknowledgments.Rmd
@@ -1,13 +1,13 @@
# Acknowledgments {-}
-We'd like to thank everyone that has contributed to the development of
+We'd like to thank everyone that has contributed to the development of
[*Data Science: A First Introduction*](https://datasciencebook.ca).
This is an open source textbook that began as a collection of course readings
-for DSCI 100, a new introductory data science course
+for DSCI 100, a new introductory data science course
at the University of British Columbia (UBC).
-Several faculty members in the UBC Department of Statistics
-were pivotal in shaping the direction of that course,
-and as such, contributed greatly to the broad structure and
+Several faculty members in the UBC Department of Statistics
+were pivotal in shaping the direction of that course,
+and as such, contributed greatly to the broad structure and
list of topics in this book. We would especially like to thank Matías
Salibían-Barrera for his mentorship during the initial development and roll-out
of both DSCI 100 and this book. His door was always open when
@@ -15,7 +15,7 @@ we needed to chat about how to best introduce and teach data science to our firs
We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials
from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these.
-We would also like to thank all those who contributed to the process of
+We would also like to thank all those who contributed to the process of
publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions:
Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer, and Greg Wilson.
The book was improved substantially by their insights.
@@ -24,9 +24,9 @@ for his support and encouragement throughout the process, and to
Roger Peng for graciously offering to write the Foreword.
Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past
-few years. They provided invaluable feedback on the book and worksheets;
-they found bugs for us (and stood by very patiently in class while
+few years. They provided invaluable feedback on the book and worksheets;
+they found bugs for us (and stood by very patiently in class while
we frantically fixed those bugs); and they brought a level of enthusiasm to the class
that sustained us during the hard work of creating a new course and writing a textbook.
Our interactions with them taught us how to teach data science, and that learning
-is reflected in the content of this book.
+is reflected in the content of this book.
diff --git a/source/classification1.Rmd b/source/classification1.Rmd
index f56ac521d..b214e8d16 100644
--- a/source/classification1.Rmd
+++ b/source/classification1.Rmd
@@ -8,7 +8,7 @@ library(stringr)
library(ggpubr)
library(ggplot2)
-knitr::opts_chunk$set(echo = TRUE,
+knitr::opts_chunk$set(echo = TRUE,
fig.align = "center")
options(knitr.table.format = ifelse(knitr::is_latex_output(), 'latex', 'html'))
@@ -40,16 +40,16 @@ hidden_print <- function(x){
hidden_print_cli <- function(x){
cleanup_and_print(cli::cli_fmt(capture.output(x)))
}
-
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
-## Overview
+## Overview
In previous chapters, we focused solely on descriptive and exploratory
-data analysis questions.
+data analysis questions.
This chapter and the next together serve as our first
foray into answering *predictive* questions about data. In particular, we will
-focus on *classification*, i.e., using one or more
+focus on *classification*, i.e., using one or more
variables to predict the value of a categorical variable of interest. This chapter
will cover the basics of classification, how to preprocess data to make it
suitable for use in a classifier, and how to use our observed data to make
@@ -57,7 +57,7 @@ predictions. The next chapter will focus on how to evaluate how accurate the
predictions from our classifier are, as well as how to improve our classifier
(where possible) to maximize its accuracy.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
@@ -65,9 +65,9 @@ By the end of the chapter, readers will be able to do the following:
- Describe what a training data set is and how it is used in classification.
- Interpret the output of a classifier.
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
-- Explain the $K$-nearest neighbor classification algorithm.
-- Perform $K$-nearest neighbor classification in R using `tidymodels`.
-- Use a `recipe` to preprocess data to be centered, scaled, and balanced.
+- Explain the K-nearest neighbors classification algorithm.
+- Perform K-nearest neighbors classification in R using `tidymodels`.
+- Use a `recipe` to center, scale, balance, and impute data as a preprocessing step.
- Combine preprocessing and model training using a `workflow`.
@@ -76,14 +76,14 @@ In many situations, we want to make predictions \index{predictive question} base
as well as past experiences. For instance, a doctor may want to diagnose a
patient as either diseased or healthy based on their symptoms and the doctor's
past experience with patients; an email provider might want to tag a given
-email as "spam" or "not spam" based on the email's text and past email text data;
+email as "spam" or "not spam" based on the email's text and past email text data;
or a credit card company may want to predict whether a purchase is fraudulent based
on the current purchase item, amount, and location as well as past purchases.
These tasks are all examples of \index{classification} **classification**, i.e., predicting a
categorical class (sometimes called a *label*) \index{class}\index{categorical variable} for an observation given its
other variables (sometimes called *features*). \index{feature|see{predictor}}
-Generally, a classifier assigns an observation without a known class (e.g., a new patient)
+Generally, a classifier assigns an observation without a known class (e.g., a new patient)
to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations
for which we do know the class (e.g., previous patients with known diseases and
symptoms). These observations with known classes that we use as a basis for
@@ -93,29 +93,29 @@ the classifier to make predictions on new data for which we do not know the clas
There are many possible methods that we could use to predict
a categorical class/label for an observation. In this book, we will
-focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
+focus on the widely used **K-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
In your future studies, you might encounter decision trees, support vector machines (SVMs),
logistic regression, neural networks, and more; see the additional resources
section at the end of the next chapter for where to begin learning more about
these other methods. It is also worth mentioning that there are many
-variations on the basic classification problem. For example,
+variations on the basic classification problem. For example,
we focus on the setting of **binary classification** \index{classification!binary} where only two
-classes are involved (e.g., a diagnosis of either healthy or diseased), but you may
+classes are involved (e.g., a diagnosis of either healthy or diseased), but you may
also run into multiclass classification problems with more than two
categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).
## Exploring a data set
-In this chapter and the next, we will study a data set of
+In this chapter and the next, we will study a data set of
[digitized breast cancer image features](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
-created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [@streetbreastcancer]. \index{breast cancer}
+created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [@streetbreastcancer]. \index{breast cancer}
Each row in the data set represents an
image of a tumor sample, including the diagnosis (benign or malignant) and
several other measurements (nucleus texture, perimeter, area, and more).
-Diagnosis for each image was conducted by physicians.
+Diagnosis for each image was conducted by physicians.
As with all data analyses, we first need to formulate a precise question that
-we want to answer. Here, the question is *predictive*: \index{question!classification} can
+we want to answer. Here, the question is *predictive*: \index{question!classification} can
we use the tumor
image measurements available to us to predict whether a future tumor image
(with unknown diagnosis) shows a benign or malignant tumor? Answering this
@@ -133,7 +133,7 @@ guide patient treatment.
Our first step is to load, wrangle, and explore the data using visualizations
in order to better understand the data we are working with. We start by
-loading the `tidyverse` package needed for our analysis.
+loading the `tidyverse` package needed for our analysis.
```{r 05-load-libraries, warning = FALSE, message = FALSE}
library(tidyverse)
@@ -158,31 +158,31 @@ Traditionally these procedures were quite invasive; modern methods such as fine
needle aspiration, used to collect the present data set, extract only a small
amount of tissue and are less invasive. Based on a digital image of each breast
tissue sample collected for this data set, ten different variables were measured
-for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean
+for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean
for each variable across the nuclei was recorded. As part of the
data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
means and why we do it later in this chapter. Each image additionally was given
a unique ID and a diagnosis by a physician. Therefore, the
total set of variables per image in this data set is:
-1. ID: identification number
+1. ID: identification number
2. Class: the diagnosis (M = malignant or B = benign)
3. Radius: the mean of distances from center to points on the perimeter
4. Texture: the standard deviation of gray-scale values
-5. Perimeter: the length of the surrounding contour
+5. Perimeter: the length of the surrounding contour
6. Area: the area inside the contour
7. Smoothness: the local variation in radius lengths
8. Compactness: the ratio of squared perimeter and area
-9. Concavity: severity of concave portions of the contour
+9. Concavity: severity of concave portions of the contour
10. Concave Points: the number of concave portions of the contour
-11. Symmetry: how similar the nucleus is when mirrored
-12. Fractal Dimension: a measurement of how "rough" the perimeter is
+11. Symmetry: how similar the nucleus is when mirrored
+12. Fractal Dimension: a measurement of how "rough" the perimeter is
\pagebreak
-Below we use `glimpse` \index{glimpse} to preview the data frame. This function can
-make it easier to inspect the data when we have a lot of columns,
-as it prints the data such that the columns go down
+Below we use `glimpse` \index{glimpse} to preview the data frame. This function can
+make it easier to inspect the data when we have a lot of columns,
+as it prints the data such that the columns go down
the page (instead of across).
```{r 05-glimpse}
@@ -203,7 +203,7 @@ We will also improve the readability of our analysis by renaming "M" to
"Malignant" and "B" to "Benign" using the `fct_recode` method. The `fct_recode` method \index{factor!fct\_recode}
is used to replace the names of factor values with other names. The arguments of `fct_recode` are the column that you
want to modify, followed any number of arguments of the form `"new name" = "old name"` to specify the renaming scheme.
-
+
```{r 05-class}
cancer <- cancer |>
mutate(Class = as_factor(Class)) |>
@@ -222,9 +222,9 @@ cancer |>
### Exploring the cancer data
Before we start doing any modeling, let's explore our data set. Below we use
-the `group_by`, `summarize` and `n` \index{group\_by}\index{summarize} functions to find the number and percentage
+the `group_by`, `summarize` and `n` \index{group\_by}\index{summarize} functions to find the number and percentage
of benign and malignant tumor observations in our data set. The `n` function within
-`summarize`, when paired with `group_by`, counts the number of observations in each `Class` group.
+`summarize`, when paired with `group_by`, counts the number of observations in each `Class` group.
Then we calculate the percentage in each group by dividing by the total number of observations
and multiplying by 100. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
```{r 05-tally}
@@ -239,15 +239,15 @@ cancer |>
Next, let's draw a scatter plot \index{visualization!scatter} to visualize the relationship between the
perimeter and concavity variables. Rather than use `ggplot's` default palette,
-we select our own colorblind-friendly colors—`"orange2"`
+we select our own colorblind-friendly colors—`"orange2"`
for light orange and `"steelblue2"` for light blue—and
- pass them as the `values` argument to the `scale_color_manual` function.
+ pass them as the `values` argument to the `scale_color_manual` function.
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
perim_concav <- cancer |>
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
geom_point(alpha = 0.6) +
- labs(x = "Perimeter (standardized)",
+ labs(x = "Perimeter (standardized)",
y = "Concavity (standardized)",
color = "Diagnosis") +
scale_color_manual(values = c("orange2", "steelblue2")) +
@@ -264,7 +264,7 @@ obtain a new observation not in the current data set that has all the variables
measured *except* the label (i.e., an image without the physician's diagnosis
for the tumor class). We could compute the standardized perimeter and concavity values,
resulting in values of, say, 1 and 1. Could we use this information to classify
-that observation as benign or malignant? Based on the scatter plot, how might
+that observation as benign or malignant? Based on the scatter plot, how might
you classify that new observation? If the standardized concavity and perimeter
values are 1 and 1 respectively, the point would lie in the middle of the
orange cloud of malignant points and thus we could probably classify it as
@@ -272,7 +272,7 @@ malignant. Based on our visualization, it seems like it may be possible
to make accurate predictions of the `Class` variable (i.e., a diagnosis) for
tumor images with unknown diagnoses.
-## Classification with $K$-nearest neighbors
+## Classification with K-nearest neighbors
```{r 05-knn-0, echo = FALSE}
## Find the distance between new point and all others in data set
@@ -305,39 +305,39 @@ neighbors <- cancer[order(my_distances$Distance), ]
```
In order to actually make predictions for new observations in practice, we
-will need a classification algorithm.
-In this book, we will use the $K$-nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
+will need a classification algorithm.
+In this book, we will use the K-nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
To predict the label of a new observation (here, classify it as either benign
-or malignant), the $K$-nearest neighbors classifier generally finds the $K$
+or malignant), the K-nearest neighbors classifier generally finds the $K$
"nearest" or "most similar" observations in our training set, and then uses
-their diagnoses to make a prediction for the new observation's diagnosis. $K$
+their diagnoses to make a prediction for the new observation's diagnosis. $K$
is a number that we must choose in advance; for now, we will assume that someone has chosen
-$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
+$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
-To illustrate the concept of $K$-nearest neighbors classification, we
+To illustrate the concept of K-nearest neighbors classification, we
will walk through an example. Suppose we have a
-new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose
+new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
Figure \@ref(fig:05-knn-1).
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
-perim_concav_with_new_point <- bind_rows(cancer,
- tibble(Perimeter = new_point[1],
- Concavity = new_point[2],
+perim_concav_with_new_point <- bind_rows(cancer,
+ tibble(Perimeter = new_point[1],
+ Concavity = new_point[2],
Class = "unknown")) |>
- ggplot(aes(x = Perimeter,
- y = Concavity,
- color = Class,
- shape = Class,
+ ggplot(aes(x = Perimeter,
+ y = Concavity,
+ color = Class,
+ shape = Class,
size = Class)) +
geom_point(alpha = 0.6) +
- labs(color = "Diagnosis", x = "Perimeter (standardized)",
+ labs(color = "Diagnosis", x = "Perimeter (standardized)",
y = "Concavity (standardized)") +
- scale_color_manual(name = "Diagnosis",
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
- values= c(16, 16, 18))+
- scale_size_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
+ values= c(16, 16, 18))+
+ scale_size_manual(name = "Diagnosis",
values= c(2, 2, 2.5))
perim_concav_with_new_point
```
@@ -346,7 +346,7 @@ Figure \@ref(fig:05-knn-2) shows that the nearest point to this new observation
located at the coordinates (`r round(neighbors[1, c(attrs[1], attrs[2])],
1)`). The idea here is that if a point is close to another in the scatter plot,
then the perimeter and concavity values are similar, and so we may expect that
-they would have the same diagnosis.
+they would have the same diagnosis.
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
@@ -370,31 +370,31 @@ Suppose we have another new observation with standardized perimeter `r new_point
concavity of `r new_point[2]`. Looking at the scatter plot in Figure \@ref(fig:05-knn-4), how would you
classify this red, diamond observation? The nearest neighbor to this new point is a
**benign** observation at (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`).
-Does this seem like the right prediction to make for this observation? Probably
+Does this seem like the right prediction to make for this observation? Probably
not, if you consider the other nearby points.
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
-perim_concav_with_new_point2 <- bind_rows(cancer,
- tibble(Perimeter = new_point[1],
- Concavity = new_point[2],
+perim_concav_with_new_point2 <- bind_rows(cancer,
+ tibble(Perimeter = new_point[1],
+ Concavity = new_point[2],
Class = "unknown")) |>
- ggplot(aes(x = Perimeter,
- y = Concavity,
- color = Class,
+ ggplot(aes(x = Perimeter,
+ y = Concavity,
+ color = Class,
shape = Class, size = Class)) +
geom_point(alpha = 0.6) +
- labs(color = "Diagnosis",
- x = "Perimeter (standardized)",
+ labs(color = "Diagnosis",
+ x = "Perimeter (standardized)",
y = "Concavity (standardized)") +
- scale_color_manual(name = "Diagnosis",
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
- values= c(16, 16, 18))+
- scale_size_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
+ values= c(16, 16, 18))+
+ scale_size_manual(name = "Diagnosis",
values= c(2, 2, 2.5))
-perim_concav_with_new_point2 +
+perim_concav_with_new_point2 +
geom_segment(aes(
x = new_point[1],
y = new_point[2],
@@ -409,10 +409,10 @@ to predict its diagnosis class. Among those 3 closest points, we use the
*majority class* as our prediction for the new observation. As shown in Figure \@ref(fig:05-knn-5), we
see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
are malignant. Therefore we take majority vote and classify our new red, diamond
-observation as malignant.
+observation as malignant.
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
-perim_concav_with_new_point2 +
+perim_concav_with_new_point2 +
geom_segment(aes(
x = new_point[1], y = new_point[2],
xend = pull(neighbors[1, attrs[1]]),
@@ -433,7 +433,7 @@ perim_concav_with_new_point2 +
Here we chose the $K=3$ nearest observations, but there is nothing special
about $K=3$. We could have used $K=4, 5$ or more (though we may want to choose
an odd number to avoid ties). We will discuss more about choosing $K$ in the
-next chapter.
+next chapter.
### Distance between points
@@ -442,8 +442,8 @@ using the *straight-line distance* (we will often just refer to this as *distanc
Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$.
Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$;
$b_x$ and $b_y$ have similar definitions for observation $b$.
-Then the straight-line distance between observation $a$ and $b$ on the x-y plane can
-be computed using the following formula:
+Then the straight-line distance between observation $a$ and $b$ on the x-y plane can
+be computed using the following formula:
$$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$
```{r 05-multiknn-0, echo = FALSE}
@@ -452,42 +452,42 @@ new_point <- c(0, 3.5)
To find the $K$ nearest neighbors to our new observation, we compute the distance
from that new observation to each observation in our training data, and select the $K$ observations corresponding to the
-$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new
-observation with perimeter of `r new_point[1]` and
+$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new
+observation with perimeter of `r new_point[1]` and
concavity of `r new_point[2]`, shown as a red diamond in Figure \@ref(fig:05-multiknn-1). Let's calculate the distances
between our new point and each of the observations in the training set to find
-the $K=5$ neighbors that are nearest to our new point.
+the $K=5$ neighbors that are nearest to our new point.
You will see in the `mutate` \index{mutate} step below, we compute the straight-line
-distance using the formula above: we square the differences between the two observations' perimeter
+distance using the formula above: we square the differences between the two observations' perimeter
and concavity coordinates, add the squared differences, and then take the square root.
In order to find the $K=5$ nearest neighbors, we will use the `slice_min` function. \index{slice\_min}
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
-perim_concav <- bind_rows(cancer,
- tibble(Perimeter = new_point[1],
- Concavity = new_point[2],
+perim_concav <- bind_rows(cancer,
+ tibble(Perimeter = new_point[1],
+ Concavity = new_point[2],
Class = "unknown")) |>
- ggplot(aes(x = Perimeter,
- y = Concavity,
- color = Class,
- shape = Class,
+ ggplot(aes(x = Perimeter,
+ y = Concavity,
+ color = Class,
+ shape = Class,
size = Class)) +
- geom_point(aes(x = new_point[1],
- y = new_point[2]),
- color = "red",
- size = 2.5,
- pch = 18) +
+ geom_point(aes(x = new_point[1],
+ y = new_point[2]),
+ color = "red",
+ size = 2.5,
+ pch = 18) +
geom_point(alpha = 0.5) +
- scale_x_continuous(name = "Perimeter (standardized)",
+ scale_x_continuous(name = "Perimeter (standardized)",
breaks = seq(-2, 4, 1)) +
- scale_y_continuous(name = "Concavity (standardized)",
+ scale_y_continuous(name = "Concavity (standardized)",
breaks = seq(-2, 4, 1)) +
- labs(color = "Diagnosis") +
- scale_color_manual(name = "Diagnosis",
+ labs(color = "Diagnosis") +
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
- values= c(16, 16, 18))+
- scale_size_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
+ values= c(16, 16, 18))+
+ scale_size_manual(name = "Diagnosis",
values= c(2, 2, 2.5))
perim_concav
@@ -498,7 +498,7 @@ new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
cancer |>
select(ID, Perimeter, Concavity, Class) |>
- mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
+ mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
(Concavity - new_obs_Concavity)^2)) |>
slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
```
@@ -513,15 +513,15 @@ training data.
my_distances <- table_with_distances(cancer[, attrs], new_point)
neighbors <- my_distances[order(my_distances$Distance), ]
k <- 5
-tab <- data.frame(neighbors[1:k, ],
+tab <- data.frame(neighbors[1:k, ],
cancer[order(my_distances$Distance), ][1:k, c("ID", "Class")])
-math_table <- tibble(Perimeter = round(tab[1:5,1],2),
- Concavity = round(tab[1:5,2],2),
+math_table <- tibble(Perimeter = round(tab[1:5,1],2),
+ Concavity = round(tab[1:5,2],2),
dist = round(neighbors[1:5, "Distance"], 2)
)
-math_table <- math_table |>
+math_table <- math_table |>
mutate(Distance = paste0("$\\sqrt{(", new_obs_Perimeter, " - ", ifelse(Perimeter < 0, "(", ""), Perimeter, ifelse(Perimeter < 0,")",""), ")^2",
" + ",
"(", new_obs_Concavity, " - ", ifelse(Concavity < 0,"(",""), Concavity, ifelse(Concavity < 0,")",""), ")^2} = ", dist, "$")) |>
@@ -530,14 +530,14 @@ math_table <- math_table |>
```
```{r 05-multiknn-mathtable, echo = FALSE}
-kable(math_table, booktabs = TRUE,
- caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors",
+kable(math_table, booktabs = TRUE,
+ caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors",
escape = FALSE) |>
kable_styling(latex_options = "hold_position")
```
The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
-malignant; since this is the majority, we classify our new observation as malignant.
+malignant; since this is the majority, we classify our new observation as malignant.
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
@@ -551,14 +551,14 @@ perim_concav + annotate("path",
)
```
-### More than two explanatory variables
+### More than two explanatory variables
-Although the above description is directed toward two predictor variables,
-exactly the same $K$-nearest neighbors algorithm applies when you
+Although the above description is directed toward two predictor variables,
+exactly the same K-nearest neighbors algorithm applies when you
have a higher number of predictor variables. Each predictor variable may give us new
information to help create our classifier. The only difference is the formula
for the distance between points. Suppose we have $m$ predictor
-variables for two observations $a$ and $b$, i.e.,
+variables for two observations $a$ and $b$, i.e.,
$a = (a_{1}, a_{2}, \dots, a_{m})$ and
$b = (b_{1}, b_{2}, \dots, b_{m})$.
@@ -580,33 +580,33 @@ $$\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.2
Let's calculate the distances between our new observation and each of the
observations in the training set to find the $K=5$ neighbors when we have these
-three predictors.
+three predictors.
```{r}
new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
new_obs_Symmetry <- 1
cancer |>
select(ID, Perimeter, Concavity, Symmetry, Class) |>
- mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
+ mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
(Concavity - new_obs_Concavity)^2 +
(Symmetry - new_obs_Symmetry)^2)) |>
slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
```
-Based on $K=5$ nearest neighbors with these three predictors, we would classify
-the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class.
-Figure \@ref(fig:05-more) shows what the data look like when we visualize them
+Based on $K=5$ nearest neighbors with these three predictors, we would classify
+the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class.
+Figure \@ref(fig:05-more) shows what the data look like when we visualize them
as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
attrs <- c("Perimeter", "Concavity", "Symmetry")
# create new scaled obs and get NNs
-new_obs_3 <- tibble(Perimeter = 0,
- Concavity = 3.5,
- Symmetry = 1,
+new_obs_3 <- tibble(Perimeter = 0,
+ Concavity = 3.5,
+ Symmetry = 1,
Class = "Unknown")
-my_distances_3 <- table_with_distances(cancer[, attrs],
+my_distances_3 <- table_with_distances(cancer[, attrs],
new_obs_3[, attrs])
neighbors_3 <- cancer[order(my_distances_3$Distance), ]
@@ -621,14 +621,14 @@ plot_3d <- scaled_cancer_3 |>
xaxis = list(title = "Perimeter", titlefont = list(size = 14)),
yaxis = list(title = "Concavity", titlefont = list(size = 14)),
zaxis = list(title = "Symmetry", titlefont = list(size = 14))
- )) |>
+ )) |>
add_trace(x = ~Perimeter,
y = ~Concavity,
z = ~Symmetry,
color = ~Class,
opacity = 0.4,
size = 2,
- colors = c("steelblue2", "orange2", "red"),
+ colors = c("steelblue2", "orange2", "red"),
symbol = ~Class, symbols = c('circle','circle','diamond'))
x1 <- c(pull(new_obs_3[1]), data$Perimeter[1])
@@ -652,32 +652,32 @@ y5 <- c(pull(new_obs_3[2]), data$Concavity[5])
z5 <- c(pull(new_obs_3[3]), data$Symmetry[5])
plot_3d <- plot_3d |>
- add_trace(x = x1, y = y1, z = z1, type = "scatter3d", mode = "lines",
+ add_trace(x = x1, y = y1, z = z1, type = "scatter3d", mode = "lines",
name = "lines", showlegend = FALSE, color = I("orange2")) |>
- add_trace(x = x2, y = y2, z = z2, type = "scatter3d", mode = "lines",
+ add_trace(x = x2, y = y2, z = z2, type = "scatter3d", mode = "lines",
name = "lines", showlegend = FALSE, color = I("orange2")) |>
- add_trace(x = x3, y = y3, z = z3, type = "scatter3d", mode = "lines",
+ add_trace(x = x3, y = y3, z = z3, type = "scatter3d", mode = "lines",
name = "lines", showlegend = FALSE, color = I("orange2")) |>
- add_trace(x = x4, y = y4, z = z4, type = "scatter3d", mode = "lines",
+ add_trace(x = x4, y = y4, z = z4, type = "scatter3d", mode = "lines",
name = "lines", showlegend = FALSE, color = I("steelblue2")) |>
- add_trace(x = x5, y = y5, z = z5, type = "scatter3d", mode = "lines",
+ add_trace(x = x5, y = y5, z = z5, type = "scatter3d", mode = "lines",
name = "lines", showlegend = FALSE, color = I("orange2"))
-if(!is_latex_output()){
+if(!is_latex_output()){
plot_3d
} else {
# scene = list(camera = list(eye = list(x=2, y=2, z = 1.5)))
# plot_3d <- plot_3d |> layout(scene = scene)
# save_image(plot_3d, "img/classification1/plot3d_knn_classification.png", scale = 10)
- # cannot adjust size of points in this plot for pdf
+ # cannot adjust size of points in this plot for pdf
# so using a screenshot for now instead
knitr::include_graphics("img/classification1/plot3d_knn_classification.png")
}
```
-### Summary of $K$-nearest neighbors algorithm
+### Summary of K-nearest neighbors algorithm
-In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
+In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following:
1. Compute the distance between the new observation and each observation in the training set.
2. Sort the data table in ascending order according to the distances.
@@ -685,26 +685,26 @@ In order to classify a new observation using a $K$-nearest neighbor classifier,
4. Classify the new observation based on a majority vote of the neighbor classes.
-## $K$-nearest neighbors with `tidymodels`
+## K-nearest neighbors with `tidymodels`
-Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated,
+Coding the K-nearest neighbors algorithm in R ourselves can get complicated,
especially if we want to handle multiple classes, more than two variables,
or predict the class for multiple new observations. Thankfully, in R,
-the $K$-nearest neighbors algorithm is
-implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip]
-included in `tidymodels`, along with
+the K-nearest neighbors algorithm is
+implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip]
+included in `tidymodels`, along with
many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip}
that you will encounter in this and future chapters of the book. The `tidymodels` collection
provides tools to help make and use models, such as classifiers. Using the packages
-in this collection will help keep our code simple, readable and accurate; the
-less we have to code ourselves, the fewer mistakes we will likely make. We
+in this collection will help keep our code simple, readable and accurate; the
+less we have to code ourselves, the fewer mistakes we will likely make. We
start by loading `tidymodels`.
```{r 05-tidymodels, warning = FALSE, message = FALSE}
library(tidymodels)
```
-Let's walk through how to use `tidymodels` to perform $K$-nearest neighbors classification.
+Let's walk through how to use `tidymodels` to perform K-nearest neighbors classification.
We will use the `cancer` data set from above, with
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
we will use the classifier to predict the diagnosis label for a new observation with
@@ -717,16 +717,16 @@ cancer_train <- cancer |>
cancer_train
```
-Next, we create a *model specification* for \index{tidymodels!model specification} $K$-nearest neighbors classification
+Next, we create a *model specification* for \index{tidymodels!model specification} K-nearest neighbors classification
by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors
(we will discuss how to choose $K$ in the next chapter) and that each neighboring point should have the same weight when voting
(`weight_func = "rectangular"`). The `weight_func` argument controls
how neighbors vote when classifying a new observation; by setting it to `"rectangular"`,
-each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices,
-which weigh each neighbor's vote differently, can be found on
+each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices,
+which weigh each neighbor's vote differently, can be found on
[the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training
-the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
+the model. Here `kknn` is the R package we will use for performing K-nearest neighbors classification.
Finally, we specify that this is a classification problem with the `set_mode` function.
```{r 05-tidymodels-3}
@@ -738,7 +738,7 @@ knn_spec
In order to fit the model on the breast cancer data, we need to pass the model specification
and the data set to the `fit` function. We also need to specify what variables to use as predictors
-and what variable to use as the response. Below, the `Class ~ Perimeter + Concavity` argument specifies
+and what variable to use as the response. Below, the `Class ~ Perimeter + Concavity` argument specifies
that `Class` is the response variable (the one we want to predict),
and both `Perimeter` and `Concavity` are to be used as the predictors.
@@ -766,18 +766,18 @@ hidden_print(knn_fit)
Here you can see the final trained model summary. It confirms that the computational engine used
to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by
-the nearest neighbor model, but we will ignore this for now and discuss it in more detail
+the K-nearest neighbors model, but we will ignore this for now and discuss it in more detail
in the next chapter.
-Finally, it shows (somewhat confusingly) that the "best" weight function
+Finally, it shows (somewhat confusingly) that the "best" weight function
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
R is just repeating those settings to us here. In the next chapter, we will actually
-let R find the value of $K$ for us.
+let R find the value of $K$ for us.
Finally, we make the prediction on the new observation by calling the `predict` \index{tidymodels!predict} function,
-passing both the fit object we just created and the new observation itself. As above,
-when we ran the $K$-nearest neighbors
-classification algorithm manually, the `knn_fit` object classifies the new observation as
-malignant. Note that the `predict` function outputs a data frame with a single
+passing both the fit object we just created and the new observation itself. As above,
+when we ran the K-nearest neighbors
+classification algorithm manually, the `knn_fit` object classifies the new observation as
+malignant. Note that the `predict` function outputs a data frame with a single
variable named `.pred_class`.
```{r 05-predict}
@@ -787,17 +787,17 @@ predict(knn_fit, new_obs)
Is this predicted malignant label the actual class for this observation?
Well, we don't know because we do not have this
-observation's diagnosis— that is what we were trying to predict! The
-classifier's prediction is not necessarily correct, but in the next chapter, we will
+observation's diagnosis— that is what we were trying to predict! The
+classifier's prediction is not necessarily correct, but in the next chapter, we will
learn ways to quantify how accurate we think our predictions are.
## Data preprocessing with `tidymodels`
### Centering and scaling
-When using $K$-nearest neighbor classification, the *scale* \index{scaling} of each variable
+When using K-nearest neighbors classification, the *scale* \index{scaling} of each variable
(i.e., its size and range of values) matters. Since the classifier predicts
-classes by identifying observations nearest to it, any variables with
+classes by identifying observations nearest to it, any variables with
a large scale will have a much larger effect than variables with a small
scale. But just because a variable has a large scale *doesn't mean* that it is
more important for making accurate predictions. For example, suppose you have a
@@ -816,20 +816,20 @@ degrees Celsius, the two variables would differ by a constant shift of 273
hypothetical job classification example, we would likely see that the center of
the salary variable is in the tens of thousands, while the center of the years
of education variable is in the single digits. Although this doesn't affect the
-$K$-nearest neighbor classification algorithm, this large shift can change the
+K-nearest neighbors classification algorithm, this large shift can change the
outcome of using many other predictive models. \index{centering}
To scale and center our data, we need to find
-our variables' *mean* (the average, which quantifies the "central" value of a
-set of numbers) and *standard deviation* (a number quantifying how spread out values are).
-For each observed value of the variable, we subtract the mean (i.e., center the variable)
-and divide by the standard deviation (i.e., scale the variable). When we do this, the data
-is said to be *standardized*, \index{standardization!K-nearest neighbors} and all variables in a data set will have a mean of 0
-and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
-neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
+our variables' *mean* (the average, which quantifies the "central" value of a
+set of numbers) and *standard deviation* (a number quantifying how spread out values are).
+For each observed value of the variable, we subtract the mean (i.e., center the variable)
+and divide by the standard deviation (i.e., scale the variable). When we do this, the data
+is said to be *standardized*, \index{standardization!K-nearest neighbors} and all variables in a data set will have a mean of 0
+and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest
+neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
cancer data set; we have been using a standardized version of the data set up
until now. As before, we will convert the `Class` variable to the factor type
-and rename the values to "Malignant" and "Benign."
+and rename the values to "Malignant" and "Benign."
To keep things simple, we will just use the `Area`, `Smoothness`, and `Class`
variables:
@@ -849,9 +849,9 @@ predictors (colored by diagnosis) for both the unstandardized data we just
loaded, and the standardized version of that same data. But first, we need to
standardize the `unscaled_cancer` data set with `tidymodels`.
-In the `tidymodels` framework, all data preprocessing happens
+In the `tidymodels` framework, all data preprocessing happens
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes].
-Here we will initialize a recipe\index{recipe} \index{tidymodels!recipe|see{recipe}} for
+Here we will initialize a recipe\index{recipe} \index{tidymodels!recipe|see{recipe}} for
the `unscaled_cancer` data above, specifying
that the `Class` variable is the response, and all other variables are predictors:
@@ -865,13 +865,13 @@ hidden_print_cli(uc_recipe)
```
So far, there is not much in the recipe; just a statement about the number of response variables
-and predictors. Let's add
-scaling (`step_scale`) \index{recipe!step\_scale} and
-centering (`step_center`) \index{recipe!step\_center} steps for
+and predictors. Let's add
+scaling (`step_scale`) \index{recipe!step\_scale} and
+centering (`step_center`) \index{recipe!step\_center} steps for
all of the predictors so that they each have a mean of 0 and standard deviation of 1.
Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in
a single recipe step; in this book we will keep `step_scale` and `step_center` separate
-to emphasize conceptually that there are two steps happening.
+to emphasize conceptually that there are two steps happening.
The `prep` function finalizes the recipe by using the data (here, `unscaled_cancer`) \index{tidymodels!prep}\index{prep|see{tidymodels}}
to compute anything necessary to run the recipe (in this case, the column means and standard
deviations):
@@ -890,9 +890,9 @@ hidden_print_cli(uc_recipe)
You can now see that the recipe includes a scaling and centering step for all predictor variables.
Note that when you add a step to a recipe, you must specify what columns to apply the step to.
-Here we used the `all_predictors()` \index{recipe!all\_predictors} function to specify that each step should be applied to
+Here we used the `all_predictors()` \index{recipe!all\_predictors} function to specify that each step should be applied to
all predictor variables. However, there are a number of different arguments one could use here,
-as well as naming particular columns with the same syntax as the `select` function.
+as well as naming particular columns with the same syntax as the `select` function.
For example:
- `all_nominal()` and `all_numeric()`: specify all categorical or all numeric variables
@@ -903,8 +903,8 @@ For example:
You can find a full set of all the steps and variable selection functions
on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html).
-At this point, we have calculated the required statistics based on the data input into the
-recipe, but the data are not yet scaled and centered. To actually scale and center
+At this point, we have calculated the required statistics based on the data input into the
+recipe, but the data are not yet scaled and centered. To actually scale and center
the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data.
```{r 05-scaling-4}
@@ -913,12 +913,12 @@ scaled_cancer
```
It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data.
- However, we do this in two steps so we can specify a different data set in the `bake` step if we want.
- For example, we may want to specify new data that were not part of the training set.
+ However, we do this in two steps so we can specify a different data set in the `bake` step if we want.
+ For example, we may want to specify new data that were not part of the training set.
You may wonder why we are doing so much work just to center and
scale our variables. Can't we just manually scale and center the `Area` and
-`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
+`Smoothness` variables ourselves before building our K-nearest neighbors model? Well,
technically *yes*; but doing so is error-prone. In particular, we might
accidentally forget to apply the same centering / scaling when making
predictions, or accidentally apply a *different* centering / scaling than what
@@ -937,13 +937,13 @@ well within the cloud of benign observations, and the neighbors are all nearly
vertically aligned with the new observation (which is why it looks like there
is only one black line on this plot). Figure \@ref(fig:05-scaling-plt-zoomed)
shows a close-up of that region on the unstandardized plot. Here the computation of nearest
-neighbors is dominated by the much larger-scale area variable. The plot for standardized data
+neighbors is dominated by the much larger-scale area variable. The plot for standardized data
on the right in Figure \@ref(fig:05-scaling-plt) shows a much more intuitively reasonable
selection of nearest neighbors. Thus, standardizing the data can change things
-in an important way when we are using predictive algorithms.
+in an important way when we are using predictive algorithms.
Standardizing your data should be a part of the preprocessing you do
before predictive modeling and you should always think carefully about your problem domain and
-whether you need to standardize your data.
+whether you need to standardize your data.
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
@@ -951,7 +951,7 @@ attrs <- c("Area", "Smoothness")
# create a new obs and get its NNs
new_obs <- tibble(Area = 400, Smoothness = 0.135, Class = "unknown")
-my_distances <- table_with_distances(unscaled_cancer[, attrs],
+my_distances <- table_with_distances(unscaled_cancer[, attrs],
new_obs[, attrs])
neighbors <- unscaled_cancer[order(my_distances$Distance), ]
@@ -959,18 +959,18 @@ neighbors <- unscaled_cancer[order(my_distances$Distance), ]
unscaled_cancer <- bind_rows(unscaled_cancer, new_obs)
# plot the scatter
-unscaled <- ggplot(unscaled_cancer, aes(x = Area,
- y = Smoothness,
- group = Class,
- color = Class,
+unscaled <- ggplot(unscaled_cancer, aes(x = Area,
+ y = Smoothness,
+ group = Class,
+ color = Class,
shape = Class, size = Class)) +
- geom_point(alpha = 0.6) +
- scale_color_manual(name = "Diagnosis",
+ geom_point(alpha = 0.6) +
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
values= c(16, 16, 18)) +
- scale_size_manual(name = "Diagnosis",
- values=c(2,2,2.5)) +
+ scale_size_manual(name = "Diagnosis",
+ values=c(2,2,2.5)) +
ggtitle("Unstandardized Data") +
geom_segment(aes(
x = unlist(new_obs[1]), y = unlist(new_obs[2]),
@@ -990,7 +990,7 @@ unscaled <- ggplot(unscaled_cancer, aes(x = Area,
# create new scaled obs and get NNs
new_obs_scaled <- tibble(Area = -0.72, Smoothness = 2.8, Class = "unknown")
-my_distances_scaled <- table_with_distances(scaled_cancer[, attrs],
+my_distances_scaled <- table_with_distances(scaled_cancer[, attrs],
new_obs_scaled[, attrs])
neighbors_scaled <- scaled_cancer[order(my_distances_scaled$Distance), ]
@@ -998,21 +998,21 @@ neighbors_scaled <- scaled_cancer[order(my_distances_scaled$Distance), ]
scaled_cancer <- bind_rows(scaled_cancer, new_obs_scaled)
# plot the scatter
-scaled <- ggplot(scaled_cancer, aes(x = Area,
- y = Smoothness,
- group = Class,
- color = Class,
- shape = Class,
+scaled <- ggplot(scaled_cancer, aes(x = Area,
+ y = Smoothness,
+ group = Class,
+ color = Class,
+ shape = Class,
size = Class)) +
- geom_point(alpha = 0.6) +
- scale_color_manual(name = "Diagnosis",
+ geom_point(alpha = 0.6) +
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
values= c(16, 16, 18)) +
- scale_size_manual(name = "Diagnosis",
- values=c(2,2,2.5)) +
+ scale_size_manual(name = "Diagnosis",
+ values=c(2,2,2.5)) +
ggtitle("Standardized Data") +
- labs(x = "Area (standardized)", y = "Smoothness (standardized)") +
+ labs(x = "Area (standardized)", y = "Smoothness (standardized)") +
# coord_equal(ratio = 1) +
geom_segment(aes(
x = unlist(new_obs_scaled[1]), y = unlist(new_obs_scaled[2]),
@@ -1036,18 +1036,18 @@ ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
library(ggforce)
-ggplot(unscaled_cancer, aes(x = Area,
- y = Smoothness,
- group = Class,
- color = Class,
+ggplot(unscaled_cancer, aes(x = Area,
+ y = Smoothness,
+ group = Class,
+ color = Class,
shape = Class)) +
- geom_point(size = 2.5, alpha = 0.6) +
- scale_color_manual(name = "Diagnosis",
+ geom_point(size = 2.5, alpha = 0.6) +
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
values= c(16, 16, 18)) +
- scale_size_manual(name = "Diagnosis",
- values = c(1, 1, 2.5)) +
+ scale_size_manual(name = "Diagnosis",
+ values = c(1, 1, 2.5)) +
ggtitle("Unstandardized Data") +
geom_segment(aes(
x = unlist(new_obs[1]), y = unlist(new_obs[2]),
@@ -1063,10 +1063,10 @@ ggplot(unscaled_cancer, aes(x = Area,
x = unlist(new_obs[1]), y = unlist(new_obs[2]),
xend = unlist(neighbors[3, attrs[1]]),
yend = unlist(neighbors[3, attrs[2]])
- ), color = "black", show.legend = FALSE) +
- facet_zoom(x = ( Area > 380 & Area < 420) ,
- y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) +
- theme_bw() +
+ ), color = "black", show.legend = FALSE) +
+ facet_zoom(x = ( Area > 380 & Area < 420) ,
+ y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) +
+ theme_bw() +
theme(text = element_text(size = 18), axis.title=element_text(size=18), legend.position="bottom")
```
@@ -1074,7 +1074,7 @@ ggplot(unscaled_cancer, aes(x = Area,
Another potential issue in a data set for a classifier is *class imbalance*, \index{balance}\index{imbalance}
i.e., when one label is much more common than another. Since classifiers like
-the $K$-nearest neighbor algorithm use the labels of nearby points to predict
+the K-nearest neighbors algorithm use the labels of nearby points to predict
the label of a new point, if there are many more data points with one label
overall, the algorithm is more likely to pick that label in general (even if
the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1083,11 +1083,11 @@ detection, there are many cases in which the "important" class to identify
(presence of disease, malicious email) is much rarer than the "unimportant"
class (no disease, normal email).
-To better illustrate the problem, let's revisit the scaled breast cancer data,
+To better illustrate the problem, let's revisit the scaled breast cancer data,
`cancer`; except now we will remove many of the observations of malignant tumors, simulating
what the data would look like if the cancer was rare. We will do this by
picking only 3 observations from the malignant group, and keeping all
-of the benign observations.
+of the benign observations.
We choose these 3 observations using the `slice_head`
function, which takes two arguments: a data frame-like object,
and the number of rows to select from the top (`n`).
@@ -1096,7 +1096,7 @@ data frames back together, and name the result `rare_cancer`.
The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
```{r 05-unbalanced-seed, echo = FALSE, fig.height = 3.5, fig.width = 4.5, warning = FALSE, message = FALSE}
-# hidden seed here for reproducibility
+# hidden seed here for reproducibility
# randomness shouldn't affect much in this use of step_upsample,
# but just in case...
set.seed(3)
@@ -1112,7 +1112,7 @@ rare_cancer <- bind_rows(
rare_plot <- rare_cancer |>
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
- labs(x = "Perimeter (standardized)",
+ labs(x = "Perimeter (standardized)",
y = "Concavity (standardized)",
color = "Diagnosis") +
scale_color_manual(values = c("orange2", "steelblue2")) +
@@ -1121,8 +1121,8 @@ rare_plot <- rare_cancer |>
rare_plot
```
-Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
-With only 3 observations of malignant tumors, the classifier
+Suppose we now decided to use $K = 7$ in K-nearest neighbors classification.
+With only 3 observations of malignant tumors, the classifier
will *always predict that the tumor is benign, no matter what its concavity and perimeter
are!* This is because in a majority vote of 7 observations, at most 3 will be
malignant (we only have 3 total malignant observations), so at least 4 must be
@@ -1138,20 +1138,20 @@ my_distances <- bind_cols(my_distances, select(rare_cancer, Class))
neighbors <- rare_cancer[order(my_distances$Distance), ]
-rare_plot <- bind_rows(rare_cancer,
- tibble(Perimeter = new_point[1],
- Concavity = new_point[2],
+rare_plot <- bind_rows(rare_cancer,
+ tibble(Perimeter = new_point[1],
+ Concavity = new_point[2],
Class = "unknown")) |>
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class)) +
geom_point(alpha = 0.5) +
- labs(color = "Diagnosis",
- x = "Perimeter (standardized)",
- y = "Concavity (standardized)") +
- scale_color_manual(name = "Diagnosis",
+ labs(color = "Diagnosis",
+ x = "Perimeter (standardized)",
+ y = "Concavity (standardized)") +
+ scale_color_manual(name = "Diagnosis",
values = c("steelblue2", "orange2", "red")) +
- scale_shape_manual(name = "Diagnosis",
- values= c(16, 16, 18))+
- scale_size_manual(name = "Diagnosis",
+ scale_shape_manual(name = "Diagnosis",
+ values= c(16, 16, 18))+
+ scale_size_manual(name = "Diagnosis",
values= c(2, 2, 2.5))
for (i in 1:7) {
@@ -1174,9 +1174,9 @@ rare_plot + geom_point(aes(x = new_point[1], y = new_point[2]),
)
```
-Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of
-each area of the plot to the predictions the $K$-nearest neighbor
-classifier would make. We can see that the decision is
+Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of
+each area of the plot to the predictions the K-nearest neighbors
+classifier would make. We can see that the decision is
always "benign," corresponding to the blue color.
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
@@ -1189,33 +1189,33 @@ knn_fit <- knn_spec |>
fit(Class ~ ., data = rare_cancer)
# create a prediction pt grid
-per_grid <- seq(min(rare_cancer$Perimeter),
- max(rare_cancer$Perimeter),
+per_grid <- seq(min(rare_cancer$Perimeter),
+ max(rare_cancer$Perimeter),
length.out = 100)
-con_grid <- seq(min(rare_cancer$Concavity),
- max(rare_cancer$Concavity),
+con_grid <- seq(min(rare_cancer$Concavity),
+ max(rare_cancer$Concavity),
length.out = 100)
pcgrid <- as_tibble(expand.grid(Perimeter = per_grid, Concavity = con_grid))
knnPredGrid <- predict(knn_fit, pcgrid)
-prediction_table <- bind_cols(knnPredGrid, pcgrid) |>
+prediction_table <- bind_cols(knnPredGrid, pcgrid) |>
rename(Class = .pred_class)
# create the basic plt
rare_plot <-
ggplot() +
- geom_point(data = rare_cancer,
- mapping = aes(x = Perimeter,
- y = Concavity,
- color = Class),
+ geom_point(data = rare_cancer,
+ mapping = aes(x = Perimeter,
+ y = Concavity,
+ color = Class),
alpha = 0.75) +
- geom_point(data = prediction_table,
- mapping = aes(x = Perimeter,
- y = Concavity,
- color = Class),
- alpha = 0.02,
+ geom_point(data = prediction_table,
+ mapping = aes(x = Perimeter,
+ y = Concavity,
+ color = Class),
+ alpha = 0.02,
size = 5.) +
- labs(color = "Diagnosis",
- x = "Perimeter (standardized)",
+ labs(color = "Diagnosis",
+ x = "Perimeter (standardized)",
y = "Concavity (standardized)") +
scale_color_manual(values = c("orange2", "steelblue2"))
@@ -1226,7 +1226,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. \index{oversampling}
In other words, we will replicate rare observations multiple times in our data set to give them more
-voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling
+voting power in the K-nearest neighbors algorithm. In order to do this, we will add an oversampling
step to the earlier `uc_recipe` recipe with the `step_upsample` function from the `themis` R package. \index{recipe!step\_upsample}
We show below how to do this, and also
use the `group_by` and `summarize` functions to see that our classes are now balanced:
@@ -1252,11 +1252,11 @@ upsampled_cancer |>
summarize(n = n())
```
-Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data.
-Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color
-of each area of our scatter plot to the decision the $K$-nearest neighbor
+Now suppose we train our K-nearest neighbors classifier with $K=7$ on this *balanced* data.
+Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color
+of each area of our scatter plot to the decision the K-nearest neighbors
classifier would make. We can see that the decision is more reasonable; when the points are close
-to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
+to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
closer to the benign tumor observations.
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
@@ -1269,24 +1269,24 @@ knn_fit <- knn_spec |>
# create a prediction pt grid
knnPredGrid <- predict(knn_fit, pcgrid)
-prediction_table <- bind_cols(knnPredGrid, pcgrid) |>
+prediction_table <- bind_cols(knnPredGrid, pcgrid) |>
rename(Class = .pred_class)
# create the basic plt
upsampled_plot <-
ggplot() +
- geom_point(data = prediction_table,
- mapping = aes(x = Perimeter,
- y = Concavity,
- color = Class),
+ geom_point(data = prediction_table,
+ mapping = aes(x = Perimeter,
+ y = Concavity,
+ color = Class),
alpha = 0.02, size = 5.) +
- geom_point(data = rare_cancer,
- mapping = aes(x = Perimeter,
- y = Concavity,
- color = Class),
+ geom_point(data = rare_cancer,
+ mapping = aes(x = Perimeter,
+ y = Concavity,
+ color = Class),
alpha = 0.75) +
- labs(color = "Diagnosis",
- x = "Perimeter (standardized)",
+ labs(color = "Diagnosis",
+ x = "Perimeter (standardized)",
y = "Concavity (standardized)") +
scale_color_manual(values = c("orange2", "steelblue2"))
@@ -1322,13 +1322,13 @@ missing_cancer <- read_csv("data/wdbc_missing.csv") |>
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
missing_cancer
```
-Recall that K-nearest neighbor classification makes predictions by computing
+Recall that K-nearest neighbors classification makes predictions by computing
the straight-line distance to nearby training observations, and hence requires
access to the values of *all* variables for *all* observations in the training
-data. So how can we perform K-nearest neighbor classification in the presence
+data. So how can we perform K-nearest neighbors classification in the presence
of missing data? Well, since there are not too many observations with missing
entries, one option is to simply remove those observations prior to building
-the K-nearest neighbor classifier. We can accomplish this by using the
+the K-nearest neighbors classifier. We can accomplish this by using the
`drop_na` function from `tidyverse` prior to working with the data.
```{r 05-naomit}
@@ -1363,7 +1363,7 @@ imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
imputed_cancer
```
-Many other options for missing data imputation can be found in
+Many other options for missing data imputation can be found in
[the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html). However
you decide to handle missing data in your data analysis, it is always crucial
to think critically about the setting, how the data were collected, and the
@@ -1380,13 +1380,13 @@ with the `wdbc_unscaled.csv` data. First we will load the data, create a
model, and specify a recipe for how the data should be preprocessed:
```{r 05-workflow, message = FALSE, warning = FALSE}
-# load the unscaled cancer data
+# load the unscaled cancer data
# and make sure the response variable, Class, is a factor
unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |>
mutate(Class = as_factor(Class)) |>
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
-# create the KNN model
+# create the K-NN model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
set_engine("kknn") |>
set_mode("classification")
@@ -1399,7 +1399,7 @@ uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |>
Note that each of these steps is exactly the same as earlier, except for one major difference:
we did not use the `select` function to extract the relevant variables from the data frame,
-and instead simply specified the relevant variables to use via the
+and instead simply specified the relevant variables to use via the
formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe.
You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is
placed in a workflow.
@@ -1427,7 +1427,7 @@ for the number of neighbors and weight function (for now, these are just the val
manually when we created `knn_spec` above). But now the fit object also includes information about
the overall workflow, including the centering and scaling preprocessing steps.
In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new
-observation, it will first apply the same recipe steps to the new observation.
+observation, it will first apply the same recipe steps to the new observation.
As an example, we will predict the class label of two new observations:
one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`.
@@ -1439,37 +1439,37 @@ prediction
```
The classifier predicts that the first observation is benign, while the second is
-malignant. Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
-trained $K$-nearest neighbor model will make on a large range of new observations.
+malignant. Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
+trained K-nearest neighbors model will make on a large range of new observations.
Although you have seen colored prediction map visualizations like this a few times now,
we have not included the code to generate them, as it is a little bit complicated.
-For the interested reader who wants a learning challenge, we now include it below.
-The basic idea is to create a grid of synthetic new observations using the `expand.grid` function,
-predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
+For the interested reader who wants a learning challenge, we now include it below.
+The basic idea is to create a grid of synthetic new observations using the `expand.grid` function,
+predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
(low `alpha` value) and large point radius. See if you can figure out what each line is doing!
-\pagebreak
+\pagebreak
> **Note:** Understanding this code is not required for the remainder of the
> textbook. It is included for those readers who would like to use similar
-> visualizations in their own data analyses.
+> visualizations in their own data analyses.
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
# create the grid of area/smoothness vals, and arrange in a data frame
-are_grid <- seq(min(unscaled_cancer$Area),
- max(unscaled_cancer$Area),
+are_grid <- seq(min(unscaled_cancer$Area),
+ max(unscaled_cancer$Area),
length.out = 100)
-smo_grid <- seq(min(unscaled_cancer$Smoothness),
- max(unscaled_cancer$Smoothness),
+smo_grid <- seq(min(unscaled_cancer$Smoothness),
+ max(unscaled_cancer$Smoothness),
length.out = 100)
-asgrid <- as_tibble(expand.grid(Area = are_grid,
+asgrid <- as_tibble(expand.grid(Area = are_grid,
Smoothness = smo_grid))
# use the fit workflow to make predictions at the grid points
knnPredGrid <- predict(knn_fit, asgrid)
# bind the predictions as a new column with the grid points
-prediction_table <- bind_cols(knnPredGrid, asgrid) |>
+prediction_table <- bind_cols(knnPredGrid, asgrid) |>
rename(Class = .pred_class)
# plot:
@@ -1477,19 +1477,19 @@ prediction_table <- bind_cols(knnPredGrid, asgrid) |>
# 2. the faded colored scatter for the grid points
wkflw_plot <-
ggplot() +
- geom_point(data = unscaled_cancer,
- mapping = aes(x = Area,
- y = Smoothness,
- color = Class),
+ geom_point(data = unscaled_cancer,
+ mapping = aes(x = Area,
+ y = Smoothness,
+ color = Class),
alpha = 0.75) +
- geom_point(data = prediction_table,
- mapping = aes(x = Area,
- y = Smoothness,
- color = Class),
- alpha = 0.02,
+ geom_point(data = prediction_table,
+ mapping = aes(x = Area,
+ y = Smoothness,
+ color = Class),
+ alpha = 0.02,
size = 5) +
- labs(color = "Diagnosis",
- x = "Area",
+ labs(color = "Diagnosis",
+ x = "Area",
y = "Smoothness") +
scale_color_manual(values = c("orange2", "steelblue2")) +
theme(text = element_text(size = 12))
@@ -1499,8 +1499,8 @@ wkflw_plot
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Classification I: training and predicting" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/classification2.Rmd b/source/classification2.Rmd
index f5ae6af3e..ac649c2ad 100644
--- a/source/classification2.Rmd
+++ b/source/classification2.Rmd
@@ -38,18 +38,18 @@ hidden_print_cli <- function(x){
cleanup_and_print(cli::cli_fmt(capture.output(x)))
}
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
-## Overview
+## Overview
This chapter continues the introduction to predictive modeling through
classification. While the previous chapter covered training and data
preprocessing, this chapter focuses on how to evaluate the performance of
a classifier, as well as how to improve the classifier (where possible)
to maximize its accuracy.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
- Describe what training, validation, and test data sets are and how they are used in classification.
@@ -57,10 +57,11 @@ By the end of the chapter, readers will be able to do the following:
- Describe what a random seed is and its importance in reproducible data analysis.
- Set the random seed in R using the `set.seed` function.
- Describe and interpret accuracy, precision, recall, and confusion matrices.
-- Evaluate classification accuracy in R using a validation data set.
+- Evaluate classification accuracy, precision, and recall in R using a test set, a single validation set, and cross-validation.
- Produce a confusion matrix in R.
-- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier.
-- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm.
+- Choose the number of neighbors in a K-nearest neighbors classifier by maximizing estimated cross-validation accuracy.
+- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors classification.
+- Describe the advantages and disadvantages of the K-nearest neighbors classification algorithm.
## Evaluating performance
@@ -75,9 +76,9 @@ and the classifier will be asked to decide whether the tumor is benign or
malignant. The key word here is *new*: our classifier is "good" if it provides
accurate predictions on data *not seen during training*, as this implies that
it has actually learned about the relationship between the predictor variables and response variable,
-as opposed to simply memorizing the labels of individual training data examples.
+as opposed to simply memorizing the labels of individual training data examples.
But then, how can we evaluate our classifier without visiting the hospital to collect more
-tumor images?
+tumor images?
The trick is to split the data into a **training set** \index{training set} and **test set** \index{test set} (Figure \@ref(fig:06-training-test))
and use only the **training set** when building the classifier.
@@ -87,7 +88,7 @@ labels for the observations in the **test set**, then we have some
confidence that our classifier might also accurately predict the class
labels for new observations without known class labels.
-> **Note:** If there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
+> **Note:** If there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
> *you cannot use the test data to build the model!* If you do, the model gets to
> "see" the test data in advance, making it look more accurate than it really
> is. Imagine how bad it would be to overestimate your classifier's accuracy
@@ -101,7 +102,7 @@ How exactly can we assess how well our predictions match the actual labels for
the observations in the test set? One way we can do this is to calculate the
prediction **accuracy**. \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
classifier made the correct prediction. To calculate this, we divide the number
-of correct predictions by the number of predictions made.
+of correct predictions by the number of predictions made.
The process for assessing if our predictions match the actual labels in the
test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test).
@@ -123,7 +124,7 @@ classifier tends to make. Table \@ref(tab:confusion-matrix) shows an example
of what a confusion matrix might look like for the tumor image data with
a test set of 65 observations.
-Table: (\#tab:confusion-matrix) An example confusion matrix for the tumor image data.
+Table: (\#tab:confusion-matrix) An example confusion matrix for the tumor image data.
| | Actually Malignant | Actually Benign |
| ---------------------- | --------------- | -------------- |
@@ -144,7 +145,7 @@ But we can also see that the classifier only identified 1 out of 4 total maligna
tumors; in other words, it misclassified 75% of the malignant cases present in the
data set! In this example, misclassifying a malignant tumor is a potentially
disastrous error, since it may lead to a patient who requires treatment not receiving it.
-Since we are particularly interested in identifying malignant cases, this
+Since we are particularly interested in identifying malignant cases, this
classifier would likely be unacceptable even with an accuracy of 89%.
Focusing more on one label than the other is
@@ -214,31 +215,31 @@ Beginning in this chapter, our data analyses will often involve the use
of *randomness*. \index{random} We use randomness any time we need to make a decision in our
analysis that needs to be fair, unbiased, and not influenced by human input.
For example, in this chapter, we need to split
-a data set into a training set and test set to evaluate our classifier. We
+a data set into a training set and test set to evaluate our classifier. We
certainly do not want to choose how to split
the data ourselves by hand, as we want to avoid accidentally influencing the result
of the evaluation. So instead, we let R *randomly* split the data.
In future chapters we will use randomness
-in many other ways, e.g., to help us select a small subset of data from a larger data set,
+in many other ways, e.g., to help us select a small subset of data from a larger data set,
to pick groupings of data, and more.
-However, the use of randomness runs counter to one of the main
+However, the use of randomness runs counter to one of the main
tenets of good data analysis practice: \index{reproducible} *reproducibility*. Recall that a reproducible
analysis produces the same result each time it is run; if we include randomness
in the analysis, would we not get a different result each time?
-The trick is that in R—and other programming languages—randomness
+The trick is that in R—and other programming languages—randomness
is not actually random! Instead, R uses a *random number generator* that
produces a sequence of numbers that
are completely determined by a\index{seed} \index{random seed|see{seed}}
- *seed value*. Once you set the seed value
+ *seed value*. Once you set the seed value
using the \index{seed!set.seed} `set.seed` function, everything after that point may *look* random,
but is actually totally reproducible. As long as you pick the same seed
value, you get the same result!
-Let's use an example to investigate how seeds work in R. Say we want
+Let's use an example to investigate how seeds work in R. Say we want
to randomly pick 10 numbers from 0 to 9 in R using the `sample` \index{sample!function} function,
but we want it to be reproducible. Before using the sample function,
-we call `set.seed`, and pass it any integer as an argument.
+we call `set.seed`, and pass it any integer as an argument.
Here, we pass in the number `1`.
```{r}
@@ -248,8 +249,8 @@ random_numbers1
```
You can see that `random_numbers1` is a list of 10 numbers
-from 0 to 9 that, from all appearances, looks random. If
-we run the `sample` function again, we will
+from 0 to 9 that, from all appearances, looks random. If
+we run the `sample` function again, we will
get a fresh batch of 10 numbers that also look random.
```{r}
@@ -259,7 +260,7 @@ random_numbers2
If we want to force R to produce the same sequences of random numbers,
we can simply call the `set.seed` function again with the same argument
-value.
+value.
```{r}
set.seed(1)
@@ -270,7 +271,7 @@ random_numbers2_again <- sample(0:9, 10, replace=TRUE)
random_numbers2_again
```
-Notice that after setting the seed, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose
+Notice that after setting the seed, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose
a different value for the seed—say, 4235—we
obtain a different sequence of random numbers.
@@ -284,33 +285,33 @@ random_numbers
```
In other words, even though the sequences of numbers that R is generating *look*
-random, they are totally determined when we set a seed value!
+random, they are totally determined when we set a seed value!
So what does this mean for data analysis? Well, `sample` is certainly
-not the only function that uses randomness in R. Many of the functions
+not the only function that uses randomness in R. Many of the functions
that we use in `tidymodels`, `tidyverse`, and beyond use randomness—many of them
without even telling you about it. So at the beginning of every data analysis you
do, right after loading packages, you should call the `set.seed` function and
pass it an integer that you pick.
Also note that when R starts up, it creates its own seed to use. So if you do not
-explicitly call the `set.seed` function in your code, your results will
+explicitly call the `set.seed` function in your code, your results will
likely not be reproducible.
And finally, be careful to set the seed *only once* at the beginning of a data
analysis. Each time you set the seed, you are inserting your own human input,
thereby influencing the analysis. If you use `set.seed` many times
-throughout your analysis, the randomness that R uses will not look
+throughout your analysis, the randomness that R uses will not look
as random as it should.
In summary: if you want your analysis to be reproducible, i.e., produce *the same result* each time you
run it, make sure to use `set.seed` exactly once at the beginning of the analysis.
-Different argument values in `set.seed` lead to different patterns of randomness, but as long as
-you pick the same argument value your result will be the same.
+Different argument values in `set.seed` lead to different patterns of randomness, but as long as
+you pick the same argument value your result will be the same.
In the remainder of the textbook, we will set the seed once at the beginning of each chapter.
## Evaluating performance with `tidymodels`
Back to evaluating classifiers now!
-In R, we can use the `tidymodels` package \index{tidymodels} not only to perform $K$-nearest neighbors
-classification, but also to assess how well our classification worked.
+In R, we can use the `tidymodels` package \index{tidymodels} not only to perform K-nearest neighbors
+classification, but also to assess how well our classification worked.
Let's work through an example of how to use tools from `tidymodels` to evaluate a classifier
using the breast cancer data set from the previous chapter.
We begin the analysis by loading the packages we require,
@@ -341,7 +342,7 @@ perim_concav <- cancer |>
ggplot(aes(x = Smoothness, y = Concavity, color = Class)) +
geom_point(alpha = 0.5) +
labs(color = "Diagnosis") +
- scale_color_manual(values = c("orange2", "steelblue2")) +
+ scale_color_manual(values = c("orange2", "steelblue2")) +
theme(text = element_text(size = 12))
perim_concav
@@ -367,7 +368,7 @@ Second, it **stratifies** the \index{stratification} data by the class label, to
the same proportion of each class ends up in both the training and testing sets. For example,
in our data set, roughly 63% of the
observations are from the benign class, and 37% are from the malignant class,
-so `initial_split` ensures that roughly 63% of the training data are benign,
+so `initial_split` ensures that roughly 63% of the training data are benign,
37% of the training data are malignant,
and the same proportions exist in the testing data.
@@ -378,7 +379,7 @@ in the training set. We will also set the `strata` argument to the categorical l
right proportions of each category of observation.
The `training` and `testing` functions then extract the training and testing
data sets into two separate data frames.
-Note that the `initial_split` function uses randomness, but since we set the
+Note that the `initial_split` function uses randomness, but since we set the
seed earlier in the chapter, the split will be reproducible.
```{r 06-initial-split-seed, echo = FALSE, message = FALSE, warning = FALSE}
@@ -389,7 +390,7 @@ set.seed(2)
```{r 06-initial-split}
cancer_split <- initial_split(cancer, prop = 0.75, strata = Class)
cancer_train <- training(cancer_split)
-cancer_test <- testing(cancer_split)
+cancer_test <- testing(cancer_split)
```
```{r 06-glimpse-training-and-test-sets}
@@ -404,14 +405,14 @@ that we use the `glimpse` function to view data with a large number of columns,
as it prints the data such that the columns go down the page (instead of across).
```{r 06-train-prop, echo = FALSE}
-train_prop <- cancer_train |>
+train_prop <- cancer_train |>
group_by(Class) |>
summarize(proportion = n()/nrow(cancer_train))
```
-We can use `group_by` and `summarize` to \index{group\_by}\index{summarize} find the percentage of malignant and benign classes
+We can use `group_by` and `summarize` to \index{group\_by}\index{summarize} find the percentage of malignant and benign classes
in `cancer_train` and we see about `r round(filter(train_prop, Class == "Benign")$proportion, 2)*100`% of the training
-data are benign and `r round(filter(train_prop, Class == "Malignant")$proportion, 2)*100`%
+data are benign and `r round(filter(train_prop, Class == "Malignant")$proportion, 2)*100`%
are malignant, indicating that our class proportions were roughly preserved when we split the data.
```{r 06-train-proportion}
@@ -425,7 +426,7 @@ cancer_proportions
### Preprocess the data
-As we mentioned in the last chapter, $K$-nearest neighbors is sensitive to the scale of the predictors,
+As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors,
so we should perform some preprocessing to standardize them. An
additional consideration we need to take when doing this is that we should
create the standardization preprocessor using **only the training data**. This ensures that
@@ -446,7 +447,7 @@ cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |>
### Train the classifier
Now that we have split our original data set into training and test sets, we
-can create our $K$-nearest neighbors classifier with only the training set using
+can create our K-nearest neighbors classifier with only the training set using
the technique we learned in the previous chapter. For now, we will just choose
the number $K$ of neighbors to be 3, and use concavity and smoothness as the
predictors. As before we need to create a model specification, combine
@@ -477,7 +478,7 @@ hidden_print(knn_fit)
### Predict the labels in the test set
-Now that we have a $K$-nearest neighbors classifier object, we can use it to
+Now that we have a K-nearest neighbors classifier object, we can use it to
predict the class labels for our test set. We use the `bind_cols` \index{bind\_cols} to add the
column of predictions to the original test data, creating the
`cancer_test_predictions` data frame. The `Class` variable contains the actual
@@ -505,8 +506,8 @@ cancer_test_predictions |>
```
```{r 06-accuracy-2, echo = FALSE, warning = FALSE}
-cancer_acc_1 <- cancer_test_predictions |>
- metrics(truth = Class, estimate = .pred_class) |>
+cancer_acc_1 <- cancer_test_predictions |>
+ metrics(truth = Class, estimate = .pred_class) |>
filter(.metric == 'accuracy')
cancer_prec_1 <- cancer_test_predictions |>
@@ -519,7 +520,7 @@ cancer_rec_1 <- cancer_test_predictions |>
In the metrics data frame, we filtered the `.metric` column since we are
interested in the `accuracy` row. Other entries involve other metrics that
are beyond the scope of this book. Looking at the value of the `.estimate` variable
- shows that the estimated accuracy of the classifier on the test data
+shows that the estimated accuracy of the classifier on the test data
was `r round(100*cancer_acc_1$.estimate, 0)`%.
To compute the precision and recall, we can use the `precision` and `recall` functions
from `tidymodels`. We first check the order of the
@@ -545,8 +546,7 @@ cancer_test_predictions |>
The output shows that the estimated precision and recall of the classifier on the test data was
`r round(100*cancer_prec_1$.estimate, 0)`% and `r round(100*cancer_rec_1$.estimate, 0)`%, respectively.
-Finally, we can look at the *confusion matrix* for
-the classifier using the `conf_mat` function.
+Finally, we can look at the *confusion matrix* for the classifier using the `conf_mat` function.
```{r 06-confusionmat}
confusion <- cancer_test_predictions |>
@@ -562,11 +562,11 @@ confu21 <- (confusionmt |> filter(name == "cell_2_1"))$value
confu22 <- (confusionmt |> filter(name == "cell_2_2"))$value
```
-The confusion matrix shows `r confu11` observations were correctly predicted
-as malignant, and `r confu22` were correctly predicted as benign.
+The confusion matrix shows `r confu11` observations were correctly predicted
+as malignant, and `r confu22` were correctly predicted as benign.
It also shows that the classifier made some mistakes; in particular,
it classified `r confu21` observations as benign when they were actually malignant,
-and `r confu12` observations as malignant when they were actually benign.
+and `r confu12` observations as malignant when they were actually benign.
Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what R reported.
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{`r confu11`+`r confu22`}{`r confu11`+`r confu22`+`r confu12`+`r confu21`} = `r round((confu11+confu22)/(confu11+confu22+confu12+confu21),3)`$$
@@ -603,12 +603,12 @@ classification problem: the *majority classifier*. The majority classifier \inde
*always* guesses the majority class label from the training data, regardless of
the predictor variables' values. It helps to give you a sense of
scale when considering accuracies. If the majority classifier obtains a 90%
-accuracy on a problem, then you might hope for your $K$-nearest neighbors
+accuracy on a problem, then you might hope for your K-nearest neighbors
classifier to do better than that. If your classifier provides a significant
improvement upon the majority classifier, this means that at least your method
is extracting some useful information from your predictor variables. Be
careful though: improving on the majority classifier does not *necessarily*
-mean the classifier is working well enough for your application.
+mean the classifier is working well enough for your application.
As an example, in the breast cancer data, recall the proportions of benign and malignant
observations in the training data are as follows:
@@ -627,13 +627,13 @@ Since the benign class represents the majority of the training data,
the majority classifier would *always* predict that a new observation
is benign. The estimated accuracy of the majority classifier is usually
fairly close to the majority class proportion in the training data.
-In this case, we would suspect that the majority classifier will have
+In this case, we would suspect that the majority classifier will have
an accuracy of around `r round(cancer_propn_1[1,1], 0)`%.
-The $K$-nearest neighbors classifier we built does quite a bit better than this,
-with an accuracy of `r round(100*cancer_acc_1$.estimate, 0)`%.
+The K-nearest neighbors classifier we built does quite a bit better than this,
+with an accuracy of `r round(100*cancer_acc_1$.estimate, 0)`%.
This means that from the perspective of accuracy,
-the $K$-nearest neighbors classifier improved quite a bit on the basic
-majority classifier. Hooray! But we still need to be cautious; in
+the K-nearest neighbors classifier improved quite a bit on the basic
+majority classifier. Hooray! But we still need to be cautious; in
this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing
patients who actually need medical care. The confusion matrix above shows
that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (`r confu21`
@@ -647,30 +647,30 @@ for the application.
The vast majority of predictive models in statistics and machine learning have
*parameters*. A *parameter*\index{parameter}\index{tuning parameter|see{parameter}}
is a number you have to pick in advance that determines
-some aspect of how the model behaves. For example, in the $K$-nearest neighbors
+some aspect of how the model behaves. For example, in the K-nearest neighbors
classification algorithm, $K$ is a parameter that we have to pick
-that determines how many neighbors participate in the class vote.
-By picking different values of $K$, we create different classifiers
+that determines how many neighbors participate in the class vote.
+By picking different values of $K$, we create different classifiers
that make different predictions.
-So then, how do we pick the *best* value of $K$, i.e., *tune* the model?
+So then, how do we pick the *best* value of $K$, i.e., *tune* the model?
And is it possible to make this selection in a principled way? In this book,
-we will focus on maximizing the accuracy of the classifier. Ideally,
+we will focus on maximizing the accuracy of the classifier. Ideally,
we want somehow to maximize the accuracy of our classifier on data *it
hasn't seen yet*. But we cannot use our test data set in the process of building
our model. So we will play the same trick we did before when evaluating
our classifier: we'll split our *training data itself* into two subsets,
use one to train the model, and then use the other to evaluate it.
-In this section, we will cover the details of this procedure, as well as
+In this section, we will cover the details of this procedure, as well as
how to use it to help you pick a good parameter value for your classifier.
**And remember:** don't touch the test set during the tuning process. Tuning is a part of model training!
### Cross-validation
-The first step in choosing the parameter $K$ is to be able to evaluate the
+The first step in choosing the parameter $K$ is to be able to evaluate the
classifier using only the training data. If this is possible, then we can compare
-the classifier's performance for different values of $K$—and pick the best—using
+the classifier's performance for different values of $K$—and pick the best—using
only the training data. As suggested at the beginning of this section, we will
accomplish this by splitting the training data, training on one subset, and evaluating
on the other. The subset of training data used for evaluation is often called the **validation set**. \index{validation set}
@@ -687,12 +687,12 @@ data *once*, our best parameter choice will depend strongly on whatever data
was lucky enough to end up in the validation set. Perhaps using multiple
different train/validation splits, we'll get a better estimate of accuracy,
which will lead to a better choice of the number of neighbors $K$ for the
-overall set of training data.
+overall set of training data.
Let's investigate this idea in R! In particular, we will generate five different train/validation
-splits of our overall training data, train five different $K$-nearest neighbors
+splits of our overall training data, train five different K-nearest neighbors
models, and evaluate their accuracy. We will start with just a single
-split.
+split.
```{r 06-five-splits-seed, echo = FALSE, warning = FALSE, message = FALSE}
# hidden seed
@@ -705,9 +705,9 @@ cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class)
cancer_subtrain <- training(cancer_split)
cancer_validation <- testing(cancer_split)
-# recreate the standardization recipe from before
+# recreate the standardization recipe from before
# (since it must be based on the training data)
-cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
data = cancer_subtrain) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
@@ -742,9 +742,9 @@ for (i in 1:5) {
cancer_subtrain <- training(cancer_split)
cancer_validation <- testing(cancer_split)
- # recreate the standardization recipe from before
+ # recreate the standardization recipe from before
# (since it must be based on the training data)
- cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
+ cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
data = cancer_subtrain) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
@@ -778,19 +778,19 @@ just five estimates of the true, underlying accuracy of our classifier built
using our overall training data. We can combine the estimates by taking their
average (here `r round(100*mean(accuracies),0)`%) to try to get a single assessment of our
classifier's accuracy; this has the effect of reducing the influence of any one
-(un)lucky validation set on the estimate.
+(un)lucky validation set on the estimate.
In practice, we don't use random splits, but rather use a more structured
splitting procedure so that each observation in the data set is used in a
-validation set only a single time. The name for this strategy is
+validation set only a single time. The name for this strategy is
**cross-validation**. In **cross-validation**, \index{cross-validation} we split our **overall training
data** into $C$ evenly sized chunks. Then, iteratively use $1$ chunk as the
-**validation set** and combine the remaining $C-1$ chunks
-as the **training set**.
+**validation set** and combine the remaining $C-1$ chunks
+as the **training set**.
This procedure is shown in Figure \@ref(fig:06-cv-image).
Here, $C=5$ different chunks of the data set are used,
resulting in 5 different choices for the **validation set**; we call this
-*5-fold* cross-validation.
+*5-fold* cross-validation.
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/cv.png")
@@ -804,7 +804,7 @@ right proportions of each category of observation.
```{r 06-vfold-seed, echo = FALSE, warning = FALSE, message = FALSE}
# hidden seed
-set.seed(14)
+set.seed(14)
```
```{r 06-vfold}
@@ -814,7 +814,7 @@ cancer_vfold
Then, when we create our data analysis workflow, we use the `fit_resamples` function\index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
instead of the `fit` function for training. This runs cross-validation on each
-train/validation split.
+train/validation split.
```{r 06-vfold-workflow-seed, echo = FALSE, warning = FALSE, message = FALSE}
# hidden seed
@@ -822,9 +822,9 @@ set.seed(1)
```
```{r 06-vfold-workflow, echo=TRUE, results=FALSE, warning=FALSE, message=FALSE}
-# recreate the standardization recipe from before
+# recreate the standardization recipe from before
# (since it must be based on the training data)
-cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
data = cancer_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
@@ -843,11 +843,11 @@ hidden_print(knn_fit)
The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
of the classifier's validation accuracy across the folds. You will find results
-related to the accuracy in the row with `accuracy` listed under the `.metric` column.
-You should consider the mean (`mean`) to be the estimated accuracy, while the standard
+related to the accuracy in the row with `accuracy` listed under the `.metric` column.
+You should consider the mean (`mean`) to be the estimated accuracy, while the standard
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
-error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
+error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
fall outside this range). You may ignore the other columns in the metrics data frame,
as they do not provide any additional insight.
@@ -855,18 +855,18 @@ You can also ignore the entire second row with `roc_auc` in the `.metric` column
as it is beyond the scope of this book.
```{r 06-vfold-metrics}
-knn_fit |>
- collect_metrics()
+knn_fit |>
+ collect_metrics()
```
We can choose any number of folds, and typically the more we use the better our
-accuracy estimate will be (lower standard error). However, we are limited
+accuracy estimate will be (lower standard error). However, we are limited
by computational power: the
more folds we choose, the more computation it takes, and hence the more time
it takes to run the analysis. So when you do cross-validation, you need to
-consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
-neighbors), and the speed of your computer. In practice, this is a
-trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
+consider the size of the data, the speed of the algorithm (e.g., K-nearest
+neighbors), and the speed of your computer. In practice, this is a
+trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
we will try 10-fold cross-validation to see if we get a lower standard error:
```r
@@ -934,12 +934,12 @@ vfold_metrics_50 |>
### Parameter value selection
Using 5- and 10-fold cross-validation, we have estimated that the prediction
-accuracy of our classifier is somewhere around `r round(100*(vfold_metrics |> filter(.metric == "accuracy"))$mean,0)`%.
+accuracy of our classifier is somewhere around `r round(100*(vfold_metrics |> filter(.metric == "accuracy"))$mean,0)`%.
Whether that is good or not
depends entirely on the downstream application of the data analysis. In the
present situation, we are trying to predict a tumor diagnosis, with expensive,
damaging chemo/radiation therapy or patient death as potential consequences of
-misprediction. Hence, we might like to
+misprediction. Hence, we might like to
do better than `r round(100*(vfold_metrics |> filter(.metric == "accuracy"))$mean,0)`% for this application.
In order to improve our classifier, we have one choice of parameter: the number of
@@ -951,17 +951,17 @@ syntax for tuning models: each parameter in the model to be tuned should be spec
as `tune()` in the model specification rather than given a particular value.
```{r 06-range-cross-val}
-knn_spec <- nearest_neighbor(weight_func = "rectangular",
+knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
```
Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross-validation!tune\_grid}\index{tidymodels!tune\_grid}
-to fit the model for each value in a range of parameter values.
+to fit the model for each value in a range of parameter values.
In particular, we first create a data frame with a `neighbors`
variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
-data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using
+data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using
the `seq` function.
Then we pass that data frame to the `grid` argument of `tune_grid`.
@@ -977,7 +977,7 @@ knn_results <- workflow() |>
add_recipe(cancer_recipe) |>
add_model(knn_spec) |>
tune_grid(resamples = cancer_vfold, grid = k_vals) |>
- collect_metrics()
+ collect_metrics()
accuracies <- knn_results |>
filter(.metric == "accuracy")
@@ -992,7 +992,7 @@ as shown in Figure \@ref(fig:06-find-k).
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
- labs(x = "Neighbors", y = "Accuracy Estimate") +
+ labs(x = "Neighbors", y = "Accuracy Estimate") +
theme(text = element_text(size = 12))
accuracy_vs_k
@@ -1019,9 +1019,7 @@ provides the highest cross-validation accuracy estimate (`r (accuracies |> arran
any selection from $K = 30$ and $60$ would be reasonably justified, as all
of these differ in classifier accuracy by a small amount. Remember: the
values you see on this plot are *estimates* of the true accuracy of our
-classifier. Although the
-$K =$ `r best_k` value is
-higher than the others on this plot,
+classifier. Although the $K =$ `r best_k` value is higher than the others on this plot,
that doesn't mean the classifier is actually more accurate with this parameter
value! Generally, when selecting $K$ (and other parameters for other predictive
models), we are looking for a value where:
@@ -1031,7 +1029,7 @@ models), we are looking for a value where:
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!).
We know that $K =$ `r best_k`
-provides the highest estimated accuracy. Further, Figure \@ref(fig:06-find-k) shows that the estimated accuracy
+provides the highest estimated accuracy. Further, Figure \@ref(fig:06-find-k) shows that the estimated accuracy
changes by only a small amount if we increase or decrease $K$ near $K =$ `r best_k`.
And finally, $K =$ `r best_k` does not create a prohibitively expensive
computational cost of training. Considering these three points, we would indeed select
@@ -1040,9 +1038,9 @@ $K =$ `r best_k` for the classifier.
### Under/Overfitting
To build a bit more intuition, what happens if we keep increasing the number of
-neighbors $K$? In fact, the accuracy actually starts to decrease!
-Let's specify a much larger range of values of $K$ to try in the `grid`
-argument of `tune_grid`. Figure \@ref(fig:06-lots-of-ks) shows a plot of estimated accuracy as
+neighbors $K$? In fact, the accuracy actually starts to decrease!
+Let's specify a much larger range of values of $K$ to try in the `grid`
+argument of `tune_grid`. Figure \@ref(fig:06-lots-of-ks) shows a plot of estimated accuracy as
we vary $K$ from 1 to almost the number of observations in the training set.
```{r 06-lots-of-ks-seed, message = FALSE, echo = FALSE, warning = FALSE}
@@ -1065,7 +1063,7 @@ accuracies_lots <- knn_results |>
accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
- labs(x = "Neighbors", y = "Accuracy Estimate") +
+ labs(x = "Neighbors", y = "Accuracy Estimate") +
theme(text = element_text(size = 12))
accuracy_vs_k_lots
@@ -1099,7 +1097,7 @@ ks <- c(1, 7, 20, 300)
plots <- list()
for (i in 1:length(ks)) {
- knn_spec <- nearest_neighbor(weight_func = "rectangular",
+ knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = ks[[i]]) |>
set_engine("kknn") |>
set_mode("classification")
@@ -1110,36 +1108,36 @@ for (i in 1:length(ks)) {
fit(data = cancer_train)
# create a prediction pt grid
- smo_grid <- seq(min(cancer_train$Smoothness),
- max(cancer_train$Smoothness),
+ smo_grid <- seq(min(cancer_train$Smoothness),
+ max(cancer_train$Smoothness),
length.out = 100)
- con_grid <- seq(min(cancer_train$Concavity),
- max(cancer_train$Concavity),
+ con_grid <- seq(min(cancer_train$Concavity),
+ max(cancer_train$Concavity),
length.out = 100)
- scgrid <- as_tibble(expand.grid(Smoothness = smo_grid,
+ scgrid <- as_tibble(expand.grid(Smoothness = smo_grid,
Concavity = con_grid))
knnPredGrid <- predict(knn_fit, scgrid)
- prediction_table <- bind_cols(knnPredGrid, scgrid) |>
+ prediction_table <- bind_cols(knnPredGrid, scgrid) |>
rename(Class = .pred_class)
# plot
plots[[i]] <-
ggplot() +
- geom_point(data = cancer_train,
- mapping = aes(x = Smoothness,
- y = Concavity,
- color = Class),
+ geom_point(data = cancer_train,
+ mapping = aes(x = Smoothness,
+ y = Concavity,
+ color = Class),
alpha = 0.75) +
- geom_point(data = prediction_table,
- mapping = aes(x = Smoothness,
- y = Concavity,
- color = Class),
- alpha = 0.02,
+ geom_point(data = prediction_table,
+ mapping = aes(x = Smoothness,
+ y = Concavity,
+ color = Class),
+ alpha = 0.02,
size = 5.) +
labs(color = "Diagnosis") +
ggtitle(paste("K = ", ks[[i]])) +
scale_color_manual(values = c("orange2", "steelblue2")) +
- theme(text = element_text(size = 18), axis.title=element_text(size=18))
+ theme(text = element_text(size = 18), axis.title=element_text(size=18))
}
p_no_legend <- lapply(plots, function(x) x + theme(legend.position = "none"))
@@ -1148,18 +1146,18 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
```
-Both overfitting and underfitting are problematic and will lead to a model
+Both overfitting and underfitting are problematic and will lead to a model
that does not generalize well to new data. When fitting a model, we need to strike
-a balance between the two. You can see these two effects in Figure
-\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
+a balance between the two. You can see these two effects in Figure
+\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
we set the number of neighbors $K$ to 1, 7, 20, and 300.
### Evaluating on the test set
-Now that we have tuned the KNN classifier and set $K =$ `r best_k`,
+Now that we have tuned the K-NN classifier and set $K =$ `r best_k`,
we are done building the model and it is time to evaluate the quality of its predictions on the held out
test data, as we did earlier in Section \@ref(eval-performance-cls2).
-We first need to retrain the KNN classifier
+We first need to retrain the K-NN classifier
on the entire training data set using the selected number of neighbors.
```{r 06-eval-on-test-set-after-tuning, message = FALSE, warning = FALSE}
@@ -1246,21 +1244,21 @@ maximize accuracy are not necessarily better for a given application.
## Summary
Classification algorithms use one or more quantitative variables to predict the
-value of another categorical variable. In particular, the $K$-nearest neighbors algorithm
+value of another categorical variable. In particular, the K-nearest neighbors algorithm
does this by first finding the $K$ points in the training data nearest
to the new observation, and then returning the majority class vote from those
training observations. We can tune and evaluate a classifier by splitting the data randomly into a
training and test data set. The training set is used to build the classifier,
-and we can tune the classifier (e.g., select the number of neighbors in $K$-NN)
+and we can tune the classifier (e.g., select the number of neighbors in K-NN)
by maximizing estimated accuracy via cross-validation. After we have tuned the
model we can use the test set to estimate its accuracy.
The overall process is summarized in Figure \@ref(fig:06-overview).
-```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
+```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/classification2/train-test-overview.jpeg")
```
-The overall workflow for performing $K$-nearest neighbors classification using `tidymodels` is as follows:
+The overall workflow for performing K-nearest neighbors classification using `tidymodels` is as follows:
\index{tidymodels}\index{recipe}\index{cross-validation}\index{K-nearest neighbors!classification}\index{classification}
1. Use the `initial_split` function to split the data into a training and test set. Set the `strata` argument to the class label variable. Put the test set aside for now.
@@ -1272,18 +1270,18 @@ The overall workflow for performing $K$-nearest neighbors classification using `
7. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier using the `fit` function.
8. Evaluate the estimated accuracy of the classifier on the test set using the `predict` function.
-In these last two chapters, we focused on the $K$-nearest neighbor algorithm,
-but there are many other methods we could have used to predict a categorical label.
-All algorithms have their strengths and weaknesses, and we summarize these for
-the $K$-NN here.
+In these last two chapters, we focused on the K-nearest neighbors algorithm,
+but there are many other methods we could have used to predict a categorical label.
+All algorithms have their strengths and weaknesses, and we summarize these for
+the K-NN here.
-**Strengths:** $K$-nearest neighbors classification
+**Strengths:** K-nearest neighbors classification
1. is a simple, intuitive algorithm,
2. requires few assumptions about what the data must look like, and
3. works for binary (two-class) and multi-class (more than 2 classes) classification problems.
-**Weaknesses:** $K$-nearest neighbors classification
+**Weaknesses:** K-nearest neighbors classification
1. becomes very slow as the training data gets larger,
2. may not perform well with a large number of predictors, and
@@ -1291,27 +1289,27 @@ the $K$-NN here.
## Predictor variable selection
-> **Note:** This section is not required reading for the remainder of the textbook. It is included for those readers
+> **Note:** This section is not required reading for the remainder of the textbook. It is included for those readers
> interested in learning how irrelevant variables can influence the performance of a classifier, and how to
> pick a subset of useful variables to include as predictors.
Another potentially important part of tuning your classifier is to choose which
variables from your data will be treated as predictor variables. Technically, you can choose
anything from using a single predictor variable to using every variable in your
-data; the $K$-nearest neighbors algorithm accepts any number of
+data; the K-nearest neighbors algorithm accepts any number of
predictors. However, it is **not** the case that using more predictors always
yields better predictions! In fact, sometimes including irrelevant predictors \index{irrelevant predictors} can
actually negatively affect classifier performance.
### The effect of irrelevant predictors
-Let's take a look at an example where $K$-nearest neighbors performs
+Let's take a look at an example where K-nearest neighbors performs
worse when given more predictors to work with. In this example, we modified
the breast cancer data to have only the `Smoothness`, `Concavity`, and
`Perimeter` variables from the original data. Then, we added irrelevant
variables that we created ourselves using a random number generator.
The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless
-of what the value `Class` variable takes. In other words, the irrelevant variables have
+of what the value `Class` variable takes. In other words, the irrelevant variables have
no meaningful relationship with the `Class` variable.
```{r 06-irrelevant-gendata, echo = FALSE, warning = FALSE}
@@ -1321,17 +1319,17 @@ cancer_irrelevant <- cancer |> select(Class, Smoothness, Concavity, Perimeter)
for (i in 1:500) {
# create column
col = (sample(2, size=nrow(cancer_irrelevant), replace=TRUE)-1)
- cancer_irrelevant <- cancer_irrelevant |>
+ cancer_irrelevant <- cancer_irrelevant |>
add_column( !!paste("Irrelevant", i, sep="") := col)
}
```
```{r 06-irrelevant-printdata, warning = FALSE}
-cancer_irrelevant |>
+cancer_irrelevant |>
select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2)
```
-Next, we build a sequence of $K$-NN classifiers that include `Smoothness`,
+Next, we build a sequence of K-NN classifiers that include `Smoothness`,
`Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant
variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1352,7 +1350,7 @@ accs <- list()
nghbrs <- list()
for (i in 1:length(ks)) {
- knn_spec <- nearest_neighbor(weight_func = "rectangular",
+ knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
@@ -1364,7 +1362,7 @@ for (i in 1:length(ks)) {
cancer_recipe <- recipe(Class ~ ., data = cancer_irrelevant_subset) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
-
+
res <- workflow() |>
add_recipe(cancer_recipe) |>
add_model(knn_spec) |>
@@ -1376,7 +1374,7 @@ for (i in 1:length(ks)) {
accs[[i]] <- res$mean
nghbrs[[i]] <- res$neighbors
- knn_spec_fixed <- nearest_neighbor(weight_func = "rectangular",
+ knn_spec_fixed <- nearest_neighbor(weight_func = "rectangular",
neighbors = 3) |>
set_engine("kknn") |>
set_mode("classification")
@@ -1409,26 +1407,26 @@ res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs)
#res <- res |> mutate(base_acc = base_acc)
#plt_irrelevant_accuracies <- res |>
# ggplot() +
-# geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned KNN")) +
+# geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned K-NN")) +
# geom_hline(data=res, mapping=aes(yintercept=base_acc, linetype="Always Predict Benign")) +
-# labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") +
+# labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") +
# scale_linetype_manual(name="Method", values = c("dashed", "solid"))
plt_irrelevant_accuracies <- ggplot(res) +
geom_line(mapping = aes(x=ks, y=accs)) +
- labs(x = "Number of Irrelevant Predictors",
- y = "Model Accuracy Estimate") +
- theme(text = element_text(size = 18), axis.title=element_text(size=18))
+ labs(x = "Number of Irrelevant Predictors",
+ y = "Model Accuracy Estimate") +
+ theme(text = element_text(size = 18), axis.title=element_text(size=18))
plt_irrelevant_accuracies
```
-Although the accuracy decreases as expected, one surprising thing about
+Although the accuracy decreases as expected, one surprising thing about
Figure \@ref(fig:06-performance-irrelevant-features) is that it shows that the method
-still outperforms the baseline majority classifier (with about `r round(cancer_propn_1[1,1], 0)`% accuracy)
+still outperforms the baseline majority classifier (with about `r round(cancer_propn_1[1,1], 0)`% accuracy)
even with 40 irrelevant variables.
How could that be? Figure \@ref(fig:06-neighbors-irrelevant-features) provides the answer:
-the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables
+the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables
by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant
variables, the number of neighbors does not increase smoothly; but the general trend is increasing.
Figure \@ref(fig:06-fixed-irrelevant-features) corroborates
@@ -1437,40 +1435,40 @@ this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls of
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
plt_irrelevant_nghbrs <- ggplot(res) +
geom_line(mapping = aes(x=ks, y=nghbrs)) +
- labs(x = "Number of Irrelevant Predictors",
- y = "Number of neighbors") +
- theme(text = element_text(size = 18), axis.title=element_text(size=18))
+ labs(x = "Number of Irrelevant Predictors",
+ y = "Number of neighbors") +
+ theme(text = element_text(size = 18), axis.title=element_text(size=18))
plt_irrelevant_nghbrs
```
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
-res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
- names_to="Type",
+res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
+ names_to="Type",
values_to="accuracy")
plt_irrelevant_nghbrs <- ggplot(res_tmp) +
geom_line(mapping = aes(x=ks, y=accuracy, color=Type)) +
- labs(x = "Number of Irrelevant Predictors", y = "Accuracy") +
- scale_color_discrete(labels= c("Tuned K", "K = 3")) +
- theme(text = element_text(size = 17), axis.title=element_text(size=17))
+ labs(x = "Number of Irrelevant Predictors", y = "Accuracy") +
+ scale_color_discrete(labels= c("Tuned K", "K = 3")) +
+ theme(text = element_text(size = 17), axis.title=element_text(size=17))
plt_irrelevant_nghbrs
```
### Finding a good subset of predictors
-So then, if it is not ideal to use all of our variables as predictors without consideration, how
+So then, if it is not ideal to use all of our variables as predictors without consideration, how
do we choose which variables we *should* use? A simple method is to rely on your scientific understanding
of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer
data that we have been studying, the `ID` variable is just a unique identifier for the observation.
As it is not related to any measured property of the cells, the `ID` variable should therefore not be used
-as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables
-is less obvious, as all seem like reasonable candidates. It
+as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables
+is less obvious, as all seem like reasonable candidates. It
is not clear which subset of them will create the best classifier. One could use visualizations and
other exploratory analyses to try to help understand which variables are potentially relevant, but
this process is both time-consuming and error-prone when there are many variables to consider.
-Therefore we need a more systematic and programmatic way of choosing variables.
+Therefore we need a more systematic and programmatic way of choosing variables.
This is a very difficult problem to solve in
general, and there are a number of methods that have been developed that apply
in particular cases of interest. Here we will discuss two basic
@@ -1479,15 +1477,15 @@ this chapter to find out where you can learn more about variable selection, incl
The first idea you might think of for a systematic way to select predictors
is to try all possible subsets of predictors and then pick the set that results in the "best" classifier.
-This procedure is indeed a well-known variable selection method referred to
+This procedure is indeed a well-known variable selection method referred to
as *best subset selection* [@bealesubset; @hockingsubset]. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
In particular, you
1. create a separate model for every possible subset of predictors,
2. tune each one using cross-validation, and
-3. pick the subset of predictors that gives you the highest cross-validation accuracy.
+3. pick the subset of predictors that gives you the highest cross-validation accuracy.
-Best subset selection is applicable to any classification method ($K$-NN or otherwise).
+Best subset selection is applicable to any classification method (K-NN or otherwise).
However, it becomes very slow when you have even a moderate
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
grows very quickly with the number of predictors, and you have to train the model (itself
@@ -1495,13 +1493,13 @@ a slow process!) for each one. For example, if we have 2 predictors—let's
them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A
and B together. If we have 3 predictors—A, B, and C—then we have 7
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
-we have to train for $m$ predictors is $2^m-1$; in other words, when we
-get to 10 predictors we have over *one thousand* models to train, and
-at 20 predictors we have over *one million* models to train!
-So although it is a simple method, best subset selection is usually too computationally
+we have to train for $m$ predictors is $2^m-1$; in other words, when we
+get to 10 predictors we have over *one thousand* models to train, and
+at 20 predictors we have over *one million* models to train!
+So although it is a simple method, best subset selection is usually too computationally
expensive to use in practice.
-Another idea is to iteratively build up a model by adding one predictor variable
+Another idea is to iteratively build up a model by adding one predictor variable
at a time. This method—known as *forward selection* [@forwardefroymson; @forwarddraper]—is also widely \index{variable selection!forward}
applicable and fairly straightforward. It involves the following steps:
@@ -1522,19 +1520,19 @@ models that best subset selection requires you to train! For example, while best
training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
Therefore we will continue the rest of this section using forward selection.
-> **Note:** One word of caution before we move on. Every additional model that you train
-> increases the likelihood that you will get unlucky and stumble
+> **Note:** One word of caution before we move on. Every additional model that you train
+> increases the likelihood that you will get unlucky and stumble
> on a model that has a high cross-validation accuracy estimate, but a low true
> accuracy on the test data and other future observations.
> Since forward selection involves training a lot of models, you run a fairly
> high risk of this happening. To keep this risk low, only use forward selection
-> when you have a large amount of data and a relatively small total number of
+> when you have a large amount of data and a relatively small total number of
> predictors. More advanced methods do not suffer from this
> problem as much; see the additional resources at the end of this chapter for
-> where to learn more about advanced predictor selection methods.
+> where to learn more about advanced predictor selection methods.
### Forward selection in R
-
+
We now turn to implementing forward selection in R.
Unfortunately there is no built-in way to do this using the `tidymodels` framework,
so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
@@ -1547,13 +1545,13 @@ set.seed(1)
```
```{r 06-fwdsel, warning = FALSE}
-cancer_subset <- cancer_irrelevant |>
- select(Class,
- Smoothness,
- Concavity,
- Perimeter,
- Irrelevant1,
- Irrelevant2,
+cancer_subset <- cancer_irrelevant |>
+ select(Class,
+ Smoothness,
+ Concavity,
+ Perimeter,
+ Irrelevant1,
+ Irrelevant2,
Irrelevant3)
names <- colnames(cancer_subset |> select(-Class))
@@ -1577,16 +1575,16 @@ example_formula
Finally, we need to write some code that performs the task of sequentially
finding the best predictor to add to the model.
If you recall the end of the wrangling chapter, we mentioned
-that sometimes one needs more flexible forms of iteration than what
+that sometimes one needs more flexible forms of iteration than what
we have used earlier, and in these cases one typically resorts to
a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
Here we will use two for loops:
-one over increasing predictor set sizes
+one over increasing predictor set sizes
(where you see `for (i in 1:length(names))` below),
and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
For each set of predictors to try, we construct a model formula,
pass it into a `recipe`, build a `workflow` that tunes
-a $K$-NN classifier using 5-fold cross-validation,
+a K-NN classifier using 5-fold cross-validation,
and finally records the estimated accuracy.
```{r 06-fwdsel-2-seed, warning = FALSE, echo = FALSE, message = FALSE}
@@ -1596,12 +1594,12 @@ set.seed(1)
```{r 06-fwdsel-2, warning = FALSE}
# create an empty tibble to store the results
-accuracies <- tibble(size = integer(),
- model_string = character(),
+accuracies <- tibble(size = integer(),
+ model_string = character(),
accuracy = numeric())
# create a model specification
-knn_spec <- nearest_neighbor(weight_func = "rectangular",
+knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
@@ -1626,12 +1624,12 @@ for (i in 1:n_total) {
model_string <- paste("Class", "~", paste(preds_new, collapse="+"))
# create a recipe from the model string
- cancer_recipe <- recipe(as.formula(model_string),
+ cancer_recipe <- recipe(as.formula(model_string),
data = cancer_subset) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
- # tune the KNN classifier with these predictors,
+ # tune the K-NN classifier with these predictors,
# and collect the accuracy for the best K
acc <- workflow() |>
add_recipe(cancer_recipe) |>
@@ -1647,9 +1645,9 @@ for (i in 1:n_total) {
models[[j]] <- model_string
}
jstar <- which.max(unlist(accs))
- accuracies <- accuracies |>
- add_row(size = i,
- model_string = models[[jstar]],
+ accuracies <- accuracies |>
+ add_row(size = i,
+ model_string = models[[jstar]],
accuracy = accs[[jstar]])
selected <- c(selected, names[[jstar]])
names <- names[-jstar]
@@ -1662,14 +1660,14 @@ Interesting! The forward selection procedure first added the three meaningful va
visualizes the accuracy versus the number of predictors in the model. You can see that
as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant
variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number
-of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have
-to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The
+of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have
+to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The
way to find that balance is to look for the *elbow* \index{variable selection!elbow method}
in Figure \@ref(fig:06-fwdsel-3), i.e., the place on the plot where the accuracy stops increasing dramatically and
-levels off or begins to decrease. The elbow in Figure \@ref(fig:06-fwdsel-3) appears to occur at the model with
+levels off or begins to decrease. The elbow in Figure \@ref(fig:06-fwdsel-3) appears to occur at the model with
3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors
occurs with 3 variables: `Class ~ Perimeter + Concavity + Smoothness`. In other words, we have successfully removed irrelevant
-predictors from the model! It is always worth remembering, however, that what cross-validation gives you
+predictors from the model! It is always worth remembering, however, that what cross-validation gives you
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
@@ -1679,19 +1677,19 @@ fwd_sel_accuracies_plot <- accuracies |>
ggplot(aes(x = size, y = accuracy)) +
geom_line() +
labs(x = "Number of Predictors", y = "Estimated Accuracy") +
- theme(text = element_text(size = 20), axis.title=element_text(size=20))
+ theme(text = element_text(size = 20), axis.title=element_text(size=20))
fwd_sel_accuracies_plot
```
> **Note:** Since the choice of which variables to include as predictors is
> part of tuning your classifier, you *cannot use your test data* for this
-> process!
+> process!
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Classification II: evaluation and tuning" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -1715,7 +1713,7 @@ and guidance that the worksheets provide will function as intended.
two chapters, you'll learn about another kind of predictive modeling setting,
so it might be worth visiting the website only after reading through those
chapters.
-- *An Introduction to Statistical Learning* [@james2013introduction] provides
+- *An Introduction to Statistical Learning* [@james2013introduction] provides
a great next stop in the process of
learning about classification. Chapter 4 discusses additional basic techniques
for classification that we do not cover, such as logistic regression, linear
diff --git a/source/clustering.Rmd b/source/clustering.Rmd
index 3e63d0580..77a04341d 100644
--- a/source/clustering.Rmd
+++ b/source/clustering.Rmd
@@ -10,7 +10,7 @@ library(egg)
#center breaks latex here
-knitr::opts_chunk$set(warning = FALSE, fig.align = "default")
+knitr::opts_chunk$set(warning = FALSE, fig.align = "default")
cleanup_and_print <- function(output){
for (i in seq_along(output)) {
@@ -41,42 +41,40 @@ hidden_print_cli <- function(x){
cleanup_and_print(cli::cli_fmt(capture.output(x)))
}
-# set the colors in the graphs,
-# some graphs with the code shown to students are hard coded
+# set the colors in the graphs,
+# some graphs with the code shown to students are hard coded
cbbPalette <- c(brewer.pal(9, "Paired"))
cbpalette <- c("darkorange3", "dodgerblue3", "goldenrod1")
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
-## Overview
+## Overview
As part of exploratory data analysis, it is often helpful to see if there are
-meaningful subgroups (or *clusters*) in the data.
-This grouping can be used for many purposes,
-such as generating new questions or improving predictive analyses.
-This chapter provides an introduction to clustering
+meaningful subgroups (or *clusters*) in the data.
+This grouping can be used for many purposes,
+such as generating new questions or improving predictive analyses.
+This chapter provides an introduction to clustering
using the K-means algorithm,
including techniques to choose the number of clusters.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-* Describe a situation in which clustering is an appropriate technique to use,
+- Describe a situation in which clustering is an appropriate technique to use,
and what insight it might extract from the data.
-* Explain the K-means clustering algorithm.
-* Interpret the output of a K-means analysis.
-* Differentiate between clustering and classification.
-* Identify when it is necessary to scale variables before clustering,
-and do this using R.
-* Perform K-means clustering in R using `tidymodels` workflows.
-* Use the elbow method to choose the number of clusters for K-means.
-* Visualize the output of K-means clustering in R using colored scatter plots.
-* Describe the advantages,
-limitations and assumptions of the K-means clustering algorithm.
+- Explain the K-means clustering algorithm.
+- Interpret the output of a K-means analysis.
+- Differentiate between clustering, classification, and regression.
+- Identify when it is necessary to scale variables before clustering, and do this using R.
+- Perform K-means clustering in R using `tidymodels` workflows.
+- Use the elbow method to choose the number of clusters for K-means.
+- Visualize the output of K-means clustering in R using colored scatter plots.
+- Describe the advantages, limitations and assumptions of the K-means clustering algorithm.
## Clustering
-Clustering \index{clustering} is a data analysis technique
-involving separating a data set into subgroups of related data.
+Clustering \index{clustering} is a data analysis technique
+involving separating a data set into subgroups of related data.
For example, we might use clustering to separate a
data set of documents into groups that correspond to topics, a data set of
human genetic information into groups that correspond to ancestral
@@ -86,23 +84,23 @@ use the subgroups to generate new questions about the data and follow up with a
predictive modeling exercise. In this course, clustering will be used only for
exploratory analysis, i.e., uncovering patterns in the data.
-Note that clustering is a fundamentally different kind of task
-than classification or regression.
-In particular, both classification and regression are *supervised tasks*
-\index{classification}\index{regression}\index{supervised}
-where there is a *response variable* (a category label or value),
-and we have examples of past data with labels/values
-that help us predict those of future data.
-By contrast, clustering is an *unsupervised task*,
-\index{unsupervised} as we are trying to understand
-and examine the structure of data without any response variable labels
-or values to help us.
-This approach has both advantages and disadvantages.
-Clustering requires no additional annotation or input on the data.
-For example, while it would be nearly impossible to annotate
-all the articles on Wikipedia with human-made topic labels,
-we can cluster the articles without this information
-to find groupings corresponding to topics automatically.
+Note that clustering is a fundamentally different kind of task
+than classification or regression.
+In particular, both classification and regression are *supervised tasks*
+\index{classification}\index{regression}\index{supervised}
+where there is a *response variable* (a category label or value),
+and we have examples of past data with labels/values
+that help us predict those of future data.
+By contrast, clustering is an *unsupervised task*,
+\index{unsupervised} as we are trying to understand
+and examine the structure of data without any response variable labels
+or values to help us.
+This approach has both advantages and disadvantages.
+Clustering requires no additional annotation or input on the data.
+For example, while it would be nearly impossible to annotate
+all the articles on Wikipedia with human-made topic labels,
+we can cluster the articles without this information
+to find groupings corresponding to topics automatically.
However, given that there is no response variable, it is not as easy to evaluate
the "quality" of a clustering. With classification, we can use a test data set
to assess prediction performance. In clustering, there is not a single good
@@ -110,31 +108,31 @@ choice for evaluation. In this book, we will use visualization to ascertain the
quality of a clustering, and leave rigorous evaluation for more advanced
courses.
-As in the case of classification,
-there are many possible methods that we could use to cluster our observations
-to look for subgroups.
-In this book, we will focus on the widely used K-means \index{K-means} algorithm [@kmeans].
+As in the case of classification,
+there are many possible methods that we could use to cluster our observations
+to look for subgroups.
+In this book, we will focus on the widely used K-means \index{K-means} algorithm [@kmeans].
In your future studies, you might encounter hierarchical clustering,
-principal component analysis, multidimensional scaling, and more;
-see the additional resources section at the end of this chapter
+principal component analysis, multidimensional scaling, and more;
+see the additional resources section at the end of this chapter
for where to begin learning more about these other methods.
\newpage
-> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised}
-> where only some of the data come with response variable labels/values,
-> but the vast majority don't.
-> The goal is to try to uncover underlying structure in the data
-> that allows one to guess the missing labels.
-> This sort of task is beneficial, for example,
-> when one has an unlabeled data set that is too large to manually label,
-> but one is willing to provide a few informative example labels as a "seed"
+> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised}
+> where only some of the data come with response variable labels/values,
+> but the vast majority don't.
+> The goal is to try to uncover underlying structure in the data
+> that allows one to guess the missing labels.
+> This sort of task is beneficial, for example,
+> when one has an unlabeled data set that is too large to manually label,
+> but one is willing to provide a few informative example labels as a "seed"
> to guess the labels for all the data.
## An illustrative example
In this chapter we will focus on a data set \index{Palmer penguins} from
-[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) [@palmerpenguins]. This
+[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) [@palmerpenguins]. This
data set was collected by Dr. Kristen Gorman and
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
measurements for adult penguins (Figure \@ref(fig:09-penguins)) found near there [@penguinpaper].
@@ -150,12 +148,12 @@ this will help us make clear visualizations that illustrate how clustering works
knitr::include_graphics("img/clustering/gentoo.jpg")
```
-Before we get started, we will load the `tidyverse` metapackage
+Before we get started, we will load the `tidyverse` metapackage
as well as set a random seed.
-This will ensure we have access to the functions we need
+This will ensure we have access to the functions we need
and that our analysis will be reproducible.
-As we will learn in more detail later in the chapter,
-setting the seed here is important
+As we will learn in more detail later in the chapter,
+setting the seed here is important
because the K-means clustering algorithm uses randomness
when choosing a starting position for each cluster.
@@ -188,23 +186,23 @@ penguins_standardized <- penguins |>
penguins_standardized
```
-Next, we can create a scatter plot using this data set
+Next, we can create a scatter plot using this data set
to see if we can detect subtypes or groups in our data set.
\newpage
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
-ggplot(penguins_standardized,
- aes(x = flipper_length_standardized,
+ggplot(penguins_standardized,
+ aes(x = flipper_length_standardized,
y = bill_length_standardized)) +
geom_point() +
xlab("Flipper Length (standardized)") +
- ylab("Bill Length (standardized)") +
+ ylab("Bill Length (standardized)") +
theme(text = element_text(size = 12))
```
-Based \index{ggplot}\index{ggplot!geom\_point} on the visualization
-in Figure \@ref(fig:10-toy-example-plot),
+Based \index{ggplot}\index{ggplot!geom\_point} on the visualization
+in Figure \@ref(fig:10-toy-example-plot),
we might suspect there are a few subtypes of penguins within our data set.
We can see roughly 3 groups of observations in Figure \@ref(fig:10-toy-example-plot),
including:
@@ -214,17 +212,17 @@ including:
3. a large flipper and bill length group.
Data visualization is a great tool to give us a rough sense of such patterns
-when we have a small number of variables.
-But if we are to group data—and select the number of groups—as part of
+when we have a small number of variables.
+But if we are to group data—and select the number of groups—as part of
a reproducible analysis, we need something a bit more automated.
-Additionally, finding groups via visualization becomes more difficult
+Additionally, finding groups via visualization becomes more difficult
as we increase the number of variables we consider when clustering.
-The way to rigorously separate the data into groups
+The way to rigorously separate the data into groups
is to use a clustering algorithm.
-In this chapter, we will focus on the *K-means* algorithm,
-\index{K-means} a widely used and often very effective clustering method,
-combined with the *elbow method* \index{elbow method}
-for selecting the number of clusters.
+In this chapter, we will focus on the *K-means* algorithm,
+\index{K-means} a widely used and often very effective clustering method,
+combined with the *elbow method* \index{elbow method}
+for selecting the number of clusters.
This procedure will separate the data into groups;
Figure \@ref(fig:10-toy-example-clustering) shows these groups
denoted by colored scatter points.
@@ -255,11 +253,11 @@ penguins_clustered <- kmeans_fit |>
mutate(cluster = replace(cluster, cluster == 4, 2)) |>
mutate(cluster = as_factor(cluster))
-ggplot(penguins_clustered, aes(y = bill_length_standardized,
+ggplot(penguins_clustered, aes(y = bill_length_standardized,
x = flipper_length_standardized, color = cluster)) +
geom_point() +
xlab("Flipper Length (standardized)") +
- ylab("Bill Length (standardized)") +
+ ylab("Bill Length (standardized)") +
scale_color_manual(values= c("darkorange3", "dodgerblue3", "goldenrod1"))
```
@@ -270,7 +268,7 @@ where we can easily visualize the clusters on a scatter plot, we can give
human-made labels to the groups using their positions on
the plot:
-- small flipper length and small bill length (orange cluster),
+- small flipper length and small bill length (orange cluster),
- small flipper length and large bill length (blue cluster).
- and large flipper length and large bill length (yellow cluster).
@@ -278,9 +276,9 @@ Once we have made these determinations, we can use them to inform our species
classifications or ask further questions about our data. For example, we might
be interested in understanding the relationship between flipper length and bill
length, and that relationship may differ depending on the type of penguin we
-have.
+have.
-## K-means
+## K-means
### Measuring cluster quality
@@ -295,19 +293,19 @@ The K-means algorithm is a procedure that groups data into K clusters.
It starts with an initial clustering of the data, and then iteratively
improves it by making adjustments to the assignment of data
to clusters until it cannot improve any further. But how do we measure
-the "quality" of a clustering, and what does it mean to improve it?
-In K-means clustering, we measure the quality of a cluster
-by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
+the "quality" of a clustering, and what does it mean to improve it?
+In K-means clustering, we measure the quality of a cluster
+by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
Computing this involves two steps.
-First, we find the cluster centers by computing the mean of each variable
-over data points in the cluster. For example, suppose we have a
+First, we find the cluster centers by computing the mean of each variable
+over data points in the cluster. For example, suppose we have a
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
Then we would compute the coordinates, $\mu_x$ and $\mu_y$, of the cluster center via
$$\mu_x = \frac{1}{4}(x_1+x_2+x_3+x_4) \quad \mu_y = \frac{1}{4}(y_1+y_2+y_3+y_4).$$
-In the first cluster from the example, there are `r nrow(clus1)` data points. These are shown with their cluster center
-(standardized flipper length `r round(mean(clus1$flipper_length_standardized),2)`, standardized bill length `r round(mean(clus1$bill_length_standardized),2)`) highlighted
+In the first cluster from the example, there are `r nrow(clus1)` data points. These are shown with their cluster center
+(standardized flipper length `r round(mean(clus1$flipper_length_standardized),2)`, standardized bill length `r round(mean(clus1$bill_length_standardized),2)`) highlighted
in Figure \@ref(fig:10-toy-example-clus1-center).
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in red.
@@ -319,31 +317,31 @@ base <- ggplot(penguins_clustered, aes(x = flipper_length_standardized, y = bill
ylab("Bill Length (standardized)")
base <- ggplot(clus1) +
- geom_point(aes(y = bill_length_standardized, x = flipper_length_standardized),
+ geom_point(aes(y = bill_length_standardized, x = flipper_length_standardized),
col = "dodgerblue3") +
labs(x = "Flipper Length (standardized)", y = "Bill Length (standardized)") +
xlim(c(
- min(clus1$flipper_length_standardized) - 0.25 *
+ min(clus1$flipper_length_standardized) - 0.25 *
sd(clus1$flipper_length_standardized),
- max(clus1$flipper_length_standardized) + 0.25 *
+ max(clus1$flipper_length_standardized) + 0.25 *
sd(clus1$flipper_length_standardized)
)) +
ylim(c(
- min(clus1$bill_length_standardized) - 0.25 *
+ min(clus1$bill_length_standardized) - 0.25 *
sd(clus1$bill_length_standardized),
- max(clus1$bill_length_standardized) + 0.25 *
+ max(clus1$bill_length_standardized) + 0.25 *
sd(clus1$bill_length_standardized)
)) +
- geom_point(aes(y = mean(bill_length_standardized),
- x = mean(flipper_length_standardized)),
- color = "#F8766D",
+ geom_point(aes(y = mean(bill_length_standardized),
+ x = mean(flipper_length_standardized)),
+ color = "#F8766D",
size = 5) +
theme(legend.position = "none")
base
```
-The second step in computing the WSSD is to add up the squared distance
+The second step in computing the WSSD is to add up the squared distance
\index{distance!K-means} between each point in the cluster and the cluster center.
We use the straight-line / Euclidean distance formula
that we learned about in Chapter \@ref(classification1).
@@ -354,36 +352,36 @@ we would compute the WSSD $S^2$ via
S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) + \\ \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right).
\end{align*}
-These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example.
+These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example.
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
-base <- ggplot(clus1)
+base <- ggplot(clus1)
-mn <- clus1 |>
- summarize(flipper_length_standardized = mean(flipper_length_standardized),
+mn <- clus1 |>
+ summarize(flipper_length_standardized = mean(flipper_length_standardized),
bill_length_standardized = mean(bill_length_standardized))
for (i in 1:nrow(clus1)) {
base <- base + geom_segment(
- x = unlist(mn[1, "flipper_length_standardized"]),
+ x = unlist(mn[1, "flipper_length_standardized"]),
y = unlist(mn[1, "bill_length_standardized"]),
- xend = unlist(clus1[i, "flipper_length_standardized"]),
+ xend = unlist(clus1[i, "flipper_length_standardized"]),
yend = unlist(clus1[i, "bill_length_standardized"])
)
}
-base <- base +
- geom_point(aes(y = mean(bill_length_standardized),
- x = mean(flipper_length_standardized)),
- color = "#F8766D",
+base <- base +
+ geom_point(aes(y = mean(bill_length_standardized),
+ x = mean(flipper_length_standardized)),
+ color = "#F8766D",
size = 5)
base <- base +
- geom_point(aes(y = bill_length_standardized,
+ geom_point(aes(y = bill_length_standardized,
x = flipper_length_standardized),
col = "dodgerblue3") +
labs(x = "Flipper Length (standardized)", y = "Bill Length (standardized)") +
- theme(legend.position = "none")
+ theme(legend.position = "none")
base
```
@@ -408,24 +406,24 @@ cluster_centers <- tibble(x = c(0, 0, 0),
y = c(0, 0, 0))
for (cluster_number in seq_along(1:3)) {
-
+
clus <- filter(penguins_clustered, cluster == cluster_number) |>
select(bill_length_standardized, flipper_length_standardized)
-
- mn <- clus |>
+
+ mn <- clus |>
summarize(flipper_length_standardized = mean(flipper_length_standardized),
bill_length_standardized = mean(bill_length_standardized))
-
+
for (i in 1:nrow(clus)) {
- all_clusters_base <- all_clusters_base +
- geom_segment(x = unlist(mn[1, "flipper_length_standardized"]),
+ all_clusters_base <- all_clusters_base +
+ geom_segment(x = unlist(mn[1, "flipper_length_standardized"]),
y = unlist(mn[1, "bill_length_standardized"]),
- xend = unlist(clus[i, "flipper_length_standardized"]),
+ xend = unlist(clus[i, "flipper_length_standardized"]),
yend = unlist(clus[i, "bill_length_standardized"]),
color = "black"
)
}
-
+
cluster_centers[cluster_number, 1] <- mean(clus$flipper_length_standardized)
cluster_centers[cluster_number, 2] <- mean(clus$bill_length_standardized)
}
@@ -435,20 +433,20 @@ all_clusters_base <- all_clusters_base +
x = flipper_length_standardized,
color = cluster)) +
xlab("Flipper Length (standardized)") +
- ylab("Bill Length (standardized)") +
- scale_color_manual(values= c("darkorange3",
- "dodgerblue3",
+ ylab("Bill Length (standardized)") +
+ scale_color_manual(values= c("darkorange3",
+ "dodgerblue3",
"goldenrod1"))
-all_clusters_base <- all_clusters_base +
- geom_point(aes(y = cluster_centers$y[1],
- x = cluster_centers$x[1]),
+all_clusters_base <- all_clusters_base +
+ geom_point(aes(y = cluster_centers$y[1],
+ x = cluster_centers$x[1]),
color = "#F8766D", size = 3) +
- geom_point(aes(y = cluster_centers$y[2],
- x = cluster_centers$x[2]),
+ geom_point(aes(y = cluster_centers$y[2],
+ x = cluster_centers$x[2]),
color = "#F8766D", size = 3) +
- geom_point(aes(y = cluster_centers$y[3],
- x = cluster_centers$x[3]),
+ geom_point(aes(y = cluster_centers$y[3],
+ x = cluster_centers$x[3]),
color = "#F8766D", size = 3)
all_clusters_base
@@ -466,8 +464,8 @@ These are beyond the scope of this book.
### The clustering algorithm
-We begin the K-means \index{K-means!algorithm} algorithm by picking K,
-and randomly assigning a roughly equal number of observations
+We begin the K-means \index{K-means!algorithm} algorithm by picking K,
+and randomly assigning a roughly equal number of observations
to each of the K clusters.
An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init).
@@ -475,13 +473,13 @@ An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init
set.seed(14)
penguins_standardized["label"] <- factor(sample(1:3, nrow(penguins_standardized), replace = TRUE))
-plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length (standardized)") +
ylab("Bill Length (standardized)") +
- theme(legend.position = "none") +
+ theme(legend.position = "none") +
scale_color_manual(values= cbpalette)
plt_lbl
@@ -513,65 +511,65 @@ for (i in 1:4) {
summarize_all(funs(mean))
nclus <- nrow(centers)
# replot with centers
- plt_ctr <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+ plt_ctr <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length\n(standardized)") +
ylab("Bill Length\n(standardized)") +
theme(legend.position = "none") +
- scale_color_manual(values= cbpalette) +
- geom_point(data = centers,
- aes(y = bill_length_standardized,
- x = flipper_length_standardized,
- fill = label),
- size = 4,
- shape = 21,
- stroke = 1,
- color = "black",
+ scale_color_manual(values= cbpalette) +
+ geom_point(data = centers,
+ aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
+ fill = label),
+ size = 4,
+ shape = 21,
+ stroke = 1,
+ color = "black",
fill = cbpalette) +
- annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5)+
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
-
+ annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5)+
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
+
if (i == 1 | i == 2) {
plt_ctr <- plt_ctr +
ggtitle("Center Update")
}
-
+
# reassign labels
dists <- rbind(centers, penguins_standardized) |>
select("flipper_length_standardized", "bill_length_standardized") |>
dist() |>
as.matrix()
dists <- as_tibble(dists[-(1:nclus), 1:nclus])
- penguins_standardized <- penguins_standardized |>
+ penguins_standardized <- penguins_standardized |>
mutate(label = apply(dists, 1, function(x) names(x)[which.min(x)]))
- plt_lbl <- ggplot(penguins_standardized,
- aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+ plt_lbl <- ggplot(penguins_standardized,
+ aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length\n(standardized)") +
ylab("Bill Length\n(standardized)") +
theme(legend.position = "none") +
scale_color_manual(values= cbpalette) +
- geom_point(data = centers,
- aes(y = bill_length_standardized,
- x = flipper_length_standardized, fill = label),
- size = 4,
- shape = 21,
- stroke = 1,
- color = "black",
+ geom_point(data = centers,
+ aes(y = bill_length_standardized,
+ x = flipper_length_standardized, fill = label),
+ size = 4,
+ shape = 21,
+ stroke = 1,
+ color = "black",
fill = cbpalette) +
- annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
if (i == 1 | i ==2) {
plt_lbl <- plt_lbl +
ggtitle("Label Update")
}
-
+
list_plot_cntrs[[i]] <- plt_ctr
list_plot_lbls[[i]] <- plt_lbl
}
@@ -584,17 +582,17 @@ iter_plot_list <- c(list_plot_cntrs[1], list_plot_lbls[1],
ggarrange(iter_plot_list[[1]] +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
- axis.title.x = element_blank(),
- plot.margin = margin(r = 2, b = 2)),
- iter_plot_list[[2]] +
+ axis.title.x = element_blank(),
+ plot.margin = margin(r = 2, b = 2)),
+ iter_plot_list[[2]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
- plot.margin = margin(r = 2, l = 2, b = 2) ),
- iter_plot_list[[3]] +
+ plot.margin = margin(r = 2, l = 2, b = 2) ),
+ iter_plot_list[[3]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
@@ -602,29 +600,29 @@ ggarrange(iter_plot_list[[1]] +
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
plot.margin = margin(r = 2, l = 2, b = 2) ),
- iter_plot_list[[4]] +
+ iter_plot_list[[4]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
- axis.title.y = element_blank(),
+ axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
plot.margin = margin(l = 2, b = 2) ),
iter_plot_list[[5]] +
- theme(plot.margin = margin(r = 2, t = 2)),
- iter_plot_list[[6]] +
+ theme(plot.margin = margin(r = 2, t = 2)),
+ iter_plot_list[[6]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
- plot.margin = margin(r = 2, l = 2, t = 2) ),
- iter_plot_list[[7]] +
+ plot.margin = margin(r = 2, l = 2, t = 2) ),
+ iter_plot_list[[7]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
plot.margin = margin(r = 2, l = 2, t = 2) ),
iter_plot_list[[8]] + theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
- axis.title.y = element_blank(),
+ axis.title.y = element_blank(),
plot.margin = margin(l = 2, t = 2) ),
nrow = 2)
```
@@ -646,11 +644,11 @@ For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky ran
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
penguins_standardized <- penguins_standardized |>
- mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
+ mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
-plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length (standardized)") +
@@ -676,63 +674,63 @@ for (i in 1:5) {
summarize_all(funs(mean))
nclus <- nrow(centers)
# replot with centers
- plt_ctr <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+ plt_ctr <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length\n(standardized)") +
ylab("Bill Length\n(standardized)") +
theme(legend.position = "none") +
- scale_color_manual(values= cbpalette) +
- geom_point(data = centers, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
- fill = label),
- size = 4,
- shape = 21,
- stroke = 1,
- color = "black",
+ scale_color_manual(values= cbpalette) +
+ geom_point(data = centers, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
+ fill = label),
+ size = 4,
+ shape = 21,
+ stroke = 1,
+ color = "black",
fill = cbpalette) +
- annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
if (i == 1 | i == 2) {
plt_ctr <- plt_ctr +
ggtitle("Center Update")
}
-
+
# reassign labels
dists <- rbind(centers, penguins_standardized) |>
select("flipper_length_standardized", "bill_length_standardized") |>
dist() |>
as.matrix()
dists <- as_tibble(dists[-(1:nclus), 1:nclus])
- penguins_standardized <- penguins_standardized |>
+ penguins_standardized <- penguins_standardized |>
mutate(label = apply(dists, 1, function(x) names(x)[which.min(x)]))
- plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
+ plt_lbl <- ggplot(penguins_standardized, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
color = label)) +
geom_point(size = 2) +
xlab("Flipper Length\n(standardized)") +
ylab("Bill Length\n(standardized)") +
theme(legend.position = "none") +
scale_color_manual(values= cbpalette) +
- geom_point(data = centers, aes(y = bill_length_standardized,
- x = flipper_length_standardized,
- fill = label),
- size = 4,
- shape = 21,
- stroke = 1,
- color = "black",
+ geom_point(data = centers, aes(y = bill_length_standardized,
+ x = flipper_length_standardized,
+ fill = label),
+ size = 4,
+ shape = 21,
+ stroke = 1,
+ color = "black",
fill = cbpalette) +
- annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
if (i == 1 | i == 2) {
plt_lbl <- plt_lbl +
ggtitle("Label Update")
}
-
+
list_plot_cntrs[[i]] <- plt_ctr
list_plot_lbls[[i]] <- plt_lbl
}
@@ -746,17 +744,17 @@ iter_plot_list <- c(list_plot_cntrs[1], list_plot_lbls[1],
ggarrange(iter_plot_list[[1]] +
theme(axis.text.x = element_blank(), #remove x axis
axis.ticks.x = element_blank(),
- axis.title.x = element_blank(),
+ axis.title.x = element_blank(),
plot.margin = margin(r = 2, b = 2)), # change margins
- iter_plot_list[[2]] +
+ iter_plot_list[[2]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
- plot.margin = margin(r = 2, l = 2, b = 2) ),
- iter_plot_list[[3]] +
+ plot.margin = margin(r = 2, l = 2, b = 2) ),
+ iter_plot_list[[3]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
@@ -764,10 +762,10 @@ ggarrange(iter_plot_list[[1]] +
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
plot.margin = margin(r = 2, l = 2, b = 2)),
- iter_plot_list[[4]] +
+ iter_plot_list[[4]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
- axis.title.y = element_blank(),
+ axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
@@ -777,30 +775,30 @@ ggarrange(iter_plot_list[[1]] +
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
plot.margin = margin(r = 2, t = 2, b = 2)),
- iter_plot_list[[6]] +
+ iter_plot_list[[6]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
- plot.margin = margin(r = 2, l = 2, t = 2, b = 2) ),
- iter_plot_list[[7]] +
+ plot.margin = margin(r = 2, l = 2, t = 2, b = 2) ),
+ iter_plot_list[[7]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
plot.margin = margin(r = 2, l = 2, t = 2, b = 2) ),
iter_plot_list[[8]] + theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
- axis.title.y = element_blank(),
- plot.margin = margin(l = 2, t = 2, b = 2)),
+ axis.title.y = element_blank(),
+ plot.margin = margin(l = 2, t = 2, b = 2)),
ggplot() + theme_void(), ggplot() + theme_void(), ggplot() + theme_void(), ggplot() + theme_void(), # adding third row of empty plots to change space between third and fourth row
- iter_plot_list[[9]] +
+ iter_plot_list[[9]] +
theme(plot.margin = margin(r = 2)),
- iter_plot_list[[10]] +
+ iter_plot_list[[10]] +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
- axis.title.y = element_blank(),
+ axis.title.y = element_blank(),
plot.margin = margin(l = 2) ),
heights = c(3, 3, -1, 3),
ncol = 4)
@@ -812,15 +810,15 @@ and pick the clustering that has the lowest final total WSSD.
### Choosing K
-In order to cluster data using K-means,
+In order to cluster data using K-means,
we also have to pick the number of clusters, K.
-But unlike in classification, we have no response variable
+But unlike in classification, we have no response variable
and cannot perform cross-validation with some measure of model prediction error.
Further, if K is chosen too small, then multiple clusters get grouped together;
-if K is too large, then clusters get subdivided.
-In both cases, we will potentially miss interesting structure in the data.
-Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
-on K-means clustering of our penguin flipper and bill length data
+if K is too large, then clusters get subdivided.
+In both cases, we will potentially miss interesting structure in the data.
+Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
+on K-means clustering of our penguin flipper and bill length data
by showing the different clusterings for K's ranging from 1 to 9.
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
@@ -843,14 +841,14 @@ assignments <- kclusts |>
clusterings <- kclusts |>
unnest(glanced, .drop = TRUE)
-clusters_levels <- c("1 Cluster",
- "2 Clusters",
- "3 Clusters",
- "4 Clusters",
- "5 Clusters",
- "6 Clusters",
- "7 Clusters",
- "8 Clusters",
+clusters_levels <- c("1 Cluster",
+ "2 Clusters",
+ "3 Clusters",
+ "4 Clusters",
+ "5 Clusters",
+ "6 Clusters",
+ "7 Clusters",
+ "8 Clusters",
"9 Clusters")
assignments$k <- factor(assignments$k)
@@ -859,32 +857,32 @@ levels(assignments$k) <- clusters_levels
clusters$k <- factor(clusters$k)
levels(clusters$k) <- clusters_levels
-p1 <- ggplot(assignments, aes(flipper_length_standardized,
+p1 <- ggplot(assignments, aes(flipper_length_standardized,
bill_length_standardized)) +
geom_point(aes(color = .cluster, size = I(2))) +
facet_wrap(~k) + scale_color_manual(values = cbbPalette) +
- labs(x = "Flipper Length (standardized)",
- y = "Bill Length (standardized)",
+ labs(x = "Flipper Length (standardized)",
+ y = "Bill Length (standardized)",
color = "Cluster") +
theme(legend.position = "none") +
- geom_point(data = clusters,
- aes(fill = cluster),
- color = "black",
- size = 4,
- shape = 21,
- stroke = 1) +
- scale_fill_manual(values = cbbPalette) +
- theme(text = element_text(size = 12), axis.title=element_text(size=12))
+ geom_point(data = clusters,
+ aes(fill = cluster),
+ color = "black",
+ size = 4,
+ shape = 21,
+ stroke = 1) +
+ scale_fill_manual(values = cbbPalette) +
+ theme(text = element_text(size = 12), axis.title=element_text(size=12))
p1
```
-If we set K less than 3, then the clustering merges separate groups of data; this causes a large
-total WSSD, since the cluster center is not close to any of the data in the cluster. On
-the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
-decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
-clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
+If we set K less than 3, then the clustering merges separate groups of data; this causes a large
+total WSSD, since the cluster center is not close to any of the data in the cluster. On
+the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
+decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
+clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
@@ -892,9 +890,9 @@ p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
geom_point(size = 2) +
geom_line() +
# annotate(geom = "line", x = 4, y = 35, xend = 2.65, yend = 27, arrow = arrow(length = unit(2, "mm"))) +
- geom_segment(aes(x = 4, y = 17,
- xend = 3.1,
- yend = 6),
+ geom_segment(aes(x = 4, y = 17,
+ xend = 3.1,
+ yend = 6),
arrow = arrow(length = unit(0.2, "cm"))) +
annotate("text", x = 4.4, y = 19, label = "Elbow", size = 7, color = "blue") +
labs(x = "Number of Clusters", y = "Total WSSD") +
@@ -910,8 +908,8 @@ p2
set.seed(1)
```
-We can perform K-means clustering in R using a `tidymodels` workflow similar
-to those in the earlier classification and regression chapters.
+We can perform K-means clustering in R using a `tidymodels` workflow similar
+to those in the earlier classification and regression chapters.
We will begin by loading the `tidyclust`\index{tidyclust} library, which contains the necessary
functionality.
```{r, echo = TRUE, warning = FALSE, message = FALSE}
@@ -919,25 +917,25 @@ library(tidyclust)
```
Returning to the original (unstandardized) `penguins` data,
-recall that K-means clustering uses straight-line
-distance to decide which points are similar to
+recall that K-means clustering uses straight-line
+distance to decide which points are similar to
each other. Therefore, the *scale* of each of the variables in the data
will influence which cluster data points end up being assigned.
-Variables with a large scale will have a much larger
-effect on deciding cluster assignment than variables with a small scale.
-To address this problem, we need to create a recipe that
-standardizes\index{standardization!K-means}\index{K-means!standardization} our data
+Variables with a large scale will have a much larger
+effect on deciding cluster assignment than variables with a small scale.
+To address this problem, we need to create a recipe that
+standardizes\index{standardization!K-means}\index{K-means!standardization} our data
before clustering using the `step_scale` and `step_center` preprocessing steps.\index{recipe!step\_scale}\index{recipe!step\_center}
-Standardization will ensure that each variable has a mean
+Standardization will ensure that each variable has a mean
of 0 and standard deviation of 1 prior to clustering.
We will designate that all variables are to be used in clustering via
the model formula `~ .`.
-> **Note:** Recipes were originally designed specifically for *predictive* data
+> **Note:** Recipes were originally designed specifically for *predictive* data
> analysis problems—like classification and regression—not clustering
> problems. So the functions in R that we use to construct recipes are a little bit
> awkward in the setting of clustering In particular, we will have to treat
-> "predictors" here as if it meant "variables to be used in clustering". So the
+> "predictors" here as if it meant "variables to be used in clustering". So the
> model formula `~ .` specifies that all variables are "predictors", i.e., all variables
> should be used for clustering. Similarly, when we use the `all_predictors()` function
> in the preprocessing steps, we really mean "apply this step to all variables used for
@@ -985,9 +983,9 @@ hidden_print(kmeans_fit)
As you can see above, the fit object has a lot of information
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
-Let's start by visualizing the clusters as a colored scatter plot! In
-order to do that, we first need to augment our
-original data frame with the cluster assignments. We can
+Let's start by visualizing the clusters as a colored scatter plot! In
+order to do that, we first need to augment our
+original data frame with the cluster assignments. We can
achieve this using the `augment` function from `tidyclust`.\index{tidyclust!augment}\index{augment}
```{r 10-kmeans-extract-augment}
clustered_data <- kmeans_fit |>
@@ -997,7 +995,7 @@ clustered_data
-Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can
+Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can
visualize them as shown in Figure \@ref(fig:10-plot-clusters-2).
-Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
+Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
visualize the *standardized* data from the recipe, we would need to use the `bake` function
to obtain that first.
```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
cluster_plot <- ggplot(clustered_data,
- aes(x = flipper_length_mm,
- y = bill_length_mm,
- color = .pred_cluster),
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = .pred_cluster),
size = 2) +
geom_point() +
- labs(x = "Flipper Length",
- y = "Bill Length",
- color = "Cluster") +
+ labs(x = "Flipper Length",
+ y = "Bill Length",
+ color = "Cluster") +
scale_color_manual(values = c("dodgerblue3",
- "darkorange3",
- "goldenrod1")) +
+ "darkorange3",
+ "goldenrod1")) +
theme(text = element_text(size = 12))
cluster_plot
```
As mentioned above, we also need to select K by finding
-where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
+where the "elbow" occurs in the plot of total WSSD versus the number of clusters.
We can obtain the total WSSD (`tot.withinss`) \index{WSSD!total} from our
-clustering with 3 clusters using the `glance` function.
+clustering with 3 clusters using the `glance` function.
```{r 10-glance}
glance(kmeans_fit)
@@ -1042,14 +1040,14 @@ glance(kmeans_fit)
To calculate the total WSSD for a variety of Ks, we will
create a data frame with a column named `num_clusters` with rows containing
-each value of K we want to run K-means with (here, 1 to 9).
+each value of K we want to run K-means with (here, 1 to 9).
```{r 10-choose-k-part1}
penguin_clust_ks <- tibble(num_clusters = 1:9)
penguin_clust_ks
```
-Then we construct our model specification again, this time
+Then we construct our model specification again, this time
specifying that we want to tune the `num_clusters` parameter.
```{r 10-kmeans-spec-tune, message=FALSE, echo=TRUE, results=FALSE}
@@ -1063,7 +1061,7 @@ hidden_print(kmeans_spec)
We combine the recipe and specification in a workflow, and then
use the `tune_cluster` function\index{tidyclust!tune\_cluster} to run K-means on each of the different
settings of `num_clusters`. The `grid` argument controls which values of
-K we want to try—in this case, the values from 1 to 9 that are
+K we want to try—in this case, the values from 1 to 9 that are
stored in the `penguin_clust_ks` data frame. We set the `resamples`
argument to `apparent(penguins)` to\index{tidymodels!apparent} tell K-means to run on the whole
data set for each value of `num_clusters`. Finally, we collect the results
@@ -1085,14 +1083,14 @@ The total WSSD results correspond to the `mean` column when the `.metric` variab
We can obtain a tidy data frame with this information using `filter` and `mutate`.\index{filter}\index{mutate}
```{r 10-kmeans-next}
-kmeans_results <- kmeans_results |>
+kmeans_results <- kmeans_results |>
filter(.metric == "sse_within_total") |>
mutate(total_WSSD = mean) |>
select(num_clusters, total_WSSD)
kmeans_results
```
-Now that we have `total_WSSD` and `num_clusters` as columns in a data frame, we can make a line plot
+Now that we have `total_WSSD` and `num_clusters` as columns in a data frame, we can make a line plot
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use. \index{elbow method}
```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
@@ -1101,20 +1099,20 @@ elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares") +
- scale_x_continuous(breaks = 1:9) +
+ scale_x_continuous(breaks = 1:9) +
theme(text = element_text(size = 12))
elbow_plot
```
It looks like 3 clusters is the right choice for this data.
-But why is there a "bump" in the total WSSD plot here?
-Shouldn't total WSSD always decrease as we add more clusters?
-Technically yes, but remember: K-means can get "stuck" in a bad solution.
+But why is there a "bump" in the total WSSD plot here?
+Shouldn't total WSSD always decrease as we add more clusters?
+Technically yes, but remember: K-means can get "stuck" in a bad solution.
Unfortunately, for K = 8 we had an unlucky initialization
-and found a bad clustering! \index{K-means!restart, nstart}
-We can help prevent finding a bad clustering
-by trying a few different random initializations
+and found a bad clustering! \index{K-means!restart, nstart}
+We can help prevent finding a bad clustering
+by trying a few different random initializations
via the `nstart` argument in the model specification.
Here we will try using 10 restarts.
@@ -1127,7 +1125,7 @@ kmeans_spec
hidden_print(kmeans_spec)
```
-Now if we rerun the same workflow with the new model specification,
+Now if we rerun the same workflow with the new model specification,
K-means clustering will be performed `nstart = 10` times for each value of K.
The `collect_metrics` function will then pick the best clustering of the 10 runs for each value of K,
and report the results for that best clustering.
@@ -1138,8 +1136,8 @@ the more likely we are to find a good clustering (if one exists).
What value should you choose for `nstart`? The answer is that it depends
on many factors: the size and characteristics of your data set,
as well as how powerful your computer is.
-The larger the `nstart` value the better from an analysis perspective,
-but there is a trade-off that doing many clusterings
+The larger the `nstart` value the better from an analysis perspective,
+but there is a trade-off that doing many clusterings
could take a long time.
So this is something that needs to be balanced.
@@ -1160,7 +1158,7 @@ elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) +
geom_line() +
xlab("K") +
ylab("Total within-cluster sum of squares") +
- scale_x_continuous(breaks = 1:9) +
+ scale_x_continuous(breaks = 1:9) +
theme(text = element_text(size = 12))
elbow_plot
@@ -1169,8 +1167,8 @@ elbow_plot
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Clustering" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/inference.Rmd b/source/inference.Rmd
index 0d3015a64..c3ce541c8 100644
--- a/source/inference.Rmd
+++ b/source/inference.Rmd
@@ -21,7 +21,7 @@ min_x <- function(dist) {
min(ggp_data$data[[1]]$xmin)
}
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
## Overview
@@ -32,23 +32,23 @@ analysis questions regarding how summaries, patterns, trends, or relationships
in a data set extend to the wider population are called *inferential
questions*. This chapter will start with the fundamental ideas of sampling from
populations and then introduce two common techniques in statistical inference:
-*point estimation* and *interval estimation*.
+*point estimation* and *interval estimation*.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-* Describe real-world examples of questions that can be answered with statistical inference.
-* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
-* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
-* Explain the difference between a population parameter and a sample point estimate.
-* Use R to draw random samples from a finite population.
-* Use R to create a sampling distribution from a finite population.
-* Describe how sample size influences the sampling distribution.
-* Define bootstrapping.
-* Use R to create a bootstrap distribution to approximate a sampling distribution.
-* Contrast the bootstrap and sampling distributions.
-
-## Why do we need sampling?
+- Describe real-world examples of questions that can be answered with statistical inference.
+- Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
+- Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
+- Explain the difference between a population parameter and a sample point estimate.
+- Use R to draw random samples from a finite population.
+- Use R to create a sampling distribution from a finite population.
+- Describe how sample size influences the sampling distribution.
+- Define bootstrapping.
+- Use R to create a bootstrap distribution to approximate a sampling distribution.
+- Contrast the bootstrap and sampling distributions.
+
+## Why do we need sampling?
We often need to understand how quantities we observe in a subset
of data relate to the same quantities in the broader population. For example, suppose a
retailer is considering selling iPhone accessories, and they want to estimate
@@ -68,7 +68,7 @@ general, a population parameter is a numerical characteristic of the entire
population. To compute this number in the example above, we would need to ask
every single undergraduate in North America whether they own an iPhone. In
practice, directly computing population parameters is often time-consuming and
-costly, and sometimes impossible.
+costly, and sometimes impossible.
A more practical approach would be to make measurements for a **sample**, i.e., a \index{sample}
subset of individuals collected from the population. We can then compute a
@@ -80,7 +80,7 @@ case, we might suspect that proportion is a reasonable estimate of the
proportion of students who own an iPhone in the entire population. Figure
\@ref(fig:11-population-vs-sample) illustrates this process.
In general, the process of using a sample to make a conclusion about the
-broader population from which it is taken is referred to as **statistical inference**.
+broader population from which it is taken is referred to as **statistical inference**.
\index{inference}\index{statistical inference|see{inference}}
```{r 11-population-vs-sample, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Population versus sample.", out.width="100%"}
@@ -91,7 +91,7 @@ Note that proportions are not the *only* kind of population parameter we might
be interested in. For example, suppose an undergraduate student studying at the University
of British Columbia in Canada is looking for an apartment
to rent. They need to create a budget, so they want to know about
-studio apartment rental prices in Vancouver. This student might
+studio apartment rental prices in Vancouver. This student might
formulate the question:
*What is the average price per month of studio apartment rentals in Vancouver?*
@@ -119,13 +119,13 @@ focus on two settings:
## Sampling distributions
### Sampling distributions for proportions
-We will look at an example using data from
+We will look at an example using data from
[Inside Airbnb](http://insideairbnb.com/) [@insideairbnb]. Airbnb \index{Airbnb} is an online
marketplace for arranging vacation rentals and places to stay. The data set
contains listings for Vancouver, Canada, in September 2020. Our data
includes an ID number, neighborhood, type of room, the number of people the
rental accommodates, number of bathrooms, bedrooms, beds, and the price per
-night.
+night.
```{r 11-example-means5, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 5.5, fig.width = 4, fig.cap = "Comparison of population distribution, sample distribution, and sampling distribution."}
@@ -515,7 +515,7 @@ sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = mean_price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Sample mean price per night (dollars)", y = "Count") +
- ggtitle("n = 20")
+ ggtitle("n = 20")
## Sampling distribution n = 50
sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = mean_price)) +
@@ -548,12 +548,12 @@ annotated_sampling_dist_20 <- sampling_distribution_20 +
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20)) +
ggtitle("n = 20") +
annotate("text",
- x = max_x(sampling_distribution_20),
- y = max_count(sampling_distribution_20),
- hjust = 1,
+ x = max_x(sampling_distribution_20),
+ y = max_count(sampling_distribution_20),
+ hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates$mean_price), 1))
- )+ theme(text = element_text(size = 12), axis.title=element_text(size=12))
+ )+ theme(text = element_text(size = 12), axis.title=element_text(size=12))
#+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_20), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates$mean_price), 1)))
@@ -562,9 +562,9 @@ annotated_sampling_dist_50 <- sampling_distribution_50 +
geom_vline(xintercept = mean(sample_estimates_50$mean_price), col = "red") +
## x limits set the same as n = 20 graph, y is this graph
annotate("text",
- x = max_x(sampling_distribution_20),
- y = max_count(sampling_distribution_50),
- hjust = 1,
+ x = max_x(sampling_distribution_20),
+ y = max_count(sampling_distribution_50),
+ hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_50$mean_price), 1))
)+ theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
@@ -574,9 +574,9 @@ annotated_sampling_dist_50 <- sampling_distribution_50 +
annotated_sampling_dist_100 <- sampling_distribution_100 +
geom_vline(xintercept = mean(sample_estimates_100$mean_price), col = "red") +
annotate("text",
- x = max_x(sampling_distribution_20),
- y = max_count(sampling_distribution_100),
- hjust = 1,
+ x = max_x(sampling_distribution_20),
+ y = max_count(sampling_distribution_100),
+ hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_100$mean_price), 1))
) + theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
@@ -586,12 +586,12 @@ annotated_sampling_dist_100 <- sampling_distribution_100 +
annotated_sampling_dist_500 <- sampling_distribution_500 +
geom_vline(xintercept = mean(sample_estimates_500$mean_price), col = "red") +
annotate("text",
- x = max_x(sampling_distribution_20),
- y = max_count(sampling_distribution_500),
- hjust = 1,
+ x = max_x(sampling_distribution_20),
+ y = max_count(sampling_distribution_500),
+ hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_500$mean_price), 1))
- ) + theme(text = element_text(size = 12), axis.title=element_text(size=12))
+ ) + theme(text = element_text(size = 12), axis.title=element_text(size=12))
#+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_500), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates_500$mean_price), 1)))
@@ -616,23 +616,23 @@ mean is roughly bell-shaped. \index{sampling distribution!effect of sample size}
> **Note:** You might notice that in the `n = 20` case in Figure \@ref(fig:11-example-means7),
> the distribution is not *quite* bell-shaped. There is a bit of skew towards the right!
> You might also notice that in the `n = 50` case and larger, that skew seems to disappear.
-> In general, the sampling distribution—for both means and proportions—only
+> In general, the sampling distribution—for both means and proportions—only
> becomes bell-shaped *once the sample size is large enough*.
-> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
+> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
> as a rule of thumb, often a sample size of at least 20 will suffice.
-
+
### Summary
1. A point estimate is a single value computed using a sample from a population (e.g., a mean or proportion).
2. The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population.
3. The shape of the sampling distribution is usually bell-shaped with one peak and centered at the population mean or proportion.
-4. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
+4. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
-## Bootstrapping
-### Overview
+## Bootstrapping
+### Overview
*Why all this emphasis on sampling distributions?*
@@ -650,15 +650,15 @@ We also need to report some notion of *uncertainty* in the value of the point
estimate.
Unfortunately, we cannot construct the exact sampling distribution without
-full access to the population. However, if we could somehow *approximate* what
-the sampling distribution would look like for a sample, we could
+full access to the population. However, if we could somehow *approximate* what
+the sampling distribution would look like for a sample, we could
use that approximation to then report how uncertain our sample
point estimate is (as we did above with the *exact* sampling
-distribution). There are several methods to accomplish this; in this book, we
-will use the \index{bootstrap} *bootstrap*. We will discuss **interval estimation** and
-construct \index{confidence interval}\index{interval|see{confidence interval}}
-**confidence intervals** using just a single sample from a population. A
-confidence interval is a range of plausible values for our population parameter.
+distribution). There are several methods to accomplish this; in this book, we
+will use the \index{bootstrap} *bootstrap*. We will discuss **interval estimation** and
+construct \index{confidence interval}\index{interval|see{confidence interval}}
+**confidence intervals** using just a single sample from a population. A
+confidence interval is a range of plausible values for our population parameter.
Here is the key idea. First, if you take a big enough sample, it *looks like*
the population. Notice the histograms' shapes for samples of different sizes
@@ -673,7 +673,7 @@ sample_10 <- airbnb |>
sample_distribution_10 <- ggplot(sample_10, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (dollars)") +
- ylab("Count") +
+ ylab("Count") +
ggtitle("n = 10")
sample_20 <- airbnb |>
@@ -682,7 +682,7 @@ sample_20 <- airbnb |>
sample_distribution_20 <- ggplot(sample_20, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (dollars)") +
- ylab("Count") +
+ ylab("Count") +
ggtitle("n = 20")
sample_50 <- airbnb |>
@@ -691,7 +691,7 @@ sample_50 <- airbnb |>
sample_distribution_50 <- ggplot(sample_50, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (dollars)") +
- ylab("Count") +
+ ylab("Count") +
ggtitle("n = 50")
sample_100 <- airbnb |>
@@ -700,7 +700,7 @@ sample_100 <- airbnb |>
sample_distribution_100 <- ggplot(sample_100, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (dollars)") +
- ylab("Count") +
+ ylab("Count") +
ggtitle("n = 100")
sample_200 <- airbnb |>
@@ -709,20 +709,20 @@ sample_200 <- airbnb |>
sample_distribution_200 <- ggplot(sample_200, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (dollars)") +
- ylab("Count") +
+ ylab("Count") +
ggtitle("n = 200")
grid.arrange(sample_distribution_10 + xlim(min(airbnb$price), 600),
- sample_distribution_20 +
+ sample_distribution_20 +
xlim(min(airbnb$price), 600),
- sample_distribution_50 +
+ sample_distribution_50 +
xlim(min(airbnb$price), 600),
- sample_distribution_100 +
+ sample_distribution_100 +
xlim(min(airbnb$price), 600),
- sample_distribution_200 +
+ sample_distribution_200 +
xlim(min(airbnb$price), 600),
- population_distribution +
- ggtitle("Population distribution") +
+ population_distribution +
+ ggtitle("Population distribution") +
xlim(min(airbnb$price), 600),
ncol = 2
)
@@ -737,7 +737,7 @@ In the previous section, we took many samples of the same size *from our
population* to get a sense of the variability of a sample estimate. But if our
sample is big enough that it looks like our population, we can pretend that our
sample *is* the population, and take more samples (with replacement) of the
-same size from it instead! This very clever technique is
+same size from it instead! This very clever technique is
called **the bootstrap**. Note that by taking many samples from our single, observed
sample, we do not obtain the true sampling distribution, but rather an
approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
@@ -749,7 +749,7 @@ approximation that we call **the bootstrap distribution**. \index{bootstrap!dist
> size $n$ *without* replacement, it would just return our original sample!
This section will explore how to create a bootstrap distribution from a single
-sample using R. The process is visualized in Figure \@ref(fig:11-intro-bootstrap-image).
+sample using R. The process is visualized in Figure \@ref(fig:11-intro-bootstrap-image).
For a sample of size $n$, you would do the following:
1. Randomly select an observation from the original sample, which was drawn from the population.
@@ -764,10 +764,10 @@ For a sample of size $n$, you would do the following:
knitr::include_graphics("img/inference/intro-bootstrap.jpeg")
```
-### Bootstrapping in R
+### Bootstrapping in R
Let’s continue working with our Airbnb example to illustrate how we might create
-and use a bootstrap distribution using just a single sample from the population.
+and use a bootstrap distribution using just a single sample from the population.
Once again, suppose we are
interested in estimating the population mean price per night of all Airbnb
listings in Vancouver, Canada, using a single sample size of 40.
@@ -775,7 +775,7 @@ Recall our point estimate was \$`r format(round(estimates$mean_price, 2), nsmall
histogram of prices in the sample is displayed in Figure \@ref(fig:11-bootstrapping1).
```{r, echo = F, message = F, warning = F}
-one_sample <- one_sample |>
+one_sample <- one_sample |>
ungroup() |> select(-replicate)
```
@@ -796,7 +796,7 @@ Remember, in practice, we usually only have this one sample from the population.
this sample and estimate are the only data we can work with.
We now perform steps 1–5 listed above to generate a single bootstrap
-sample in R and calculate a point estimate from that bootstrap sample. We will
+sample in R and calculate a point estimate from that bootstrap sample. We will
use the `rep_sample_n` function as we did when we were
creating our sampling distribution. But critically, note that we now
pass `one_sample`—our single sample of size 40—as the first argument.
@@ -810,7 +810,7 @@ boot1 <- one_sample |>
rep_sample_n(size = 40, replace = TRUE, reps = 1)
boot1_dist <- ggplot(boot1, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
- labs(x = "Price per night (dollars)", y = "Count") +
+ labs(x = "Price per night (dollars)", y = "Count") +
theme(text = element_text(size = 12))
boot1_dist
@@ -857,7 +857,7 @@ ggplot(six_bootstrap_samples, aes(price)) +
We see in Figure \@ref(fig:11-bootstrapping-six-bootstrap-samples) how the
bootstrap samples differ. We can also calculate the sample mean for each of
-these six replicates.
+these six replicates.
```{r 11-bootstrapping-six-bootstrap-samples-means, echo = TRUE, message = FALSE, warning = FALSE}
six_bootstrap_samples |>
group_by(replicate) |>
@@ -887,7 +887,7 @@ boot_est_dist <- ggplot(boot20000_means, aes(x = mean_price)) +
boot_est_dist
```
-Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
+Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
the true sampling distribution—which corresponds to taking many samples from the population.
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of the distribution of the bootstrap sample means and sampling distribution.", fig.height = 3.5}
@@ -900,27 +900,27 @@ sample_estimates <- samples |>
sampling_dist <- ggplot(sample_estimates, aes(x = mean_price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
ylab("Count") +
- xlab("Sample mean price per night (dollars)")
+ xlab("Sample mean price per night (dollars)")
-annotated_sampling_dist <- sampling_dist +
- xlim(min_x(sampling_dist), max_x(sampling_dist)) +
+annotated_sampling_dist <- sampling_dist +
+ xlim(min_x(sampling_dist), max_x(sampling_dist)) +
geom_vline(xintercept = mean(sample_estimates$mean_price), col = "red") +
annotate("text",
- x = max_x(sampling_dist), y = max_count(sampling_dist),
- hjust = 1,
+ x = max_x(sampling_dist), y = max_count(sampling_dist),
+ hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates$mean_price), 1)))
-boot_est_dist_limits <- boot_est_dist +
- xlim(min_x(sampling_dist), max_x(sampling_dist))
+boot_est_dist_limits <- boot_est_dist +
+ xlim(min_x(sampling_dist), max_x(sampling_dist))
-annotated_boot_est_dist <- boot_est_dist_limits +
+annotated_boot_est_dist <- boot_est_dist_limits +
geom_vline(xintercept = mean(boot20000_means$mean_price), col = "red") +
annotate("text",
- x = max_x(sampling_dist), y = max_count(boot_est_dist_limits),
- vjust = 1,
- hjust = 1,
- label = paste("mean = ", round(mean(boot20000_means$mean_price), 1)))
+ x = max_x(sampling_dist), y = max_count(boot_est_dist_limits),
+ vjust = 1,
+ hjust = 1,
+ label = paste("mean = ", round(mean(boot20000_means$mean_price), 1)))
grid.arrange(annotated_sampling_dist + ggtitle("Sampling distribution"),
annotated_boot_est_dist + ggtitle("Bootstrap distribution"),
ncol = 2
@@ -933,13 +933,13 @@ There are two essential points that we can take away from Figure
distribution and the bootstrap distribution are similar; the bootstrap
distribution lets us get a sense of the point estimate's variability. The
second important point is that the means of these two distributions are
-different. The sampling distribution is centered at
+different. The sampling distribution is centered at
\$`r round(mean(airbnb$price),2)`, the population mean value. However, the bootstrap
-distribution is centered at the original sample's mean price per night,
+distribution is centered at the original sample's mean price per night,
\$`r round(mean(boot20000_means$mean_price), 2)`. Because we are resampling from the
original sample repeatedly, we see that the bootstrap distribution is centered
at the original sample's mean value (unlike the sampling distribution of the
-sample mean, which is centered at the population parameter value).
+sample mean, which is centered at the population parameter value).
Figure
\@ref(fig:11-bootstrapping7) summarizes the bootstrapping process.
@@ -1085,10 +1085,10 @@ grid.text("many means...",
)
```
-### Using the bootstrap to calculate a plausible range
+### Using the bootstrap to calculate a plausible range
Now that we have constructed our bootstrap distribution, let's use it to create
-an approximate 95\% percentile bootstrap confidence interval.
+an approximate 95\% percentile bootstrap confidence interval.
A **confidence interval** \index{confidence interval} is a range of plausible values for the population parameter. We will
find the range of values covering the middle 95\% of the bootstrap
distribution, giving us a 95\% confidence interval. You may be wondering, what
@@ -1108,13 +1108,13 @@ confidence level.
To calculate a 95\% percentile bootstrap confidence interval, we will do the following:
-1. Arrange the observations in the bootstrap distribution in ascending order.
+1. Arrange the observations in the bootstrap distribution in ascending order.
2. Find the value such that 2.5\% of observations fall below it (the 2.5\% percentile). Use that value as the lower bound of the interval.
3. Find the value such that 97.5\% of observations fall below it (the 97.5\% percentile). Use that value as the upper bound of the interval.
\newpage
-To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
+To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
\index{quantile}
\index{pull}
@@ -1132,7 +1132,7 @@ bounds
Our interval, \$`r round(bounds[1],2) ` to \$`r round(bounds[2],2)`, captures
the middle 95\% of the sample mean prices in the bootstrap distribution. We can
visualize the interval on our distribution in Figure
-\@ref(fig:11-bootstrapping9).
+\@ref(fig:11-bootstrapping9).
```{r 11-bootstrapping9, echo = F, message = FALSE, warning = FALSE, fig.cap = "Distribution of the bootstrap sample means with percentile lower and upper bounds.", fig.height=4, fig.width = 6.5}
boot_est_dist +
@@ -1149,7 +1149,7 @@ boot_est_dist +
To finish our estimation of the population parameter, we would report the point
estimate and our confidence interval's lower and upper bounds. Here the sample
-mean price per night of 40 Airbnb listings was
+mean price per night of 40 Airbnb listings was
\$`r format(round(mean(one_sample$price),2), nsmall=2)`, and we are 95\% "confident" that the true
population mean price per night for all Airbnb listings in Vancouver is between
\$`r round(bounds[1],2)` and \$`r round(bounds[2],2)`.
@@ -1169,8 +1169,8 @@ statistical techniques you may learn about in the future!
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the two "Statistical inference" rows.
You can launch an interactive version of each worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/intro.Rmd b/source/intro.Rmd
index f707bccb1..4d2e5bb78 100644
--- a/source/intro.Rmd
+++ b/source/intro.Rmd
@@ -13,9 +13,9 @@ knitr::opts_chunk$set(fig.align = "default")
This chapter provides an introduction to data science and the R programming language.
The goal here is to get your hands dirty right from the start! We will walk through an entire data analysis,
-and along the way introduce different types of data analysis question, some fundamental programming
+and along the way introduce different types of data analysis question, some fundamental programming
concepts in R, and the basics of loading, cleaning, and visualizing data. In the following chapters, we will
-dig into each of these steps in much more detail; but for now, let's jump in to see how much we can do
+dig into each of these steps in much more detail; but for now, let's jump in to see how much we can do
with data science!
## Chapter learning objectives
@@ -27,13 +27,14 @@ By the end of the chapter, readers will be able to do the following:
- Read tabular data with `read_csv`.
- Create new variables and objects in R using the assignment symbol.
- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
+- Add and modify columns in tabular data using `mutate`.
- Visualize data with a `ggplot` bar plot.
- Use `?` to access help and documentation tools in R.
## Canadian languages data set
In this chapter, \index{Canadian languages} we will walk through a full analysis of a data set relating to
-languages spoken at home by Canadian residents. Many Indigenous peoples exist in Canada
+languages spoken at home by Canadian residents. Many Indigenous peoples exist in Canada
with their own cultures and languages; these languages are often unique to Canada and not spoken
anywhere else in the world [@statcan2018mothertongue]. Sadly, colonization has
led to the loss of many of these languages. For instance, generations of
@@ -41,18 +42,18 @@ children were not allowed to speak their mother tongue (the first language an
individual learns in childhood) in Canadian residential schools. Colonizers
also renamed places they had "discovered" [@wilson2018]. Acts such as these
have significantly harmed the continuity of Indigenous languages in Canada, and
-some languages are considered "endangered" as few people report speaking them.
-To learn more, please see *Canadian Geographic*'s article, "Mapping Indigenous Languages in
-Canada" [@walker2017],
-*They Came for the Children: Canada, Aboriginal
-peoples, and Residential Schools* [@children2012]
-and the *Truth and Reconciliation Commission of Canada's*
+some languages are considered "endangered" as few people report speaking them.
+To learn more, please see *Canadian Geographic*'s article, "Mapping Indigenous Languages in
+Canada" [@walker2017],
+*They Came for the Children: Canada, Aboriginal
+peoples, and Residential Schools* [@children2012]
+and the *Truth and Reconciliation Commission of Canada's*
*Calls to Action* [@calls2015].
-The data set we will study in this chapter is taken from
-[the `canlang` R data package](https://ttimbers.github.io/canlang/)
+The data set we will study in this chapter is taken from
+[the `canlang` R data package](https://ttimbers.github.io/canlang/)
[@timbers2020canlang], which has
-population language data collected during the 2016 Canadian census [@cancensus2016].
+population language data collected during the 2016 Canadian census [@cancensus2016].
In this data, there are 214 languages recorded, each having six different properties:
1. `category`: Higher-level language category, describing whether the language is an Official Canadian language, an Aboriginal (i.e., Indigenous) language, or a Non-Official and Non-Aboriginal language.
@@ -64,12 +65,12 @@ In this data, there are 214 languages recorded, each having six different proper
According to the census, more than 60 Aboriginal languages were reported
as being spoken in Canada. Suppose we want to know which are the most common;
-then we might ask the following question, which we wish to answer using our data:
+then we might ask the following question, which we wish to answer using our data:
*Which ten Aboriginal languages were most often reported in 2016 as mother
-tongues in Canada, and how many people speak each of them?*
+tongues in Canada, and how many people speak each of them?*
-> **Note:** Data science\index{data science!good practices} cannot be done without
+> **Note:** Data science\index{data science!good practices} cannot be done without
> a deep understanding of the data and
> problem domain. In this book, we have simplified the data sets used in our
> examples to concentrate on methods and fundamental concepts. But in real
@@ -79,15 +80,15 @@ tongues in Canada, and how many people speak each of them?*
> about *how* the data were collected, which affects the conclusions you can
> draw. If your data are biased, then your results will be biased!
-## Asking a question
+## Asking a question
Every good data analysis begins with a *question*—like the
above—that you aim to answer using data. As it turns out, there
are actually a number of different *types* of question regarding data:
descriptive, exploratory, inferential, predictive, causal, and mechanistic,
all of which are defined in Table \@ref(tab:questions-table).
-Carefully formulating a question as early as possible in your analysis—and
-correctly identifying which type of question it is—will guide your overall approach to
+Carefully formulating a question as early as possible in your analysis—and
+correctly identifying which type of question it is—will guide your overall approach to
the analysis as well as the selection of appropriate tools.\index{question!data analysis}
\index{descriptive question!definition}
\index{exploratory question!definition}
@@ -107,30 +108,30 @@ Table: (\#tab:questions-table) Types of data analysis question [@leek2015questio
| Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? |
| Mechanistic | A question that asks about the underlying mechanism of the observed patterns, trends, or relationships (i.e., how does it happen?) | How does wealth lead to voting for a certain political party in Canadian elections? |
-In this book, you will learn techniques to answer the
-first four types of question: descriptive, exploratory, predictive, and inferential;
+In this book, you will learn techniques to answer the
+first four types of question: descriptive, exploratory, predictive, and inferential;
causal and mechanistic questions are beyond the scope of this book.
In particular, you will learn how to apply the following analysis tools:
-1. **Summarization:** \index{summarization!overview} computing and reporting aggregated values pertaining to a data set.
+1. **Summarization:** \index{summarization!overview} computing and reporting aggregated values pertaining to a data set.
Summarization is most often used to answer descriptive questions,
and can occasionally help with answering exploratory questions.
-For example, you might use summarization to answer the following question:
+For example, you might use summarization to answer the following question:
*What is the average race time for runners in this data set?*
Tools for summarization are covered in detail in Chapters \@ref(reading)
and \@ref(wrangling), but appear regularly throughout the text.
-2. **Visualization:** \index{visualization!overview} plotting data graphically.
+2. **Visualization:** \index{visualization!overview} plotting data graphically.
Visualization is typically used to answer descriptive and exploratory questions,
but plays a critical supporting role in answering all of the types of question in Table \@ref(tab:questions-table).
For example, you might use visualization to answer the following question:
-*Is there any relationship between race time and age for runners in this data set?*
+*Is there any relationship between race time and age for runners in this data set?*
This is covered in detail in Chapter \@ref(viz), but again appears regularly throughout the book.
3. **Classification:** \index{classification!overview} predicting a class or category for a new observation.
Classification is used to answer predictive questions.
For example, you might use classification to answer the following question:
*Given measurements of a tumor's average cell area and perimeter, is the tumor benign or malignant?*
Classification is covered in Chapters \@ref(classification1) and \@ref(classification2).
-4. **Regression:** \index{regression!overview} predicting a quantitative value for a new observation.
+4. **Regression:** \index{regression!overview} predicting a quantitative value for a new observation.
Regression is also used to answer predictive questions.
For example, you might use regression to answer the following question:
*What will be the race time for a 20-year-old runner who weighs 50kg?*
@@ -140,26 +141,26 @@ data set. Clustering is often used to answer exploratory questions.
For example, you might use clustering to answer the following question:
*What products are commonly bought together on Amazon?*
Clustering is covered in Chapter \@ref(clustering).
-6. **Estimation:** \index{estimation!overview} taking measurements for a small number of items from a large group
- and making a good guess for the average or proportion for the large group. Estimation
+6. **Estimation:** \index{estimation!overview} taking measurements for a small number of items from a large group
+ and making a good guess for the average or proportion for the large group. Estimation
is used to answer inferential questions.
For example, you might use estimation to answer the following question:
*Given a survey of cellphone ownership of 100 Canadians, what proportion
-of the entire Canadian population own Android phones?*
+of the entire Canadian population own Android phones?*
Estimation is covered in Chapter \@ref(inference).
-Referring to Table \@ref(tab:questions-table), our question about
+Referring to Table \@ref(tab:questions-table), our question about
Aboriginal languages is an example of a *descriptive question*: we are
summarizing the characteristics of a data set without further interpretation.
And referring to the list above, it looks like we should use visualization
and perhaps some summarization to answer the question. So in the remainder
-of this chapter, we will work towards making a visualization that shows
+of this chapter, we will work towards making a visualization that shows
us the ten most common Aboriginal languages in Canada and their associated counts,
-according to the 2016 census.
+according to the 2016 census.
## Loading a tabular data set
A data set is, at its core essence, a structured collection of numbers and characters.
-Aside from that, there are really no strict rules; data sets can come in
+Aside from that, there are really no strict rules; data sets can come in
many different forms! Perhaps the most common form of data set that you will
find in the wild, however, is *tabular data*\index{tabular data}. Think spreadsheets in Microsoft Excel: tabular data are
rectangular-shaped and spreadsheet-like, as shown in Figure
@@ -172,7 +173,7 @@ R, it is represented as a *data frame* object\index{data frame!overview}. Figure
to a spreadsheet. We refer to the rows as \index{observation} **observations**; these are the things that we
collect the data on, e.g., voters, cities, etc. We refer to the columns as \index{variable}
**variables**; these are the characteristics of those observations, e.g., voters' political
-affiliations, cities' populations, etc.
+affiliations, cities' populations, etc.
```{r img-spreadsheet-vs-dataframe, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A spreadsheet versus a data frame in R.", out.width="100%", fig.retina = 2}
knitr::include_graphics("img/intro/spreadsheet_vs_dataframe.png")
@@ -182,7 +183,7 @@ The first kind of data file that we will learn how to load into R as a data
frame is the *comma-separated values* format (`.csv` for short)\index{comma-separated values|see{csv}}\index{csv}. These files
have names ending in `.csv`, and can be opened and saved using common
spreadsheet programs like Microsoft Excel and Google Sheets. For example, the
-`.csv` file named `can_lang.csv`
+`.csv` file named `can_lang.csv`
is included with [the code for this book](https://github.com/UBC-DSCI/introduction-to-datascience/tree/master/data).
If we were to open this data in a plain text editor (a program like Notepad that just shows
text with no formatting), we would see each row on its own line, and each entry in the table separated by a comma:
@@ -204,7 +205,7 @@ To load this data into R so that we can do things with it (e.g., perform
analyses or create data visualizations), we will need to use a *function.* \index{function} A
function is a special word in R that takes instructions (we call these
*arguments*) \index{argument} and does something. The function we will use to load a `.csv` file
-into R is called `read_csv`. \index{read function!read\_csv} In its most basic
+into R is called `read_csv`. \index{read function!read\_csv} In its most basic
use-case, `read_csv` expects that the data file:
- has column names (or *headers*),
@@ -216,14 +217,14 @@ Below you'll see the code used to load the data into R using the `read_csv`
function. Note that the `read_csv` function is not included in the base
installation of R, meaning that it is not one of the primary functions ready to
use when you install R. Therefore, you need to load it from somewhere else
-before you can use it. The place from which we will load it is called an R *package*.
+before you can use it. The place from which we will load it is called an R *package*.
An R package \index{package} is a collection of functions that can be used in addition to the
built-in R package functions once loaded. The `read_csv` function, in
-particular, can be made accessible by loading
+particular, can be made accessible by loading
[the `tidyverse` R package](https://tidyverse.tidyverse.org/) [@tidyverse; @wickham2019tidverse]
using the `library` function. \index{library} The `tidyverse` \index{tidyverse} package contains many
-functions that we will use throughout this book to load, clean, wrangle,
-and visualize data.
+functions that we will use throughout this book to load, clean, wrangle,
+and visualize data.
```{r load-tidyverse, message = TRUE, warning = TRUE}
library(tidyverse)
@@ -256,7 +257,7 @@ code to distinguish it from the special words (like functions!) that make up the
language. The file's name is the only argument we need to provide because our
file satisfies everything else that the `read_csv` function expects in the default
use-case. Figure \@ref(fig:img-read-csv) describes how we use the `read_csv`
-to read data into R.
+to read data into R.
(ref:img-read-csv) Syntax for the `read_csv` function.
@@ -270,8 +271,8 @@ read_csv("data/can_lang.csv")
```
> **Note:** There is another function
-> that also loads csv files named `read.csv`. We will *always* use
-> `read_csv` in this book, as it is designed to play nicely with all of the
+> that also loads csv files named `read.csv`. We will *always* use
+> `read_csv` in this book, as it is designed to play nicely with all of the
> other `tidyverse` functions, which we will use extensively. Be
> careful not to accidentally use `read.csv`, as it can cause some tricky
> errors to occur in your code that are hard to track down!
@@ -279,14 +280,14 @@ read_csv("data/can_lang.csv")
## Naming things in R
When we loaded the 2016 Canadian census language data
-using `read_csv`, we did not give this data frame a name.
-Therefore the data was just printed on the screen,
-and we cannot do anything else with it. That isn't very useful.
-What would be more useful would be to give a name
-to the data frame that `read_csv` outputs,
+using `read_csv`, we did not give this data frame a name.
+Therefore the data was just printed on the screen,
+and we cannot do anything else with it. That isn't very useful.
+What would be more useful would be to give a name
+to the data frame that `read_csv` outputs,
so that we can refer to it later for analysis and visualization.
-The way to assign a name to a value in R is via the *assignment symbol* `<-`.
+The way to assign a name to a value in R is via the *assignment symbol* `<-`.
\index{aaaassignsymb@\texttt{<-}|see{assignment symbol}}\index{assignment symbol}
On the left side of the assignment symbol you put the name that you want
to use, and on the right side of the assignment symbol
@@ -300,17 +301,17 @@ and we set `name` to the string `"Alice"`. \index{string}
my_number <- 1 + 2
name <- "Alice"
```
-Note that when
-we name something in R using the assignment symbol, `<-`,
-we do not need to surround the name we are creating with quotes. This is
+Note that when
+we name something in R using the assignment symbol, `<-`,
+we do not need to surround the name we are creating with quotes. This is
because we are formally telling R that this special word denotes
the value of whatever is on the right-hand side.
Only characters and words that act as *values* on the right-hand side of the assignment
-symbol—e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above—need
+symbol—e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above—need
to be surrounded by quotes.
After making the assignment, we can use the special name words we have created in
-place of their values. For example, if we want to do something with the value `3` later on,
+place of their values. For example, if we want to do something with the value `3` later on,
we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
R just interprets this as adding 3 and 2:
```{r naming-things2}
@@ -329,7 +330,7 @@ my-number <- 1
Error in my - number <- 1 : object 'my' not found
```
-There are certain conventions for naming objects in R.
+There are certain conventions for naming objects in R.
When naming \index{object!naming convention} an object we
suggest using only lowercase letters, numbers and underscores `_` to separate
the words in a name. R is case sensitive, which means that `Letter` and
@@ -340,19 +341,19 @@ remember what each name in your code represents. We recommend following the
Tidyverse naming conventions outlined in the *Tidyverse Style Guide* [@tidyversestyleguide]. Let's
now use the assignment symbol to give the name
`can_lang` to the 2016 Canadian census language data frame that we get from
-`read_csv`.
+`read_csv`.
```{r load-data-with-name, message=FALSE}
can_lang <- read_csv("data/can_lang.csv")
```
Wait a minute, nothing happened this time! Where's our data?
-Actually, something did happen: the data was loaded in
-and now has the name `can_lang` associated with it.
-And we can use that name to access the data frame and do things with it.
-For example, we can type the name of the data frame to print the first few rows
-on the screen. You will also see at the top that the number of observations (i.e., rows) and
-variables (i.e., columns) are printed. Printing the first few rows of a data frame
+Actually, something did happen: the data was loaded in
+and now has the name `can_lang` associated with it.
+And we can use that name to access the data frame and do things with it.
+For example, we can type the name of the data frame to print the first few rows
+on the screen. You will also see at the top that the number of observations (i.e., rows) and
+variables (i.e., columns) are printed. Printing the first few rows of a data frame
like this is a handy way to get a quick sense for what is contained in a data frame.
```{r print}
@@ -363,12 +364,12 @@ can_lang
Now that we've loaded our data into R, we can start wrangling the data to
find the ten Aboriginal languages that were most often reported
-in 2016 as mother tongues in Canada. In particular, we will construct
-a table with the ten Aboriginal languages that have the largest
-counts in the `mother_tongue` column.
+in 2016 as mother tongues in Canada. In particular, we will construct
+a table with the ten Aboriginal languages that have the largest
+counts in the `mother_tongue` column.
The `filter` and `select` functions from the `tidyverse` package will help us
here. The `filter` \index{filter} function allows you to obtain a subset of the
-rows with specific values, while the `select` \index{select} function allows you
+rows with specific values, while the `select` \index{select} function allows you
to obtain a subset of the columns. Therefore, we can `filter` the rows to extract the
Aboriginal languages in the data set, and then use `select` to obtain only the
columns we want to include in our table.
@@ -377,8 +378,8 @@ columns we want to include in our table.
Looking at the `can_lang` data above, we see the `category` column contains different
high-level categories of languages, which include "Aboriginal languages",
"Non-Official & Non-Aboriginal languages" and "Official languages". To answer
-our question we want to filter our data set so we restrict our attention
-to only those languages in the "Aboriginal languages" category.
+our question we want to filter our data set so we restrict our attention
+to only those languages in the "Aboriginal languages" category.
We can use the `filter` \index{filter} function to obtain the subset of rows with desired
values from a data frame. Figure \@ref(fig:img-filter) outlines what arguments we need to specify to use `filter`. The first argument to `filter` is the name of the data frame
@@ -386,7 +387,7 @@ object, `can_lang`. The second argument is a *logical statement* \index{logical
filtering the rows. A logical statement evaluates to either `TRUE` or `FALSE`;
`filter` keeps only those rows for which the logical statement evaluates to `TRUE`.
For example, in our analysis, we are interested in keeping only languages in the
-"Aboriginal languages" higher-level category. We can use
+"Aboriginal languages" higher-level category. We can use
the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
of the `category` column with the value `"Aboriginal languages"`; you will learn about
many other kinds of logical statements in Chapter \@ref(wrangling). Similar to
@@ -394,7 +395,7 @@ when we loaded the data file and put quotes around the file name, here we need
to put quotes around `"Aboriginal languages"`. Using quotes tells R that this
is a string *value* \index{string} and not one of the special words that make up the R
programming language, or one of the names we have given to data frames in the
-code we have already written.
+code we have already written.
(ref:img-filter) Syntax for the `filter` function.
@@ -405,7 +406,7 @@ image_read("img/intro/filter_function.jpeg") |>
With these arguments, `filter` returns a data frame that has all the columns of
the input data frame, but only those rows we asked for in our logical filter
-statement.
+statement.
```{r}
aboriginal_lang <- filter(can_lang, category == "Aboriginal languages")
@@ -415,7 +416,7 @@ It's good practice to check the output after using a
function in R. We can see the original `can_lang` data set contained 214 rows
with multiple kinds of `category`. The data frame
`aboriginal_lang` contains only 67 rows, and looks like it only contains languages in
-the "Aboriginal languages" in the `category` column. So it looks like the function
+the "Aboriginal languages" in the `category` column. So it looks like the function
gave us the result we wanted!
### Using `select` to extract columns
@@ -429,7 +430,7 @@ arguments are the column names that we want to select: `language` and
returns two columns (the `language` and `mother_tongue` columns that we asked
for) as a data frame. This code is also a great example of why being
able to name things in R is useful: you can see that we are using the
-result of our earlier `filter` step (which we named `aboriginal_lang`) here
+result of our earlier `filter` step (which we named `aboriginal_lang`) here
in the next step of the analysis!
(ref:img-select) Syntax for the `select` function.
@@ -457,12 +458,12 @@ rescue! \index{arrange}\index{slice}
The `arrange` function allows us to order the rows of a data frame by the
values of a particular column. Figure \@ref(fig:img-arrange) details what arguments we need to specify to
use the `arrange` function. We need to pass the data frame as the first
-argument to this function, and the variable to order by as the second argument.
+argument to this function, and the variable to order by as the second argument.
Since we want to choose the ten Aboriginal languages most often reported as a mother tongue
language, we will use the `arrange` function to order the rows in our
`selected_lang` data frame by the `mother_tongue` column. We want to
arrange the rows in descending order (from largest to smallest),
-so we pass the column to the `desc` function before using it as an argument.
+so we pass the column to the `desc` function before using it as an argument.
(ref:img-arrange) Syntax for the `arrange` function.
@@ -501,16 +502,16 @@ data frame as its first argument, then specify the equation that computes the pe
in the second argument. By using a new variable name on the left hand side of the equation,
we will create a new column in the data frame; and if we use an existing name, we will
modify that variable. In this case, we will opt to
-create a new column called `mother_tongue_percent`.
+create a new column called `mother_tongue_percent`.
-```{r}
+```{r}
canadian_population = 35151728
ten_lang_percent = mutate(ten_lang, mother_tongue_percent = 100 * mother_tongue / canadian_population)
ten_lang_percent
```
The `ten_lang_percent` data frame shows that
-the ten Aboriginal languages in the `ten_lang` data frame were spoken
+the ten Aboriginal languages in the `ten_lang` data frame were spoken
as a mother tongue by between 0.008% and 0.18% of the Canadian population.
@@ -520,14 +521,14 @@ We have now answered our initial question by generating the `ten_lang` table!
Are we done? Well, not quite; tables are almost never the best way to present
the result of your analysis to your audience. Even the `ten_lang` table with
only two columns presents some difficulty: for example, you have to scrutinize
-the table quite closely to get a sense for the relative numbers of speakers of
-each language. When you move on to more complicated analyses, this issue only
-gets worse. In contrast, a *visualization* would convey this information in a much
-more easily understood format.
+the table quite closely to get a sense for the relative numbers of speakers of
+each language. When you move on to more complicated analyses, this issue only
+gets worse. In contrast, a *visualization* would convey this information in a much
+more easily understood format.
Visualizations are a great tool for summarizing information to help you
effectively communicate with your audience, and
creating effective data visualizations \index{visualization} is an essential component of any data
-analysis. In this section we will develop a visualization of the
+analysis. In this section we will develop a visualization of the
ten Aboriginal languages that were most often reported in 2016 as mother tongues in
Canada, as well as the number of people that speak each of them.
@@ -544,7 +545,7 @@ formally introduce tidy data in Chapter \@ref(wrangling).
We will make a bar plot to visualize our data. A bar plot \index{plot|see{visualization}}\index{visualization|see{ggplot}}\index{visualization!bar} is a chart where the
lengths of the bars represent certain values, like counts or proportions. We
will make a bar plot using the `mother_tongue` and `language` columns from our
-`ten_lang` data frame. To create a bar plot of these two variables using the
+`ten_lang` data frame. To create a bar plot of these two variables using the
`ggplot` function, we must specify the data frame, which variables
to put on the x and y axes, and what kind of plot to create. The `ggplot`
function and its common usage is illustrated in Figure \@ref(fig:img-ggplot).
@@ -589,7 +590,7 @@ Canadian Residents)" would be much more informative.
Adding additional layers \index{plot!layers} to our visualizations that we create in `ggplot` is
one common and easy way to improve and refine our data visualizations. New
layers are added to `ggplot` objects using the `+` symbol. For example, we can
-use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions
+use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions
to add layers where we specify meaningful
and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
@@ -606,7 +607,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
ylab("Mother Tongue (Number of Canadian Residents)")
```
-The result is shown in Figure \@ref(fig:barplot-mother-tongue-labs).
+The result is shown in Figure \@ref(fig:barplot-mother-tongue-labs).
This is already quite an improvement! Let's tackle the next major issue with the visualization
in Figure \@ref(fig:barplot-mother-tongue-labs): the overlapping x axis labels, which are
currently making it difficult to read the different language names.
@@ -620,14 +621,14 @@ ggplot(ten_lang, aes(x = mother_tongue, y = language)) +
ylab("Language")
```
-Another big step forward, as shown in Figure \@ref(fig:barplot-mother-tongue-flipped)! There
+Another big step forward, as shown in Figure \@ref(fig:barplot-mother-tongue-flipped)! There
are no more serious issues with the visualization. Now comes time to refine
the visualization to make it even more well-suited to answering the question
we asked earlier in this chapter. For example, the visualization could be made more transparent by
organizing the bars according to the number of Canadian residents reporting
each language, rather than in alphabetical order. We can reorder the bars using
the `reorder` \index{reorder} function, which orders a variable (here `language`) based on the
-values of the second variable (`mother_tongue`).
+values of the second variable (`mother_tongue`).
\newpage
@@ -650,7 +651,7 @@ n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
> Cree languages include the following categories: Cree n.o.s., Swampy Cree,
> Plains Cree, Woods Cree, and a 'Cree not included elsewhere' category (which
> includes Moose Cree, Northern East Cree and Southern East Cree)
-> [@language2016].
+> [@language2016].
### Putting it all together
@@ -658,11 +659,11 @@ In the block of code below, we put everything from this chapter together, with a
modifications. In particular, we have actually skipped the
`select` step that we did above; since you specify the variable names to plot
in the `ggplot` function, you don't actually need to `select` the columns in advance
-when creating a visualization. We have also provided *comments* next to
+when creating a visualization. We have also provided *comments* next to
many of the lines of code below using the
-hash symbol `#`. When R sees a `#` sign, \index{comment} \index{aaacommentsymb@\#|see{comment}} it
+hash symbol `#`. When R sees a `#` sign, \index{comment} \index{aaacommentsymb@\#|see{comment}} it
will ignore all of the text that
-comes after the symbol on that line. So you can use comments to explain lines
+comes after the symbol on that line. So you can use comments to explain lines
of code for others, and perhaps more importantly, your future self!
It's good practice to get in the habit of
commenting your code to improve its readability.
@@ -672,7 +673,7 @@ performed an entire data science workflow with a highly effective data
visualization! We asked a question, loaded the data into R, wrangled the data
(using `filter`, `arrange` and `slice`) and created a data visualization to
help answer our question. In this chapter, you got a quick taste of the data
-science workflow; continue on with the next few chapters to learn each of
+science workflow; continue on with the next few chapters to learn each of
these steps in much more detail!
```{r nachos-to-cheesecake, fig.width=5, fig.height=3, warning=FALSE, message=FALSE, fig.cap = "Putting it all together: bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue."}
@@ -691,35 +692,35 @@ ggplot(ten_lang, aes(x = mother_tongue,
y = reorder(language, mother_tongue))) +
geom_bar(stat = "identity") +
xlab("Mother Tongue (Number of Canadian Residents)") +
- ylab("Language")
+ ylab("Language")
```
## Accessing documentation
-There are many R functions in the `tidyverse` package (and beyond!), and
+There are many R functions in the `tidyverse` package (and beyond!), and
nobody can be expected to remember what every one of them does
-or all of the arguments we have to give them. Fortunately, R provides
-the `?` symbol, which
+or all of the arguments we have to give them. Fortunately, R provides
+the `?` symbol, which
\index{aaaquestionmark@?|see{documentation}}
\index{help|see{documentation}}
-\index{documentation} provides an easy way to pull up the documentation for
-most functions quickly. To use the `?` symbol to access documentation, you
+\index{documentation} provides an easy way to pull up the documentation for
+most functions quickly. To use the `?` symbol to access documentation, you
just put the name of the function you are curious about after the `?` symbol.
For example, if you had forgotten what the `filter` function
did or exactly what arguments to pass in, you could run the following
-code:
+code:
```
?filter
```
Figure \@ref(fig:01-help) shows the documentation that will pop up,
-including a high-level description of the function, its arguments,
+including a high-level description of the function, its arguments,
a description of each, and more. Note that you may find some of the
-text in the documentation a bit too technical right now
+text in the documentation a bit too technical right now
(for example, what is `dbplyr`, and what is a lazy data frame?).
Fear not: as you work through this book, many of these terms will be introduced
-to you, and slowly but surely you will become more adept at understanding and navigating
+to you, and slowly but surely you will become more adept at understanding and navigating
documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind that the documentation
is not written to *teach* you about a function; it is just there as a reference to *remind*
you about the different arguments and usage of functions that you have already learned about elsewhere.
@@ -733,8 +734,8 @@ knitr::include_graphics("img/intro/help-filter.png")
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "R and the tidyverse" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/jupyter.Rmd b/source/jupyter.Rmd
index e8d77b149..77bbea022 100644
--- a/source/jupyter.Rmd
+++ b/source/jupyter.Rmd
@@ -6,7 +6,7 @@ library(magrittr)
library(knitr)
library(fontawesome)
-knitr::opts_chunk$set(message = FALSE,
+knitr::opts_chunk$set(message = FALSE,
fig.align = "center")
play <- function(){
@@ -47,14 +47,14 @@ circle <- function(){
A typical data analysis involves not only writing and executing code, but also writing text and displaying images
that help tell the story of the analysis. In fact, ideally, we would like to *interleave* these three media,
with the text and images serving as narration for the code and its output.
-In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in
+In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in
data science. Jupyter notebooks do precisely what we need: they let you combine text, images, and (executable!) code in a single
document. In this chapter, we will focus on the *use* of Jupyter notebooks to program in R and write
-text via a web interface.
+text via a web interface.
These skills are essential to getting your analysis running; think of it like getting dressed in the morning!
Note that we assume that you already have Jupyter set up and ready to use. If that is not the case, please first read
Chapter \@ref(setup) to learn how to install and configure Jupyter on your own
-computer.
+computer.
## Chapter learning objectives
@@ -68,17 +68,17 @@ By the end of the chapter, readers will be able to do the following:
## Jupyter
-Jupyter is a web-based interactive development environment for creating, editing,
-and executing documents called Jupyter notebooks. Jupyter notebooks \index{Jupyter notebook} are
-documents that contain a mix of computer code (and its output) and formattable
-text. Given that they combine these two analysis artifacts in a single
-document—code is not separate from the output or written report—notebooks are
-one of the leading tools to create reproducible data analyses. Reproducible data
-analysis \index{reproducible} is one where you can reliably and easily re-create the same results when
-analyzing the same data. Although this sounds like something that should always
-be true of any data analysis, in reality, this is not often the case; one needs
+Jupyter is a web-based interactive development environment for creating, editing,
+and executing documents called Jupyter notebooks. Jupyter notebooks \index{Jupyter notebook} are
+documents that contain a mix of computer code (and its output) and formattable
+text. Given that they combine these two analysis artifacts in a single
+document—code is not separate from the output or written report—notebooks are
+one of the leading tools to create reproducible data analyses. Reproducible data
+analysis \index{reproducible} is one where you can reliably and easily re-create the same results when
+analyzing the same data. Although this sounds like something that should always
+be true of any data analysis, in reality, this is not often the case; one needs
to make a conscious effort to perform data analysis in a reproducible manner.
-An example of what a Jupyter notebook looks like is shown in
+An example of what a Jupyter notebook looks like is shown in
Figure \@ref(fig:img-jupyter).
```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A screenshot of a Jupyter Notebook.", fig.retina = 2, out.width="100%"}
@@ -87,60 +87,60 @@ knitr::include_graphics("img/jupyter/jupyter.png")
### Accessing Jupyter
-One of the easiest ways to start working with Jupyter is to use a
-web-based platform called \index{JupyterHub} JupyterHub. JupyterHubs often have Jupyter, R, a number of R
-packages, and collaboration tools installed, configured and ready to use.
+One of the easiest ways to start working with Jupyter is to use a
+web-based platform called \index{JupyterHub} JupyterHub. JupyterHubs often have Jupyter, R, a number of R
+packages, and collaboration tools installed, configured and ready to use.
JupyterHubs are usually created and provisioned by organizations,
and require authentication to gain access. For example, if you are reading
this book as part of a course, your instructor may have a JupyterHub
-already set up for you to use! Jupyter can also be installed on your
own computer; see Chapter \@ref(setup) for instructions.
## Code cells
-The sections of a Jupyter notebook that contain code are referred to as code cells.
-A code cell \index{Jupyter notebook!code cell} that has not yet been
-executed has no number inside the square brackets to the left of the cell
-(Figure \@ref(fig:code-cell-not-run)). Running a code cell will execute all of
+The sections of a Jupyter notebook that contain code are referred to as code cells.
+A code cell \index{Jupyter notebook!code cell} that has not yet been
+executed has no number inside the square brackets to the left of the cell
+(Figure \@ref(fig:code-cell-not-run)). Running a code cell will execute all of
the code it contains, and the output (if any exists) will be displayed directly
-underneath the code that generated it. Outputs may include printed text or
-numbers, data frames and data visualizations. Cells that have been executed
-also have a number inside the square brackets to the left of the cell.
-This number indicates the order in which the cells were run
+underneath the code that generated it. Outputs may include printed text or
+numbers, data frames and data visualizations. Cells that have been executed
+also have a number inside the square brackets to the left of the cell.
+This number indicates the order in which the cells were run
(Figure \@ref(fig:code-cell-run)).
```{r code-cell-not-run, echo = FALSE, fig.cap = "A code cell in Jupyter that has not yet been executed.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/code-cell-not-run.png") |>
+image_read("img/jupyter/code-cell-not-run.png") |>
image_crop("3632x1000")
```
```{r code-cell-run, echo = FALSE, fig.cap = "A code cell in Jupyter that has been executed.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/code-cell-run.png") |>
+image_read("img/jupyter/code-cell-run.png") |>
image_crop("3632x2000")
```
### Executing code cells
-Code cells \index{Jupyter notebook!cell execution} can be run independently or as part of executing the entire notebook
-using one of the "**Run all**" commands found in the **Run** or **Kernel** menus
-in Jupyter. Running a single code cell independently is a workflow typically
-used when editing or writing your own R code. Executing an entire notebook is a
-workflow typically used to ensure that your analysis runs in its entirety before
-sharing it with others, and when using a notebook as part of an automated
+Code cells \index{Jupyter notebook!cell execution} can be run independently or as part of executing the entire notebook
+using one of the "**Run all**" commands found in the **Run** or **Kernel** menus
+in Jupyter. Running a single code cell independently is a workflow typically
+used when editing or writing your own R code. Executing an entire notebook is a
+workflow typically used to ensure that your analysis runs in its entirety before
+sharing it with others, and when using a notebook as part of an automated
process.
To run a code cell independently, the cell needs to first be activated. This
is done by clicking on it with the cursor. Jupyter will indicate a cell has been
activated by highlighting it with a blue rectangle to its left. After the cell
-has been activated (Figure \@ref(fig:activate-and-run-button)), the cell can be run
+has been activated (Figure \@ref(fig:activate-and-run-button)), the cell can be run
by either pressing the **Run** (`r play()`).
button in the toolbar, or by using the keyboard shortcut `Shift + Enter`.
```{r activate-and-run-button, echo = FALSE, fig.cap = "An activated cell that is ready to be run. The blue rectangle to the cell's left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red).", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/activate-and-run-button-annotated.png") |>
+image_read("img/jupyter/activate-and-run-button-annotated.png") |>
image_crop("3632x900")
```
@@ -162,21 +162,21 @@ then running all cells (options 2 or 3) emulates how your notebook code would
run if you completely restarted Jupyter before executing your entire notebook.
```{r restart-kernel-run-all, echo = FALSE, fig.cap = "Restarting the R session can be accomplished by clicking Restart Kernel and Run All Cells...", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/restart-kernel-run-all.png") |>
+image_read("img/jupyter/restart-kernel-run-all.png") |>
image_crop("3632x900")
```
### The Kernel
-The kernel\index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and
-outputs the results. Kernels for many different programming languages have
-been created for Jupyter, which means that Jupyter can interpret and execute
-the code of many different programming languages. To run R code, your notebook
-will need an R kernel. In the top right of your window, you can see a circle
+The kernel\index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and
+outputs the results. Kernels for many different programming languages have
+been created for Jupyter, which means that Jupyter can interpret and execute
+the code of many different programming languages. To run R code, your notebook
+will need an R kernel. In the top right of your window, you can see a circle
that indicates the status of your kernel. If the circle is empty (`r circle()`)
the kernel is idle and ready to execute code. If the circle is filled in (`r filled_circle()`)
the kernel is busy running some code.
-You may run into problems where your kernel \index{kernel!interrupt, restart} is stuck for an excessive amount
+You may run into problems where your kernel \index{kernel!interrupt, restart} is stuck for an excessive amount
of time, your notebook is very slow and unresponsive, or your kernel loses its
connection. If this happens, try the following steps:
@@ -186,25 +186,25 @@ connection. If this happens, try the following steps:
### Creating new code cells
-To create a new code cell in Jupyter (Figure \@ref(fig:create-new-code-cell)), click the `+` button in the
-toolbar. By default, all new cells in Jupyter start out as code cells,
-so after this, all you have to do is write R code within the new cell you just
+To create a new code cell in Jupyter (Figure \@ref(fig:create-new-code-cell)), click the `+` button in the
+toolbar. By default, all new cells in Jupyter start out as code cells,
+so after this, all you have to do is write R code within the new cell you just
created!
```{r create-new-code-cell, echo = FALSE, fig.cap = "New cells can be created by clicking the + button, and are by default code cells.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/create-new-code-cell.png") |>
+image_read("img/jupyter/create-new-code-cell.png") |>
image_crop("3632x900")
```
## Markdown cells
-Text cells inside a Jupyter notebook are\index{markdown}\index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells
-are rich formatted text cells, which means you can **bold** and *italicize*
-text, create subject headers, create bullet and numbered lists, and more. These cells are
+Text cells inside a Jupyter notebook are\index{markdown}\index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells
+are rich formatted text cells, which means you can **bold** and *italicize*
+text, create subject headers, create bullet and numbered lists, and more. These cells are
given the name "Markdown" because they use *Markdown language* to specify the rich text formatting.
-You do not need to learn Markdown to write text in the Markdown cells in
-Jupyter; plain text will work just fine. However, you might want to learn a bit
-about Markdown eventually to enable you to create nicely formatted analyses.
+You do not need to learn Markdown to write text in the Markdown cells in
+Jupyter; plain text will work just fine. However, you might want to learn a bit
+about Markdown eventually to enable you to create nicely formatted analyses.
See the additional resources at the end of this chapter to find out
where you can start learning Markdown.
@@ -212,85 +212,85 @@ where you can start learning Markdown.
To edit a Markdown cell in Jupyter, you need to double click on the cell. Once
you do this, the unformatted (or *unrendered*) version of the text will be
-shown (Figure \@ref(fig:markdown-cell-not-run)). You
+shown (Figure \@ref(fig:markdown-cell-not-run)). You
can then use your keyboard to edit the text. To view the formatted
-(or *rendered*) text (Figure \@ref(fig:markdown-cell-run)), click the **Run** (`r play()`) button in the toolbar,
+(or *rendered*) text (Figure \@ref(fig:markdown-cell-run)), click the **Run** (`r play()`) button in the toolbar,
or use the `Shift + Enter` keyboard shortcut.
```{r markdown-cell-not-run, echo = FALSE, fig.cap = "A Markdown cell in Jupyter that has not yet been rendered and can be edited.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/markdown-cell-not-run.png") |>
+image_read("img/jupyter/markdown-cell-not-run.png") |>
image_crop("3632x900")
```
```{r markdown-cell-run, echo = FALSE, fig.cap = "A Markdown cell in Jupyter that has been rendered and exhibits rich text formatting. ", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/markdown-cell-run.png") |>
+image_read("img/jupyter/markdown-cell-run.png") |>
image_crop("3632x900")
```
### Creating new Markdown cells
-To create a new Markdown cell in Jupyter, click the `+` button in the toolbar.
-By default, all new cells in Jupyter start as code cells, so
-the cell format needs to be changed to be recognized and rendered as a Markdown
-cell. To do this, click on the cell with your cursor to
+To create a new Markdown cell in Jupyter, click the `+` button in the toolbar.
+By default, all new cells in Jupyter start as code cells, so
+the cell format needs to be changed to be recognized and rendered as a Markdown
+cell. To do this, click on the cell with your cursor to
ensure it is activated. Then click on the drop-down box on the toolbar that says "Code" (it
is next to the `r fast_forward()` button), and change it from "**Code**" to "**Markdown**" (Figure \@ref(fig:convert-to-markdown-cell)).
```{r convert-to-markdown-cell, echo = FALSE, fig.cap = "New cells are by default code cells. To create Markdown cells, the cell format must be changed.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/convert-to-markdown-cell.png") |>
+image_read("img/jupyter/convert-to-markdown-cell.png") |>
image_crop("3632x900")
```
## Saving your work
-As with any file you work on, it is critical to save your work often so you
-don't lose your progress! Jupyter has an autosave feature, where open files are
-saved periodically. The default for this is every two minutes. You can also
-manually save a Jupyter notebook by selecting **Save Notebook** from the
+As with any file you work on, it is critical to save your work often so you
+don't lose your progress! Jupyter has an autosave feature, where open files are
+saved periodically. The default for this is every two minutes. You can also
+manually save a Jupyter notebook by selecting **Save Notebook** from the
**File** menu, by clicking the disk icon on the toolbar,
or by using a keyboard shortcut (`Control + S` for Windows, or `Command + S` for
-Mac OS).
+Mac OS).
## Best practices for running a notebook
### Best practices for executing code cells
As you might know (or at least imagine) by now, Jupyter notebooks are great for
-interactively editing, writing and running R code; this is what they were
-designed for! Consequently, Jupyter notebooks are flexible in regards to code
-cell execution order. This flexibility means that code cells can be run in any
-arbitrary order using the **Run** (`r play()`) button. But this flexibility has a downside:
-it can lead to Jupyter notebooks whose code cannot be executed in a linear
-order (from top to bottom of the notebook). A nonlinear notebook is problematic
-because a linear order is the conventional way code documents are run, and
-others will have this expectation when running your notebook. Finally, if the
-code is used in some automated process, it will need to run in a linear order,
+interactively editing, writing and running R code; this is what they were
+designed for! Consequently, Jupyter notebooks are flexible in regards to code
+cell execution order. This flexibility means that code cells can be run in any
+arbitrary order using the **Run** (`r play()`) button. But this flexibility has a downside:
+it can lead to Jupyter notebooks whose code cannot be executed in a linear
+order (from top to bottom of the notebook). A nonlinear notebook is problematic
+because a linear order is the conventional way code documents are run, and
+others will have this expectation when running your notebook. Finally, if the
+code is used in some automated process, it will need to run in a linear order,
from top to bottom of the notebook. \index{Jupyter notebook!best practices}
-The most common way to inadvertently create a nonlinear notebook is to rely solely
-on using the `r play()` button to execute cells. For example,
-suppose you write some R code that creates an R object, say a variable named
-`y`. When you execute that cell and create `y`, it will continue
-to exist until it is deliberately deleted with R code, or when the Jupyter
-notebook R session (*i.e.*, kernel) is stopped or restarted. It can also be
-referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
+The most common way to inadvertently create a nonlinear notebook is to rely solely
+on using the `r play()` button to execute cells. For example,
+suppose you write some R code that creates an R object, say a variable named
+`y`. When you execute that cell and create `y`, it will continue
+to exist until it is deliberately deleted with R code, or when the Jupyter
+notebook R session (*i.e.*, kernel) is stopped or restarted. It can also be
+referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
Together, this means that you could then write a code cell further above in the
-notebook that references `y` and execute it without error in the current session
-(Figure \@ref(fig:out-of-order-2)). This could also be done successfully in
-future sessions if, and only if, you run the cells in the same unconventional
-order. However, it is difficult to remember this unconventional order, and it
-is not the order that others would expect your code to be executed in. Thus, in
-the future, this would lead
-to errors when the notebook is run in the conventional
+notebook that references `y` and execute it without error in the current session
+(Figure \@ref(fig:out-of-order-2)). This could also be done successfully in
+future sessions if, and only if, you run the cells in the same unconventional
+order. However, it is difficult to remember this unconventional order, and it
+is not the order that others would expect your code to be executed in. Thus, in
+the future, this would lead
+to errors when the notebook is run in the conventional
linear order (Figure \@ref(fig:out-of-order-3)).
```{r out-of-order-1, echo = FALSE, fig.cap = "Code that was written out of order, but not yet executed.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/out-of-order-1.png") |>
+image_read("img/jupyter/out-of-order-1.png") |>
image_crop("3632x800")
```
```{r out-of-order-2, echo = FALSE, fig.cap = "Code that was written out of order, and was executed using the run button in a nonlinear order without error. The order of execution can be traced by following the numbers to the left of the code cells; their order indicates the order in which the cells were executed.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/out-of-order-2.png") |>
+image_read("img/jupyter/out-of-order-2.png") |>
image_crop("3632x800")
```
@@ -298,156 +298,156 @@ image_read("img/jupyter/out-of-order-2.png") |>
(ref:out-of-order-3) Code that was written out of order, and was executed in a linear order using "Restart Kernel and Run All Cells..." This resulted in an error at the execution of the second code cell and it failed to run all code cells in the notebook.
```{r out-of-order-3, echo = FALSE, fig.cap = '(ref:out-of-order-3)', fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/out-of-order-3.png") |>
+image_read("img/jupyter/out-of-order-3.png") |>
image_crop("3632x800")
```
You can also accidentally create a nonfunctioning notebook by
-creating an object in a cell that later gets deleted. In such a
-scenario, that object only exists for that one particular R session and will
-not exist once the notebook is restarted and run again. If that
-object was referenced in another cell in that notebook, an error
+creating an object in a cell that later gets deleted. In such a
+scenario, that object only exists for that one particular R session and will
+not exist once the notebook is restarted and run again. If that
+object was referenced in another cell in that notebook, an error
would occur when the notebook was run again in a new session.
-These events may not negatively affect the current R session when
-the code is being written; but as you might now see, they will likely lead to
-errors when that notebook is run in a future session. Regularly executing
-the entire notebook in a fresh R session will help guard
+These events may not negatively affect the current R session when
+the code is being written; but as you might now see, they will likely lead to
+errors when that notebook is run in a future session. Regularly executing
+the entire notebook in a fresh R session will help guard
against this. If you restart your session and new errors seem to pop up when
you run all of your cells in linear order, you can at least be aware that there
-is an issue. Knowing this sooner rather than later will allow you to
+is an issue. Knowing this sooner rather than later will allow you to
fix the issue and ensure your notebook can be run linearly from start to finish.
We recommend as a best practice to run the entire notebook in a fresh R session
at least 2–3 times within any period of work. Note that,
critically, you *must do this in a fresh R session* by restarting your kernel.
-We recommend using either the **Kernel** >>
+We recommend using either the **Kernel** >>
**Restart Kernel and Run All Cells...** command from the menu or the `r fast_forward()`
-button in the toolbar. Note that the **Run** >> **Run All Cells**
-menu item will not restart the kernel, and so it is not sufficient
+button in the toolbar. Note that the **Run** >> **Run All Cells**
+menu item will not restart the kernel, and so it is not sufficient
to guard against these errors.
### Best practices for including R packages in notebooks
-Most data analyses these days depend on functions from external R packages that
-are not built into R. One example is the `tidyverse` metapackage that we
-heavily rely on in this book. This package provides us access to functions like
-`read_csv` for reading data, `select` for subsetting columns, and `ggplot` for
-creating high-quality graphics.
-
-As mentioned earlier in the book, external R packages need to be loaded before
-the functions they contain can be used. Our recommended way to do this is via
-`library(package_name)`. But where should this line of code be written in a
-Jupyter notebook? One idea could be to load the library right before the
-function is used in the notebook. However, although this technically works, this
-causes hidden, or at least non-obvious, R package dependencies when others view
-or try to run the notebook. These hidden dependencies can lead to errors when
-the notebook is executed on another computer if the needed R packages are not
-installed. Additionally, if the data analysis code takes a long time to run,
-uncovering the hidden dependencies that need to be installed so that the
+Most data analyses these days depend on functions from external R packages that
+are not built into R. One example is the `tidyverse` metapackage that we
+heavily rely on in this book. This package provides us access to functions like
+`read_csv` for reading data, `select` for subsetting columns, and `ggplot` for
+creating high-quality graphics.
+
+As mentioned earlier in the book, external R packages need to be loaded before
+the functions they contain can be used. Our recommended way to do this is via
+`library(package_name)`. But where should this line of code be written in a
+Jupyter notebook? One idea could be to load the library right before the
+function is used in the notebook. However, although this technically works, this
+causes hidden, or at least non-obvious, R package dependencies when others view
+or try to run the notebook. These hidden dependencies can lead to errors when
+the notebook is executed on another computer if the needed R packages are not
+installed. Additionally, if the data analysis code takes a long time to run,
+uncovering the hidden dependencies that need to be installed so that the
analysis can run without error can take a great deal of time to uncover.
-Therefore, we recommend you load all R packages in a code cell near the top of
-the Jupyter notebook. Loading all your packages at the start ensures that all
-packages are loaded before their functions are called, assuming the notebook is
-run in a linear order from top to bottom as recommended above. It also makes it
-easy for others viewing or running the notebook to see what external R packages
-are used in the analysis, and hence, what packages they should install on
+Therefore, we recommend you load all R packages in a code cell near the top of
+the Jupyter notebook. Loading all your packages at the start ensures that all
+packages are loaded before their functions are called, assuming the notebook is
+run in a linear order from top to bottom as recommended above. It also makes it
+easy for others viewing or running the notebook to see what external R packages
+are used in the analysis, and hence, what packages they should install on
their computer to run the analysis successfully.
### Summary of best practices for running a notebook
1. Write code so that it can be executed in a linear order.
-2. As you write code in a Jupyter notebook, run the notebook in a linear order
-and in its entirety often (2–3 times every work session) via the **Kernel** >>
+2. As you write code in a Jupyter notebook, run the notebook in a linear order
+and in its entirety often (2–3 times every work session) via the **Kernel** >>
**Restart Kernel and Run All Cells...** command from the Jupyter menu or the `r fast_forward()`
button in the toolbar.
-3. Write the code that loads external R packages near the top of the Jupyter
+3. Write the code that loads external R packages near the top of the Jupyter
notebook.
## Exploring data files
It is essential to preview data files before you try to read them into R to see
-whether or not there are column names, what the delimiters are, and if there are
-lines you need to skip. In Jupyter, you preview data files stored as plain text
-files (e.g., comma- and tab-separated files) in their plain text format (Figure \@ref(fig:open-data-w-editor-2)) by
-right-clicking on the file's name in the Jupyter file explorer, selecting
-**Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)).
-Suppose you do not specify to open
-the data file with an editor. In that case, Jupyter will render a nice table
-for you, and you will not be able to see the column delimiters, and therefore
-you will not know which function to use, nor which arguments to use and values
+whether or not there are column names, what the delimiters are, and if there are
+lines you need to skip. In Jupyter, you preview data files stored as plain text
+files (e.g., comma- and tab-separated files) in their plain text format (Figure \@ref(fig:open-data-w-editor-2)) by
+right-clicking on the file's name in the Jupyter file explorer, selecting
+**Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)).
+Suppose you do not specify to open
+the data file with an editor. In that case, Jupyter will render a nice table
+for you, and you will not be able to see the column delimiters, and therefore
+you will not know which function to use, nor which arguments to use and values
to specify for them.
```{r open-data-w-editor-1, echo = FALSE, fig.cap = "Opening data files with an editor in Jupyter.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/open_data_w_editor_01.png") |>
+image_read("img/jupyter/open_data_w_editor_01.png") |>
image_crop("3632x2000")
```
```{r open-data-w-editor-2, echo = FALSE, fig.cap = "A data file as viewed in an editor in Jupyter.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/open_data_w_editor_02.png") |>
+image_read("img/jupyter/open_data_w_editor_02.png") |>
image_crop("3632x2000")
```
-## Exporting to a different file format
+## Exporting to a different file format
-In Jupyter, viewing, editing and running R code is done in the Jupyter notebook
-file format with \index{Jupyter notebook!export} file extension `.ipynb`. This file format is not easy to open and
-view outside of Jupyter. Thus, to share your analysis with people who do not
-commonly use Jupyter, it is recommended that you export your executed analysis
-as a more common file type, such as an `.html` file, or a `.pdf`. We recommend
-exporting the Jupyter notebook after executing the analysis so that you can
+In Jupyter, viewing, editing and running R code is done in the Jupyter notebook
+file format with \index{Jupyter notebook!export} file extension `.ipynb`. This file format is not easy to open and
+view outside of Jupyter. Thus, to share your analysis with people who do not
+commonly use Jupyter, it is recommended that you export your executed analysis
+as a more common file type, such as an `.html` file, or a `.pdf`. We recommend
+exporting the Jupyter notebook after executing the analysis so that you can
also share the outputs of your code. Note, however, that your audience will not be
able to *run* your analysis using a `.html` or `.pdf` file. If you want your audience
to be able to reproduce the analysis, you must provide them with the `.ipynb` Jupyter notebook file.
### Exporting to HTML
-Exporting to `.html` will result in a shareable file that anyone can open
+Exporting to `.html` will result in a shareable file that anyone can open
using a web browser (e.g., Firefox, Safari, Chrome, or Edge). The `.html`
-output will produce a document that is visually similar to what the Jupyter notebook
-looked like inside Jupyter. One point of caution here is that if there are
-images in your Jupyter notebook, you will need to share the image files and the
+output will produce a document that is visually similar to what the Jupyter notebook
+looked like inside Jupyter. One point of caution here is that if there are
+images in your Jupyter notebook, you will need to share the image files and the
`.html` file to see them.
### Exporting to PDF
-Exporting to `.pdf` will result in a shareable file that anyone can open
-using many programs, including Adobe Acrobat, Preview, web browsers and many
-more. The benefit of exporting to PDF is that it is a standalone document,
-even if the Jupyter notebook included references to image files.
-Unfortunately, the default settings will result in a document
-that visually looks quite different from what the Jupyter notebook looked
-like. The font, page margins, and other details will appear different in the `.pdf` output.
+Exporting to `.pdf` will result in a shareable file that anyone can open
+using many programs, including Adobe Acrobat, Preview, web browsers and many
+more. The benefit of exporting to PDF is that it is a standalone document,
+even if the Jupyter notebook included references to image files.
+Unfortunately, the default settings will result in a document
+that visually looks quite different from what the Jupyter notebook looked
+like. The font, page margins, and other details will appear different in the `.pdf` output.
## Creating a new Jupyter notebook
-At some point, you will want to create a new, fresh Jupyter notebook for your
-own project instead of viewing, running or editing a notebook that was started
-by someone else. To do this, navigate to the **Launcher** tab, and click on
-the R icon under the **Notebook** heading. If no **Launcher** tab is visible,
-you can get a new one via clicking the **+** button at the top of the Jupyter
-file explorer (Figure \@ref(fig:launcher)).
+At some point, you will want to create a new, fresh Jupyter notebook for your
+own project instead of viewing, running or editing a notebook that was started
+by someone else. To do this, navigate to the **Launcher** tab, and click on
+the R icon under the **Notebook** heading. If no **Launcher** tab is visible,
+you can get a new one via clicking the **+** button at the top of the Jupyter
+file explorer (Figure \@ref(fig:launcher)).
```{r launcher, echo = FALSE, fig.cap = "Clicking on the R icon under the Notebook heading will create a new Jupyter notebook with an R kernel.", fig.retina = 2, out.width="100%"}
-image_read("img/jupyter/launcher-annotated.png") |>
+image_read("img/jupyter/launcher-annotated.png") |>
image_crop("3632x2000")
```
-Once you have created a new Jupyter notebook, be sure to give it a descriptive
-name, as the default file name is `Untitled.ipynb`. You can rename files by
-first right-clicking on the file name of the notebook you just created, and
-then clicking **Rename**. This will make
-the file name editable. Use your keyboard to
-change the name. Pressing `Enter` or clicking anywhere else in the Jupyter
+Once you have created a new Jupyter notebook, be sure to give it a descriptive
+name, as the default file name is `Untitled.ipynb`. You can rename files by
+first right-clicking on the file name of the notebook you just created, and
+then clicking **Rename**. This will make
+the file name editable. Use your keyboard to
+change the name. Pressing `Enter` or clicking anywhere else in the Jupyter
interface will save the changed file name.
-We recommend not using white space or non-standard characters in file names.
-Doing so will not prevent you from using that file in Jupyter. However, these
-sorts of things become troublesome as you start to do more advanced data
-science projects that involve repetition and automation. We recommend naming
-files using lower case characters and separating words by a dash (`-`) or an
+We recommend not using white space or non-standard characters in file names.
+Doing so will not prevent you from using that file in Jupyter. However, these
+sorts of things become troublesome as you start to do more advanced data
+science projects that involve repetition and automation. We recommend naming
+files using lower case characters and separating words by a dash (`-`) or an
underscore (`_`).
## Additional resources
diff --git a/source/reading.Rmd b/source/reading.Rmd
index ff3d47f12..2fdeae677 100644
--- a/source/reading.Rmd
+++ b/source/reading.Rmd
@@ -12,7 +12,7 @@ print_html_nodes <- function(html_nodes_object) {
html_nodes_object
} else {
output <- capture.output(html_nodes_object)
-
+
for (i in seq_along(output)) {
if (nchar(output[i]) <= 79) {
cat(output[i], sep = "\n")
@@ -24,7 +24,7 @@ print_html_nodes <- function(html_nodes_object) {
}
```
-## Overview
+## Overview
In this chapter, you’ll learn to read tabular data of various formats into R
from your local device (e.g., your laptop) and the web. "Reading" (or "loading")
@@ -42,31 +42,31 @@ tied well before going for a run so that you don’t trip later on!
## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-- Define the following:
+- Define the types of path and use them to locate files:
- absolute file path
- relative file path
- - **U**niform **R**esource **L**ocator (URL)
-- Read data into R using a relative path and a URL.
-- Compare and contrast the following functions:
- - `read_csv`
+ - Uniform Resource Locator (URL)
+- Read data into R from various types of path using:
+ - `read_csv`
- `read_tsv`
- `read_csv2`
- `read_delim`
- `read_excel`
-- Match the following `tidyverse` `read_*` function arguments to their descriptions:
- - `file`
+- Compare and contrast the `read_*` functions.
+- Describe when to use the following `read_*` function arguments:
+ - `skip`
- `delim`
- `col_names`
- - `skip`
- Choose the appropriate `tidyverse` `read_*` function and function arguments to load a given plain text tabular data set into R.
- Use the `rename` function to rename columns in a data frame.
-- Use `readxl` package's `read_excel` function and arguments to load a sheet from an excel file into R.
-- Connect to a database using the `DBI` package's `dbConnect` function.
-- List the tables in a database using the `DBI` package's `dbListTables` function.
-- Create a reference to a database table that is queriable using the `tbl` from the `dbplyr` package.
-- Retrieve data from a database query and bring it into R using the `collect` function from the `dbplyr` package.
+- Use `read_excel` function and arguments to load a sheet from an excel file into R.
+- Work with databases using functions from `dbplyr` and `DBI`:
+ - Connect to a database with `dbConnect`.
+ - List tables in the database with `dbListTables`.
+ - Create a reference to a database table with `tbl`.
+ - Bring data from a database into R using `collect`.
- Use `write_csv` to save a data frame to a `.csv` file.
-- (*Optional*) Obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping.
+- (*Optional*) Obtain data from the web using scraping and application programming interfaces (APIs):
- Read HTML source code from a URL using the `rvest` package.
- Read data from the NASA "Astronomy Picture of the Day" API using the `httr2` package.
- Compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website.
@@ -77,14 +77,14 @@ This chapter will discuss the different functions we can use to import data
into R, but before we can talk about *how* we read the data into R with these
functions, we first need to talk about *where* the data lives. When you load a
data set into R, you first need to tell R where those files live. The file
-could live on your computer (*local*)
-\index{location|see{path}} \index{path!local, remote, relative, absolute}
-or somewhere on the internet (*remote*).
+could live on your computer (*local*)
+\index{location|see{path}} \index{path!local, remote, relative, absolute}
+or somewhere on the internet (*remote*).
The place where the file lives on your computer is referred to as its "path". You can
think of the path as directions to the file. There are two kinds of paths:
*relative* paths and *absolute* paths. A relative path indicates where the file is
-with respect to your *working directory* (i.e., "where you are currently") on the computer.
+with respect to your *working directory* (i.e., "where you are currently") on the computer.
On the other hand, an absolute path indicates where the file is
with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
@@ -120,38 +120,38 @@ Note that there is no forward slash at the beginning of a relative path; if we a
R would look for a folder named `data` in the root folder of the computer—but that doesn't exist!
Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
-special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
+special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
-the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
+the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
-So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
-Using a relative path helps ensure that your code can be run
+So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
+Using a relative path helps ensure that your code can be run
on a different computer (and as an added bonus, relative paths are often shorter—easier to type!).
This is because a file's relative path is often the same across different computers, while a
-file's absolute path (the names of
-all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
-across different computers. For example, suppose Fatima and Jayden are working on a
-project together on the `happiness_report.csv` data. Fatima's file is stored at
+file's absolute path (the names of
+all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
+across different computers. For example, suppose Fatima and Jayden are working on a
+project together on the `happiness_report.csv` data. Fatima's file is stored at
-`/home/Fatima/project/data/happiness_report.csv`,
+`/home/Fatima/project/data/happiness_report.csv`,
-while Jayden's is stored at
+while Jayden's is stored at
`/home/Jayden/project/data/happiness_report.csv`.
-
+
Even though Fatima and Jayden stored their files in the same place on their
computers (in their home folders), the absolute paths are different due to
their different usernames. If Jayden has code that loads the
`happiness_report.csv` data using an absolute path, the code won't work on
Fatima's computer. But the relative path from inside the `project` folder
(`data/happiness_report.csv`) is the same on both computers; any code that uses
-relative paths will work on both! In the additional resources section,
+relative paths will work on both! In the additional resources section,
we include a link to a short video on the
difference between absolute and relative paths. You can also check out the
`here` package, which provides methods for finding and constructing file paths
-in R.
+in R.
Beyond files stored on your computer (i.e., locally), we also need a way to locate resources
stored elsewhere on the internet (i.e., remotely). For this purpose we use a
@@ -165,23 +165,23 @@ to where the resource is located on the remote machine. \index{URL}
### `read_csv` to read in comma-separated values files {#readcsv}
Now that we have learned about *where* data could be, we will learn about *how*
-to import data into R using various functions. Specifically, we will learn how
+to import data into R using various functions. Specifically, we will learn how
to *read* tabular data from a plain text file (a document containing only text)
*into* R and *write* tabular data to a file *out of* R. The function we use to do this
depends on the file's format. For example, in the last chapter, we learned about using
the `tidyverse` `read_csv` function when reading `.csv` (**c**omma-**s**eparated **v**alues)
files. \index{csv} In that case, the separator or *delimiter* \index{reading!delimiter} that divided our columns was a
-comma (`,`). We only learned the case where the data matched the expected defaults
+comma (`,`). We only learned the case where the data matched the expected defaults
of the `read_csv` function \index{read function!read\_csv}
-(column names are present, and commas are used as the delimiter between columns).
-In this section, we will learn how to read
+(column names are present, and commas are used as the delimiter between columns).
+In this section, we will learn how to read
files that do not satisfy the default expectations of `read_csv`.
-Before we jump into the cases where the data aren't in the expected default format
+Before we jump into the cases where the data aren't in the expected default format
for `tidyverse` and `read_csv`, let's revisit the more straightforward
case where the defaults hold, and the only argument we need to give to the function
-is the path to the file, `data/can_lang.csv`. The `can_lang` data set contains
-language data from the 2016 Canadian census. \index{Canadian languages!canlang data}
+is the path to the file, `data/can_lang.csv`. The `can_lang` data set contains
+language data from the 2016 Canadian census. \index{Canadian languages!canlang data}
We put `data/` before the file's
name when we are loading the data set because this data set is located in a
sub-folder, named `data`, relative to where we are running our R code.
@@ -200,9 +200,9 @@ Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21
Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670
```
-And here is a review of how we can use `read_csv` to load it into R. First we
+And here is a review of how we can use `read_csv` to load it into R. First we
load the `tidyverse` \index{tidyverse} package to gain access to useful
-functions for reading the data.
+functions for reading the data.
```{r, message = FALSE}
library(tidyverse)
@@ -258,7 +258,7 @@ Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670
With this extra information being present at the top of the file, using
`read_csv` as we did previously does not allow us to correctly load the data
into R. In the case of this file we end up only reading in one column of the
-data set. In contrast to the normal and expected messages above, this time R
+data set. In contrast to the normal and expected messages above, this time R
prints out a warning for us indicating that there might be a problem with how
our data is being read in. \index{warning}
@@ -267,20 +267,20 @@ canlang_data <- read_csv("data/can_lang_meta-data.csv")
canlang_data
```
-To successfully read data like this into R, the `skip`
-argument \index{read function!skip argument} can be useful to tell R
+To successfully read data like this into R, the `skip`
+argument \index{read function!skip argument} can be useful to tell R
how many lines to skip before
it should start reading in the data. In the example above, we would set this
value to 3.
```{r}
-canlang_data <- read_csv("data/can_lang_meta-data.csv",
+canlang_data <- read_csv("data/can_lang_meta-data.csv",
skip = 3)
canlang_data
```
How did we know to skip three lines? We looked at the data! The first three lines
-of the data had information we didn't need to import:
+of the data had information we didn't need to import:
```code
Data source: https://ttimbers.github.io/canlang/
@@ -288,30 +288,30 @@ Data originally published in: Statistics Canada Census of Population 2016.
Reproduced and distributed on an as-is basis with their permission.
```
-The column names began at line 4, so we skipped the first three lines.
+The column names began at line 4, so we skipped the first three lines.
### `read_tsv` to read in tab-separated values files
Another common way data is stored is with tabs as the delimiter. Notice the
data file, `can_lang.tsv`, has tabs in between the columns instead of
-commas.
+commas.
```code
category language mother_tongue most_at_home most_at_work lang_kno
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415
-Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150
+Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150
Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150
Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120
Aboriginal languages Algonquin 1260 370 40 2480
-Non-Official & Non-Aboriginal languages American Sign Language 2685 3020
+Non-Official & Non-Aboriginal languages American Sign Language 2685 3020
Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670
```
We can use the `read_tsv` function
\index{tab-separated values|see{tsv}}\index{tsv}\index{read function!read\_tsv}
-to read in `.tsv` (**t**ab **s**eparated **v**alues) files.
+to read in `.tsv` (**t**ab **s**eparated **v**alues) files.
```{r 01-read-tab}
canlang_data <- read_tsv("data/can_lang.tsv")
@@ -320,57 +320,57 @@ canlang_data
If you compare the data frame here to the data frame we obtained in Section
\@ref(readcsv) using `read_csv`, you'll notice that they look identical:
-they have the same number of columns and rows, the same column names, and the same entries! So
+they have the same number of columns and rows, the same column names, and the same entries! So
even though we needed to use a different
function depending on the file format, our resulting data frame
-(`canlang_data`) in both cases was the same.
+(`canlang_data`) in both cases was the same.
### `read_delim` as a more flexible method to get tabular data into R
The `read_csv` and `read_tsv` functions are actually just special cases of the more general
`read_delim` \index{read function!read\_delim} function. We can use
`read_delim` to import both comma and tab-separated values files, and more; we just
-have to specify the delimiter.
+have to specify the delimiter.
For example, the `can_lang_no_names.tsv` file contains a different version of
this same data set with no column names and uses tabs as the delimiter
-\index{reading!delimiter} instead of commas.
+\index{reading!delimiter} instead of commas.
Here is how the file would look in a plain text editor:
```code
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415
-Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150
+Non-Official & Non-Aboriginal languages Afro-Asiatic languages, n.i.e. 1150
Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150
Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120
Aboriginal languages Algonquin 1260 370 40 2480
-Non-Official & Non-Aboriginal languages American Sign Language 2685 3020
+Non-Official & Non-Aboriginal languages American Sign Language 2685 3020
Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670
Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055
```
-To read this into R using the `read_delim` function, we specify the path
+To read this into R using the `read_delim` function, we specify the path
to the file as the first argument, provide
the tab character `"\t"` as the `delim` argument \index{read function!delim argument},
and set the `col_names` argument to `FALSE` to denote that there are no column names
provided in the data. Note that the `read_csv`, `read_tsv`, and `read_delim` functions
-all have a `col_names` argument \index{read function!col\_names argument} with
-the default value `TRUE`.
+all have a `col_names` argument \index{read function!col\_names argument} with
+the default value `TRUE`.
-> **Note:** `\t` is an example of an *escaped character*,
+> **Note:** `\t` is an example of an *escaped character*,
> which always starts with a backslash (`\`). \index{escape character}
-> Escaped characters are used to represent non-printing characters
-> (like the tab) or those with special meanings (such as quotation marks).
+> Escaped characters are used to represent non-printing characters
+> (like the tab) or those with special meanings (such as quotation marks).
```{r}
-canlang_data <- read_delim("data/can_lang_no_names.tsv",
- delim = "\t",
+canlang_data <- read_delim("data/can_lang_no_names.tsv",
+ delim = "\t",
col_names = FALSE)
canlang_data
```
-Data frames in R need to have column names. Thus if you read in data
+Data frames in R need to have column names. Thus if you read in data
without column names, R will assign names automatically. In this example,
R assigns the column names `X1, X2, X3, X4, X5, X6`.
It is best to rename your columns manually in this scenario. The current
@@ -379,18 +379,18 @@ To rename your columns, you can use the `rename` function
\index{rename} from [the `dplyr` R package](https://dplyr.tidyverse.org/) [@dplyr]
\index{dplyr} (one of the packages
loaded with `tidyverse`, so we don't need to load it separately). The first
-argument is the data set, and in the subsequent arguments you
-write `new_name = old_name` for the selected variables to
+argument is the data set, and in the subsequent arguments you
+write `new_name = old_name` for the selected variables to
rename. We rename the `X1, X2, ..., X6`
-columns in the `canlang_data` data frame to more descriptive names below.
+columns in the `canlang_data` data frame to more descriptive names below.
```{r 01-rename-columns}
canlang_data <- rename(canlang_data,
- category = X1,
- language = X2,
+ category = X1,
+ language = X2,
mother_tongue = X3,
- most_at_home = X4,
- most_at_work = X5,
+ most_at_home = X4,
+ most_at_work = X5,
lang_known = X6)
canlang_data
```
@@ -416,7 +416,7 @@ Occasionally the data available at a URL is not formatted nicely enough to use
`read_csv`, `read_tsv`, `read_delim`, or other related functions to read the data
directly into R. In situations where it is necessary to download a file
to our local computer prior to working with it in R, we can use the `download.file`
-function. The first argument is the URL, and the second is a path where we would
+function. The first argument is the URL, and the second is a path where we would
like to store the downloaded file.
```r
@@ -433,7 +433,7 @@ canlang_data
In many of the examples above, we gave you previews of the data file before we read
it into R. Previewing data is essential to see whether or not there are column
-names, what the delimiters are, and if there are lines you need to skip.
+names, what the delimiters are, and if there are lines you need to skip.
You should do this yourself when trying to read in data files: open the file in
whichever text editor you prefer to inspect its contents prior to reading it into R.
@@ -449,7 +449,7 @@ though `.csv` and `.xlsx` files look almost identical when loaded into Excel,
the data themselves are stored completely differently. While `.csv` files are
plain text files, where the characters you see when you open the file in a text
editor are exactly the data they represent, this is not the case for `.xlsx`
-files. Take a look at a snippet of what a `.xlsx` file would look like in a text editor:
+files. Take a look at a snippet of what a `.xlsx` file would look like in a text editor:
```
@@ -462,8 +462,8 @@ t 8f??3wn
?Pd(??J-?E???7?'t(?-GZ?????y???c~N?g[^_r?4
yG?O
?K??G?
-
-
+
+
]TUEe??O??c[???????6q??s??d?m???\???H?^????3} ?rZY? ?:L60?^?????XTP+?|?
X?a??4VT?,D?Jq
```
@@ -472,7 +472,7 @@ This type of file representation allows Excel files to store additional things
that you cannot store in a `.csv` file, such as fonts, text formatting,
graphics, multiple sheets and more. And despite looking odd in a plain text
editor, we can read Excel spreadsheets into R using the `readxl` package
-developed specifically for this
+developed specifically for this
purpose. \index{readxl}\index{read function!read\_excel}
```{r}
@@ -486,13 +486,13 @@ If the `.xlsx` file has multiple sheets, you have to use the `sheet` argument
to specify the sheet number or name. You can also specify cell ranges using the
`range` argument. This functionality is useful when a single sheet contains
multiple tables (a sad thing that happens to many Excel spreadsheets since this
-makes reading in data more difficult).
+makes reading in data more difficult).
As with plain text files, you should always explore the data file before
importing it into R. Exploring the data beforehand helps you decide which
arguments you need to load the data into R successfully. If you do not have
the Excel program on your computer, you can use other programs to preview the
-file. Examples include Google Sheets and Libre Office.
+file. Examples include Google Sheets and Libre Office.
In Table \@ref(tab:read-table) we summarize the `read_*` functions we covered
in this chapter. We also include the `read_csv2` function for data separated by
@@ -524,7 +524,7 @@ different relational database management systems each have their own advantages
and limitations. Almost all employ SQL (*structured query language*) to obtain
data from the database. But you don't need to know SQL to analyze data from
a database; several packages have been written that allow you to connect to
-relational databases and use the R programming language
+relational databases and use the R programming language
to obtain data. In this book, we will give examples of how to do this
using R with SQLite and PostgreSQL databases.
@@ -533,9 +533,9 @@ using R with SQLite and PostgreSQL databases.
SQLite \index{database!SQLite} is probably the simplest relational database system
that one can use in combination with R. SQLite databases are self-contained, and are
usually stored and accessed locally on one computer from
-a file with a `.db` extension (or sometimes an `.sqlite` extension).
+a file with a `.db` extension (or sometimes an `.sqlite` extension).
Similar to Excel files, these are not plain text
-files and cannot be read in a plain text editor.
+files and cannot be read in a plain text editor.
The first thing you need to do to read data into R from a database is to
connect to the database. We do that using the `dbConnect` function from the
@@ -550,7 +550,7 @@ conn_lang_data <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")
```
Often relational databases have many tables; thus, in order to retrieve
-data from a database, you need to know the name of the table
+data from a database, you need to know the name of the table
in which the data is stored. You can get the names of
all the tables in the database using the `dbListTables` \index{database!tables}
function:
@@ -562,7 +562,7 @@ tables
The `dbListTables` function returned only one name, which tells us
that there is only one table in this database. To reference a table in the
-database (so that we can perform operations like selecting columns and filtering rows), we
+database (so that we can perform operations like selecting columns and filtering rows), we
use the `tbl` function \index{database!tbl} from the `dbplyr` package. The object returned
by the `tbl` function \index{dbplyr|see{database}}\index{database!dbplyr} allows us to work with data
stored in databases as if they were just regular data frames; but secretly, behind
@@ -573,7 +573,7 @@ into SQL queries!
library(dbplyr)
lang_db <- tbl(conn_lang_data, "lang")
-lang_db
+lang_db
```
Although it looks like we just got a data frame from the database, we didn't!
@@ -582,19 +582,19 @@ It's a *reference*; the data is still stored only in the SQLite database. The
and joining large data sets than R. And typically the database will not even
be stored on your computer, but rather a more powerful machine somewhere on the
web. So R is lazy and waits to bring this data into memory until you explicitly
-tell it to using the `collect` \index{database!collect} function.
+tell it to using the `collect` \index{database!collect} function.
Figure \@ref(fig:01-ref-vs-tibble) highlights the difference
between a `tibble` object in R and the output we just created. Notice in the table
on the right, the first two lines of the output indicate the source is SQL. The
last line doesn't show how many rows there are (R is trying to avoid performing
-expensive query operations), whereas the output for the `tibble` object does.
+expensive query operations), whereas the output for the `tibble` object does.
```{r 01-ref-vs-tibble, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Comparison of a reference to data in a database and a tibble in R.", fig.retina = 2, out.width="80%"}
image_read("img/reading/ref_vs_tibble.001.jpeg") |>
image_crop("3632x1600")
```
-We can look at the SQL commands that are sent to the database when we write
+We can look at the SQL commands that are sent to the database when we write
`tbl(conn_lang_data, "lang")` in R with the `show_query` function from the
`dbplyr` package. \index{database!show\_query}
@@ -605,10 +605,10 @@ show_query(tbl(conn_lang_data, "lang"))
The output above shows the SQL code that is sent to the database. When we
write `tbl(conn_lang_data, "lang")` in R, in the background, the function is
translating the R code into SQL, sending that SQL to the database, and then translating the
-response for us. So `dbplyr` does all the hard work of translating from R to SQL and back for us;
-we can just stick with R!
+response for us. So `dbplyr` does all the hard work of translating from R to SQL and back for us;
+we can just stick with R!
-With our `lang_db` table reference for the 2016 Canadian Census data in hand, we
+With our `lang_db` table reference for the 2016 Canadian Census data in hand, we
can mostly continue onward as if it were a regular data frame. For example, let's do the same exercise
from Chapter \@ref(intro): we will obtain only those rows corresponding to Aboriginal languages, and keep only
the `language` and `mother_tongue` columns.
@@ -616,7 +616,7 @@ We can use the `filter` function to obtain only certain rows. Below we filter th
```{r}
aboriginal_lang_db <- filter(lang_db, category == "Aboriginal languages")
-aboriginal_lang_db
+aboriginal_lang_db
```
Above you can again see the hints that this data is not actually stored in R yet:
@@ -645,9 +645,9 @@ aboriginal_lang_data
```
Aside from knowing the number of rows, the data looks pretty similar in both
-outputs shown above. And `dbplyr` provides many more functions (not just `filter`)
-that you can use to directly feed the database reference (`lang_db`) into
-downstream analysis functions (e.g., `ggplot2` for data visualization).
+outputs shown above. And `dbplyr` provides many more functions (not just `filter`)
+that you can use to directly feed the database reference (`lang_db`) into
+downstream analysis functions (e.g., `ggplot2` for data visualization).
But `dbplyr` does not provide *every* function that we need for analysis;
we do eventually need to call `collect`.
For example, look what happens when we try to use `nrow` to count rows
@@ -656,7 +656,7 @@ in a data frame: \index{nrow}
```{r}
nrow(aboriginal_lang_selected_db)
```
-
+
or `tail` to preview the last six rows of a data frame:
\index{tail}
@@ -677,11 +677,11 @@ But be very careful using `collect`: databases are often *very* big,
and reading an entire table into R might take a long time to run or even possibly
crash your machine. So make sure you use `filter` and `select` on the database table
to reduce the data to a reasonable size before using `collect` to read it into R!
-
-### Reading data from a PostgreSQL database
+
+### Reading data from a PostgreSQL database
PostgreSQL (also called Postgres) \index{database!PostgreSQL} is a very popular
-and open-source option for relational database software.
+and open-source option for relational database software.
Unlike SQLite,
PostgreSQL uses a client–server database engine, as it was designed to be used
and accessed on a network. This means that you have to provide more information
@@ -697,8 +697,8 @@ need to include when you call the `dbConnect` function is listed below:
Additionally, we must use the `RPostgres` package instead of `RSQLite` in the
`dbConnect` function call. Below we demonstrate how to connect to a version of
the `can_mov_db` database, which contains information about Canadian movies.
-Note that the `host` (`fakeserver.stat.ubc.ca`), `user` (`user0001`), and
-`password` (`abc123`) below are *not real*; you will not actually
+Note that the `host` (`fakeserver.stat.ubc.ca`), `user` (`user0001`), and
+`password` (`abc123`) below are *not real*; you will not actually
be able to connect to a database using this information.
```{r, eval = FALSE}
@@ -717,8 +717,8 @@ dbListTables(conn_mov_data)
```
```
- [1] "themes" "medium" "titles" "title_aliases" "forms"
- [6] "episodes" "names" "names_occupations" "occupation" "ratings"
+ [1] "themes" "medium" "titles" "title_aliases" "forms"
+ [6] "episodes" "names" "names_occupations" "occupation" "ratings"
```
We see that there are 10 tables in this database. Let's first look at the
@@ -804,27 +804,27 @@ been a really bad movie...
### Why should we bother with databases at all?
-Opening a database \index{database!reasons to use}
+Opening a database \index{database!reasons to use}
involved a lot more effort than just opening a `.csv`, `.tsv`, or any of the
-other plain text or Excel formats. We had to open a connection to the database,
+other plain text or Excel formats. We had to open a connection to the database,
then use `dbplyr` to translate `tidyverse`-like
commands (`filter`, `select` etc.) into SQL commands that the database
-understands, and then finally `collect` the results. And not
+understands, and then finally `collect` the results. And not
all `tidyverse` commands can currently be translated to work with
databases. For example, we can compute a mean with a database
but can't easily compute a median. So you might be wondering: why should we use
-databases at all?
+databases at all?
Databases are beneficial in a large-scale setting:
- They enable storing large data sets across multiple computers with backups.
- They provide mechanisms for ensuring data integrity and validating input.
- They provide security and data access control.
-- They allow multiple users to access data simultaneously
+- They allow multiple users to access data simultaneously
and remotely without conflicts and errors.
- For example, there are billions of Google searches conducted daily in 2021 [@googlesearches].
- Can you imagine if Google stored all of the data
- from those searches in a single `.csv` file!? Chaos would ensue!
+ For example, there are billions of Google searches conducted daily in 2021 [@googlesearches].
+ Can you imagine if Google stored all of the data
+ from those searches in a single `.csv` file!? Chaos would ensue!
## Writing data from R to a `.csv` file
@@ -843,7 +843,7 @@ no_official_lang_data <- filter(can_lang, category != "Official languages")
write_csv(no_official_lang_data, "data/no_official_languages.csv")
```
-## Obtaining data from the web
+## Obtaining data from the web
> **Note:** This section is not required reading for the remainder of the textbook. It
> is included for those readers interested in learning a little bit more about
@@ -870,20 +870,20 @@ website) to give it the website's data, and then your browser translates that
data into something you can see. If the website shows you some information that
you're interested in, you could *create* a data set for yourself by copying and
pasting that information into a file. This process of taking information
-directly from what a website displays is called \index{web scraping}
+directly from what a website displays is called \index{web scraping}
*web scraping* (or sometimes *screen scraping*). Now, of course, copying and pasting
information manually is a painstaking and error-prone process, especially when
there is a lot of information to gather. So instead of asking your browser to
translate the information that the web server provides into something you can
see, you can collect that data programmatically—in the form of
-**h**yper**t**ext **m**arkup **l**anguage
-(HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML}
-and **c**ascading **s**tyle **s**heet (CSS) code—and process it
+**h**yper**t**ext **m**arkup **l**anguage
+(HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML}
+and **c**ascading **s**tyle **s**heet (CSS) code—and process it
to extract useful information. HTML provides the
basic structure of a site and tells the webpage how to display the content
(e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
-content and tells the webpage how the HTML elements should
-be presented (e.g., colors, layouts, fonts etc.).
+content and tells the webpage how the HTML elements should
+be presented (e.g., colors, layouts, fonts etc.).
This subsection will show you the basics of both web scraping
with the [`rvest` R package](https://rvest.tidyverse.org/) [@rvest]
@@ -896,15 +896,15 @@ using the [`httr2` R package](https://httr2.r-lib.org/) [@httr2].
When you enter a URL into your browser, your browser connects to the
web server at that URL and asks for the *source code* for the website.
-This is the data that the browser translates
+This is the data that the browser translates
\index{web scraping}\index{HTML!selector}\index{CSS!selector}
into something you can see; so if we
are going to create our own data by scraping a website, we have to first understand
what that data looks like! For example, let's say we are interested
in knowing the average rental price (per square foot) of the most recently
-available one-bedroom apartments in Vancouver
+available one-bedroom apartments in Vancouver
on [Craiglist](https://vancouver.craigslist.org). When we visit the Vancouver Craigslist
-website \index{Craigslist} and search for one-bedroom apartments,
+website \index{Craigslist} and search for one-bedroom apartments,
we should see something similar to Figure \@ref(fig:craigslist-human).
```{r craigslist-human, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Craigslist webpage of advertisements for one-bedroom apartments.", fig.retina = 2, out.width="100%"}
@@ -915,8 +915,8 @@ Based on what our browser shows us, it's pretty easy to find the size and price
for each apartment listed. But we would like to be able to obtain that information
using R, without any manual human effort or copying and pasting. We do this by
examining the *source code* that the web server actually sent our browser to
-display for us. We show a snippet of it below; the
-entire source
+display for us. We show a snippet of it below; the
+entire source
is [included with the code for this book](https://github.com/UBC-DSCI/introduction-to-datascience/blob/main/img/reading/website_source.txt):
```html
@@ -976,22 +976,22 @@ take a look at another line of the source snippet above:
It's yet another price for an apartment listing, and the tags surrounding it
have the `"result-price"` class. Wonderful! Now that we know what pattern we
are looking for—a dollar amount between opening and closing tags that have the
-`"result-price"` class—we should be able to use code to pull out all of the
+`"result-price"` class—we should be able to use code to pull out all of the
matching patterns from the source code to obtain our data. This sort of "pattern"
is known as a *CSS selector* (where CSS stands for **c**ascading **s**tyle **s**heet).
-The above was a simple example of "finding the pattern to look for"; many
+The above was a simple example of "finding the pattern to look for"; many
websites are quite a bit larger and more complex, and so is their website
source code. Fortunately, there are tools available to make this process
-easier. For example,
-[SelectorGadget](https://selectorgadget.com/) is
-an open-source tool that simplifies identifying the generating
-and finding of CSS selectors.
+easier. For example,
+[SelectorGadget](https://selectorgadget.com/) is
+an open-source tool that simplifies identifying the generating
+and finding of CSS selectors.
At the end of the chapter in the additional resources section, we include a link to
-a short video on how to install and use the SelectorGadget tool to
-obtain CSS selectors for use in web scraping.
-After installing and enabling the tool, you can click the
-website element for which you want an appropriate selector. For
+a short video on how to install and use the SelectorGadget tool to
+obtain CSS selectors for use in web scraping.
+After installing and enabling the tool, you can click the
+website element for which you want an appropriate selector. For
example, if we click the price of an apartment listing, we
find that SelectorGadget shows us the selector `.result-price`
in its toolbar, and highlights all the other apartment
@@ -1003,14 +1003,14 @@ knitr::include_graphics("img/reading/sg1.png")
If we then click the size of an apartment listing, SelectorGadget shows us
the `span` selector, and highlights many of the lines on the page; this indicates that the
-`span` selector is not specific enough to capture only apartment sizes (Figure \@ref(fig:sg3)).
+`span` selector is not specific enough to capture only apartment sizes (Figure \@ref(fig:sg3)).
```{r sg3, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Using the SelectorGadget on a Craigslist webpage to obtain a CCS selector useful for obtaining apartment sizes.", fig.retina = 2, out.width="100%"}
knitr::include_graphics("img/reading/sg3.png")
```
To narrow the selector, we can click one of the highlighted elements that
-we *do not* want. For example, we can deselect the "pic/map" links,
+we *do not* want. For example, we can deselect the "pic/map" links,
resulting in only the data we want highlighted using the `.housing` selector (Figure \@ref(fig:sg2)).
```{r sg2, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Using the SelectorGadget on a Craigslist webpage to refine the CCS selector to one that is most useful for obtaining apartment sizes.", fig.retina = 2, out.width="100%"}
@@ -1030,7 +1030,7 @@ R if we are using more than one CSS selector.
you are *allowed* to scrape it! There are two documents that are important
for this: the `robots.txt` file and the Terms of Service
document. If we take a look at [Craigslist's Terms of Service document](https://www.craigslist.org/about/terms.of.use),
-we find the following text: *"You agree not to copy/collect CL content
+we find the following text: *"You agree not to copy/collect CL content
via robots, spiders, scripts, scrapers, crawlers, or any automated or manual equivalent (e.g., by hand)."*
So unfortunately, without explicit permission, we are not allowed to scrape the website.
@@ -1041,9 +1041,9 @@ to find data about rental prices in Vancouver, we must go elsewhere.
To continue learning how to scrape data from the web, let's instead
scrape data on the population of Canadian cities from Wikipedia. \index{Wikipedia}
We have checked the [Terms of Service document](https://foundation.wikimedia.org/wiki/Terms_of_Use/en),
-and it does not mention that web scraping is disallowed.
+and it does not mention that web scraping is disallowed.
We will use the SelectorGadget tool to pick elements that we are interested in
-(city names and population counts) and deselect others to indicate that we are not
+(city names and population counts) and deselect others to indicate that we are not
interested in them (province names), as shown in Figure \@ref(fig:sg4).
```{r sg4, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Using the SelectorGadget on a Wikipedia webpage.", fig.retina = 2, out.width="100%"}
@@ -1080,22 +1080,22 @@ page <- read_html("https://en.wikipedia.org/wiki/Canada")
```{r echo=FALSE, warning = FALSE}
# the above cell doesn't actually run; this one does run
-# and loads the html data from a local, static file
+# and loads the html data from a local, static file
page <- read_html("data/canada_wiki.html")
```
-The `read_html` function \index{read function!read\_html} directly downloads the source code for the page at
-the URL you specify, just like your browser would if you navigated to that site. But
-instead of displaying the website to you, the `read_html` function just returns
+The `read_html` function \index{read function!read\_html} directly downloads the source code for the page at
+the URL you specify, just like your browser would if you navigated to that site. But
+instead of displaying the website to you, the `read_html` function just returns
the HTML source code itself, which we have
stored in the `page` variable. Next, we send the page object to the `html_nodes`
function, along with the CSS selectors we obtained from
the SelectorGadget tool. Make sure to surround the selectors with quotation marks; the function, `html_nodes`, expects that
argument is a string. We store the result of the `html_nodes` function in the `population_nodes` variable.
Note that below we use the `paste` function with a comma separator (`sep=","`)
-to build the list of selectors. The `paste` function converts
-elements to characters and combines the values into a list. We use this function to
+to build the list of selectors. The `paste` function converts
+elements to characters and combines the values into a list. We use this function to
build the list of selectors to maintain code readability; this avoids
having a very long line of code.
@@ -1138,7 +1138,7 @@ population_text <- html_text(population_nodes)
head(population_text)
```
-Fantastic! We seem to have extracted the data of interest from the
+Fantastic! We seem to have extracted the data of interest from the
raw HTML source code. But we are not quite done; the data
is not yet in an optimal format for data analysis. Both the city names and
population are encoded as characters in a single vector, instead of being in a
@@ -1213,7 +1213,7 @@ endpoint is `https://api.nasa.gov/planetary/apod`. Second, we write `?`, which d
list of *query parameters* will follow. And finally, we specify a list of
query parameters of the form `parameter=value`, separated by `&` characters. The NASA
"Astronomy Picture of the Day" API accepts the parameters shown in
-Figure \@ref(fig:NASA-API-parameters).
+Figure \@ref(fig:NASA-API-parameters).
```{r NASA-API-parameters, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = 'The set of parameters that you can specify when querying the NASA "Astronomy Picture of the Day" API, along with syntax, default settings, and a description of each.', fig.retina = 2, out.width="100%"}
knitr::include_graphics("img/reading/NASA-API-parameters.png")
@@ -1221,7 +1221,7 @@ knitr::include_graphics("img/reading/NASA-API-parameters.png")
So for example, to obtain the image of the day
from July 13, 2023, the API query would have two parameters: `api_key=YOUR_API_KEY`
-and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
+and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
received from NASA in your email! Putting it all together, the query will look like the following:
```
https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
@@ -1261,7 +1261,7 @@ you will recognize the same query URL that we pasted into the browser earlier.
We will then send the query using the `req_perform` function, and finally
obtain a JSON representation of the response using the `resp_body_json` function.
-
```r
library(httr2)
@@ -1278,7 +1278,7 @@ library(jsonlite)
nasa_data <- read_json("data/nasa.json")
# the last entry in the stored data is July 13, 2023, so print that
-nasa_data[[74]]
+nasa_data[[74]]
```
We can obtain more records at once by using the `start_date` and `end_date` parameters, as
@@ -1286,7 +1286,7 @@ shown in the table of parameters in \@ref(fig:NASA-API-parameters).
Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
in an object called `nasa_data`; now the response
will take the form of an R *list* (you'll learn more about these in Chapter \@ref(wrangling)).
-Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
+Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
and there will be 74 items total, one for each day between the start and end dates:
```r
@@ -1331,8 +1331,8 @@ data you are requesting and how frequently you are making requests.
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Reading in data locally and from the web" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -1343,7 +1343,7 @@ found in Chapter \@ref(setup). This will ensure that the automated feedback
and guidance that the worksheets provide will function as intended.
## Additional resources
-- The [`readr` documentation](https://readr.tidyverse.org/)
+- The [`readr` documentation](https://readr.tidyverse.org/)
provides the documentation for many of the reading functions we cover in this chapter.
It is where you should look if you want to learn more about the functions in this
chapter, the full set of arguments you can use, and other related functions.
@@ -1355,10 +1355,10 @@ and guidance that the worksheets provide will function as intended.
Science* [@wickham2016r], which goes into a lot more detail about how R parses
text from files into data frames.
- The [`here` R package](https://here.r-lib.org/) [@here]
- provides a way for you to construct or find your files' paths.
+ provides a way for you to construct or find your files' paths.
- The [`readxl` documentation](https://readxl.tidyverse.org/) provides more
details on reading data from Excel, such as reading in data with multiple
- sheets, or specifying the cells to read in.
+ sheets, or specifying the cells to read in.
- The [`rio` R package](https://github.com/leeper/rio) [@rio] provides an alternative
set of tools for reading and writing data in R. It aims to be a "Swiss army
knife" for data reading/writing/converting, and supports a wide variety of data
@@ -1372,4 +1372,4 @@ and guidance that the worksheets provide will function as intended.
- [extracting the data for apartment listings on Craigslist](https://www.youtube.com/embed/YdIWI6K64zo), and
- [extracting Canadian city names and populations from Wikipedia](https://www.youtube.com/embed/O9HKbdhqYzk).
- The [`polite` R package](https://dmi3kno.github.io/polite/) [@polite] provides
- a set of tools for responsibly scraping data from websites.
+ a set of tools for responsibly scraping data from websites.
diff --git a/source/regression1.Rmd b/source/regression1.Rmd
index 42a4a1e15..636a799c8 100644
--- a/source/regression1.Rmd
+++ b/source/regression1.Rmd
@@ -35,66 +35,67 @@ hidden_print <- function(x){
hidden_print_cli <- function(x){
cleanup_and_print(cli::cli_fmt(capture.output(x)))
}
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
-## Overview
+## Overview
This chapter continues our foray into answering predictive questions.
-Here we will focus on predicting *numerical* variables
+Here we will focus on predicting *numerical* variables
and will use *regression* to perform this task.
This is unlike the past two chapters, which focused on predicting categorical
variables via classification. However, regression does have many similarities
to classification: for example, just as in the case of classification,
-we will split our data into training, validation, and test sets, we will
-use `tidymodels` workflows, we will use a K-nearest neighbors (KNN)
+we will split our data into training, validation, and test sets, we will
+use `tidymodels` workflows, we will use a K-nearest neighbors (K-NN)
approach to make predictions, and we will use cross-validation to choose K.
Because of how similar these procedures are, make sure to read Chapters
-\@ref(classification1) and \@ref(classification2) before reading
+\@ref(classification1) and \@ref(classification2) before reading
this one—we will move a little bit faster here with the
concepts that have already been covered.
-This chapter will primarily focus on the case where there is a single predictor,
+This chapter will primarily focus on the case where there is a single predictor,
but the end of the chapter shows how to perform
regression with more than one predictor variable, i.e., *multivariable regression*.
-It is important to note that regression
-can also be used to answer inferential and causal questions,
+It is important to note that regression
+can also be used to answer inferential and causal questions,
however that is beyond the scope of this book.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-* Recognize situations where a simple regression analysis would be appropriate for making predictions.
-* Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
-* Interpret the output of a KNN regression.
-* In a data set with two or more variables, perform K-nearest neighbor regression in R using a `tidymodels` workflow.
-* Execute cross-validation in R to choose the number of neighbors.
-* Evaluate KNN regression prediction accuracy in R using a test data set and the root mean squared prediction error (RMSPE).
-* In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
-* Describe the advantages and disadvantages of K-nearest neighbors regression.
+- Recognize situations where a regression analysis would be appropriate for making predictions.
+- Explain the K-nearest neighbors (K-NN) regression algorithm and describe how it differs from K-NN classification.
+- Interpret the output of a K-NN regression.
+- In a data set with two or more variables, perform K-nearest neighbors regression in R.
+- Evaluate K-NN regression prediction quality in R using the root mean squared prediction error (RMSPE).
+- Estimate the RMSPE in R using cross-validation or a test set.
+- Choose the number of neighbors in K-nearest neighbors regression by minimizing estimated cross-validation RMSPE.
+- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors regression.
+- Describe the advantages and disadvantages of K-nearest neighbors regression.
## The regression problem
Regression, like classification, is a predictive \index{predictive question} problem setting where we want
to use past information to predict future observations. But in the case of
-regression, the goal is to predict *numerical* values instead of *categorical* values.
+regression, the goal is to predict *numerical* values instead of *categorical* values.
The variable that you want to predict is often called the *response variable*. \index{response variable}
For example, we could try to use the number of hours a person spends on
-exercise each week to predict their race time in the annual Boston marathon. As
+exercise each week to predict their race time in the annual Boston marathon. As
another example, we could try to use the size of a house to
-predict its sale price. Both of these response variables—race time and sale price—are
+predict its sale price. Both of these response variables—race time and sale price—are
numerical, and so predicting them given past data is considered a regression problem.
-Just like in the \index{classification!comparison to regression}
-classification setting, there are many possible methods that we can use
+Just like in the \index{classification!comparison to regression}
+classification setting, there are many possible methods that we can use
to predict numerical response variables. In this chapter we will
focus on the **K-nearest neighbors** algorithm [@knnfix; @knncover], and in the next chapter
we will study **linear regression**.
In your future studies, you might encounter regression trees, splines,
and general local regression methods; see the additional resources
section at the end of the next chapter for where to begin learning more about
-these other methods.
+these other methods.
-Many of the concepts from classification map over to the setting of regression. For example,
+Many of the concepts from classification map over to the setting of regression. For example,
a regression model predicts a new observation's response variable based on the response variables
for similar observations in the data set of past observations. When building a regression model,
we first split the data into training and test sets, in order to ensure that we assess the performance
@@ -115,19 +116,19 @@ is that we are now predicting numerical variables instead of categorical variabl
> though: sometimes categorical variables will be encoded as numbers in your
> data (e.g., "1" represents "benign", and "0" represents "malignant"). In
> these cases you have to ask the question about the *meaning* of the labels
-> ("benign" and "malignant"), not their values ("1" and "0").
+> ("benign" and "malignant"), not their values ("1" and "0").
## Exploring a data set
-In this chapter and the next, we will study
-a data set \index{Sacramento real estate} of
-[932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
+In this chapter and the next, we will study
+a data set \index{Sacramento real estate} of
+[932 real estate transactions in Sacramento, California](https://support.spatialkey.com/spatialkey-sample-csv-data/)
originally reported in the *Sacramento Bee* newspaper.
We first need to formulate a precise question that
we want to answer. In this example, our question is again predictive:
\index{question!regression} Can we use the size of a house in the Sacramento, CA area to predict
its sale price? A rigorous, quantitative answer to this question might help
-a realtor advise a client as to whether the price of a particular listing
+a realtor advise a client as to whether the price of a particular listing
is fair, or perhaps how to set the price of a new listing.
We begin the analysis by loading and examining the data, and setting the seed value.
@@ -153,10 +154,10 @@ want to predict (sale price) on the y-axis.
\index{ggplot!geom\_point}
\index{visualization!scatter}
-> **Note:** Given that the y-axis unit is dollars in Figure \@ref(fig:07-edaRegr),
-> we format the axis labels to put dollar signs in front of the house prices,
+> **Note:** Given that the y-axis unit is dollars in Figure \@ref(fig:07-edaRegr),
+> we format the axis labels to put dollar signs in front of the house prices,
> as well as commas to increase the readability of the larger numbers.
-> We can do this in R by passing the `dollar_format` function
+> We can do this in R by passing the `dollar_format` function
> (from the `scales` package)
> to the `labels` argument of the `scale_y_continuous` function.
@@ -165,31 +166,31 @@ eda <- ggplot(sacramento, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
xlab("House size (square feet)") +
ylab("Price (USD)") +
- scale_y_continuous(labels = dollar_format()) +
+ scale_y_continuous(labels = dollar_format()) +
theme(text = element_text(size = 12))
eda
```
-
+
The plot is shown in Figure \@ref(fig:07-edaRegr).
We can see that in Sacramento, CA, as the
size of a house increases, so does its sale price. Thus, we can reason that we
may be able to use the size of a not-yet-sold house (for which we don't know
the sale price) to predict its final sale price. Note that we do not suggest here
that a larger house size *causes* a higher sale price; just that house price
-tends to increase with house size, and that we may be able to use the latter to
+tends to increase with house size, and that we may be able to use the latter to
predict the former.
## K-nearest neighbors regression
-Much like in the case of classification,
-we can use a K-nearest neighbors-based \index{K-nearest neighbors!regression}
-approach in regression to make predictions.
-Let's take a small sample of the data in Figure \@ref(fig:07-edaRegr)
-and walk through how K-nearest neighbors (KNN) works
+Much like in the case of classification,
+we can use a K-nearest neighbors-based \index{K-nearest neighbors!regression}
+approach in regression to make predictions.
+Let's take a small sample of the data in Figure \@ref(fig:07-edaRegr)
+and walk through how K-nearest neighbors (K-NN) works
in a regression context before we dive in to creating our model and assessing
how well it predicts house sale price. This subsample is taken to allow us to
-illustrate the mechanics of KNN regression with a few data points; later in
+illustrate the mechanics of K-NN regression with a few data points; later in
this chapter we will use all the data.
To take a small random sample of size 30, we'll use the function
@@ -213,7 +214,7 @@ to this question by using the data we have to predict the sale price given the
sale prices we have already observed. But in Figure \@ref(fig:07-small-eda-regr),
you can see that we have no
observations of a house of size *exactly* 2,000 square feet. How can we predict
-the sale price?
+the sale price?
```{r 07-small-eda-regr, fig.height = 3.5, fig.width = 4.5, fig.cap = "Scatter plot of price (USD) versus house size (square feet) with vertical line indicating 2,000 square feet on x-axis."}
small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
@@ -221,7 +222,7 @@ small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
xlab("House size (square feet)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
- geom_vline(xintercept = 2000, linetype = "dotted") +
+ geom_vline(xintercept = 2000, linetype = "dotted") +
theme(text = element_text(size = 12))
small_plot
@@ -229,9 +230,9 @@ small_plot
We will employ the same intuition from the classification chapter, and use the
neighboring points to the new point of interest to suggest/predict what its
-sale price might be.
-For the example shown in Figure \@ref(fig:07-small-eda-regr),
-we find and label the 5 nearest neighbors to our observation
+sale price might be.
+For the example shown in Figure \@ref(fig:07-small-eda-regr),
+we find and label the 5 nearest neighbors to our observation
of a house that is 2,000 square feet.
\index{mutate}\index{slice\_min}\index{abs}
@@ -257,7 +258,7 @@ nn_plot
Figure \@ref(fig:07-knn3-example) illustrates the difference between the house sizes
of the 5 nearest neighbors (in terms of house size) to our new
-2,000 square-foot house of interest. Now that we have obtained these nearest neighbors,
+2,000 square-foot house of interest. Now that we have obtained these nearest neighbors,
we can use their values to predict the
sale price for the new home. Specifically, we can take the mean (or
average) of these 5 values as our predicted value, as illustrated by
@@ -279,28 +280,28 @@ Our predicted price is \$`r format(round(prediction[[1]]), big.mark=",", nsmall=
(shown as a red point in Figure \@ref(fig:07-predictedViz-knn)), which is much less than \$350,000; perhaps we
might want to offer less than the list price at which the house is advertised.
But this is only the very beginning of the story. We still have all the same
-unanswered questions here with KNN regression that we had with KNN
+unanswered questions here with K-NN regression that we had with K-NN
classification: which $K$ do we choose, and is our model any good at making
predictions? In the next few sections, we will address these questions in the
-context of KNN regression.
+context of K-NN regression.
-One strength of the KNN regression algorithm
+One strength of the K-NN regression algorithm
that we would like to draw attention to at this point
is its ability to work well with non-linear relationships
(i.e., if the relationship is not a straight line).
This stems from the use of nearest neighbors to predict values.
-The algorithm really has very few assumptions
+The algorithm really has very few assumptions
about what the data must look like for it to work.
## Training, evaluating, and tuning the model
-As usual,
-we must start by putting some test data away in a lock box
-that we will come back to only after we choose our final model.
-Let's take care of that now.
-Note that for the remainder of the chapter
-we'll be working with the entire Sacramento data set,
-as opposed to the smaller sample of 30 points
+As usual,
+we must start by putting some test data away in a lock box
+that we will come back to only after we choose our final model.
+Let's take care of that now.
+Note that for the remainder of the chapter
+we'll be working with the entire Sacramento data set,
+as opposed to the smaller sample of 30 points
that we used earlier in the chapter (Figure \@ref(fig:07-small-eda-regr)).
\index{training data}
\index{test data}
@@ -316,12 +317,12 @@ sacramento_train <- training(sacramento_split)
sacramento_test <- testing(sacramento_split)
```
-Next, we'll use cross-validation \index{cross-validation} to choose $K$. In KNN classification, we used
+Next, we'll use cross-validation \index{cross-validation} to choose $K$. In K-NN classification, we used
accuracy to see how well our predictions matched the true labels. We cannot use
the same metric in the regression setting, since our predictions will almost never
*exactly* match the true response variable values. Therefore in the
-context of KNN regression we will use root mean square prediction error \index{root mean square prediction error|see{RMSPE}}\index{RMSPE}
-(RMSPE) instead. The mathematical formula for calculating RMSPE is:
+context of K-NN regression we will use root mean square prediction error \index{root mean square prediction error|see{RMSPE}}\index{RMSPE}
+(RMSPE) instead. The mathematical formula for calculating RMSPE is:
$$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
@@ -331,13 +332,13 @@ where:
- $y_i$ is the observed value for the $i^\text{th}$ observation, and
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation.
-In other words, we compute the *squared* difference between the predicted and true response
+In other words, we compute the *squared* difference between the predicted and true response
value for each observation in our test (or validation) set, compute the average, and then finally
take the square root. The reason we use the *squared* difference (and not just the difference)
is that the differences can be positive or negative, i.e., we can overshoot or undershoot the true
response value. Figure \@ref(fig:07-verticalerrors) illustrates both positive and negative differences
between predicted and true response values.
-So if we want to measure error—a notion of distance between our predicted and true response values—we
+So if we want to measure error—a notion of distance between our predicted and true response values—we
want to make sure that we are only adding up positive values, with larger positive values representing larger
mistakes.
If the predictions are very close to the true values, then
@@ -359,7 +360,7 @@ sacr_recipe_hid <- recipe(price ~ sqft, data = small_sacramento) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
-sacr_spec_hid <- nearest_neighbor(weight_func = "rectangular",
+sacr_spec_hid <- nearest_neighbor(weight_func = "rectangular",
neighbors = 4) |>
set_engine("kknn") |>
set_mode("regression")
@@ -369,11 +370,11 @@ sacr_fit_hid <- workflow() |>
add_model(sacr_spec_hid) |>
fit(data = small_sacramento)
-sacr_full_preds_hid <- sacr_fit_hid |>
+sacr_full_preds_hid <- sacr_fit_hid |>
predict(finegrid) |>
bind_cols(finegrid)
-sacr_new_preds_hid <- sacr_fit_hid |>
+sacr_new_preds_hid <- sacr_fit_hid |>
predict(pts) |>
bind_cols(pts)
@@ -383,15 +384,15 @@ errors_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
xlab("House size (square feet)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
- geom_line(data = sacr_full_preds_hid,
- aes(x = sqft, y = .pred),
- color = "blue") +
+ geom_line(data = sacr_full_preds_hid,
+ aes(x = sqft, y = .pred),
+ color = "blue") +
geom_segment(
data = sacr_new_preds_hid,
- aes(x = sqft, xend = sqft, y = price, yend = .pred),
- color = "red") +
- geom_point(data = sacr_new_preds_hid,
- aes(x = sqft, y = price),
+ aes(x = sqft, xend = sqft, y = price, yend = .pred),
+ color = "red") +
+ geom_point(data = sacr_new_preds_hid,
+ aes(x = sqft, y = price),
color = "black")
# restore the seed
@@ -400,31 +401,31 @@ errors_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
errors_plot
```
-> **Note:** When using many code packages (`tidymodels` included), the evaluation output
+> **Note:** When using many code packages (`tidymodels` included), the evaluation output
> we will get to assess the prediction quality of
-> our KNN regression models is labeled "RMSE", or "root mean squared
+> our K-NN regression models is labeled "RMSE", or "root mean squared
> error". Why is this so, and why not RMSPE? \index{RMSPE!comparison with RMSE}
> In statistics, we try to be very precise with our
> language to indicate whether we are calculating the prediction error on the
-> training data (*in-sample* prediction) versus on the testing data
-> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
+> training data (*in-sample* prediction) versus on the testing data
+> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
> say RMSE. By contrast, when predicting and evaluating prediction quality
-> on the testing or validation data, we say RMSPE.
+> on the testing or validation data, we say RMSPE.
> The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
-> training or testing data. But many people just use RMSE for both,
+> training or testing data. But many people just use RMSE for both,
> and rely on context to denote which data the root mean squared error is being calculated on.
Now that we know how we can assess how well our model predicts a numerical
value, let's use R to perform cross-validation and to choose the optimal $K$.
First, we will create a recipe for preprocessing our data.
Note that we include standardization
-in our preprocessing to build good habits, but since we only have one
+in our preprocessing to build good habits, but since we only have one
predictor, it is technically not necessary; there is no risk of comparing two predictors
of different scales.
-Next we create a model specification for K-nearest neighbors regression. Note
+Next we create a model specification for K-nearest neighbors regression. Note
that we use `set_mode("regression")`
now in the model specification to denote a regression problem, as opposed to the classification
-problems from the previous chapters.
+problems from the previous chapters.
The use of `set_mode("regression")` essentially
tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
for tuning and evaluation.
@@ -437,7 +438,7 @@ sacr_recipe <- recipe(price ~ sqft, data = sacramento_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
-sacr_spec <- nearest_neighbor(weight_func = "rectangular",
+sacr_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("regression")
@@ -455,7 +456,7 @@ sacr_wkflw
hidden_print(sacr_wkflw)
```
-Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
+Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
The following code tunes
the model and returns the RMSPE for each number of neighbors. In the output of the `sacr_results`
results data frame, we see that the `neighbors` variable contains the value of $K$,
@@ -479,7 +480,7 @@ sacr_results <- sacr_wkflw |>
sacr_results
```
-```{r 07-choose-k-knn-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Effect of the number of neighbors on the RMSPE."}
+```{r 07-choose-k-knn-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Effect of the number of neighbors on the RMSPE."}
sacr_tunek_plot <- ggplot(sacr_results, aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
@@ -508,22 +509,22 @@ kmin <- sacr_min |> pull(neighbors)
## Underfitting and overfitting
Similar to the setting of classification, by setting the number of neighbors
to be too small or too large, we cause the RMSPE to increase, as shown in
-Figure \@ref(fig:07-choose-k-knn-plot). What is happening here?
+Figure \@ref(fig:07-choose-k-knn-plot). What is happening here?
Figure \@ref(fig:07-howK) visualizes the effect of different settings of $K$ on the
regression model. Each plot shows the predicted values for house sale price from
-our KNN regression model on the training data for 6 different values for $K$: 1, 3, 25, `r kmin`, 250, and 680 (almost the entire training set).
+our K-NN regression model on the training data for 6 different values for $K$: 1, 3, 25, `r kmin`, 250, and 680 (almost the entire training set).
For each model, we predict prices for the range of possible home sizes we
observed in the data set (here 500 to 5,000 square feet) and we plot the
predicted prices as a blue line.
-```{r 07-howK, echo = FALSE, warning = FALSE, fig.height = 13, fig.width = 10,fig.cap = "Predicted values for house price (represented as a blue line) from KNN regression models for six different values for $K$."}
+```{r 07-howK, echo = FALSE, warning = FALSE, fig.height = 13, fig.width = 10,fig.cap = "Predicted values for house price (represented as a blue line) from K-NN regression models for six different values for $K$."}
gridvals <- c(1, 3, 25, kmin, 250, 680)
plots <- list()
for (i in 1:6) {
- sacr_spec <- nearest_neighbor(weight_func = "rectangular",
+ sacr_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = gridvals[[i]]) |>
set_engine("kknn") |>
set_mode("regression")
@@ -551,9 +552,9 @@ grid.arrange(grobs = plots, ncol = 2)
```
Figure \@ref(fig:07-howK) shows that when $K$ = 1, the blue line runs perfectly
-through (almost) all of our training observations.
+through (almost) all of our training observations.
This happens because our
-predicted values for a given region (typically) depend on just a single observation.
+predicted values for a given region (typically) depend on just a single observation.
In general, when $K$ is too small, the line follows the training data quite
closely, even if it does not match it perfectly.
If we used a different training data set of house prices and sizes
@@ -564,17 +565,17 @@ predictions on new observations which, generally, will not have the same fluctua
as the original training data.
Recall from the classification
chapters that this behavior—where the model is influenced too much
-by the noisy data—is called *overfitting*; we use this same term
+by the noisy data—is called *overfitting*; we use this same term
in the context of regression. \index{overfitting!regression}
-What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
-say, $K$ = 250 or 680?
+What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
+say, $K$ = 250 or 680?
In this case the blue line becomes extremely smooth, and actually becomes flat
-once $K$ is equal to the number of datapoints in the training set.
+once $K$ is equal to the number of datapoints in the training set.
This happens because our predicted values for a given x value (here, home
-size), depend on many neighboring observations; in the case where $K$ is equal
+size), depend on many neighboring observations; in the case where $K$ is equal
to the size of the training set, the prediction is just the mean of the house prices
-(completely ignoring the house size). In contrast to the $K=1$ example,
+(completely ignoring the house size). In contrast to the $K=1$ example,
the smooth, inflexible blue line does not follow the training observations very closely.
In other words, the model is *not influenced enough* by the training data.
Recall from the classification
@@ -585,20 +586,20 @@ Ideally, what we want is neither of the two situations discussed above. Instead,
we would like a model that (1) follows the overall "trend" in the training data, so the model
actually uses the training data to learn something useful, and (2) does not follow
the noisy fluctuations, so that we can be confident that our model will transfer/generalize
-well to other new data. If we explore
+well to other new data. If we explore
the other values for $K$, in particular $K$ = `r sacr_min |> pull(neighbors)`
(as suggested by cross-validation),
we can see it achieves this goal: it follows the increasing trend of house price
versus house size, but is not influenced too much by the idiosyncratic variations
in price. All of this is similar to how
the choice of $K$ affects K-nearest neighbors classification, as discussed in the previous
-chapter.
+chapter.
## Evaluating on the test set
To assess how well our model might do at predicting on unseen data, we will
assess its RMSPE on the test data. To do this, we will first
-re-train our KNN regression model on the entire training data set,
+re-train our K-NN regression model on the entire training data set,
using $K =$ `r sacr_min |> pull(neighbors)` neighbors. Then we will
use `predict` to make predictions on the test data, and use the `metrics`
function again to compute the summary of regression quality. Because
@@ -626,24 +627,24 @@ sacr_summary <- sacr_fit |>
sacr_summary
```
-Our final model's test error as assessed by RMSPE
-is $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
+Our final model's test error as assessed by RMSPE
+is $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
Note that RMSPE is measured in the same units as the response variable.
-In other words, on new observations, we expect the error in our prediction to be
-*roughly* $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
+In other words, on new observations, we expect the error in our prediction to be
+*roughly* $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
From one perspective, this is good news: this is about the same as the cross-validation
-RMSPE estimate of our tuned model
-(which was $`r format(round(sacr_min |> pull(mean)), big.mark=",", nsmall=0, scientific=FALSE)`),
+RMSPE estimate of our tuned model
+(which was $`r format(round(sacr_min |> pull(mean)), big.mark=",", nsmall=0, scientific=FALSE)`),
so we can say that the model appears to generalize well
to new data that it has never seen before.
-However, much like in the case of KNN classification, whether this value for RMSPE is *good*—i.e.,
+However, much like in the case of K-NN classification, whether this value for RMSPE is *good*—i.e.,
whether an error of around $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`
-is acceptable—depends entirely on the application.
+is acceptable—depends entirely on the application.
In this application, this error
-is not prohibitively large, but it is not negligible either;
+is not prohibitively large, but it is not negligible either;
$`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`
might represent a substantial fraction of a home buyer's budget, and
-could make or break whether or not they could afford put an offer on a house.
+could make or break whether or not they could afford put an offer on a house.
Finally, Figure \@ref(fig:07-predict-all) shows the predictions that our final
model makes across the range of house sizes we might encounter in the
@@ -658,7 +659,7 @@ You have already seen a
few plots like this in this chapter, but here we also provide the code that
generated it as a learning opportunity.
-```{r 07-predict-all, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Predicted values of house price (blue line) for the final KNN regression model."}
+```{r 07-predict-all, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Predicted values of house price (blue line) for the final K-NN regression model."}
sqft_prediction_grid <- tibble(
sqft = seq(
from = sacramento |> select(sqft) |> min(),
@@ -673,22 +674,22 @@ sacr_preds <- sacr_fit |>
plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
- geom_line(data = sacr_preds,
- mapping = aes(x = sqft, y = .pred),
+ geom_line(data = sacr_preds,
+ mapping = aes(x = sqft, y = .pred),
color = "blue") +
xlab("House size (square feet)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
- ggtitle(paste0("K = ", kmin)) +
+ ggtitle(paste0("K = ", kmin)) +
theme(text = element_text(size = 12))
plot_final
```
-## Multivariable KNN regression
+## Multivariable K-NN regression
-As in KNN classification, we can use multiple predictors in KNN regression.
-In this setting, we have the same concerns regarding the scale of the predictors. Once again,
+As in K-NN classification, we can use multiple predictors in K-NN regression.
+In this setting, we have the same concerns regarding the scale of the predictors. Once again,
predictions are made by identifying the $K$
observations that are nearest to the new point we want to predict; any
variables that are on a large scale will have a much larger effect than
@@ -696,13 +697,13 @@ variables on a small scale. But since the `recipe` we built above scales and cen
all predictor variables, this is handled for us.
Note that we also have the same concern regarding the selection of predictors
-in KNN regression as in KNN classification: having more predictors is **not** always
+in K-NN regression as in K-NN classification: having more predictors is **not** always
better, and the choice of which predictors to use has a potentially large influence
-on the quality of predictions. Fortunately, we can use the predictor selection
-algorithm from the classification chapter in KNN regression as well.
+on the quality of predictions. Fortunately, we can use the predictor selection
+algorithm from the classification chapter in K-NN regression as well.
As the algorithm is the same, we will not cover it again in this chapter.
-We will now demonstrate a multivariable KNN regression \index{K-nearest neighbors!multivariable regression} analysis of the
+We will now demonstrate a multivariable K-NN regression \index{K-nearest neighbors!multivariable regression} analysis of the
Sacramento real estate \index{Sacramento real estate} data using `tidymodels`. This time we will use
house size (measured in square feet) as well as number of bedrooms as our
predictors, and continue to use house sale price as our response variable
@@ -715,8 +716,8 @@ to help predict the sale price of a house.
```{r 07-bedscatter, fig.height = 3.5, fig.width = 4.5, fig.cap = "Scatter plot of the sale price of houses versus the number of bedrooms."}
plot_beds <- sacramento |>
ggplot(aes(x = beds, y = price)) +
- geom_point(alpha = 0.4) +
- labs(x = 'Number of Bedrooms', y = 'Price (USD)') +
+ geom_point(alpha = 0.4) +
+ labs(x = 'Number of Bedrooms', y = 'Price (USD)') +
theme(text = element_text(size = 12))
plot_beds
```
@@ -725,7 +726,7 @@ Figure \@ref(fig:07-bedscatter) shows that as the number of bedrooms increases,
the house sale price tends to increase as well, but that the relationship
is quite weak. Does adding the number of bedrooms
to our model improve our ability to predict price? To answer that
-question, we will have to create a new KNN regression
+question, we will have to create a new K-NN regression
model using house size and number of bedrooms, and then we can compare it to
the model we previously came up with that only used house
size. Let's do that now!
@@ -738,7 +739,7 @@ and set `neighbors = tune()` to tell `tidymodels` to tune the number of neighbor
sacr_recipe <- recipe(price ~ sqft + beds, data = sacramento_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
-sacr_spec <- nearest_neighbor(weight_func = "rectangular",
+sacr_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("regression")
@@ -763,24 +764,24 @@ sacr_multi
```
Here we see that the smallest estimated RMSPE from cross-validation occurs when $K =$ `r sacr_k`.
-If we want to compare this multivariable KNN regression model to the model with only a single
+If we want to compare this multivariable K-NN regression model to the model with only a single
predictor *as part of the model tuning process* (e.g., if we are running forward selection as described
in the chapter on evaluating and tuning classification models),
then we must compare the RMSPE estimated using only the training data via cross-validation.
-Looking back, the estimated cross-validation RMSPE for the single-predictor
+Looking back, the estimated cross-validation RMSPE for the single-predictor
model was \$`r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
The estimated cross-validation RMSPE for the multivariable model is
\$`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
-Thus in this case, we did not improve the model
+Thus in this case, we did not improve the model
by a large amount by adding this additional predictor.
-Regardless, let's continue the analysis to see how we can make predictions with a multivariable KNN regression model
+Regardless, let's continue the analysis to see how we can make predictions with a multivariable K-NN regression model
and evaluate its performance on test data. We first need to re-train the model on the entire
training data set with $K =$ `r sacr_k`, and then use that model to make predictions
on the test data.
```{r 07-re-train}
-sacr_spec <- nearest_neighbor(weight_func = "rectangular",
+sacr_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = sacr_k) |>
set_engine("kknn") |>
set_mode("regression")
@@ -800,23 +801,23 @@ knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) |>
knn_mult_mets
```
-This time, when we performed KNN regression on the same data set, but also
-included number of bedrooms as a predictor, we obtained a RMSPE test error
+This time, when we performed K-NN regression on the same data set, but also
+included number of bedrooms as a predictor, we obtained a RMSPE test error
of \$`r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
-Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This
+Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This
time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
-predictors instead of 1.
+predictors instead of 1.
-```{r 07-knn-mult-viz, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "KNN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes.", out.width="100%"}
-xvals <- seq(from = min(sacramento_train$sqft),
- to = max(sacramento_train$sqft),
+```{r 07-knn-mult-viz, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "K-NN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes.", out.width="100%"}
+xvals <- seq(from = min(sacramento_train$sqft),
+ to = max(sacramento_train$sqft),
length = 50)
-yvals <- seq(from = min(sacramento_train$beds),
- to = max(sacramento_train$beds),
+yvals <- seq(from = min(sacramento_train$beds),
+ to = max(sacramento_train$beds),
length = 50)
zvals <- knn_mult_fit |>
- predict(crossing(xvals, yvals) |>
+ predict(crossing(xvals, yvals) |>
mutate(sqft = xvals, beds = yvals)) |>
pull(.pred)
@@ -842,7 +843,7 @@ plot_3d <- plot_ly() |>
colorbar = list(title = "Price (USD)")
)
-if(!is_latex_output()){
+if(!is_latex_output()){
plot_3d
} else {
scene = list(camera = list(eye = list(x = -2.1, y = -2.2, z = 0.75)))
@@ -853,7 +854,7 @@ if(!is_latex_output()){
```
We can see that the predictions in this case, where we have 2 predictors, form
-a surface instead of a line. Because the newly added predictor (number of bedrooms) is
+a surface instead of a line. Because the newly added predictor (number of bedrooms) is
related to price (as price changes, so does number of bedrooms)
and is not totally determined by house size (our other predictor),
we get additional and useful information for making our
@@ -862,15 +863,15 @@ house with a size of 2,500 square feet generally increases slightly as the numbe
of bedrooms increases. Without having the additional predictor of number of
bedrooms, we would predict the same price for these two houses.
-## Strengths and limitations of KNN regression
+## Strengths and limitations of K-NN regression
-As with KNN classification (or any prediction algorithm for that matter), KNN
+As with K-NN classification (or any prediction algorithm for that matter), K-NN
regression has both strengths and weaknesses. Some are listed here:
**Strengths:** K-nearest neighbors regression
1. is a simple, intuitive algorithm,
-2. requires few assumptions about what the data must look like, and
+2. requires few assumptions about what the data must look like, and
3. works well with non-linear relationships (i.e., if the relationship is not a straight line).
**Weaknesses:** K-nearest neighbors regression
@@ -881,8 +882,8 @@ regression has both strengths and weaknesses. Some are listed here:
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Regression I: K-nearest neighbors" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/regression2.Rmd b/source/regression2.Rmd
index ce08502b8..bcc666127 100644
--- a/source/regression2.Rmd
+++ b/source/regression2.Rmd
@@ -35,60 +35,61 @@ hidden_print <- function(x){
hidden_print_cli <- function(x){
cleanup_and_print(cli::cli_fmt(capture.output(x)))
}
-theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
+theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
-## Overview
+## Overview
Up to this point, we have solved all of our predictive problems—both classification
-and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression,
+and regression—using K-nearest neighbors (K-NN)-based approaches. In the context of regression,
there is another commonly used method known as *linear regression*. This chapter provides an introduction
to the basic concept of linear regression, shows how to use `tidymodels` to perform linear regression in R,
-and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual,
+and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual,
on the case where there is a single predictor and single response variable of interest; but the chapter
concludes with an example using *multivariable linear regression* when there is more than one
predictor.
-## Chapter learning objectives
+## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
-* Use R and `tidymodels` to fit a linear regression model on training data.
-* Evaluate the linear regression model on test data.
-* Compare and contrast predictions obtained from K-nearest neighbor regression to those obtained using linear regression from the same data set.
+- Use R to fit simple and multivariable linear regression models on training data.
+- Evaluate the linear regression model on test data.
+- Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set.
+- Describe how linear regression is affected by outliers and multicollinearity.
## Simple linear regression
-At the end of the previous chapter, we noted some limitations of KNN regression.
-While the method is simple and easy to understand, KNN regression does not
+At the end of the previous chapter, we noted some limitations of K-NN regression.
+While the method is simple and easy to understand, K-NN regression does not
predict well beyond the range of the predictors in the training data, and
the method gets significantly slower as the training data set grows. \index{regression!linear}
-Fortunately, there is an alternative to KNN regression—*linear regression*—that addresses
-both of these limitations. Linear regression is also very commonly
+Fortunately, there is an alternative to K-NN regression—*linear regression*—that addresses
+both of these limitations. Linear regression is also very commonly
used in practice because it provides an interpretable mathematical equation that describes
the relationship between the predictor and response variables. In this first part of the chapter, we will focus on *simple* linear regression,
which involves only one predictor variable and one response variable; later on, we will consider
*multivariable* linear regression, which involves multiple predictor variables.
- Like KNN regression, simple linear regression involves
+ Like K-NN regression, simple linear regression involves
predicting a numerical response variable (like race time, house price, or height);
-but *how* it makes those predictions for a new observation is quite different from KNN regression.
+but *how* it makes those predictions for a new observation is quite different from K-NN regression.
Instead of looking at the K nearest neighbors and averaging
-over their values for a prediction, in simple linear regression, we create a
+over their values for a prediction, in simple linear regression, we create a
straight line of best fit through the training data and then
"look up" the prediction using the line.
-> **Note:** Although we did not cover it in earlier chapters, there
+> **Note:** Although we did not cover it in earlier chapters, there
> is another popular method for classification called *logistic
> regression* (it is used for classification even though the name, somewhat confusingly,
> has the word "regression" in it). In logistic regression—similar to linear regression—you
> "fit" the model to the training data and then "look up" the prediction for each new observation.
-> Logistic regression and KNN classification have an advantage/disadvantage comparison
-> similar to that of linear regression and KNN
+> Logistic regression and K-NN classification have an advantage/disadvantage comparison
+> similar to that of linear regression and K-NN
> regression. It is useful to have a good understanding of linear regression before learning about
> logistic regression. After reading this chapter, see the "Additional Resources" section at the end of the
> classification chapters to learn more about logistic regression. \index{regression!logistic}
Let's return to the Sacramento housing data \index{Sacramento real estate} from Chapter \@ref(regression1) to learn
-how to apply linear regression and compare it to KNN regression. For now, we
-will consider
+how to apply linear regression and compare it to K-NN regression. For now, we
+will consider
a smaller version of the housing data to help make our visualizations clear.
Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict
its sale price? \index{question!regression} In particular, recall that we have come across a new 2,000 square-foot house we are interested
@@ -129,18 +130,18 @@ where
- $\beta_0$ is the *vertical intercept* of the line (the price when house size is 0)
- $\beta_1$ is the *slope* of the line (how quickly the price increases as you increase house size)
-Therefore using the data to find the line of best fit is equivalent to finding coefficients
+Therefore using the data to find the line of best fit is equivalent to finding coefficients
$\beta_0$ and $\beta_1$ that *parametrize* (correspond to) the line of best fit.
Now of course, in this particular problem, the idea of a 0 square-foot house is a bit silly;
-but you can think of $\beta_0$ here as the "base price," and
+but you can think of $\beta_0$ here as the "base price," and
$\beta_1$ as the increase in price for each square foot of space.
-Let's push this thought even further: what would happen in the equation for the line if you
+Let's push this thought even further: what would happen in the equation for the line if you
tried to evaluate the price of a house with size 6 *million* square feet?
Or what about *negative* 2,000 square feet? As it turns out, nothing in the formula breaks; linear
regression will happily make predictions for nonsensical predictor values if you ask it to. But even though
you *can* make these wild predictions, you shouldn't. You should only make predictions roughly within
the range of your original data, and perhaps a bit beyond it only if it makes sense. For example,
-the data in Figure \@ref(fig:08-lin-reg1) only reaches around 800 square feet on the low end, but
+the data in Figure \@ref(fig:08-lin-reg1) only reaches around 800 square feet on the low end, but
it would probably be reasonable to use the linear regression model to make a prediction at 600 square feet, say.
Back to the example! Once we have the coefficients $\beta_0$ and $\beta_1$, we can use the equation
@@ -154,26 +155,26 @@ prediction <- predict(small_model, data.frame(sqft = 2000))
small_plot +
geom_vline(xintercept = 2000, linetype = "dotted") +
- geom_point(aes(x = 2000,
- y = prediction[[1]],
+ geom_point(aes(x = 2000,
+ y = prediction[[1]],
color = "red", size = 2.5)) +
- annotate("text",
- x = 2350,
- y = prediction[[1]]-30000,
- label=paste("$",
+ annotate("text",
+ x = 2350,
+ y = prediction[[1]]-30000,
+ label=paste("$",
format(round(prediction[[1]]),
- big.mark=",",
- nsmall=0,
+ big.mark=",",
+ nsmall=0,
scientific = FALSE),
sep="")) +
theme(legend.position = "none")
```
By using simple linear regression on this small data set to predict the sale price
-for a 2,000 square-foot house, we get a predicted value of
+for a 2,000 square-foot house, we get a predicted value of
\$`r format(round(prediction[[1]]), big.mark=",", nsmall=0, scientific = FALSE)`. But wait a minute... how
exactly does simple linear regression choose the line of best fit? Many
-different lines could be drawn through the data points.
+different lines could be drawn through the data points.
Some plausible examples are shown in Figure \@ref(fig:08-several-lines).
```{r 08-several-lines, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Scatter plot of sale price versus size with many possible lines that could be drawn through the data points."}
@@ -185,10 +186,10 @@ small_plot +
Simple linear regression chooses the straight line of best fit by choosing
the line that minimizes the **average squared vertical distance** between itself and
-each of the observed data points in the training data. Figure \@ref(fig:08-verticalDistToMin) illustrates
-these vertical distances as red lines. Finally, to assess the predictive
+each of the observed data points in the training data. Figure \@ref(fig:08-verticalDistToMin) illustrates
+these vertical distances as red lines. Finally, to assess the predictive
accuracy of a simple linear regression model,
-we use RMSPE—the same measure of predictive performance we used with KNN regression.
+we use RMSPE—the same measure of predictive performance we used with K-NN regression.
\index{RMSPE}
```{r 08-verticalDistToMin, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Scatter plot of sale price versus size with red lines denoting the vertical distances between the predicted values and the observed data points."}
@@ -196,15 +197,15 @@ small_sacramento <- small_sacramento |>
mutate(predicted = predict(small_model))
small_plot +
- geom_segment(data = small_sacramento,
- aes(xend = sqft, yend = predicted),
+ geom_segment(data = small_sacramento,
+ aes(xend = sqft, yend = predicted),
colour = "red")
```
## Linear regression in R
We can perform simple linear regression in R using `tidymodels` \index{tidymodels} in a
-very similar manner to how we performed KNN regression.
+very similar manner to how we performed K-NN regression.
To do this, instead of creating a `nearest_neighbor` model specification with
the `kknn` engine, we use a `linear_reg` model specification
with the `lm` engine. Another difference is that we do not need to choose $K$ in the
@@ -261,17 +262,17 @@ hidden_print(lm_fit)
> hurt anything—but if you leave the predictors in their original form,
> the best fit coefficients are usually easier to interpret afterward.
-Our coefficients are
+Our coefficients are
(intercept) $\beta_0=$ `r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[1]), scientific=FALSE)`
and (slope) $\beta_1=$ `r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[2]), scientific=FALSE)`.
This means that the equation of the line of best fit is
$$\text{house sale price} = `r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[1]), scientific=FALSE)` + `r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[2]), scientific=FALSE)`\cdot (\text{house size}).$$
-In other words, the model predicts that houses
+In other words, the model predicts that houses
start at \$`r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[1]), big.mark=",", nsmall=0, scientific=FALSE)` for 0 square feet, and that
-every extra square foot increases the cost of
-the house by \$`r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[2]), scientific=FALSE)`. Finally,
+every extra square foot increases the cost of
+the house by \$`r format(round(pull(tidy(extract_fit_parsnip(lm_fit)), estimate)[2]), scientific=FALSE)`. Finally,
we predict on the test data set to assess how well our model does:
```{r 08-assessFinal}
@@ -314,8 +315,8 @@ sacr_preds <- lm_fit |>
lm_plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
- geom_line(data = sacr_preds,
- mapping = aes(x = sqft, y = .pred),
+ geom_line(data = sacr_preds,
+ mapping = aes(x = sqft, y = .pred),
color = "blue") +
xlab("House size (square feet)") +
ylab("Price (USD)") +
@@ -338,16 +339,16 @@ coeffs <- lm_fit |>
coeffs
```
-## Comparing simple linear and KNN regression
+## Comparing simple linear and K-NN regression
-Now that we have a general understanding of both simple linear and KNN
+Now that we have a general understanding of both simple linear and K-NN
regression, we \index{regression!comparison of methods} can start to compare and contrast these methods as well as the
predictions made by them. To start, let's look at the visualization of the
simple linear regression model predictions for the Sacramento real estate data
-(predicting price from house size) and the "best" KNN regression model
+(predicting price from house size) and the "best" K-NN regression model
obtained from the same problem, shown in Figure \@ref(fig:08-compareRegression).
-```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4.75, fig.width = 10, fig.cap = "Comparison of simple linear regression and KNN regression."}
+```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4.75, fig.width = 10, fig.cap = "Comparison of simple linear regression and K-NN regression."}
set.seed(1234)
# neighbors = 52 from regression1 chapter
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 52) |>
@@ -390,12 +391,12 @@ knn_plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
- ggtitle("KNN regression") +
+ ggtitle("K-NN regression") +
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", sacr_rmspe)) +
- theme(text = element_text(size = 18), axis.title=element_text(size=18))
+ theme(text = element_text(size = 18), axis.title=element_text(size=18))
+
-
lm_rmspe <- lm_test_results |>
filter(.metric == "rmse") |>
pull(.estimate) |>
@@ -404,75 +405,75 @@ lm_rmspe <- lm_test_results |>
lm_plot_final <- lm_plot_final +
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", lm_rmspe)) +
ggtitle("linear regression") +
- theme(text = element_text(size = 18), axis.title=element_text(size=18))
+ theme(text = element_text(size = 18), axis.title=element_text(size=18))
grid.arrange(lm_plot_final, knn_plot_final, ncol = 2)
```
What differences do we observe in Figure \@ref(fig:08-compareRegression)? One obvious
difference is the shape of the blue lines. In simple linear regression we are
-restricted to a straight line, whereas in KNN regression our line is much more
+restricted to a straight line, whereas in K-NN regression our line is much more
flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the
-model to a straight line. A
+model to a straight line. A
straight line can be defined by two numbers, the
vertical intercept and the slope. The intercept tells us what the prediction is when
all of the predictors are equal to 0; and the slope tells us what unit increase in the response
variable we predict given a unit increase in the predictor
-variable. KNN regression, as simple as it is to implement and understand, has no such
-interpretability from its wiggly line.
+variable. K-NN regression, as simple as it is to implement and understand, has no such
+interpretability from its wiggly line.
There can, however, also be a disadvantage to using a simple linear regression
model in some cases, particularly when the relationship between the response and
-the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In
+the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In
these cases the prediction model from a simple linear regression
will underfit \index{underfitting!regression} (have high bias), meaning that model/predicted values do not
match the actual observed values very well. Such a model would probably have a
quite high RMSE when assessing model goodness of fit on the training data and
a quite high RMSPE when assessing model prediction quality on a test data
-set. On such a data set, KNN regression may fare better. Additionally, there
+set. On such a data set, K-NN regression may fare better. Additionally, there
are other types of regression you can learn about in future books that may do
even better at predicting with such data.
How do these two models compare on the Sacramento house prices data set? In
-Figure \@ref(fig:08-compareRegression), we also printed the RMSPE as calculated from
+Figure \@ref(fig:08-compareRegression), we also printed the RMSPE as calculated from
predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear
-regression model is slightly lower than the RMSPE for the KNN regression model.
+regression model is slightly lower than the RMSPE for the K-NN regression model.
Considering that the simple linear regression model is also more interpretable,
if we were comparing these in practice we would likely choose to use the simple
linear regression model.
-Finally, note that the KNN regression model becomes "flat"
+Finally, note that the K-NN regression model becomes "flat"
at the left and right boundaries of the data, while the linear model
predicts a constant slope. Predicting outside the range of the observed
-data is known as *extrapolation*; \index{extrapolation} KNN and linear models behave quite differently
+data is known as *extrapolation*; \index{extrapolation} K-NN and linear models behave quite differently
when extrapolating. Depending on the application, the flat
or constant slope trend may make more sense. For example, if our housing
-data were slightly different, the linear model may have actually predicted
+data were slightly different, the linear model may have actually predicted
a *negative* price for a small house (if the intercept $\beta_0$ was negative),
which obviously does not match reality. On the other hand, the trend of increasing
-house size corresponding to increasing house price probably continues for large houses,
-so the "flat" extrapolation of KNN likely does not match reality.
+house size corresponding to increasing house price probably continues for large houses,
+so the "flat" extrapolation of K-NN likely does not match reality.
## Multivariable linear regression
-As in KNN classification and KNN regression, we can move beyond the simple
-case of only one predictor to the case with multiple predictors,
+As in K-NN classification and K-NN regression, we can move beyond the simple
+case of only one predictor to the case with multiple predictors,
known as *multivariable linear regression*. \index{regression!multivariable linear}\index{regression!multivariable linear equation|see{plane equation}}
To do this, we follow a very similar approach to what we did for
-KNN regression: we just add more predictors to the model formula in the
+K-NN regression: we just add more predictors to the model formula in the
recipe. But recall that we do not need to use cross-validation to choose any parameters,
-nor do we need to standardize (i.e., center and scale) the data for linear regression.
+nor do we need to standardize (i.e., center and scale) the data for linear regression.
Note once again that we have the same concerns regarding multiple predictors
- as in the settings of multivariable KNN regression and classification: having more predictors is **not** always
-better. But because the same predictor selection
+ as in the settings of multivariable K-NN regression and classification: having more predictors is **not** always
+better. But because the same predictor selection
algorithm from the classification chapter extends to the setting of linear regression,
it will not be covered again in this chapter.
We will demonstrate multivariable linear regression using the Sacramento real estate \index{Sacramento real estate}
data with both house size
(measured in square feet) as well as number of bedrooms as our predictors, and
-continue to use house sale price as our response variable. We will start by
-changing the formula in the recipe to
+continue to use house sale price as our response variable. We will start by
+changing the formula in the recipe to
include both the `sqft` and `beds` variables as predictors:
```{r 08-lm-mult-test-train-split}
@@ -510,15 +511,15 @@ In the case of two predictors, we can plot the predictions made by our linear re
shown in Figure \@ref(fig:08-3DlinReg).
```{r 08-3DlinReg, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Linear regression plane of best fit overlaid on top of the data (using price, house size, and number of bedrooms as predictors). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the regression plane looks like for learning purposes.", out.width="100%"}
-xvals <- seq(from = min(sacramento_train$sqft),
- to = max(sacramento_train$sqft),
+xvals <- seq(from = min(sacramento_train$sqft),
+ to = max(sacramento_train$sqft),
length = 50)
-yvals <- seq(from = min(sacramento_train$beds),
- to = max(sacramento_train$beds),
+yvals <- seq(from = min(sacramento_train$beds),
+ to = max(sacramento_train$beds),
length = 50)
zvals <- mlm_fit |>
- predict(crossing(xvals, yvals) |>
+ predict(crossing(xvals, yvals) |>
mutate(sqft = xvals, beds = yvals)) |>
pull(.pred)
@@ -544,7 +545,7 @@ plot_3d <- plot_ly() |>
colorbar = list(title = "Price (USD)")
)
-if(!is_latex_output()){
+if(!is_latex_output()){
plot_3d
} else {
scene = list(camera = list(eye = list(x = -2.1, y = -2.2, z = 0.75)))
@@ -555,8 +556,8 @@ if(!is_latex_output()){
```
We see that the predictions from linear regression with two predictors form a
-flat plane. This is the hallmark of linear regression, and differs from the
-wiggly, flexible surface we get from other methods such as KNN regression.
+flat plane. This is the hallmark of linear regression, and differs from the
+wiggly, flexible surface we get from other methods such as K-NN regression.
As discussed, this can be advantageous in one aspect, which is that for each
predictor, we can get slopes/intercept from linear regression, and thus describe the
plane mathematically. We can extract those slope values from our model object
@@ -580,31 +581,31 @@ where:
- $\beta_2$ is the *slope* for the second predictor (how quickly the price changes as you increase the number of bedrooms, holding house size constant)
Finally, we can fill in the values for $\beta_0$, $\beta_1$ and $\beta_2$ from the model output above
-to create the equation of the plane of best fit to the data:
+to create the equation of the plane of best fit to the data:
```{r 08-lm-multi-get-coeffs-hidden, echo = FALSE}
-icept <- format(round(mcoeffs |>
- filter(term == "(Intercept)") |>
+icept <- format(round(mcoeffs |>
+ filter(term == "(Intercept)") |>
pull(estimate)), scientific = FALSE)
-sqftc <- format(round(mcoeffs |>
- filter(term == "sqft") |>
+sqftc <- format(round(mcoeffs |>
+ filter(term == "sqft") |>
pull(estimate)), scientific = FALSE)
-bedsc <- format(round(mcoeffs |>
- filter(term == "beds") |>
+bedsc <- format(round(mcoeffs |>
+ filter(term == "beds") |>
pull(estimate)), scientific = FALSE)
```
$$\text{house sale price} = `r icept` + `r sqftc`\cdot (\text{house size}) `r bedsc` \cdot (\text{number of bedrooms})$$
-This model is more interpretable than the multivariable KNN
+This model is more interpretable than the multivariable K-NN
regression model; we can write a mathematical equation that explains how
-each predictor is affecting the predictions. But as always, we should
+each predictor is affecting the predictions. But as always, we should
question how well multivariable linear regression is doing compared to
-the other tools we have, such as simple linear regression
-and multivariable KNN regression. If this comparison is part of
+the other tools we have, such as simple linear regression
+and multivariable K-NN regression. If this comparison is part of
the model tuning process—for example, if we are trying
out many different sets of predictors for multivariable linear
-and KNN regression—we must perform this comparison using
+and K-NN regression—we must perform this comparison using
cross-validation on only our training data. But if we have already
decided on a small number (e.g., 2 or 3) of tuned candidate models and
we want to make a final comparison, we can do so by comparing the prediction
@@ -614,29 +615,29 @@ error of the methods on the test data.
lm_mult_test_results
```
-We obtain an RMSPE \index{RMSPE} for the multivariable linear regression model
+We obtain an RMSPE \index{RMSPE} for the multivariable linear regression model
of \$`r format(lm_mult_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`. This prediction error
- is less than the prediction error for the multivariable KNN regression model,
+ is less than the prediction error for the multivariable K-NN regression model,
indicating that we should likely choose linear regression for predictions of
house sale price on this data set. Revisiting the simple linear regression model
-with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
+with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
\$`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`,
-which is almost the same as that of our more complex model.
+which is almost the same as that of our more complex model.
As mentioned earlier, this is not always the case: often including more
-predictors will either positively or negatively impact the prediction performance on unseen
+predictors will either positively or negatively impact the prediction performance on unseen
test data.
## Multicollinearity and outliers
-What can go wrong when performing (possibly multivariable) linear regression?
-This section will introduce two common issues—*outliers* and *collinear predictors*—and
+What can go wrong when performing (possibly multivariable) linear regression?
+This section will introduce two common issues—*outliers* and *collinear predictors*—and
illustrate their impact on predictions.
### Outliers
Outliers \index{outliers} are data points that do not follow the usual pattern of the rest of the data.
In the setting of linear regression, these are points that
- have a vertical distance to the line of best fit that is either much higher or much lower
+ have a vertical distance to the line of best fit that is either much higher or much lower
than you might expect based on the rest of the data. The problem with outliers is that
they can have *too much influence* on the line of best fit. In general, it is very difficult
to judge accurately which data are outliers without advanced techniques that are beyond
@@ -646,7 +647,7 @@ But to illustrate what can happen when you have outliers, Figure \@ref(fig:08-lm
shows a small subset of the Sacramento housing data again, except we have added a *single* data point (highlighted
in red). This house is 5,000 square feet in size, and sold for only \$50,000. Unbeknownst to the
data analyst, this house was sold by a parent to their child for an absurdly low price. Of course,
-this is not representative of the real housing market values that the other data points follow;
+this is not representative of the real housing market values that the other data points follow;
the data point is an *outlier*. In blue we plot the original line of best fit, and in red
we plot the new line of best fit including the outlier. You can see how different the red line
is from the blue line, which is entirely caused by that one extra outlier data point.
@@ -657,14 +658,14 @@ sacramento_outlier <- tibble(sqft = 5000, price = 50000)
lm_plot_outlier <- ggplot(sacramento_train_small, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
- geom_point(data = sacramento_outlier,
+ geom_point(data = sacramento_outlier,
mapping = aes(x = sqft, y = price), color="red", size = 2.5) +
xlab("House size (square feet)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
- geom_smooth(method = "lm", se = FALSE) +
- geom_smooth(data = sacramento_train_small |>
- add_row(sqft = 5000, price = 50000),
+ geom_smooth(method = "lm", se = FALSE) +
+ geom_smooth(data = sacramento_train_small |>
+ add_row(sqft = 5000, price = 50000),
method = "lm", se = FALSE, color = "red")
lm_plot_outlier
@@ -685,18 +686,18 @@ sacramento_outlier <- tibble(sqft = 5000, price = 50000)
lm_plot_outlier_large <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
geom_point(alpha = 0.4) +
- geom_point(data = sacramento_outlier,
- mapping = aes(x = sqft, y = price),
- color="red",
+ geom_point(data = sacramento_outlier,
+ mapping = aes(x = sqft, y = price),
+ color="red",
size = 2.5) +
xlab("House size (square feet)") +
ylab("Price (USD)") +
scale_y_continuous(labels = dollar_format()) +
- geom_smooth(method = "lm", se = FALSE) +
- geom_smooth(data = sacramento_train |>
- add_row(sqft = 5000, price = 50000),
- method = "lm",
- se = FALSE,
+ geom_smooth(method = "lm", se = FALSE) +
+ geom_smooth(data = sacramento_train |>
+ add_row(sqft = 5000, price = 50000),
+ method = "lm",
+ se = FALSE,
color = "red")
lm_plot_outlier_large
@@ -719,17 +720,17 @@ sacramento_train <- sacramento_train |>
mutate(sqft1 = sqft + 100 * sample(1000000,
size=nrow(sacramento_train),
replace=TRUE)/1000000) |>
- mutate(sqft2 = sqft + 100 * sample(1000000,
+ mutate(sqft2 = sqft + 100 * sample(1000000,
size=nrow(sacramento_train),
replace=TRUE)/1000000) |>
- mutate(sqft3 = sqft + 100 * sample(1000000,
+ mutate(sqft3 = sqft + 100 * sample(1000000,
size=nrow(sacramento_train),
- replace=TRUE)/1000000)
+ replace=TRUE)/1000000)
lm_plot_multicol_1 <- ggplot(sacramento_train) +
geom_point(aes(x = sqft, y = sqft1), alpha=0.4)+
xlab("House size measurement 1 (square feet)") +
- ylab("House size measurement 2 (square feet)")
+ ylab("House size measurement 2 (square feet)")
lm_plot_multicol_1
```
@@ -743,17 +744,17 @@ lm_fit1 <- workflow() |>
coeffs <- tidy(extract_fit_parsnip(lm_fit1))
-icept1 <- format(round(coeffs |>
- filter(term == "(Intercept)") |>
- pull(estimate)),
+icept1 <- format(round(coeffs |>
+ filter(term == "(Intercept)") |>
+ pull(estimate)),
scientific = FALSE)
-sqft1 <- format(round(coeffs |>
- filter(term == "sqft") |>
- pull(estimate)),
+sqft1 <- format(round(coeffs |>
+ filter(term == "sqft") |>
+ pull(estimate)),
scientific = FALSE)
-sqft11 <- format(round(coeffs |>
- filter(term == "sqft1") |>
- pull(estimate)),
+sqft11 <- format(round(coeffs |>
+ filter(term == "sqft1") |>
+ pull(estimate)),
scientific = FALSE)
lm_recipe2 <- recipe(price ~ sqft + sqft2, data = sacramento_train)
@@ -764,17 +765,17 @@ lm_fit2 <- workflow() |>
fit(data = sacramento_train)
coeffs <- tidy(extract_fit_parsnip(lm_fit2))
-icept2 <- format(round(coeffs |>
- filter(term == "(Intercept)") |>
- pull(estimate)),
+icept2 <- format(round(coeffs |>
+ filter(term == "(Intercept)") |>
+ pull(estimate)),
scientific = FALSE)
-sqft2 <- format(round(coeffs |>
- filter(term == "sqft") |>
- pull(estimate)),
+sqft2 <- format(round(coeffs |>
+ filter(term == "sqft") |>
+ pull(estimate)),
scientific = FALSE)
-sqft22 <- format(round(coeffs |>
- filter(term == "sqft2") |>
- pull(estimate)),
+sqft22 <- format(round(coeffs |>
+ filter(term == "sqft2") |>
+ pull(estimate)),
scientific = FALSE)
lm_recipe3 <- recipe(price ~ sqft + sqft3, data = sacramento_train)
@@ -785,17 +786,17 @@ lm_fit3 <- workflow() |>
fit(data = sacramento_train)
coeffs <- tidy(extract_fit_parsnip(lm_fit3))
-icept3 <- format(round(coeffs |>
- filter(term == "(Intercept)") |>
- pull(estimate)),
+icept3 <- format(round(coeffs |>
+ filter(term == "(Intercept)") |>
+ pull(estimate)),
scientific = FALSE)
-sqft3 <- format(round(coeffs |>
- filter(term == "sqft") |>
- pull(estimate)),
+sqft3 <- format(round(coeffs |>
+ filter(term == "sqft") |>
+ pull(estimate)),
scientific = FALSE)
-sqft33 <- format(round(coeffs |>
- filter(term == "sqft3") |>
- pull(estimate)),
+sqft33 <- format(round(coeffs |>
+ filter(term == "sqft3") |>
+ pull(estimate)),
scientific = FALSE)
```
@@ -820,15 +821,15 @@ book; see the list of additional resources at the end of this chapter to find ou
We were quite fortunate in our initial exploration to find a predictor variable (house size)
that seems to have a meaningful and nearly linear relationship with our response variable (sale price).
But what should we do if we cannot immediately find such a nice variable?
-Well, sometimes it is just a fact that the variables in the data do not have enough of
+Well, sometimes it is just a fact that the variables in the data do not have enough of
a relationship with the response variable to provide useful predictions. For example,
if the only available predictor was "the current house owner's favorite ice cream flavor",
we likely would have little hope of using that variable to predict the house's sale price
-(barring any future remarkable scientific discoveries about the relationship between
-the housing market and homeowner ice cream preferences). In cases like these,
+(barring any future remarkable scientific discoveries about the relationship between
+the housing market and homeowner ice cream preferences). In cases like these,
the only option is to obtain measurements of more useful variables.
-There are, however, a wide variety of cases where the predictor variables do have a
+There are, however, a wide variety of cases where the predictor variables do have a
meaningful relationship with the response variable, but that relationship does not fit
the assumptions of the regression method you have chosen. For example, a data frame `df`
with two variables—`x` and `y`—with a nonlinear relationship between the two variables
@@ -839,7 +840,7 @@ set.seed(1)
df = tibble(x = sample(10000, 100, replace=TRUE) / 10000)
-df <- df |>
+df <- df |>
mutate(y = x^3 + 0.2 * sample(10000, 100, replace=TRUE)/10000 - 0.1)
```
@@ -868,8 +869,8 @@ df <- df |>
Then we can perform linear regression for `y` using the predictor variable `z`,
as shown in Figure \@ref(fig:08-predictor-design-2).
-Here you can see that the transformed predictor `z` helps the
-linear regression model make more accurate predictions.
+Here you can see that the transformed predictor `z` helps the
+linear regression model make more accurate predictions.
Note that none of the `y` response values have changed between Figures \@ref(fig:08-predictor-design)
and \@ref(fig:08-predictor-design-2); the only change is that the `x` values
have been replaced by `z` values.
@@ -883,12 +884,12 @@ curve_plt2 <- ggplot(df, aes(x = z, y = y)) +
curve_plt2
```
-
+
The process of
transforming predictors (and potentially combining multiple predictors in the process)
is known as *feature engineering*. \index{feature engineering|see{predictor design}} In real data analysis
problems, you will need to rely on
-a deep understanding of the problem—as well as the wrangling tools
+a deep understanding of the problem—as well as the wrangling tools
from previous chapters—to engineer useful new features that improve
predictive performance.
@@ -904,11 +905,11 @@ So far in this textbook we have used regression only in the context of
prediction. However, regression can also be seen as a method to understand and
quantify the effects of individual predictor variables on a response variable of interest.
In the housing example from this chapter, beyond just using past data
-to predict future sale prices,
+to predict future sale prices,
we might also be interested in describing the
individual relationships of house size and the number of bedrooms with house price,
quantifying how strong each of these relationships are, and assessing how accurately we
-can estimate their magnitudes. And even beyond that, we may be interested in
+can estimate their magnitudes. And even beyond that, we may be interested in
understanding whether the predictors *cause* changes in the price.
These sides of regression are well beyond the scope of this book; but
the material you have learned here should give you a foundation of knowledge
@@ -916,8 +917,8 @@ that will serve you well when moving to more advanced books on the topic.
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Regression II: linear regression" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -935,7 +936,7 @@ and guidance that the worksheets provide will function as intended.
packages in the past two chapters. Aside from that, it also has a [nice
beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list
of more advanced examples](https://www.tidymodels.org/learn/) that you can use
- to continue learning beyond the scope of this book.
+ to continue learning beyond the scope of this book.
- *Modern Dive* [@moderndive] is another textbook that uses the
`tidyverse` / `tidymodels` framework. Chapter 6 complements the material in
the current chapter well; it covers some slightly more advanced concepts than
@@ -944,7 +945,7 @@ and guidance that the worksheets provide will function as intended.
"explanatory" / "inferential" approach to regression in general (in Chapters 5,
6, and 10), which provides a nice complement to the predictive tack we take in
the present book.
-- *An Introduction to Statistical Learning* [@james2013introduction] provides
+- *An Introduction to Statistical Learning* [@james2013introduction] provides
a great next stop in the process of
learning about regression. Chapter 3 covers linear regression at a slightly
more mathematical level than we do here, but it is not too large a leap and so
@@ -952,6 +953,6 @@ and guidance that the worksheets provide will function as intended.
of "informative" predictors when you have a data set with many predictors, and
you expect only a few of them to be relevant. Chapter 7 covers regression
models that are more flexible than linear regression models but still enjoy the
- computational efficiency of linear regression. In contrast, the KNN methods we
+ computational efficiency of linear regression. In contrast, the K-NN methods we
covered earlier are indeed more flexible but become very slow when given lots
of data.
diff --git a/source/setup.Rmd b/source/setup.Rmd
index 9efe6ebb7..b52bb87d7 100644
--- a/source/setup.Rmd
+++ b/source/setup.Rmd
@@ -16,8 +16,8 @@ options(knitr.table.format = ifelse(knitr::is_latex_output(), "latex", "html"))
In this chapter, you'll learn how to set up the software needed to follow along
with this book on your own computer. Given that installation instructions can
vary based on computer setup, we provide instructions for
-multiple operating systems (Ubuntu Linux, MacOS, and Windows).
-Although the instructions in this chapter will likely work on many systems,
+multiple operating systems (Ubuntu Linux, MacOS, and Windows).
+Although the instructions in this chapter will likely work on many systems,
we have specifically verified that they work on a computer that:
- runs Windows 10 Home, MacOS 13 Ventura, or Ubuntu 22.04,
@@ -38,13 +38,13 @@ By the end of the chapter, readers will be able to do the following:
## Obtaining the worksheets for this book
-The worksheets containing exercises for this book
+The worksheets containing exercises for this book
are online at [https://worksheets.datasciencebook.ca](https://worksheets.datasciencebook.ca).
The worksheets can be launched directly from that page using the Binder links in the rightmost
-column of the table. This is the easiest way to access the worksheets, but note that you will not
+column of the table. This is the easiest way to access the worksheets, but note that you will not
be able to save your work and return to it again later.
-In order to save your progress, you will need to download the worksheets to your own computer and
-work on them locally. You can download the worksheets as a compressed zip file
+In order to save your progress, you will need to download the worksheets to your own computer and
+work on them locally. You can download the worksheets as a compressed zip file
using [the link at the top of the page](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/archive/refs/heads/main.zip).
Once you unzip the downloaded file, you will have a folder containing all of the Jupyter notebook worksheets
accompanying this book. See Chapter \@ref(jupyter) for
@@ -59,7 +59,7 @@ software packages, not to mention getting the right versions of
everything—the worksheets and autograder tests may not work unless all the versions are
exactly right! To keep things simple, we instead recommend that you install
[Docker](https://docker.com). Docker lets you run your Jupyter notebooks inside
-a pre-built *container* that comes with precisely the right versions of
+a pre-built *container* that comes with precisely the right versions of
all software packages needed run the worksheets that come with this book.
\index{Docker}
@@ -73,37 +73,37 @@ all software packages needed run the worksheets that come with this book.
### Windows
-**Installation** To install Docker on Windows,
+**Installation** To install Docker on Windows,
visit [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/),
and download the `Docker Desktop Installer.exe` file. Double-click the file to open the installer
and follow the instructions on the installation wizard, choosing **WSL-2** instead of **Hyper-V** when prompted.
> **Note:** Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see:
->
+>
> - If you need to update WSL, you can enter `cmd.exe` in the Start menu to run the command line. Type `wsl --update` to update WSL.
-> - If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group.
-> Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to
+> - If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group.
+> Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to
> add the user to the group. Log out and log back in for the changes to take effect.
> - If you need to enable virtualization, you will need to edit your BIOS. Restart your computer, and enter the BIOS using the hotkey
> (usually Delete, Esc, and/or one of the F# keys). Look for an "Advanced" menu, and under your CPU settings, set the "Virtualization" option
-> to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert
-> to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book.
+> to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert
+> to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book.
**Running JupyterLab** Run Docker Desktop. Once it is running, you need to download and run the
-Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a
+Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a
computer with all the right packages pre-installed). You only need to do this step one time; the image will remain
the next time you run Docker Desktop.
-In the Docker Desktop search bar, enter `ubcdsci/r-dsci-100`, as this is
+In the Docker Desktop search bar, enter `ubcdsci/r-dsci-100`, as this is
the name of the image. You will see the `ubcdsci/r-dsci-100` image in the list (Figure \@ref(fig:docker-desktop-search)),
and "latest" in the Tag drop down menu. We need to change "latest" to the right image version before proceeding.
-To find the right tag, open
+To find the right tag, open
the [`Dockerfile` in the worksheets repository](https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/Dockerfile),
and look for the line `FROM ubcdsci/r-dsci-100:` followed by the tag consisting of a sequence of numbers and letters.
-Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click
-the "Pull" button to download the image.
+Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click
+the "Pull" button to download the image.
```{r docker-desktop-search, echo = FALSE, fig.cap = "The Docker Desktop search window. Make sure to click the Tag drop down menu and find the right version of the image before clicking the Pull button to download it.", fig.retina = 2, out.width="100%"}
-image_read("img/setup/docker-1.png") |>
+image_read("img/setup/docker-1.png") |>
image_crop("3632x2000")
```
@@ -112,12 +112,12 @@ of the Docker Desktop window (Figure \@ref(fig:docker-desktop-images)). You
will see the recently downloaded image listed there under the "Local" tab.
```{r docker-desktop-images, echo = FALSE, fig.cap = "The Docker Desktop images tab.", fig.retina = 2, out.width="100%"}
-image_read("img/setup/docker-2.png") |>
+image_read("img/setup/docker-2.png") |>
image_crop("3632x2000")
```
To start up a *container* using that image, click the play button beside the
-image. This will open the run configuration menu (Figure \@ref(fig:docker-desktop-runconfig)).
+image. This will open the run configuration menu (Figure \@ref(fig:docker-desktop-runconfig)).
Expand the "Optional settings" drop down menu. In the "Host port" textbox, enter
`8888`. In the "Volumes" section, click the "Host path" box and navigate to the
folder where your Jupyter worksheets are stored. In the "Container path" text
@@ -125,26 +125,26 @@ box, enter `/home/jovyan/work`. Then click the "Run" button to start the
container.
```{r docker-desktop-runconfig, echo = FALSE, fig.cap = "The Docker Desktop container run configuration menu.", fig.retina = 2, out.width="100%"}
-image_read("img/setup/docker-3.png") |>
+image_read("img/setup/docker-3.png") |>
image_crop("3632x2000")
```
-After clicking the "Run" button, you will see a terminal. The terminal will then print
-some text as the Docker container starts. Once the text stops scrolling, find the
-URL in the terminal that starts
-with `http://127.0.0.1:8888` (highlighted by the red box in Figure \@ref(fig:docker-desktop-url)), and paste it
+After clicking the "Run" button, you will see a terminal. The terminal will then print
+some text as the Docker container starts. Once the text stops scrolling, find the
+URL in the terminal that starts
+with `http://127.0.0.1:8888` (highlighted by the red box in Figure \@ref(fig:docker-desktop-url)), and paste it
into your browser to start JupyterLab.
```{r docker-desktop-url, echo = FALSE, fig.cap = "The terminal text after running the Docker container. The red box indicates the URL that you should paste into your browser to open JupyterLab.", fig.retina = 2, out.width="100%"}
-image_read("img/setup/docker-4.png") |>
+image_read("img/setup/docker-4.png") |>
image_crop("3632x2000")
```
-When you are done working, make sure to shut down and remove the container by
+When you are done working, make sure to shut down and remove the container by
clicking the red trash can symbol (in the top right corner of Figure \@ref(fig:docker-desktop-url)).
You will not be able to start the container again until you do so.
More information on installing and running
-Docker on Windows, as well as troubleshooting tips, can
+Docker on Windows, as well as troubleshooting tips, can
be found in [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/).
### MacOS
@@ -152,18 +152,18 @@ be found in [the online Docker documentation](https://docs.docker.com/desktop/in
**Installation** To install Docker on MacOS,
visit [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/), and
download the `Docker.dmg` installation file that is appropriate for your
-computer. To know which installer is right for your machine, you need to know
+computer. To know which installer is right for your machine, you need to know
whether your computer has an Intel processor (older machines) or an
Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has
-information to help you determine which processor you have. Once downloaded, double-click
+information to help you determine which processor you have. Once downloaded, double-click
the file to open the installer, then drag the Docker icon to the Applications folder.
-Double-click the icon in the Applications folder to start Docker. In the installation
+Double-click the icon in the Applications folder to start Docker. In the installation
window, use the recommended settings.
**Running JupyterLab** Run Docker Desktop. Once it is running, follow the
instructions above in the Windows section on *Running JupyterLab* (the user
interface is the same). More information on installing and running Docker on
-MacOS, as well as troubleshooting tips, can be
+MacOS, as well as troubleshooting tips, can be
found in [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/).
### Ubuntu
@@ -184,15 +184,15 @@ the following command, replacing `TAG` with the *tag* you found earlier.
```
docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/r-dsci-100:TAG jupyter lab
```
-The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the
-URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the
+The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the
+URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the
red box in Figure \@ref(fig:ubuntu-docker-terminal)), and paste it into your browser to start JupyterLab.
More information on installing and running Docker on Ubuntu, as well as troubleshooting tips, can be found in
[the online Docker documentation](https://docs.docker.com/engine/install/ubuntu/).
```{r ubuntu-docker-terminal, echo = FALSE, fig.cap = "The terminal text after running the Docker container in Ubuntu. The red box indicates the URL that you should paste into your browser to open JupyterLab.", fig.retina = 2, out.width="100%"}
-image_read("img/setup/ubuntu-docker.png") |>
+image_read("img/setup/ubuntu-docker.png") |>
image_crop("3632x2000")
```
@@ -201,22 +201,22 @@ image_read("img/setup/ubuntu-docker.png") |>
You can also run the worksheets accompanying this book on your computer
using [JupyterLab Desktop](https://github.com/jupyterlab/jupyterlab-desktop).
The advantage of JupyterLab Desktop over Docker is that it can be easier to install;
-Docker can sometimes run into some fairly technical issues (especially on Windows computers)
-that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that
+Docker can sometimes run into some fairly technical issues (especially on Windows computers)
+that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that
you may not end up with the right versions of all the R packages needed for the worksheets. Docker, on the other hand,
-*guarantees* that the worksheets will work exactly as intended.
+*guarantees* that the worksheets will work exactly as intended.
In this section, we will cover how to install JupyterLab Desktop,
-Git and the JupyterLab Git extension (for version control, as discussed in Chapter \@ref(version-control)), and
+Git and the JupyterLab Git extension (for version control, as discussed in Chapter \@ref(version-control)), and
all of the R packages needed to run
the code in this book.
\index{git!installation}\index{JupyterLab Desktop}
### Windows
-**Installation** First, we will install Git for version control.
-Go to [the Git download page](https://git-scm.com/download/win) and
-download the Windows version of Git. Once the download has finished, run the installer and accept
+**Installation** First, we will install Git for version control.
+Go to [the Git download page](https://git-scm.com/download/win) and
+download the Windows version of Git. Once the download has finished, run the installer and accept
the default configuration for all pages.
Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation).
Download the `JupyterLab-Setup-Windows.exe` installer file for Windows.
@@ -225,7 +225,7 @@ Run JupyterLab Desktop by clicking the icon on your desktop.
**Configuring JupyterLab Desktop**
-Next, in the JupyterLab Desktop graphical interface that appears (Figure \@ref(fig:setup-jlab-gui)),
+Next, in the JupyterLab Desktop graphical interface that appears (Figure \@ref(fig:setup-jlab-gui)),
you will see text at the bottom saying "Python environment not found". Click "Install using the bundled installer"
to set up the environment.
@@ -247,28 +247,28 @@ pip install --upgrade jupyterlab-git
conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/environment.yml
```
The second command installs the specific R and package versions specified in
-the `environment.yml` file found in
+the `environment.yml` file found in
[the worksheets repository](https://worksheets.datasciencebook.ca).
We will always keep the versions in the `environment.yml` file updated
so that they are compatible with the exercise worksheets that accompany the book.
-Once all of the software installation is complete, it is a good idea to restart
+Once all of the software installation is complete, it is a good idea to restart
JupyterLab Desktop entirely before you proceed to doing your data analysis.
-This will ensure all the software and settings you put in place are
+This will ensure all the software and settings you put in place are
correctly set up and ready for use.
### MacOS
-**Installation** First, we will install Git for version control.
-Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY))
+**Installation** First, we will install Git for version control.
+Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY))
and type the following command:
```
xcode-select --install
```
Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation).
-Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file.
-To know which installer is right for your machine, you need to know
+Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file.
+To know which installer is right for your machine, you need to know
whether your computer has an Intel processor (older machines) or an
Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has
information to help you determine which processor you have.
@@ -283,7 +283,7 @@ the various R software packages needed for the worksheets.
### Ubuntu
-**Installation** First, we will install Git for version control.
+**Installation** First, we will install Git for version control.
Open the terminal and type the following commands:
```
sudo apt update
diff --git a/source/version-control.Rmd b/source/version-control.Rmd
index c6891d630..77325a0b0 100644
--- a/source/version-control.Rmd
+++ b/source/version-control.Rmd
@@ -5,31 +5,31 @@ library(magick)
library(magrittr)
library(knitr)
-knitr::opts_chunk$set(message = FALSE,
- echo = FALSE,
+knitr::opts_chunk$set(message = FALSE,
+ echo = FALSE,
warning = FALSE,
fig.align = "center")
```
-> *You mostly collaborate with yourself,
+> *You mostly collaborate with yourself,
> and me-from-two-months-ago never responds to email.*
->
+>
> --Mark T. Holder
## Overview
-This chapter will introduce the concept of using version control systems
-to track changes to a project over its lifespan, to share
-and edit code in a collaborative team,
+This chapter will introduce the concept of using version control systems
+to track changes to a project over its lifespan, to share
+and edit code in a collaborative team,
and to distribute the finished project to its intended audience.
-This chapter will also introduce how to use
-the two most common version control tools: Git \index{git} for local version control,
-and GitHub \index{GitHub} for remote version control.
-We will focus on the most common version control operations
-used day-to-day in a standard data science project.
-There are many user interfaces for Git; in this chapter
-we will cover the Jupyter Git interface.
+This chapter will also introduce how to use
+the two most common version control tools: Git \index{git} for local version control,
+and GitHub \index{GitHub} for remote version control.
+We will focus on the most common version control operations
+used day-to-day in a standard data science project.
+There are many user interfaces for Git; in this chapter
+we will cover the Jupyter Git interface.
## Chapter learning objectives
@@ -37,7 +37,7 @@ By the end of the chapter, readers will be able to do the following:
- Describe what version control is and why data analysis projects can benefit from it.
- Create a remote version control repository on GitHub.
-- Use Jupyter's Git version control tools for project versioning and collaboration:
+- Use Jupyter's Git version control tools for project versioning and collaboration:
- Clone a remote version control repository to create a local repository.
- Commit changes to a local version control repository.
- Push local changes to a remote version control repository.
@@ -49,35 +49,35 @@ By the end of the chapter, readers will be able to do the following:
## What is version control, and why should I use it?
-Data analysis projects often require iteration
+Data analysis projects often require iteration
and revision to move from an initial idea to a finished product
-ready for the intended audience.
-Without deliberate and conscious effort towards tracking changes
-made to the analysis, projects tend to become messy.
-This mess can have serious, negative repercussions on an analysis project,
+ready for the intended audience.
+Without deliberate and conscious effort towards tracking changes
+made to the analysis, projects tend to become messy.
+This mess can have serious, negative repercussions on an analysis project,
including interesting results files that your code cannot reproduce,
-temporary files with snippets of ideas that are forgotten or
+temporary files with snippets of ideas that are forgotten or
not easy to find, mind-boggling file names that make it unclear which is
-the current working version of the file (e.g., `document_final_draft_final.txt`,
-`to_hand_in_final_v2.txt`, etc.), and more.
+the current working version of the file (e.g., `document_final_draft_final.txt`,
+`to_hand_in_final_v2.txt`, etc.), and more.
-Additionally, the iterative nature of data analysis projects
+Additionally, the iterative nature of data analysis projects
means that most of the time, the final version of the analysis that is
-shared with the audience is only a fraction of what was explored during
-the development of that analysis.
-Changes in data visualizations and modeling approaches,
-as well as some negative results, are often not observable from
+shared with the audience is only a fraction of what was explored during
+the development of that analysis.
+Changes in data visualizations and modeling approaches,
+as well as some negative results, are often not observable from
reviewing only the final, polished analysis.
The lack of observability of these parts of the analysis development
-can lead to others repeating things that did not work well,
-instead of seeing what did not work well,
+can lead to others repeating things that did not work well,
+instead of seeing what did not work well,
and using that as a springboard to new, more fruitful approaches.
-Finally, data analyses are typically completed by a team of people
-rather than a single person.
-This means that files need to be shared across multiple computers,
-and multiple people often end up editing the project simultaneously.
-In such a situation, determining who has the latest version of the
+Finally, data analyses are typically completed by a team of people
+rather than a single person.
+This means that files need to be shared across multiple computers,
+and multiple people often end up editing the project simultaneously.
+In such a situation, determining who has the latest version of the
project—and how to resolve conflicting edits—can be a real challenge.
*Version control* \index{version control} helps solve these challenges. Version control is the process
@@ -93,49 +93,49 @@ collaboration via tools to share edits with others and resolve conflicting
edits. But even if you're working on a project alone, you should still use
version control. It helps you keep track of what you've done, when you did it,
and what you're planning to do next!
-
-To version control a project, you generally need two things:
+
+To version control a project, you generally need two things:
a *version control system* \index{version control!system} and a *repository hosting service*. \index{version control!repository hosting}
-The version control system is the software responsible
-for tracking changes, sharing changes you make with others,
+The version control system is the software responsible
+for tracking changes, sharing changes you make with others,
obtaining changes from others, and resolving conflicting edits.
-The repository hosting service is responsible for storing a copy
-of the version-controlled project online (a *repository*),
-where you and your collaborators can access it remotely,
-discuss issues and bugs, and distribute your final product.
+The repository hosting service is responsible for storing a copy
+of the version-controlled project online (a *repository*),
+where you and your collaborators can access it remotely,
+discuss issues and bugs, and distribute your final product.
For both of these items, there is a wide variety of choices.
-In this textbook we'll use Git for version control,
-and GitHub for repository hosting,
+In this textbook we'll use Git for version control,
+and GitHub for repository hosting,
because both are currently the most widely used platforms.
-In the
+In the
additional resources section at the end of the chapter,
-we list many of the common version control systems
+we list many of the common version control systems
and repository hosting services in use today.
-> **Note:** Technically you don't *have to* use a repository hosting service.
+> **Note:** Technically you don't *have to* use a repository hosting service.
> You can, for example, version control a project
-> that is stored only in a folder on your computer—never
-> sharing it on a repository hosting service.
-> But using a repository hosting service provides a few big benefits,
+> that is stored only in a folder on your computer—never
+> sharing it on a repository hosting service.
+> But using a repository hosting service provides a few big benefits,
> including managing collaborator access permissions,
-> tools to discuss and track bugs,
-> and the ability to have external collaborators contribute work,
-> not to mention the safety of having your work backed up in the cloud.
-> Since most repository hosting services now offer free accounts,
-> there are not many situations in which you wouldn't
-> want to use one for your project.
+> tools to discuss and track bugs,
+> and the ability to have external collaborators contribute work,
+> not to mention the safety of having your work backed up in the cloud.
+> Since most repository hosting services now offer free accounts,
+> there are not many situations in which you wouldn't
+> want to use one for your project.
## Version control repositories
-Typically, when we put a data analysis project under version control,
-we create two copies of the repository \index{repository} (Figure \@ref(fig:vc1-no-changes)).
+Typically, when we put a data analysis project under version control,
+we create two copies of the repository \index{repository} (Figure \@ref(fig:vc1-no-changes)).
One copy we use as our primary workspace where we create, edit, and delete files.
This copy is commonly referred to as \index{repository!local} the **local repository**. The local
repository most commonly exists on our computer or laptop, but can also exist within
a workspace on a server (e.g., JupyterHub).
The other copy is typically stored in a repository hosting service (e.g., GitHub), where
-we can easily share it with our collaborators.
+we can easily share it with our collaborators.
This copy is commonly referred to as \index{repository!remote} the **remote repository**.
```{r vc1-no-changes, fig.cap = 'Schematic of local and remote version control repositories.', fig.retina = 2, out.width="100%"}
@@ -143,34 +143,34 @@ image_read("img/version-control/vc1-no-changes.png") |>
image_crop("3632x2000")
```
-Both copies of the repository have a **working directory** \index{working directory}
-where you can create, store, edit, and delete
+Both copies of the repository have a **working directory** \index{working directory}
+where you can create, store, edit, and delete
files (e.g., `analysis.ipynb` in Figure \@ref(fig:vc1-no-changes)).
-Both copies of the repository also maintain a full project history
+Both copies of the repository also maintain a full project history
(Figure \@ref(fig:vc1-no-changes)). This history is a record of all versions of the
project files that have been created. The repository history is not
automatically generated; Git must be explicitly told when to record
-a version of the project. These records are \index{git!commit} called **commits**. They
+a version of the project. These records are \index{git!commit} called **commits**. They
are a snapshot of the file contents as well
metadata about the repository at that time the record was created (who made the
commit, when it was made, etc.). In the local and remote repositories shown in
Figure \@ref(fig:vc1-no-changes), there are two commits represented as gray
-circles. Each commit can be identified by a
+circles. Each commit can be identified by a
human-readable **message**, which you write when you make a commit, and a
**commit hash** that Git automatically adds for you.
-The purpose of the message is to contain a brief, rich description
+The purpose of the message is to contain a brief, rich description
of what work was done since the last commit.
-Messages act as a very useful narrative
-of the changes to a project over its lifespan.
+Messages act as a very useful narrative
+of the changes to a project over its lifespan.
If you ever want to view or revert to an earlier version of the project,
the message can help you identify which commit to view or revert to.
-In Figure \@ref(fig:vc1-no-changes), you can see two such messages,
+In Figure \@ref(fig:vc1-no-changes), you can see two such messages,
one for each commit: `Created README.md` and `Added analysis draft`.
The hash \index{hash} is a string of characters consisting of about 40 letters and numbers.
The purpose of the hash is to serve as a unique identifier for the commit,
-and is used by Git to index project history. Although hashes are quite long—imagine
+and is used by Git to index project history. Although hashes are quite long—imagine
having to type out 40 precise characters to view an old project version!—Git is able
to work with shorter versions of hashes. In Figure \@ref(fig:vc1-no-changes), you can see
two of these shortened hashes, one for each commit: `Daa29d6` and `884c7ce`.
@@ -178,7 +178,7 @@ two of these shortened hashes, one for each commit: `Daa29d6` and `884c7ce`.
## Version control workflows
When you work in a local version-controlled repository, there are generally three additional
-steps you must take as part of your regular workflow. In addition to
+steps you must take as part of your regular workflow. In addition to
just working on files—creating,
editing, and deleting files as you normally would—you must:
@@ -190,7 +190,7 @@ In this section we will discuss all three of these steps in detail.
### Committing changes to a local repository {#commit-changes}
-When working on files in your local version control
+When working on files in your local version control
repository (e.g., using Jupyter) and saving your work, these changes will only initially exist in the
working directory of the local repository (Figure \@ref(fig:vc2-changes)).
@@ -199,16 +199,16 @@ image_read("img/version-control/vc2-changes.png") |>
image_crop("3632x2000")
```
-Once you reach a point that you want Git to keep a record
-of the current version of your work, you need to commit
+Once you reach a point that you want Git to keep a record
+of the current version of your work, you need to commit
(i.e., snapshot) your changes. A prerequisite to this is telling Git which
-files should be included in that snapshot. We call this step **adding** the
+files should be included in that snapshot. We call this step **adding** the
files to the **staging area**. \index{git!add, staging area}
-Note that the staging area is not a real physical location on your computer;
+Note that the staging area is not a real physical location on your computer;
it is instead a conceptual placeholder for these files until they are committed.
-The benefit of the Git version control system using a staging area is that you
-can choose to commit changes in only certain files. For example,
-in Figure \@ref(fig:vc-ba2-add), we add only the two files
+The benefit of the Git version control system using a staging area is that you
+can choose to commit changes in only certain files. For example,
+in Figure \@ref(fig:vc-ba2-add), we add only the two files
that are important to the analysis project (`analysis.ipynb` and `README.md`)
and not our personal scratch notes for the project (`notes.txt`).
@@ -217,15 +217,15 @@ image_read("img/version-control/vc-ba2-add.png") |>
image_crop("3632x1200")
```
-Once the files we wish to commit have been added
+Once the files we wish to commit have been added
to the staging area, we can then commit those files to the repository history (Figure \@ref(fig:vc-ba3-commit)).
-When we do this, we are required to include a helpful *commit message* to tell
+When we do this, we are required to include a helpful *commit message* to tell
collaborators (which often includes future you!) about the changes that were
-made. In Figure \@ref(fig:vc-ba3-commit), the message is `Message about changes...`; in
+made. In Figure \@ref(fig:vc-ba3-commit), the message is `Message about changes...`; in
your work you should make sure to replace this with an
informative message about what changed. It is also important to note here that
these changes are only being committed to the local repository's history. The
-remote repository on GitHub has not changed, and collaborators are not yet
+remote repository on GitHub has not changed, and collaborators are not yet
able to see your new changes.
```{r vc-ba3-commit, fig.cap = "Committing the modified files in the staging area to the local repository history, with an informative message about what changed.", fig.retina = 2, out.width="100%"}
@@ -235,11 +235,11 @@ image_read("img/version-control/vc-ba3-commit.png") |>
### Pushing changes to a remote repository
-Once you have made one or more commits that you want to share with your collaborators,
-you need \index{git!push} to **push** (i.e., send) those commits back to GitHub (Figure \@ref(fig:vc5-push)). This updates
-the history in the remote repository (i.e., GitHub) to match what you have in your
+Once you have made one or more commits that you want to share with your collaborators,
+you need \index{git!push} to **push** (i.e., send) those commits back to GitHub (Figure \@ref(fig:vc5-push)). This updates
+the history in the remote repository (i.e., GitHub) to match what you have in your
local repository. Now when collaborators interact with the remote repository, they will be able
-to see the changes you made. And you can also take comfort in the fact that your work is now backed
+to see the changes you made. And you can also take comfort in the fact that your work is now backed
up in the cloud!
```{r vc5-push, fig.cap = 'Pushing the commit to send the changes to the remote repository on GitHub.', fig.retina = 2, out.width="100%"}
@@ -252,7 +252,7 @@ image_read("img/version-control/vc5-push.png") |>
If you are working on a project with collaborators, they will also be making changes to files
(e.g., to the analysis code in a Jupyter notebook and the project's README file),
committing them to their own local repository, and pushing their commits to the remote GitHub repository
-to share them with you. When they push their changes, those changes will only initially exist in
+to share them with you. When they push their changes, those changes will only initially exist in
the remote GitHub repository and not in your local repository (Figure \@ref(fig:vc6-remote-changes)).
```{r vc6-remote-changes, fig.cap = 'Changes pushed by collaborators, or created directly on GitHub will not be automatically sent to your local repository.', fig.retina = 2, out.width="100%"}
@@ -265,22 +265,22 @@ to **pull** \index{git!pull} those changes to your own local repository. By pul
you synchronize your local repository to what is present on GitHub (Figure \@ref(fig:vc7-pull)).
Additionally, until you pull changes from the remote repository, you will not
be able to push any more changes yourself (though you will still be able to
-work and make commits in your own local repository).
+work and make commits in your own local repository).
```{r vc7-pull, fig.cap = 'Pulling changes from the remote GitHub repository to synchronize your local repository.', fig.retina = 2, out.width="100%"}
image_read("img/version-control/vc7-pull.png") |>
image_crop("3632x2000")
```
-## Working with remote repositories using GitHub
+## Working with remote repositories using GitHub
-Now that you have been introduced to some of the key general concepts
-and workflows of Git version control, we will walk through the practical steps.
+Now that you have been introduced to some of the key general concepts
+and workflows of Git version control, we will walk through the practical steps.
There are several different ways to start using version control
-with a new project. For simplicity and ease of setup,
+with a new project. For simplicity and ease of setup,
we recommend creating a remote repository \index{repository!remote} first.
This section covers how to both create and edit a remote repository on \index{GitHub} GitHub.
-Once you have a remote repository set up, we recommend **cloning** (or copying) that \index{git!clone}
+Once you have a remote repository set up, we recommend **cloning** (or copying) that \index{git!clone}
repository to create a local repository in which you primarily work.
You can clone the repository either
on your own computer or in a workspace on a server (e.g., a JupyterHub server).
@@ -288,15 +288,15 @@ Section \@ref(local-repo-jupyter) below will cover this second step in detail.
### Creating a remote repository on GitHub
-Before you can create remote repositories on GitHub,
-you will need a GitHub account; you can sign up for a free account
+Before you can create remote repositories on GitHub,
+you will need a GitHub account; you can sign up for a free account
at [https://github.com/](https://github.com/).
-Once you have logged into your account, you can create a new repository to host
-your project by clicking on the "+" icon in the upper right-hand
-corner, and then on "New Repository," as shown in
+Once you have logged into your account, you can create a new repository to host
+your project by clicking on the "+" icon in the upper right-hand
+corner, and then on "New Repository," as shown in
Figure \@ref(fig:new-repository-01).
-(ref:new-repository-01) New repositories on GitHub can be created by clicking on "New Repository" from the + menu.
+(ref:new-repository-01) New repositories on GitHub can be created by clicking on "New Repository" from the + menu.
```{r new-repository-01, fig.cap = '(ref:new-repository-01)', fig.retina = 2, out.width="100%"}
image_read("img/version-control/new_repository_01.png") |>
@@ -305,15 +305,15 @@ image_read("img/version-control/new_repository_01.png") |>
image_flop()
```
-Repositories can be set up with a variety of configurations, including a name,
-optional description, and the inclusion (or not) of several template files.
+Repositories can be set up with a variety of configurations, including a name,
+optional description, and the inclusion (or not) of several template files.
One of the most important configuration items to choose is the visibility to the outside world,
either public or private. *Public* repositories \index{repository!public} can be viewed by anyone.
*Private* repositories can be viewed by only you. Both public and private repositories
are only editable by you, but you can change that by giving access to other collaborators.
-To get started with a *public* repository having a template `README.md` file, take the
-following steps shown in Figure \@ref(fig:new-repository-02):
+To get started with a *public* repository having a template `README.md` file, take the
+following steps shown in Figure \@ref(fig:new-repository-02):
1. Enter the name of your project repository. In the example below, we use `canadian_languages`. Most repositories follow a similar naming convention involving only lowercase letter words separated by either underscores or hyphens.
2. Choose an option for the privacy of your repository.
@@ -340,8 +340,8 @@ image_read("img/version-control/new_repository_03.png") |>
### Editing files on GitHub with the pen tool
-The pen tool \index{GitHub!pen tool} can be used to edit existing plain text files. When you click on
-the pen tool, the file will be opened in a text box where you can use your
+The pen tool \index{GitHub!pen tool} can be used to edit existing plain text files. When you click on
+the pen tool, the file will be opened in a text box where you can use your
keyboard to make changes (Figures \@ref(fig:pen-tool-01) and \@ref(fig:pen-tool-02)).
```{r pen-tool-01, fig.cap = 'Clicking on the pen tool opens a text box for editing plain text files.', fig.retina = 2, out.width="100%"}
@@ -358,11 +358,11 @@ image_read("img/version-control/pen-tool_02.png") |>
# image_flop()
```
-After you are done with your edits, they can be "saved" by *committing* your
-changes. When you *commit a file* in a repository, the version control system
-takes a snapshot of what the file looks like. As you continue working on the
-project, over time you will possibly make many commits to a single file; this
-generates a useful version history for that file. On GitHub, if you click the
+After you are done with your edits, they can be "saved" by *committing* your
+changes. When you *commit a file* in a repository, the version control system
+takes a snapshot of what the file looks like. As you continue working on the
+project, over time you will possibly make many commits to a single file; this
+generates a useful version history for that file. On GitHub, if you click the
green "Commit changes" button, \index{GitHub!commit} it will save the file and then make a commit
(Figure \@ref(fig:pen-tool-03)).
@@ -370,7 +370,7 @@ Recall from Section \@ref(commit-changes) that you normally have to add files
to the staging area before committing them. Why don't we have to do that when
we work directly on GitHub? Behind the scenes, when you click the green "Commit changes"
button, GitHub *is* adding that one file to the staging area prior to committing it.
-But note that on GitHub you are limited to committing changes to only one file at a time.
+But note that on GitHub you are limited to committing changes to only one file at a time.
When you work in your own local repository, you can commit
changes to multiple files simultaneously. This is especially useful when one
"improvement" to the project involves modifying multiple files.
@@ -384,9 +384,9 @@ image_read("img/version-control/pen-tool_03.png") |>
### Creating files on GitHub with the "Add file" menu
-The "Add file" menu \index{GitHub!add file} can be used to create new plain text files and upload files
-from your computer. To create a new plain text file, click the "Add file"
-drop-down menu and select the "Create new file" option
+The "Add file" menu \index{GitHub!add file} can be used to create new plain text files and upload files
+from your computer. To create a new plain text file, click the "Add file"
+drop-down menu and select the "Create new file" option
(Figure \@ref(fig:create-new-file-01)).
```{r create-new-file-01, fig.cap = 'New plain text files can be created directly on GitHub.', fig.retina = 2, out.width="100%"}
@@ -396,14 +396,14 @@ image_read("img/version-control/create-new-file_01.png") |>
image_flop()
```
-A page will open with a small text box for the file name to be entered, and a
-larger text box where the desired file content text can be entered. Note the two
-tabs, "Edit new file" and "Preview". Toggling between them lets you enter and
+A page will open with a small text box for the file name to be entered, and a
+larger text box where the desired file content text can be entered. Note the two
+tabs, "Edit new file" and "Preview". Toggling between them lets you enter and
edit text and view what the text will look like when rendered, respectively
-(Figure \@ref(fig:create-new-file-02)).
-Note that GitHub understands and renders `.md` files \index{markdown} using a
-[markdown syntax](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf)
-very similar to Jupyter notebooks, so the "Preview" tab is especially helpful
+(Figure \@ref(fig:create-new-file-02)).
+Note that GitHub understands and renders `.md` files \index{markdown} using a
+[markdown syntax](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf)
+very similar to Jupyter notebooks, so the "Preview" tab is especially helpful
for checking markdown code correctness.
```{r create-new-file-02, fig.cap = 'New plain text files require a file name in the text box circled in red, and file content entered in the larger text box (red arrow).', fig.retina = 2, out.width="100%"}
@@ -413,7 +413,7 @@ image_read("img/version-control/create-new-file_02.png") |>
image_flop()
```
-Save and commit your changes by clicking the green "Commit changes" button at the
+Save and commit your changes by clicking the green "Commit changes" button at the
bottom of the page (Figure \@ref(fig:create-new-file-03)).
```{r create-new-file-03, fig.cap = 'To be saved, newly created files are required to be committed along with an associated commit message.', fig.retina = 2, out.width="100%"}
@@ -421,13 +421,13 @@ image_read("img/version-control/create-new-file_03.png") |>
image_crop("3584x1500+1+500")
```
-You can also upload files that you have created on your local machine by using
+You can also upload files that you have created on your local machine by using
the "Add file" drop-down menu and selecting "Upload files"
(Figure \@ref(fig:upload-files-01)).
-To select the files from your local computer to upload, you can either drag and
-drop them into the gray box area shown below, or click the "choose your files"
-link to access a file browser dialog. Once the files you want to upload have
-been selected, click the green "Commit changes" button at the bottom of the
+To select the files from your local computer to upload, you can either drag and
+drop them into the gray box area shown below, or click the "choose your files"
+link to access a file browser dialog. Once the files you want to upload have
+been selected, click the green "Commit changes" button at the bottom of the
page (Figure \@ref(fig:upload-files-02)).
```{r upload-files-01, fig.cap = 'New files of any type can be uploaded to GitHub.', fig.retina = 2, out.width="100%"}
@@ -446,7 +446,7 @@ image_read("img/version-control/upload-files_02.png") |>
image_flop()
```
-Note that Git and GitHub are designed to track changes in individual files.
+Note that Git and GitHub are designed to track changes in individual files.
**Do not** upload your whole project in an archive file (e.g., `.zip`). If you do,
then Git can only keep track of changes to the entire `.zip` file, which will not
be human-readable. Committing one big archive defeats the whole purpose of using
@@ -464,29 +464,29 @@ remote repository that was created on GitHub to a local coding environment. Thi
can be done by creating and working in a local copy of the repository.
In this chapter, we focus on interacting with Git via Jupyter using
the Jupyter Git extension. The Jupyter Git \index{git!Jupyter extension} extension
-can be run by Jupyter on your local computer, or on a JupyterHub server.
+can be run by Jupyter on your local computer, or on a JupyterHub server.
We recommend reading Chapter \@ref(jupyter) to learn how
to use Jupyter before reading this chapter.
### Generating a GitHub personal access token
-To send and retrieve work between your local repository
+To send and retrieve work between your local repository
and the remote repository on GitHub,
-you will frequently need to authenticate with GitHub
+you will frequently need to authenticate with GitHub
to prove you have the required permission.
-There are several methods to do this,
-but for beginners we recommend using the HTTPS method
+There are several methods to do this,
+but for beginners we recommend using the HTTPS method
because it is easier and requires less setup.
-In order to use the HTTPS method,
-GitHub requires you to provide a *personal access token*. \index{GitHub!personal access token}
+In order to use the HTTPS method,
+GitHub requires you to provide a *personal access token*. \index{GitHub!personal access token}
A personal access token is like a password—so keep it a secret!—but it gives
you more fine-grained control over what parts of your account
the token can be used to access, and lets you set an expiry date for the authentication.
-To generate a personal access token,
+To generate a personal access token,
you must first visit [https://github.com/settings/tokens](https://github.com/settings/tokens),
which will take you to the "Personal access tokens" page in your account settings.
Once there, click "Generate new token" (Figure \@ref(fig:generate-pat-01)).
-Note that you may be asked to re-authenticate with your username
+Note that you may be asked to re-authenticate with your username
and password to proceed.
(ref:generate-pat-01) The "Generate new token" button used to initiate the creation of a new personal access token. It is found in the "Personal access tokens" section of the "Developer settings" page in your account settings.
@@ -495,13 +495,13 @@ and password to proceed.
image_read("img/version-control/generate-pat_01.png")
```
-You will be asked to add a note to describe the purpose for your personal access token.
+You will be asked to add a note to describe the purpose for your personal access token.
Next, you need to select permissions for the token; this is where
you can control what parts of your account the token can be used to access.
Make sure to choose only those permissions that you absolutely require. In
-Figure \@ref(fig:generate-pat-02), we tick only the "repo" box, which gives the
-token access to our repositories (so that we can push and pull) but none of our other GitHub
-account features. Finally, to generate the token, scroll to the bottom of that page
+Figure \@ref(fig:generate-pat-02), we tick only the "repo" box, which gives the
+token access to our repositories (so that we can push and pull) but none of our other GitHub
+account features. Finally, to generate the token, scroll to the bottom of that page
and click the green "Generate token" button (Figure \@ref(fig:generate-pat-02)).
(ref:generate-pat-02) Webpage for creating a new personal access token.
@@ -510,16 +510,16 @@ and click the green "Generate token" button (Figure \@ref(fig:generate-pat-02)).
image_read("img/version-control/generate-pat_02.png")
```
-Finally, you will be taken to a page where you will be able to see
-and copy the personal access token you just generated (Figure \@ref(fig:generate-pat-03)).
+Finally, you will be taken to a page where you will be able to see
+and copy the personal access token you just generated (Figure \@ref(fig:generate-pat-03)).
Since it provides access to certain parts of your account, you should
-treat this token like a password; for example, you should consider
+treat this token like a password; for example, you should consider
securely storing it (and your other passwords and tokens, too!) using a password manager.
Note that this page will only display the token to you once,
so make sure you store it in a safe place right away. If you accidentally forget to
-store it, though, do not fret—you can delete that token by clicking the
-"Delete" button next to your token, and generate a new one from scratch.
-To learn more about GitHub authentication,
+store it, though, do not fret—you can delete that token by clicking the
+"Delete" button next to your token, and generate a new one from scratch.
+To learn more about GitHub authentication,
see the additional resources section at the end of this chapter.
(ref:generate-pat-03) Display of the newly generated personal access token.
@@ -528,15 +528,15 @@ see the additional resources section at the end of this chapter.
image_read("img/version-control/generate-pat_03.png")
```
-### Cloning a repository using Jupyter
+### Cloning a repository using Jupyter
*Cloning* a \index{git!clone} remote repository from GitHub
-to create a local repository results in a
-copy that knows where it was obtained from so that it knows where to send/receive
-new committed edits. In order to do this, first copy the URL from the HTTPS tab
+to create a local repository results in a
+copy that knows where it was obtained from so that it knows where to send/receive
+new committed edits. In order to do this, first copy the URL from the HTTPS tab
of the Code drop-down menu on GitHub (Figure \@ref(fig:clone-02)).
(ref:clone-02) The green "Code" drop-down menu contains the remote address (URL) corresponding to the location of the remote GitHub repository.
@@ -546,7 +546,7 @@ image_read("img/version-control/clone_02.png") |>
image_crop("3584x1050")
```
-Open Jupyter, and click the Git+ icon on the file browser tab
+Open Jupyter, and click the Git+ icon on the file browser tab
(Figure \@ref(fig:clone-01)).
```{r clone-01, fig.pos = "H", out.extra="", fig.cap = 'The Jupyter Git Clone icon (red circle).', fig.retina = 2, out.width="100%"}
@@ -554,7 +554,7 @@ image_read("img/version-control/clone_01.png") |>
image_crop("2400x1300+1")
```
-Paste the URL of the GitHub project repository you
+Paste the URL of the GitHub project repository you
created and click the blue "CLONE" button (Figure \@ref(fig:clone-03)).
```{r clone-03, fig.pos = "H", out.extra="", fig.cap = 'Prompt where the remote address (URL) corresponding to the location of the GitHub repository needs to be input in Jupyter.', fig.retina = 2, out.width="100%"}
@@ -572,11 +572,11 @@ image_read("img/version-control/clone_04.png") |>
### Specifying files to commit
Now that you have cloned the remote repository from GitHub to create a local repository,
-you can get to work editing, creating, and deleting files.
-For example, suppose you created and saved a new file (named `eda.ipynb`) that you would
+you can get to work editing, creating, and deleting files.
+For example, suppose you created and saved a new file (named `eda.ipynb`) that you would
like to send back to the project repository on GitHub (Figure \@ref(fig:git-add-01)).
To "add" this modified file to the staging area (i.e., flag that this is a
-file whose changes we would like to commit), click the Jupyter Git extension
+file whose changes we would like to commit), click the Jupyter Git extension
icon on the far left-hand side of Jupyter (Figure \@ref(fig:git-add-01)).
```{r git-add-01, fig.pos = "H", out.extra="", fig.cap = 'Jupyter Git extension icon (circled in red).', fig.retina = 2, out.width="100%"}
@@ -586,8 +586,8 @@ image_read("img/version-control/git_add_01.png") |>
This opens the Jupyter Git graphical user interface pane. Next,
click the plus sign (+) beside the file(s) that you want to "add" \index{git!add}
-(Figure \@ref(fig:git-add-02)). Note that because this is the
-first change for this file, it falls under the "Untracked" heading.
+(Figure \@ref(fig:git-add-02)). Note that because this is the
+first change for this file, it falls under the "Untracked" heading.
However, next time you edit this file and want to add the changes,
you will find it under the "Changed" heading.
@@ -603,12 +603,12 @@ image_read("img/version-control/git_add_02.png") |>
image_crop("3584x1200")
```
-Clicking the plus sign (+) moves the file from the "Untracked" heading to the "Staged" heading,
-so that Git knows you want a snapshot of its current state
+Clicking the plus sign (+) moves the file from the "Untracked" heading to the "Staged" heading,
+so that Git knows you want a snapshot of its current state
as a commit (Figure \@ref(fig:git-add-03)).
-Now you are ready to "commit" the changes.
+Now you are ready to "commit" the changes.
Make sure to include a (clear and helpful!) message about what was changed
-so that your collaborators (and future you) know what happened in this commit.
+so that your collaborators (and future you) know what happened in this commit.
(ref:git-add-03) Adding `eda.ipynb` makes it visible in the staging area.
@@ -619,23 +619,23 @@ image_read("img/version-control/git_add_03.png") |>
### Making the commit
-To snapshot the changes with an associated commit message,
-you must put a message in the text box at the bottom of the Git pane
+To snapshot the changes with an associated commit message,
+you must put a message in the text box at the bottom of the Git pane
and click on the blue "Commit" button (Figure \@ref(fig:git-commit-01)). \index{git!commit}
-It is highly recommended to write useful and meaningful messages about what
-was changed. These commit messages, and the datetime stamp for a given
-commit, are the primary means to navigate through the project's history in the
- event that you need to view or retrieve a past version of a file, or
+It is highly recommended to write useful and meaningful messages about what
+was changed. These commit messages, and the datetime stamp for a given
+commit, are the primary means to navigate through the project's history in the
+ event that you need to view or retrieve a past version of a file, or
revert your project to an earlier state.
-When you click the "Commit" button for the first time, you will be prompted to
-enter your name and email. This only needs to be done once for each machine
+When you click the "Commit" button for the first time, you will be prompted to
+enter your name and email. This only needs to be done once for each machine
you use Git on.
```{r git-commit-01, fig.pos = "H", out.extra="", fig.cap = 'A commit message must be added into the Jupyter Git extension commit text box before the blue Commit button can be used to record the commit.', fig.retina = 2, out.width="100%"}
image_read("img/version-control/git_commit_01.png")
```
-After "committing" the file(s), you will see there are 0 "Staged" files.
+After "committing" the file(s), you will see there are 0 "Staged" files.
You are now ready to push your changes
to the remote repository on GitHub (Figure \@ref(fig:git-commit-03)).
@@ -646,9 +646,9 @@ image_read("img/version-control/git_commit_03.png") |>
### Pushing the commits to GitHub
-To send the committed changes back to the remote repository on
-GitHub, you need to *push* them. \index{git!push} To do this,
-click on the cloud icon with the up arrow on the Jupyter Git tab
+To send the committed changes back to the remote repository on
+GitHub, you need to *push* them. \index{git!push} To do this,
+click on the cloud icon with the up arrow on the Jupyter Git tab
(Figure \@ref(fig:git-push-01)).
(ref:git-push-01) The Jupyter Git extension "push" button (circled in red).
@@ -658,9 +658,9 @@ image_read("img/version-control/git_push_01.png") |>
image_crop("3584x1500")
```
-You will then be prompted to enter your GitHub username
+You will then be prompted to enter your GitHub username
and the personal access token that you generated
-earlier (not your account password!). Click
+earlier (not your account password!). Click
the blue "OK" button to initiate the push (Figure \@ref(fig:git-push-02)).
```{r git-push-02, fig.pos = "H", out.extra="", fig.cap = 'Enter your Git credentials to authorize the push to the remote repository.', fig.retina = 2, out.width="100%"}
@@ -668,8 +668,8 @@ image_read("img/version-control/git_push_02.png") |>
image_crop("3584x1900")
```
-If the files were successfully pushed to the project repository on
-GitHub, you will be shown a success message (Figure \@ref(fig:git-push-03)).
+If the files were successfully pushed to the project repository on
+GitHub, you will be shown a success message (Figure \@ref(fig:git-push-03)).
Click "Dismiss" to continue working in Jupyter.
```{r git-push-03, fig.pos = "H", out.extra="", fig.cap = 'The prompt that the push was successful.', fig.retina = 2, out.width="100%"}
@@ -677,7 +677,7 @@ image_read("img/version-control/git_push_03.png") |>
image_crop("3584x1900")
```
-If you visit the remote repository on GitHub,
+If you visit the remote repository on GitHub,
you will see that the changes now exist there too
(Figure \@ref(fig:git-push-04))!
@@ -690,10 +690,10 @@ image_read("img/version-control/git_push_04.png") |>
### Giving collaborators access to your project
-As mentioned earlier, GitHub allows you to control who has access to your
-project. The default of both public and private projects are that only the
-person who created the GitHub \index{GitHub!collaborator access} repository has permissions to create, edit and
-delete files (*write access*). To give your collaborators write access to the
+As mentioned earlier, GitHub allows you to control who has access to your
+project. The default of both public and private projects are that only the
+person who created the GitHub \index{GitHub!collaborator access} repository has permissions to create, edit and
+delete files (*write access*). To give your collaborators write access to the
projects, navigate to the "Settings" tab (Figure \@ref(fig:add-collab-01)).
(ref:add-collab-01) The "Settings" tab on the GitHub web interface.
@@ -721,7 +721,7 @@ image_read("img/version-control/add_collab_03.png") |>
image_crop("3584x2200")
```
-Type in the collaborator's GitHub username or email,
+Type in the collaborator's GitHub username or email,
and select their name when it appears (Figure \@ref(fig:add-collab-04)).
```{r add-collab-04, fig.pos = "H", out.extra="", fig.cap = "The text box where a collaborator's GitHub username or email can be entered.", fig.retina = 2, out.width="100%"}
@@ -736,15 +736,15 @@ image_read("img/version-control/add_collab_05.png") |>
image_crop("3584x1250")
```
-After this, you should see your newly added collaborator listed under the
-"Manage access" tab. They should receive an email invitation to join the
-GitHub repository as a collaborator. They need to accept this invitation
+After this, you should see your newly added collaborator listed under the
+"Manage access" tab. They should receive an email invitation to join the
+GitHub repository as a collaborator. They need to accept this invitation
to enable write access.
### Pulling changes from GitHub using Jupyter
-We will now walk through how to use the Jupyter Git extension tool to pull changes
-to our `eda.ipynb` analysis file that were made by a collaborator
+We will now walk through how to use the Jupyter Git extension tool to pull changes
+to our `eda.ipynb` analysis file that were made by a collaborator
(Figure \@ref(fig:git-pull-00)).
```{r git-pull-00, fig.pos = "H", out.extra="", fig.cap = 'The GitHub interface indicates the name of the last person to push a commit to the remote repository, a preview of the associated commit message, the unique commit identifier, and how long ago the commit was snapshotted.', fig.retina = 2, out.width="100%"}
@@ -752,7 +752,7 @@ image_read("img/version-control/git_pull_00.png") |>
image_crop("3584x1600")
```
-You can tell Git to "pull" by \index{git!pull} clicking on the cloud icon with
+You can tell Git to "pull" by \index{git!pull} clicking on the cloud icon with
the down arrow in Jupyter (Figure \@ref(fig:git-pull-01)).
```{r git-pull-01, fig.pos = "H", out.extra="", fig.cap = 'The Jupyter Git extension clone button.', fig.retina = 2, out.width="100%"}
@@ -779,7 +779,7 @@ image_read("img/version-control/git_pull_03.png") |>
```
It can be very useful to review the history of the changes to your project. You
-can do this directly in Jupyter by clicking "History" in the Git tab
+can do this directly in Jupyter by clicking "History" in the Git tab
(Figure \@ref(fig:git-pull-04)).
```{r git-pull-04, fig.pos = "H", out.extra="", fig.cap = 'Version control repository history viewed using the Jupyter Git extension.', fig.retina = 2, out.width="100%"}
@@ -787,12 +787,12 @@ image_read("img/version-control/git_pull_04.png") |>
image_crop("3584x1600")
```
-It is good practice to pull any changes at the start of *every* work session
-before you start working on your local copy.
-If you do not do this,
-and your collaborators have pushed some changes to the project to GitHub,
-then you will be unable to push your changes to GitHub until you pull.
-This situation can be recognized by the error message
+It is good practice to pull any changes at the start of *every* work session
+before you start working on your local copy.
+If you do not do this,
+and your collaborators have pushed some changes to the project to GitHub,
+then you will be unable to push your changes to GitHub until you pull.
+This situation can be recognized by the error message
shown in Figure \@ref(fig:merge-conflict-01).
```{r merge-conflict-01, fig.pos = "H", out.extra="", fig.cap = 'Error message that indicates that there are changes on the remote repository that you do not have locally.', fig.retina = 2, out.width="100%"}
@@ -821,7 +821,7 @@ image_read("img/version-control/merge_conflict_03.png") |>
To fix the merge conflict, \index{git!merge conflict} you need to open the offending file
in a plain text editor and look for special marks that Git puts in the file to
-tell you where the merge conflict occurred (Figure \@ref(fig:merge-conflict-04)).
+tell you where the merge conflict occurred (Figure \@ref(fig:merge-conflict-04)).
```{r merge-conflict-04, fig.cap = 'How to open a Jupyter notebook as a plain text file view in Jupyter.', fig.retina = 2, out.width="100%"}
image_read("img/version-control/merge_conflict_04.png") |>
@@ -829,9 +829,9 @@ image_read("img/version-control/merge_conflict_04.png") |>
```
The beginning of the merge
-conflict is preceded by `<<<<<<< HEAD` and the end of the merge conflict is
-marked by `>>>>>>>`. Between these markings, Git also inserts a separator
-(`=======`). The version of the change before the separator is your change, and
+conflict is preceded by `<<<<<<< HEAD` and the end of the merge conflict is
+marked by `>>>>>>>`. Between these markings, Git also inserts a separator
+(`=======`). The version of the change before the separator is your change, and
the version that follows the separator was the change that existed on GitHub.
In Figure \@ref(fig:merge-conflict-05), you can see that in your local repository
there is a line of code that calls `scale_color_manual` with three color values (`deeppink2`, `cyan4`, and `purple1`).
@@ -851,31 +851,31 @@ image_read("img/version-control/merge_conflict_06.png") |>
image_crop("3584x1400")
```
-The file must be saved, added to the staging area, and then committed before you will be able to
+The file must be saved, added to the staging area, and then committed before you will be able to
push your changes to GitHub.
### Communicating using GitHub issues
When working on a project in a team, you don't just want a historical record of who changed
-what file and when in the project—you also want a record of decisions that were made,
-ideas that were floated, problems that were identified and addressed, and all other
+what file and when in the project—you also want a record of decisions that were made,
+ideas that were floated, problems that were identified and addressed, and all other
communication surrounding the project. Email and messaging apps are both very popular for general communication, but are not
designed for project-specific communication: they both generally do not have facilities for organizing conversations by project subtopics,
searching for conversations related to particular bugs or software versions, etc.
-GitHub *issues* \index{GitHub!issues} are an alternative written communication medium to email and
-messaging apps, and were designed specifically to facilitate project-specific
+GitHub *issues* \index{GitHub!issues} are an alternative written communication medium to email and
+messaging apps, and were designed specifically to facilitate project-specific
communication. Issues are *opened* from the "Issues" tab on the project's
-GitHub page, and they persist there even after the conversation is over and the issue is *closed* (in
+GitHub page, and they persist there even after the conversation is over and the issue is *closed* (in
contrast to email, issues are not usually deleted). One issue thread is usually created
-per topic, and they are easily searchable using GitHub's search tools. All
-issues are accessible to all project collaborators, so no one is left out of
-the conversation. Finally, issues can be set up so that team members get email
-notifications when a new issue is created or a new post is made in an issue
+per topic, and they are easily searchable using GitHub's search tools. All
+issues are accessible to all project collaborators, so no one is left out of
+the conversation. Finally, issues can be set up so that team members get email
+notifications when a new issue is created or a new post is made in an issue
thread. Replying to issues from email is also possible. Given all of these advantages,
we highly recommend the use of issues for project-related communication.
-To open a GitHub issue,
+To open a GitHub issue,
first click on the "Issues" tab (Figure \@ref(fig:issue-01)).
(ref:issue-01) The "Issues" tab on the GitHub web interface.
@@ -896,7 +896,7 @@ image_read("img/version-control/issue_02.png") |>
image_crop("3584x1250")
```
-Add an issue title (which acts like an email subject line), and then put the
+Add an issue title (which acts like an email subject line), and then put the
body of the message in the larger text box. Finally, click "Submit new issue"
to post the issue to share with others (Figure \@ref(fig:issue-03)).
@@ -913,8 +913,8 @@ image_read("img/version-control/issue_04.png") |>
image_crop("3584x2000")
```
-When a conversation is resolved, you can click "Close issue".
-The closed issue can be later viewed by clicking the "Closed" header link
+When a conversation is resolved, you can click "Close issue".
+The closed issue can be later viewed by clicking the "Closed" header link
in the "Issue" tab (Figure \@ref(fig:issue-06)).
(ref:issue-06) The "Closed" issues tab on the GitHub web interface.
@@ -926,8 +926,8 @@ image_read("img/version-control/issue_06.png") |>
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Collaboration with version control" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -939,11 +939,11 @@ and guidance that the worksheets provide will function as intended.
## Additional resources {#vc-add-res}
-Now that you've picked up the basics of version control with Git and GitHub,
+Now that you've picked up the basics of version control with Git and GitHub,
you can expand your knowledge through the resources listed below:
- GitHub's [guides website](https://guides.github.com/) and [YouTube
- channel](https://www.youtube.com/githubguides), and [*Happy Git and GitHub
+ channel](https://www.youtube.com/githubguides), and [*Happy Git and GitHub
for the useR*](https://happygitwithr.com/) are great resources to take the next steps in
learning about Git and GitHub.
- [Good enough practices in scientific
@@ -957,7 +957,7 @@ you can expand your knowledge through the resources listed below:
perfectly fine to just stick with GitHub. Just be aware that you have options!
- GitHub's [documentation on creating a personal access
token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
- and the *Happy Git and GitHub for the useR*
+ and the *Happy Git and GitHub for the useR*
[personal access tokens chapter](https://happygitwithr.com/https-pat.html) are both
excellent additional resources to consult if you need additional help
generating and using personal access tokens.
diff --git a/source/viz.Rmd b/source/viz.Rmd
index 6f481c5aa..a76e0a12d 100644
--- a/source/viz.Rmd
+++ b/source/viz.Rmd
@@ -13,13 +13,13 @@ knitr::opts_chunk$set(fig.align = "center")
options(knitr.table.format = ifelse(knitr::is_latex_output(), "latex", "html"))
```
-## Overview
+## Overview
This chapter will introduce concepts and tools relating to data visualization
beyond what we have seen and practiced so far. We will focus on guiding
principles for effective data visualization and explaining visualizations
independent of any particular tool or programming language. In the process, we
will cover some specifics of creating visualizations (scatter plots, bar
-plots, line plots, and histograms) for data using R.
+plots, line plots, and histograms) for data using R.
## Chapter learning objectives
@@ -28,16 +28,12 @@ By the end of the chapter, readers will be able to do the following:
- Describe when to use the following kinds of visualizations to answer specific questions using a data set:
- scatter plots
- line plots
- - bar plots
+ - bar plots
- histogram plots
- Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question.
-- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question.
+- Evaluate the effectiveness of a visualization and suggest improvements to better answer a given question.
- Referring to the visualization, communicate the conclusions in non-technical terms.
-- Identify rules of thumb for creating effective visualizations.
-- Define the three key aspects of ggplot objects:
- - aesthetic mappings
- - geometric objects
- - scales
+- Identify rules of thumb for creating effective visualizations.
- Use the `ggplot2` package in R to create and refine the above visualizations using:
- geometric objects: `geom_point`, `geom_line`, `geom_histogram`, `geom_bar`, `geom_vline`, `geom_hline`
- scales: `xlim`, `ylim`
@@ -45,6 +41,10 @@ By the end of the chapter, readers will be able to do the following:
- labeling: `xlab`, `ylab`, `labs`
- font control and legend positioning: `theme`
- subplots: `facet_grid`
+- Define the three key aspects of `ggplot2` objects:
+ - aesthetic mappings
+ - geometric objects
+ - scales
- Describe the difference in raster and vector output formats.
- Use `ggsave` to save visualizations in `.png` and `.svg` format.
@@ -61,22 +61,22 @@ Imagine your visualization as part of a poster presentation for a project; even
if you aren't standing at the poster explaining things, an effective
visualization will convey your message to the audience.
-Recall the different data analysis questions
-from Chapter \@ref(intro).
-With the visualizations we will cover in this chapter,
-we will be able to answer *only descriptive and exploratory* questions.
-Be careful to not answer any *predictive, inferential, causal*
-*or mechanistic* questions with the visualizations presented here,
+Recall the different data analysis questions
+from Chapter \@ref(intro).
+With the visualizations we will cover in this chapter,
+we will be able to answer *only descriptive and exploratory* questions.
+Be careful to not answer any *predictive, inferential, causal*
+*or mechanistic* questions with the visualizations presented here,
as we have not learned the tools necessary to do that properly just yet.
As with most coding tasks, it is totally fine (and quite common) to make
mistakes and iterate a few times before you find the right visualization for
your data and question. There are many different kinds of plotting
-graphics available to use (see Chapter 5 of *Fundamentals of Data Visualization* [@wilkeviz] for a directory).
+graphics available to use (see Chapter 5 of *Fundamentals of Data Visualization* [@wilkeviz] for a directory).
The types of plot that we introduce in this book are shown in Figure \@ref(fig:plot-sketches);
-which one you should select depends on your data
-and the question you want to answer.
-In general, the guiding principles of when to use each type of plot
+which one you should select depends on your data
+and the question you want to answer.
+In general, the guiding principles of when to use each type of plot
are as follows:
\index{visualization!line}
@@ -127,7 +127,7 @@ plot_grid(scatter_plot,
line_plot,
bar_plot,
histogram_plot,
- ncol = 2,
+ ncol = 2,
greedy = FALSE)
```
@@ -149,7 +149,7 @@ Just being able to make a visualization in R (or any other language,
for that matter) doesn't mean that it effectively communicates your message to
others. Once you have selected a broad type of visualization to use, you will
have to refine it to suit your particular need. Some rules of thumb for doing
-this are listed below. They generally fall into two classes: you want to
+this are listed below. They generally fall into two classes: you want to
*make your visualization convey your message*, and you want to *reduce visual noise*
as much as possible. Humans have limited cognitive ability to process
information; both of these types of refinement aim to reduce the mental load on
@@ -163,18 +163,18 @@ understand and remember your message quickly.
- Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read.
- Ensure the data are clearly visible; don't hide the shape/distribution of the data behind other objects (e.g., a bar).
- Make sure to use color schemes that are understandable by those with
- colorblindness (a surprisingly large fraction of the overall
+ colorblindness (a surprisingly large fraction of the overall
population—from about 1% to 10%, depending on sex and ancestry [@deebblind]).
- For example, [ColorBrewer](https://colorbrewer2.org)
- and [the `RColorBrewer` R package](https://cran.r-project.org/web/packages/RColorBrewer/index.html) [@RColorBrewer] provide the
- ability to pick such color schemes, and you can check your visualizations
+ For example, [ColorBrewer](https://colorbrewer2.org)
+ and [the `RColorBrewer` R package](https://cran.r-project.org/web/packages/RColorBrewer/index.html) [@RColorBrewer] provide the
+ ability to pick such color schemes, and you can check your visualizations
after you have created them by uploading to online tools
such as a [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/).
- Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience.
**Minimize noise**
-- Use colors sparingly. Too many different colors can be distracting, create false patterns, and detract from the message.
+- Use colors sparingly. Too many different colors can be distracting, create false patterns, and detract from the message.
- Be wary of overplotting. Overplotting is when marks that represent the data
overlap, and is problematic as it prevents you from seeing how many data
points are represented in areas of the visualization where this occurs. If your
@@ -184,13 +184,13 @@ understand and remember your message quickly.
- Don't adjust the axes to zoom in on small differences. If the difference is small, show that it's small!
-## Creating visualizations with `ggplot2`
+## Creating visualizations with `ggplot2`
#### *Build the visualization iteratively* {-}
This section will cover examples of how to choose and refine a visualization
given a data set and a question that you want to answer, and then how to create
the visualization in R \index{ggplot} using the `ggplot2` R package. Given that
-the `ggplot2` package is loaded by the `tidyverse` metapackage, we still
+the `ggplot2` package is loaded by the `tidyverse` metapackage, we still
need to load only `tidyverse':
```{r 03-tidyverse, warning=FALSE, message=FALSE}
@@ -203,18 +203,18 @@ options(warn = -1)
### Scatter plots and line plots: the Mauna Loa CO$_{\text{2}}$ data set
-The [Mauna Loa CO$_{\text{2}}$ data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html),
-curated by Dr. Pieter Tans, NOAA/GML
+The [Mauna Loa CO$_{\text{2}}$ data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html),
+curated by Dr. Pieter Tans, NOAA/GML
and Dr. Ralph Keeling, Scripps Institution of Oceanography,
-records the atmospheric concentration of carbon dioxide
-(CO$_{\text{2}}$, in parts per million)
-at the Mauna Loa research station in \index{Mauna Loa} Hawaii
+records the atmospheric concentration of carbon dioxide
+(CO$_{\text{2}}$, in parts per million)
+at the Mauna Loa research station in \index{Mauna Loa} Hawaii
from 1959 onward [@maunadata].
For this book, we are going to focus on the last 40 years of the data set,
1980-2020.
-**Question:** \index{question!visualization}
-Does the concentration of atmospheric CO$_{\text{2}}$ change over time,
+**Question:** \index{question!visualization}
+Does the concentration of atmospheric CO$_{\text{2}}$ change over time,
and are there any interesting patterns to note?
```{r, echo = FALSE, warning = FALSE, message = FALSE}
@@ -237,38 +237,38 @@ co2_df <- read_csv("data/mauna_loa_data.csv")
co2_df
```
-We see that there are two columns in the `co2_df` data frame; `date_measured` and `ppm`.
-The `date_measured` column holds the date the measurement was taken,
+We see that there are two columns in the `co2_df` data frame; `date_measured` and `ppm`.
+The `date_measured` column holds the date the measurement was taken,
and is of type `date`.
-The `ppm` column holds the value of CO$_{\text{2}}$ in parts per million
+The `ppm` column holds the value of CO$_{\text{2}}$ in parts per million
that was measured on each date, and is type `double`.
> **Note:** `read_csv` was able to parse the `date_measured` column into the
-> `date` vector type because it was entered
-> in the international standard date format,
+> `date` vector type because it was entered
+> in the international standard date format,
> called ISO 8601, which lists dates as `year-month-day`.
-> `date` vectors are `double` vectors with special properties that allow
+> `date` vectors are `double` vectors with special properties that allow
> them to handle dates correctly.
-> For example, `date` type vectors allow functions like `ggplot`
-> to treat them as numeric dates and not as character vectors,
-> even though they contain non-numeric characters
+> For example, `date` type vectors allow functions like `ggplot`
+> to treat them as numeric dates and not as character vectors,
+> even though they contain non-numeric characters
> (e.g., in the `date_measured` column in the `co2_df` data frame).
-> This means R will not accidentally plot the dates in the wrong order
-> (i.e., not alphanumerically as would happen if it was a character vector).
-> An in-depth study of dates and times is beyond the scope of the book,
-> but interested readers
+> This means R will not accidentally plot the dates in the wrong order
+> (i.e., not alphanumerically as would happen if it was a character vector).
+> An in-depth study of dates and times is beyond the scope of the book,
+> but interested readers
> may consult the Dates and Times chapter of *R for Data Science* [@wickham2016r];
> see the additional resources at the end of this chapter.
-Since we are investigating a relationship between two variables
-(CO$_{\text{2}}$ concentration and date),
-a scatter plot is a good place to start.
-Scatter plots show the data as individual points with `x` (horizontal axis)
+Since we are investigating a relationship between two variables
+(CO$_{\text{2}}$ concentration and date),
+a scatter plot is a good place to start.
+Scatter plots show the data as individual points with `x` (horizontal axis)
and `y` (vertical axis) coordinates.
-Here, we will use the measurement date as the `x` coordinate
-and the CO$_{\text{2}}$ concentration as the `y` coordinate.
-When using the `ggplot2` package,
-we create a plot object with the `ggplot` function.
+Here, we will use the measurement date as the `x` coordinate
+and the CO$_{\text{2}}$ concentration as the `y` coordinate.
+When using the `ggplot2` package,
+we create a plot object with the `ggplot` function.
There are a few basic aspects of a plot that we need to specify:
\index{ggplot!aesthetic mapping}
\index{ggplot!geometric object}
@@ -283,7 +283,7 @@ There are a few basic aspects of a plot that we need to specify:
- To create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects).
- Here, we use the `geom_point` function to visualize our data as a scatter plot.
-Figure \@ref(fig:03-ggplot-function-scatter)
+Figure \@ref(fig:03-ggplot-function-scatter)
shows how each of these aspects map to code
for creating a basic scatter plot of the `co2_df` data.
Note that we could pass many other possible arguments to the aesthetic mapping
@@ -292,7 +292,7 @@ testing things out to see what they look like, though, we can just start with th
default settings.
\index{ggplot!aes}
\index{ggplot!geom\_point}
-
+
(ref:03-ggplot-function-scatter) Creating a scatter plot with the `ggplot` function.
```{r 03-ggplot-function-scatter, echo = FALSE, fig.cap = "(ref:03-ggplot-function-scatter)", message = FALSE, out.width = "100%"}
@@ -309,28 +309,28 @@ co2_scatter <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
co2_scatter
```
-Certainly, the visualization in Figure \@ref(fig:03-data-co2-scatter)
-shows a clear upward trend
+Certainly, the visualization in Figure \@ref(fig:03-data-co2-scatter)
+shows a clear upward trend
in the atmospheric concentration of CO$_{\text{2}}$ over time.
-This plot answers the first part of our question in the affirmative,
-but that appears to be the only conclusion one can make
-from the scatter visualization.
+This plot answers the first part of our question in the affirmative,
+but that appears to be the only conclusion one can make
+from the scatter visualization.
One important thing to note about this data is that one of the variables
we are exploring is time.
-Time is a special kind of quantitative variable
-because it forces additional structure on the data—the
-data points have a natural order.
-Specifically, each observation in the data set has a predecessor
-and a successor, and the order of the observations matters; changing their order
+Time is a special kind of quantitative variable
+because it forces additional structure on the data—the
+data points have a natural order.
+Specifically, each observation in the data set has a predecessor
+and a successor, and the order of the observations matters; changing their order
alters their meaning.
In situations like this, we typically use a line plot to visualize
the data. Line plots connect the sequence of `x` and `y` coordinates
of the observations with line segments, thereby emphasizing their order.
-We can create a line plot in `ggplot` using the `geom_line` function.
-Let's now try to visualize the `co2_df` as a line plot
-with just the default arguments:
+We can create a line plot in `ggplot` using the `geom_line` function.
+Let's now try to visualize the `co2_df` as a line plot
+with just the default arguments:
\index{ggplot!geom\_line}
```{r 03-data-co2-line, warning=FALSE, message=FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time."}
@@ -369,7 +369,7 @@ co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
co2_line
```
-> **Note:** The `theme` function is quite complex and has many arguments
+> **Note:** The `theme` function is quite complex and has many arguments
> that can be specified to control many non-data aspects of a visualization.
> An in-depth discussion of the `theme` function is beyond the scope of this book.
> Interested readers may consult the `theme` function documentation;
@@ -382,20 +382,20 @@ answer. We will accomplish this by using *scales*, \index{ggplot!scales}
another important feature of `ggplot2` that easily transforms the different
variables and set limits. We scale the horizontal axis using the `xlim` function,
and the vertical axis with the `ylim` function.
-In particular, here, we will use the `xlim` function to zoom in
+In particular, here, we will use the `xlim` function to zoom in
on just five years of data (say, 1990-1994).
-`xlim` takes a vector of length two
-to specify the upper and lower bounds to limit the axis.
+`xlim` takes a vector of length two
+to specify the upper and lower bounds to limit the axis.
We can create that using the `c` function.
Note that it is important that the vector given to `xlim` must be of the same
-type as the data that is mapped to that axis.
-Here, we have mapped a date to the x-axis,
-and so we need to use the `date` function
-(from the `tidyverse` [`lubridate` R package](https://lubridate.tidyverse.org/) [@lubridate; @lubridatepaper])
+type as the data that is mapped to that axis.
+Here, we have mapped a date to the x-axis,
+and so we need to use the `date` function
+(from the `tidyverse` [`lubridate` R package](https://lubridate.tidyverse.org/) [@lubridate; @lubridatepaper])
to convert the character strings we provide to `c` to `date` vectors.
> **Note:** `lubridate` is a package that is installed by the `tidyverse` metapackage,
-> but is not loaded by it.
+> but is not loaded by it.
> Hence we need to load it separately in the code below.
```{r 03-data-co2-line-3, warning = FALSE, message = FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ from 1990 to 1994."}
@@ -411,44 +411,44 @@ co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
co2_line
```
-Interesting! It seems that each year, the atmospheric CO$_{\text{2}}$ increases until it reaches its peak somewhere around April, decreases until around late September,
+Interesting! It seems that each year, the atmospheric CO$_{\text{2}}$ increases until it reaches its peak somewhere around April, decreases until around late September,
and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April.
Therefore, the oscillating pattern in CO$_{\text{2}}$ matches up fairly closely with the two seasons.
As you might have noticed from the code used to create the final visualization
-of the `co2_df` data frame,
+of the `co2_df` data frame,
we construct the visualizations in `ggplot` with layers.
-New layers are added with the `+` operator,
+New layers are added with the `+` operator,
and we can really add as many as we would like!
A useful analogy to constructing a data visualization is painting a picture.
-We start with a blank canvas,
-and the first thing we do is prepare the surface
-for our painting by adding primer.
-In our data visualization this is akin to calling `ggplot`
+We start with a blank canvas,
+and the first thing we do is prepare the surface
+for our painting by adding primer.
+In our data visualization this is akin to calling `ggplot`
and specifying the data set we will be using.
-Next, we sketch out the background of the painting.
-In our data visualization,
+Next, we sketch out the background of the painting.
+In our data visualization,
this would be when we map data to the axes in the `aes` function.
Then we add our key visual subjects to the painting.
-In our data visualization,
+In our data visualization,
this would be the geometric objects (e.g., `geom_point`, `geom_line`, etc.).
And finally, we work on adding details and refinements to the painting.
In our data visualization this would be when we fine tune axis labels,
change the font, adjust the point size, and do other related things.
### Scatter plots: the Old Faithful eruption time data set
-The `faithful` data set \index{Old Faithful} contains measurements
-of the waiting time between eruptions
+The `faithful` data set \index{Old Faithful} contains measurements
+of the waiting time between eruptions
and the subsequent eruption duration (in minutes) of the Old Faithful
-geyser in Yellowstone National Park, Wyoming, United States.
+geyser in Yellowstone National Park, Wyoming, United States.
The `faithful` data set is available in base R as a data frame,
so it does not need to be loaded.
-We convert it to a tibble to take advantage of the nicer print output
+We convert it to a tibble to take advantage of the nicer print output
these specialized data frames provide.
-**Question:** \index{question!visualization}
-Is there a relationship between the waiting time before an eruption
-and the duration of the eruption?
+**Question:** \index{question!visualization}
+Is there a relationship between the waiting time before an eruption
+and the duration of the eruption?
```{r 03-data-faithful, warning=FALSE, message=FALSE}
# old faithful eruption time / wait time data
@@ -456,14 +456,14 @@ faithful <- as_tibble(faithful)
faithful
```
-Here again, we investigate the relationship between two quantitative variables
-(waiting time and eruption time).
-But if you look at the output of the data frame,
+Here again, we investigate the relationship between two quantitative variables
+(waiting time and eruption time).
+But if you look at the output of the data frame,
you'll notice that unlike time in the Mauna Loa CO$_{\text{2}}$ data set,
neither of the variables here have a natural order to them.
So a scatter plot is likely to be the most appropriate
visualization. Let's create a scatter plot using the `ggplot`
-function with the `waiting` variable on the horizontal axis, the `eruptions`
+function with the `waiting` variable on the horizontal axis, the `eruptions`
variable on the vertical axis, and the `geom_point` geometric object.
The result is shown in Figure \@ref(fig:03-data-faithful-scatter).
@@ -484,7 +484,7 @@ labels and make the font more readable:
```{r 03-data-faithful-scatter-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time with clearer axes and labels."}
faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
geom_point() +
- xlab("Waiting Time (mins)") +
+ xlab("Waiting Time (mins)") +
ylab("Eruption Duration (mins)") +
theme(text = element_text(size = 12))
@@ -501,7 +501,7 @@ Recall the `can_lang` data set [@timbers2020canlang] from Chapters \@ref(intro),
Canadian census.
**Question:** \index{question!visualization} Is there a relationship between
-the percentage of people who speak a language as their mother tongue and
+the percentage of people who speak a language as their mother tongue and
the percentage for whom that is the primary language spoken at home?
And is there a pattern in the strength of this relationship in the
higher-level language categories (Official languages, Aboriginal languages, or
@@ -521,15 +521,15 @@ The resulting plot is shown in Figure \@ref(fig:03-mother-tongue-vs-most-at-home
```{r 03-mother-tongue-vs-most-at-home, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home."}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
geom_point()
-```
+```
-To make an initial improvement in the interpretability
-of Figure \@ref(fig:03-mother-tongue-vs-most-at-home), we should
+To make an initial improvement in the interpretability
+of Figure \@ref(fig:03-mother-tongue-vs-most-at-home), we should
replace the default axis
names with more informative labels. We can use `\n` to create a line break in
the axis names so that the words after `\n` are printed on a new line. This will
make the axes labels on the plots more readable.
-\index{escape character} We should also increase the font size to further
+\index{escape character} We should also increase the font size to further
improve readability.
```{r 03-mother-tongue-vs-most-at-home-labs, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels."}
@@ -541,15 +541,15 @@ ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
```
```{r mother-tongue-hidden-summaries, echo = FALSE, warning = FALSE, message = FALSE}
-numlang_speakers <- can_lang |>
- select(mother_tongue) |>
- summarize(maxsp = max(mother_tongue),
+numlang_speakers <- can_lang |>
+ select(mother_tongue) |>
+ summarize(maxsp = max(mother_tongue),
minsp = min(mother_tongue))
-maxlang_speakers <- numlang_speakers |>
+maxlang_speakers <- numlang_speakers |>
pull(maxsp)
-minlang_speakers <- numlang_speakers |>
+minlang_speakers <- numlang_speakers |>
pull(minsp)
```
@@ -558,8 +558,8 @@ much more readable and interpretable now. However, the scatter points themselves
some work; most of the 214 data points are bunched
up in the lower left-hand side of the visualization. The data is clumped because
many more people in Canada speak English or French (the two points in
-the upper right corner) than other languages.
-In particular, the most common mother tongue language
+the upper right corner) than other languages.
+In particular, the most common mother tongue language
has `r format(maxlang_speakers, scientific = FALSE, big.mark = ",")` speakers,
while the least common has only `r format(minlang_speakers, scientific = FALSE, big.mark = ",")`.
That's a `r as.integer(floor(log10(maxlang_speakers/minlang_speakers)))`-decimal-place difference
@@ -574,7 +574,7 @@ can_lang |>
```
Recall that our question about this data pertains to *all* languages;
-so to properly answer our question,
+so to properly answer our question,
we will need to adjust the scale of the axes so that we can clearly
see all of the scatter points.
In particular, we will improve the plot by adjusting the horizontal
@@ -582,13 +582,13 @@ and vertical axes so that they are on a **logarithmic** (or **log**) scale. \ind
Log scaling is useful when your data take both *very large* and *very small* values,
because it helps space out small values and squishes larger values together.
For example, $\log_{10}(1) = 0$, $\log_{10}(10) = 1$, $\log_{10}(100) = 2$, and $\log_{10}(1000) = 3$;
-on the logarithmic scale, \index{ggplot!logarithmic scaling}
+on the logarithmic scale, \index{ggplot!logarithmic scaling}
the values 1, 10, 100, and 1000 are all the same distance apart!
-So we see that applying this function is moving big values closer together
+So we see that applying this function is moving big values closer together
and moving small values farther apart.
-Note that if your data can take the value 0, logarithmic scaling may not
+Note that if your data can take the value 0, logarithmic scaling may not
be appropriate (since `log10(0) = -Inf` in R). There are other ways to transform
-the data in such a case, but these are beyond the scope of the book.
+the data in such a case, but these are beyond the scope of the book.
We can accomplish logarithmic scaling in a `ggplot` visualization
using the `scale_x_log10` and `scale_y_log10` functions.
@@ -617,17 +617,17 @@ english_mother_tongue <- can_lang |>
census_popn <- 35151728
```
-Similar to some of the examples in Chapter \@ref(wrangling),
-we can convert the counts to percentages to give them context
+Similar to some of the examples in Chapter \@ref(wrangling),
+we can convert the counts to percentages to give them context
and make them easier to understand.
-We can do this by dividing the number of people reporting a given language
-as their mother tongue or primary language at home
-by the number of people who live in Canada and multiplying by 100\%.
-For example,
-the percentage of people who reported that their mother tongue was English
-in the 2016 Canadian census
-was `r format(english_mother_tongue, scientific = FALSE, big.mark = ",") `
-/ `r format(census_popn, scientific = FALSE, big.mark = ",")` $\times$
+We can do this by dividing the number of people reporting a given language
+as their mother tongue or primary language at home
+by the number of people who live in Canada and multiplying by 100\%.
+For example,
+the percentage of people who reported that their mother tongue was English
+in the 2016 Canadian census
+was `r format(english_mother_tongue, scientific = FALSE, big.mark = ",") `
+/ `r format(census_popn, scientific = FALSE, big.mark = ",")` $\times$
`r 100` \% =
`r format(round(english_mother_tongue/census_popn*100, 2), scientific = FALSE, big.mark = ",")`\%.
@@ -645,12 +645,12 @@ can_lang <- can_lang |>
most_at_home_percent = (most_at_home / 35151728) * 100
)
-can_lang |>
+can_lang |>
select(mother_tongue_percent, most_at_home_percent)
```
Finally, we will edit the visualization to use the percentages we just computed
-(and change our axis labels to reflect this change in
+(and change our axis labels to reflect this change in
units). Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) displays
the final result.
@@ -666,35 +666,35 @@ ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) +
Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) is the appropriate
visualization to use to answer the first question in this section, i.e.,
-whether there is a relationship between the percentage of people who speak
+whether there is a relationship between the percentage of people who speak
a language as their mother tongue and the percentage for whom that
is the primary language spoken at home.
To fully answer the question, we need to use
Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
-to assess a few key characteristics of the data:
+to assess a few key characteristics of the data:
-- **Direction:** if the y variable tends to increase when the x variable increases, then y has a **positive** relationship with x. If
- y tends to decrease when x increases, then y has a **negative** relationship with x. If y does not meaningfully increase or decrease
+- **Direction:** if the y variable tends to increase when the x variable increases, then y has a **positive** relationship with x. If
+ y tends to decrease when x increases, then y has a **negative** relationship with x. If y does not meaningfully increase or decrease
as x increases, then y has **little or no** relationship with x. \index{relationship!positive, negative, none}
- **Strength:** if the y variable *reliably* increases, decreases, or stays flat as x increases,
then the relationship is **strong**. Otherwise, the relationship is **weak**. Intuitively, \index{relationship!strong, weak}
the relationship is strong when the scatter points are close together and look more like a "line" or "curve" than a "cloud."
- **Shape:** if you can draw a straight line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**. \index{relationship!linear, nonlinear}
-In Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props), we see that
-as the percentage of people who have a language as their mother tongue increases,
-so does the percentage of people who speak that language at home.
+In Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props), we see that
+as the percentage of people who have a language as their mother tongue increases,
+so does the percentage of people who speak that language at home.
Therefore, there is a **positive** relationship between these two variables.
-Furthermore, because the points in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
+Furthermore, because the points in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
are fairly close together, and the points look more like a "line" than a "cloud",
-we can say that this is a **strong** relationship.
-And finally, because drawing a straight line through these points in
+we can say that this is a **strong** relationship.
+And finally, because drawing a straight line through these points in
Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
would fit the pattern we observe quite well, we say that the relationship is **linear**.
Onto the second part of our exploratory data analysis question!
-Recall that we are interested in knowing whether the strength
-of the relationship we uncovered
+Recall that we are interested in knowing whether the strength
+of the relationship we uncovered
in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) depends
on the higher-level language category (Official languages, Aboriginal languages,
and non-official, non-Aboriginal languages).
@@ -702,18 +702,18 @@ One common way to explore this
is to color the data points on the scatter plot we have already created by
group. For example, given that we have the higher-level language category for
each language recorded in the 2016 Canadian census, we can color the points in
-our previous
+our previous
scatter plot to represent each language's higher-level language category.
Here we want to distinguish the values according to the `category` group with
which they belong. We can add an argument to the `aes` function, specifying
that the `category` column should color the points. Adding this argument will
color the points according to their group and add a legend at the side of the
-plot.
+plot.
```{r 03-scatter-color-by-category, warning=FALSE, fig.height=3.5, fig.width=5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category."}
-ggplot(can_lang, aes(x = most_at_home_percent,
- y = mother_tongue_percent,
+ggplot(can_lang, aes(x = most_at_home_percent,
+ y = mother_tongue_percent,
color = category)) +
geom_point() +
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
@@ -723,23 +723,23 @@ ggplot(can_lang, aes(x = most_at_home_percent,
scale_y_log10(labels = comma)
```
-The legend in Figure \@ref(fig:03-scatter-color-by-category)
-takes up valuable plot area.
+The legend in Figure \@ref(fig:03-scatter-color-by-category)
+takes up valuable plot area.
We can improve this by moving the legend title using the `legend.position`
and `legend.direction`
-arguments of the `theme` function.
+arguments of the `theme` function.
Here we set `legend.position` to `"top"` to put the legend above the plot
-and `legend.direction` to `"vertical"` so that the legend items remain
+and `legend.direction` to `"vertical"` so that the legend items remain
vertically stacked on top of each other.
-When the `legend.position` is set to either `"top"` or `"bottom"`
+When the `legend.position` is set to either `"top"` or `"bottom"`
the default direction is to stack the legend items horizontally.
-However, that will not work well for this particular visualization
-because the legend labels are quite long
+However, that will not work well for this particular visualization
+because the legend labels are quite long
and would run off the page if displayed this way.
```{r 03-scatter-color-by-category-legend-edit, warning=FALSE, fig.height=4.75, fig.width=3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with the legend edited."}
-ggplot(can_lang, aes(x = most_at_home_percent,
- y = mother_tongue_percent,
+ggplot(can_lang, aes(x = most_at_home_percent,
+ y = mother_tongue_percent,
color = category)) +
geom_point() +
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
@@ -753,13 +753,13 @@ ggplot(can_lang, aes(x = most_at_home_percent,
In Figure \@ref(fig:03-scatter-color-by-category-legend-edit), the points are colored with
the default `ggplot2` color palette. But what if you want to use different
-colors? In R, two packages that provide alternative color
+colors? In R, two packages that provide alternative color
palettes \index{color palette} are `RColorBrewer` [@RColorBrewer]
and `ggthemes` [@ggthemes]; in this book we will cover how to use `RColorBrewer`.
You can visualize the list of color
palettes that `RColorBrewer` has to offer with the `display.brewer.all`
function. You can also print a list of color-blind friendly palettes by adding
-`colorblindFriendly = TRUE` to the function.
+`colorblindFriendly = TRUE` to the function.
(ref:rcolorbrewer) Color palettes available from the `RColorBrewer` R package.
@@ -768,25 +768,25 @@ library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)
```
-From Figure \@ref(fig:rcolorbrewer),
-we can choose the color palette we want to use in our plot.
-To change the color palette,
-we add the `scale_color_brewer` layer indicating the palette we want to use.
-You can use
-this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check
-if your visualizations \index{color palette!color blindness simulator}
+From Figure \@ref(fig:rcolorbrewer),
+we can choose the color palette we want to use in our plot.
+To change the color palette,
+we add the `scale_color_brewer` layer indicating the palette we want to use.
+You can use
+this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check
+if your visualizations \index{color palette!color blindness simulator}
are color-blind friendly.
Below we pick the `"Set2"` palette, with the result shown
in Figure \@ref(fig:scatter-color-by-category-palette).
We also set the `shape` aesthetic mapping to the `category` variable as well;
-this makes the scatter point shapes different for each category. This kind of
+this makes the scatter point shapes different for each category. This kind of
visual redundancy—i.e., conveying the same information with both scatter point color and shape—can
further improve the clarity and accessibility of your visualization.
```{r scatter-color-by-category-palette, fig.height=4.75, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with color-blind friendly colors."}
-ggplot(can_lang, aes(x = most_at_home_percent,
- y = mother_tongue_percent,
- color = category,
+ggplot(can_lang, aes(x = most_at_home_percent,
+ y = mother_tongue_percent,
+ color = category,
shape = category)) +
geom_point() +
xlab("Language spoken most at home \n (percentage of Canadian residents)") +
@@ -799,20 +799,20 @@ ggplot(can_lang, aes(x = most_at_home_percent,
scale_color_brewer(palette = "Set2")
```
-From the visualization in Figure \@ref(fig:scatter-color-by-category-palette),
-we can now clearly see that the vast majority of Canadians reported one of the official languages
-as their mother tongue and as the language they speak most often at home.
-What do we see when considering the second part of our exploratory question?
+From the visualization in Figure \@ref(fig:scatter-color-by-category-palette),
+we can now clearly see that the vast majority of Canadians reported one of the official languages
+as their mother tongue and as the language they speak most often at home.
+What do we see when considering the second part of our exploratory question?
Do we see a difference in the relationship
between languages spoken as a mother tongue and as a primary language
-at home across the higher-level language categories?
+at home across the higher-level language categories?
Based on Figure \@ref(fig:scatter-color-by-category-palette), there does not
appear to be much of a difference.
-For each higher-level language category,
-there appears to be a strong, positive, and linear relationship between
-the percentage of people who speak a language as their mother tongue
-and the percentage who speak it as their primary language at home.
-The relationship looks similar regardless of the category.
+For each higher-level language category,
+there appears to be a strong, positive, and linear relationship between
+the percentage of people who speak a language as their mother tongue
+and the percentage who speak it as their primary language at home.
+The relationship looks similar regardless of the category.
Does this mean that this relationship is positive for all languages in the
world? And further, can we use this data visualization on its own to predict how many people
@@ -821,14 +821,14 @@ it as their primary language at home? The answer to both these questions is
"no!" However, with exploratory data analysis, we can create new hypotheses,
ideas, and questions (like the ones at the beginning of this paragraph).
Answering those questions often involves doing more complex analyses, and sometimes
-even gathering additional data. We will see more of such complex analyses later on in
+even gathering additional data. We will see more of such complex analyses later on in
this book.
### Bar plots: the island landmass data set
-The `islands.csv` data set \index{Island landmasses} contains a list of Earth's landmasses as well as their area (in thousands of square miles) [@islandsdata].
+The `islands.csv` data set \index{Island landmasses} contains a list of Earth's landmasses as well as their area (in thousands of square miles) [@islandsdata].
**Question:** \index{question!visualization} Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth's seven largest landmasses? If so, what are the next few largest landmasses after those?
```{r, echo = FALSE, message = FALSE, warning = FALSE}
islands_df <- read_csv("data/islands.csv")
-continents <- c("Africa", "Antarctica", "Asia", "Australia",
+continents <- c("Africa", "Antarctica", "Asia", "Australia",
"Europe", "North America", "South America")
-islands_df <- mutate(islands_df,
- landmass_type = ifelse(landmass %in% continents,
+islands_df <- mutate(islands_df,
+ landmass_type = ifelse(landmass %in% continents,
"Continent", "Other"))
write_csv(islands_df, "data/islands.csv")
@@ -892,22 +892,22 @@ islands_df <- read_csv("data/islands.csv")
islands_df
```
-Here, we have a data frame of Earth's landmasses,
-and are trying to compare their sizes.
-The right type of visualization to answer this question is a bar plot.
+Here, we have a data frame of Earth's landmasses,
+and are trying to compare their sizes.
+The right type of visualization to answer this question is a bar plot.
In a bar plot, the height of each bar represents the value of an *amount*
(a size, count, proportion, percentage, etc).
They are particularly useful for comparing counts or proportions across different
-groups of a categorical variable. Note, however, that bar plots should generally not be
+groups of a categorical variable. Note, however, that bar plots should generally not be
used to display mean or median values, as they hide important information about
-the variation of the data. Instead it's better to show the distribution of
+the variation of the data. Instead it's better to show the distribution of
all the individual data points, e.g., using a histogram, which we will discuss further in Section \@ref(histogramsviz).
We specify that we would like to use a bar plot
-via the `geom_bar` function in `ggplot2`.
+via the `geom_bar` function in `ggplot2`.
However, by default, `geom_bar` sets the heights
of bars to the number of times a value appears in a data frame (its *count*); here, we want to plot exactly the values in the data frame, i.e.,
-the landmass sizes. So we have to pass the `stat = "identity"` argument to `geom_bar`. The result is
+the landmass sizes. So we have to pass the `stat = "identity"` argument to `geom_bar`. The result is
shown in Figure \@ref(fig:03-data-islands-bar).
\index{ggplot!geom\_bar}
@@ -925,7 +925,7 @@ are hard to distinguish, and the names of the landmasses are obscuring each
other as they have been squished into too little space. But remember that the
question we asked was only about the largest landmasses; let's make the plot a
little bit clearer by keeping only the largest 12 landmasses. We do this using
-the `slice_max` function: the `order_by` argument is the name of the column we
+the `slice_max` function: the `order_by` argument is the name of the column we
want to use for comparing which is largest, and the `n` argument specifies how many
rows to keep. Then to give the labels enough
space, we'll use horizontal bars instead of vertical ones. We do this by
@@ -941,29 +941,29 @@ swapping the `x` and `y` variables.\index{slice\_max}\index{slice\_min}
```{r 03-data-islands-bar-2, warning=FALSE, message=FALSE, fig.width=5, fig.height=2.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of size for Earth's largest 12 landmasses."}
islands_top12 <- slice_max(islands_df, order_by = size, n = 12)
islands_bar <- ggplot(islands_top12, aes(x = size, y = landmass)) +
- geom_bar(stat = "identity")
+ geom_bar(stat = "identity")
islands_bar
```
-The plot in Figure \@ref(fig:03-data-islands-bar-2) is definitely clearer now,
-and allows us to answer our question
-("Are the top 7 largest landmasses continents?") in the affirmative.
+The plot in Figure \@ref(fig:03-data-islands-bar-2) is definitely clearer now,
+and allows us to answer our question
+("Are the top 7 largest landmasses continents?") in the affirmative.
However, we could still improve this visualization by
coloring the bars based on whether they correspond to a continent,
and by organizing the bars by landmass size rather than by alphabetical order.
The data for coloring the bars is stored in the `landmass_type` column,
so we add the `fill` argument to the aesthetic mapping
and set it to `landmass_type`.
-To organize the landmasses by their `size` variable,
+To organize the landmasses by their `size` variable,
we will use the `tidyverse` `fct_reorder` function
in the aesthetic mapping to organize the landmasses by their `size` variable.
The first argument passed to `fct_reorder` is the name of the factor column
-whose levels we would like to reorder (here, `landmass`).
-The second argument is the column name
+whose levels we would like to reorder (here, `landmass`).
+The second argument is the column name
that holds the values we would like to use to do the ordering (here, `size`).
-The `fct_reorder` function uses ascending order by default,
-but this can be changed to descending order
+The `fct_reorder` function uses ascending order by default,
+but this can be changed to descending order
by setting `.desc = TRUE`.
We do this here so that the largest bar will be closest to the axis line,
which is more visually appealing.
@@ -976,21 +976,21 @@ But if you decide to include one, a good plot title should provide the take home
that you want readers to focus on, e.g., "Earth's seven largest landmasses are continents,"
or a more general summary of the information displayed, e.g., "Earth's twelve largest landmasses."
-To make these final adjustments we will use the `labs` function rather than the `xlab` and `ylab` functions
+To make these final adjustments we will use the `labs` function rather than the `xlab` and `ylab` functions
we have seen earlier in this chapter, as `labs` lets us modify the legend label and title in addition to axis labels.
We provide a label for each aesthetic mapping in the plot—in this case, `x`, `y`, and `fill`—as well as one for the `title` argument.
-Finally, we again \index{ggplot!reorder} use the `theme` function
+Finally, we again \index{ggplot!reorder} use the `theme` function
to change the font size.
```{r 03-data-islands-bar-4, warning = FALSE, message = FALSE, fig.width=5, fig.height=2.75, fig.align="center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of size for Earth's largest 12 landmasses, colored by landmass type, with clearer axes and labels."}
-islands_bar <- ggplot(islands_top12,
+islands_bar <- ggplot(islands_top12,
aes(x = size,
- y = fct_reorder(landmass, size, .desc = TRUE),
+ y = fct_reorder(landmass, size, .desc = TRUE),
fill = landmass_type)) +
geom_bar(stat = "identity") +
- labs(x = "Size (1000 square mi)",
- y = "Landmass",
- fill = "Type",
+ labs(x = "Size (1000 square mi)",
+ y = "Landmass",
+ fill = "Type",
title = "Earth's twelve largest landmasses") +
theme(text = element_text(size = 10))
@@ -1003,26 +1003,26 @@ their size, and continents are colored differently than other landmasses,
making it quite clear that continents are the largest seven landmasses.
### Histograms: the Michelson speed of light data set {#histogramsviz}
-The `morley` data set \index{Michelson speed of light}
-contains measurements of the speed of light
+The `morley` data set \index{Michelson speed of light}
+contains measurements of the speed of light
collected in experiments performed in 1879.
-Five experiments were performed,
-and in each experiment, 20 runs were performed—meaning that
-20 measurements of the speed of light were collected
+Five experiments were performed,
+and in each experiment, 20 runs were performed—meaning that
+20 measurements of the speed of light were collected
in each experiment [@lightdata].
The `morley` data set is available in base R as a data frame,
so it does not need to be loaded.
-Because the speed of light is a very large number
+Because the speed of light is a very large number
(the true value is 299,792.458 km/sec), the data is coded
to be the measured speed of light minus 299,000.
This coding allows us to focus on the variations in the measurements, which are generally
much smaller than 299,000.
If we used the full large speed measurements, the variations in the measurements
would not be noticeable, making it difficult to study the differences between the experiments.
-Note that we convert the `morley` data to a tibble to take advantage of the nicer print output
+Note that we convert the `morley` data to a tibble to take advantage of the nicer print output
these specialized data frames provide.
-**Question:** \index{question!visualization} Given what we know now about the speed of
+**Question:** \index{question!visualization} Given what we know now about the speed of
light (299,792.458 kilometres per second), how accurate were each of the experiments?
```{r 03-data-morley, warning=FALSE, message=FALSE}
@@ -1031,22 +1031,22 @@ morley <- as_tibble(morley)
morley
```
-In this experimental data,
-Michelson was trying to measure just a single quantitative number
-(the speed of light).
-The data set contains many measurements of this single quantity.
-\index{distribution}
-To tell how accurate the experiments were,
-we need to visualize the distribution of the measurements
-(i.e., all their possible values and how often each occurs).
-We can do this using a *histogram*.
-A histogram \index{ggplot!histogram}
-helps us visualize how a particular variable is distributed in a data set
-by separating the data into bins,
-and then using vertical bars to show how many data points fell in each bin.
+In this experimental data,
+Michelson was trying to measure just a single quantitative number
+(the speed of light).
+The data set contains many measurements of this single quantity.
+\index{distribution}
+To tell how accurate the experiments were,
+we need to visualize the distribution of the measurements
+(i.e., all their possible values and how often each occurs).
+We can do this using a *histogram*.
+A histogram \index{ggplot!histogram}
+helps us visualize how a particular variable is distributed in a data set
+by separating the data into bins,
+and then using vertical bars to show how many data points fell in each bin.
To create a histogram in `ggplot2` we will use the `geom_histogram` geometric
-object, setting the `x` axis to the `Speed` measurement variable. As usual,
+object, setting the `x` axis to the `Speed` measurement variable. As usual,
let's use the default arguments just to see how things look.
```{r 03-data-morley-hist, warning=FALSE, message=FALSE, fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data."}
@@ -1056,28 +1056,28 @@ morley_hist <- ggplot(morley, aes(x = Speed)) +
morley_hist
```
-Figure \@ref(fig:03-data-morley-hist) is a great start.
-However,
-we cannot tell how accurate the measurements are using this visualization
+Figure \@ref(fig:03-data-morley-hist) is a great start.
+However,
+we cannot tell how accurate the measurements are using this visualization
unless we can see the true value.
-In order to visualize the true speed of light,
+In order to visualize the true speed of light,
we will add a vertical line with the `geom_vline` function.
To draw a vertical line with `geom_vline`, \index{ggplot!geom\_vline}
-we need to specify where on the x-axis the line should be drawn.
-We can do this by setting the `xintercept` argument.
-Here we set it to 792.458, which is the true value of light speed
-minus 299,000; this ensures it is coded the same way as the
+we need to specify where on the x-axis the line should be drawn.
+We can do this by setting the `xintercept` argument.
+Here we set it to 792.458, which is the true value of light speed
+minus 299,000; this ensures it is coded the same way as the
measurements in the `morley` data frame.
-We would also like to fine tune this vertical line,
+We would also like to fine tune this vertical line,
styling it so that it is dashed and 1 point in thickness.
-A point is a measurement unit commonly used with fonts,
-and 1 point is about 0.353 mm.
-We do this by setting `linetype = "dashed"` and `linewidth = 1`, respectively.
-There is a similar function, `geom_hline`,
-that is used for plotting horizontal lines.
-Note that
-*vertical lines* are used to denote quantities on the *horizontal axis*,
-while *horizontal lines* are used to denote quantities on the *vertical axis*.
+A point is a measurement unit commonly used with fonts,
+and 1 point is about 0.353 mm.
+We do this by setting `linetype = "dashed"` and `linewidth = 1`, respectively.
+There is a similar function, `geom_hline`,
+that is used for plotting horizontal lines.
+Note that
+*vertical lines* are used to denote quantities on the *horizontal axis*,
+while *horizontal lines* are used to denote quantities on the *vertical axis*.
```{r 03-data-morley-hist-2, warning=FALSE, fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", message=FALSE,fig.cap = "Histogram of Michelson's speed of light data with vertical line indicating true speed of light."}
morley_hist <- ggplot(morley, aes(x = Speed)) +
@@ -1087,25 +1087,25 @@ morley_hist <- ggplot(morley, aes(x = Speed)) +
morley_hist
```
-In Figure \@ref(fig:03-data-morley-hist-2),
-we still cannot tell which experiments (denoted in the `Expt` column)
-led to which measurements;
-perhaps some experiments were more accurate than others.
-To fully answer our question,
-we need to separate the measurements from each other visually.
-We can try to do this using a *colored* histogram,
-where counts from different experiments are stacked on top of each other
-in different colors.
-We can create a histogram colored by the `Expt` variable
-by adding it to the `fill` aesthetic mapping.
-We make sure the different colors can be seen
-(despite them all sitting on top of each other)
-by setting the `alpha` argument in `geom_histogram` to `0.5`
-to make the bars slightly translucent.
-We also specify `position = "identity"` in `geom_histogram` to ensure
-the histograms for each experiment will be overlaid side-by-side,
-instead of stacked bars
-(which is the default for bar plots or histograms
+In Figure \@ref(fig:03-data-morley-hist-2),
+we still cannot tell which experiments (denoted in the `Expt` column)
+led to which measurements;
+perhaps some experiments were more accurate than others.
+To fully answer our question,
+we need to separate the measurements from each other visually.
+We can try to do this using a *colored* histogram,
+where counts from different experiments are stacked on top of each other
+in different colors.
+We can create a histogram colored by the `Expt` variable
+by adding it to the `fill` aesthetic mapping.
+We make sure the different colors can be seen
+(despite them all sitting on top of each other)
+by setting the `alpha` argument in `geom_histogram` to `0.5`
+to make the bars slightly translucent.
+We also specify `position = "identity"` in `geom_histogram` to ensure
+the histograms for each experiment will be overlaid side-by-side,
+instead of stacked bars
+(which is the default for bar plots or histograms
when they are colored by another categorical variable).
```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE, fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data where an attempt is made to color the bars by experiment."}
@@ -1117,7 +1117,7 @@ morley_hist
```
Alright great, Figure \@ref(fig:03-data-morley-hist-3) looks...wait a second! The
-histogram is still all the same color! What is going on here? Well, if you
+histogram is still all the same color! What is going on here? Well, if you
recall from Chapter \@ref(wrangling), the *data type* you use for each variable
can influence how R and `tidyverse` treats it. Here, we indeed have an issue
with the data types in the `morley` data frame. In particular, the `Expt` column
@@ -1146,8 +1146,8 @@ morley_hist
> example) and (2) the ordering of levels in a plot. `ggplot` takes into account
> the order of the factor levels as opposed to the order of data in
> your data frame. Learning how to reorder your factor levels will help you with
-> reordering the labels of a factor on a plot.
-
+> reordering the labels of a factor on a plot.
+
Unfortunately, the attempt to separate out the experiment number visually has
created a bit of a mess. All of the colors in Figure
\@ref(fig:03-data-morley-hist-with-factor) are blending together, and although it is
@@ -1157,17 +1157,17 @@ message and answer the question. Let's try a different strategy of creating
grid of separate histogram plots.
-We use the `facet_grid` function to create a plot
+We use the `facet_grid` function to create a plot
that has multiple subplots arranged in a grid.
-The argument to `facet_grid` specifies the variable(s) used to split the plot
+The argument to `facet_grid` specifies the variable(s) used to split the plot
into subplots, and how to split them (i.e., into rows or columns).
-If the plot is to be split horizontally, into rows,
+If the plot is to be split horizontally, into rows,
then the `rows` argument is used.
-If the plot is to be split vertically, into columns,
+If the plot is to be split vertically, into columns,
then the `cols` argument is used.
-Both the `rows` and `cols` arguments take the column names on which to split the data when creating the subplots.
+Both the `rows` and `cols` arguments take the column names on which to split the data when creating the subplots.
Note that the column names must be surrounded by the `vars` function.
-This function allows the column names to be correctly evaluated
+This function allows the column names to be correctly evaluated
in the context of the data frame.
\index{ggplot!facet\_grid}
@@ -1180,42 +1180,42 @@ morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) +
morley_hist
```
-The visualization in Figure \@ref(fig:03-data-morley-hist-4)
-now makes it quite clear how accurate the different experiments were
-with respect to one another.
-The most variable measurements came from Experiment 1.
+The visualization in Figure \@ref(fig:03-data-morley-hist-4)
+now makes it quite clear how accurate the different experiments were
+with respect to one another.
+The most variable measurements came from Experiment 1.
There the measurements ranged from about 650–1050 km/sec.
The least variable measurements came from Experiment 2.
There, the measurements ranged from about 750–950 km/sec.
The most different experiments still obtained quite similar results!
There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels
-using the `labs` function, and increase the font size to make it readable using the `theme` function. Second, and perhaps more subtly, even though it
-is easy to compare the experiments on this plot to one another, it is hard to get a sense
+using the `labs` function, and increase the font size to make it readable using the `theme` function. Second, and perhaps more subtly, even though it
+is easy to compare the experiments on this plot to one another, it is hard to get a sense
of just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light?
To answer this question, we'll use the `mutate` function to transform our data into a relative measure of accuracy rather than absolute measurements:
\index{ggplot!labs}\index{ggplot!theme}
```{r 03-data-morley-hist-5, warning=FALSE, message=FALSE, fig.height = 5.25, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of relative accuracy split vertically by experiment with clearer axes and labels."}
-morley_rel <- mutate(morley,
- relative_accuracy = 100 *
+morley_rel <- mutate(morley,
+ relative_accuracy = 100 *
((299000 + Speed) - 299792.458) / (299792.458))
-morley_hist <- ggplot(morley_rel,
- aes(x = relative_accuracy,
+morley_hist <- ggplot(morley_rel,
+ aes(x = relative_accuracy,
fill = as_factor(Expt))) +
geom_histogram() +
facet_grid(rows = vars(Expt)) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1.0) +
- labs(x = "Relative Accuracy (%)",
- y = "# Measurements",
+ labs(x = "Relative Accuracy (%)",
+ y = "# Measurements",
fill = "Experiment ID") +
theme(text = element_text(size = 12))
morley_hist
```
-Wow, impressive! These measurements of the speed of light from 1879 had errors around *0.05%* of the true speed. Figure \@ref(fig:03-data-morley-hist-5) shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an
+Wow, impressive! These measurements of the speed of light from 1879 had errors around *0.05%* of the true speed. Figure \@ref(fig:03-data-morley-hist-5) shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an
admirable job given the technology available at the time.
\newpage
@@ -1228,100 +1228,100 @@ You can set the number of bins yourself by using
the `bins` argument in the `geom_histogram` geometric object.
You can also set the *width* of the bins using the
`binwidth` argument in the `geom_histogram` geometric object.
-But what number of bins, or bin width, is the right one to use?
+But what number of bins, or bin width, is the right one to use?
Unfortunately there is no hard rule for what the right bin number
-or width is. It depends entirely on your problem; the *right* number of bins
-or bin width is
-the one that *helps you answer the question* you asked.
-Choosing the correct setting for your problem
+or width is. It depends entirely on your problem; the *right* number of bins
+or bin width is
+the one that *helps you answer the question* you asked.
+Choosing the correct setting for your problem
is something that commonly takes iteration.
We recommend setting the *bin width* (not the *number of bins*) because
it often more directly corresponds to values in your problem of interest. For example,
if you are looking at a histogram of human heights,
-a bin width of 1 inch would likely be reasonable, while the number of bins to use is
+a bin width of 1 inch would likely be reasonable, while the number of bins to use is
not immediately clear.
It's usually a good idea to try out several bin widths to see which one
most clearly captures your data in the context of the question
you want to answer.
-To get a sense for how different bin widths affect visualizations,
+To get a sense for how different bin widths affect visualizations,
let's experiment with the histogram that we have been working on in this section.
In Figure \@ref(fig:03-data-morley-hist-binwidth),
-we compare the default setting with three other histograms where we set the
+we compare the default setting with three other histograms where we set the
`binwidth` to 0.001, 0.01 and 0.1.
-In this case, we can see that both the default number of bins
+In this case, we can see that both the default number of bins
and the binwidth of 0.01 are effective for helping answer our question.
On the other hand, the bin widths of 0.001 and 0.1 are too small and too big, respectively.
```{r 03-data-morley-hist-binwidth, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 10, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Effect of varying bin width on histograms."}
-morley_hist_default <- ggplot(morley_rel,
- aes(x = relative_accuracy,
+morley_hist_default <- ggplot(morley_rel,
+ aes(x = relative_accuracy,
fill = as_factor(Expt))) +
geom_histogram() +
facet_grid(rows = vars(Expt)) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1.0) +
- labs(x = "Relative Accuracy (%)",
- y = "# Measurements",
+ labs(x = "Relative Accuracy (%)",
+ y = "# Measurements",
fill = "Experiment ID") +
theme(legend.position = "none") +
- ggtitle("Default (bins = 30)") +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ ggtitle("Default (bins = 30)") +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
-morley_hist_big <- ggplot(morley_rel,
- aes(x = relative_accuracy,
+morley_hist_big <- ggplot(morley_rel,
+ aes(x = relative_accuracy,
fill = as_factor(Expt))) +
geom_histogram(binwidth = 0.1) +
facet_grid(rows = vars(Expt)) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1.0) +
- labs(x = "Relative Accuracy (%)",
- y = "# Measurements",
+ labs(x = "Relative Accuracy (%)",
+ y = "# Measurements",
fill = "Experiment ID") +
theme(legend.position = "none") +
- ggtitle( "binwidth = 0.1") +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ ggtitle( "binwidth = 0.1") +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
-morley_hist_med <- ggplot(morley_rel,
- aes(x = relative_accuracy,
+morley_hist_med <- ggplot(morley_rel,
+ aes(x = relative_accuracy,
fill = as_factor(Expt))) +
geom_histogram(binwidth = 0.01) +
facet_grid(rows = vars(Expt)) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1.0) +
- labs(x = "Relative Accuracy (%)",
- y = "# Measurements",
+ labs(x = "Relative Accuracy (%)",
+ y = "# Measurements",
fill = "Experiment ID") +
theme(legend.position = "none") +
- ggtitle("binwidth = 0.01") +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ ggtitle("binwidth = 0.01") +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
-morley_hist_small <- ggplot(morley_rel,
- aes(x = relative_accuracy,
+morley_hist_small <- ggplot(morley_rel,
+ aes(x = relative_accuracy,
fill = as_factor(Expt))) +
geom_histogram(binwidth = 0.001) +
facet_grid(rows = vars(Expt)) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 1.0) +
- labs(x = "Relative Accuracy (%)",
- y = "# Measurements",
+ labs(x = "Relative Accuracy (%)",
+ y = "# Measurements",
fill = "Experiment ID") +
theme(legend.position = "none") +
- ggtitle("binwidth = 0.001") +
- theme(text = element_text(size = 14), axis.title=element_text(size=14))
+ ggtitle("binwidth = 0.001") +
+ theme(text = element_text(size = 14), axis.title=element_text(size=14))
-plot_grid(morley_hist_default,
- morley_hist_small,
- morley_hist_med,
- morley_hist_big,
+plot_grid(morley_hist_default,
+ morley_hist_small,
+ morley_hist_med,
+ morley_hist_big,
ncol = 2)
```
#### Adding layers to a `ggplot` plot object {-}
-One of the powerful features of `ggplot` is that you
+One of the powerful features of `ggplot` is that you
can continue to iterate on a single plot object, adding and refining
one layer \index{ggplot!add layer} at a time. If you stored your plot as a named object
-using the assignment symbol (`<-`), you can
+using the assignment symbol (`<-`), you can
add to it using the `+` operator.
-For example, if we wanted to add a title to the last plot we created (`morley_hist`),
+For example, if we wanted to add a title to the last plot we created (`morley_hist`),
we can use the `+` operator to add a title layer with the `ggtitle` function.
The result is shown in Figure \@ref(fig:03-data-morley-hist-addlayer).
@@ -1332,8 +1332,8 @@ morley_hist_title <- morley_hist +
morley_hist_title
```
-> **Note:** Good visualization titles clearly communicate
-> the take home message to the audience. Typically,
+> **Note:** Good visualization titles clearly communicate
+> the take home message to the audience. Typically,
> that is the answer to the question you posed before making the visualization.
## Explaining the visualization
@@ -1346,27 +1346,27 @@ conclusion. For example, you could use an exploratory visualization in the
opening of the presentation to motivate your choice of a more detailed data
analysis / model, a visualization of the results of your analysis to show what
your analysis has uncovered, or even one at the end of a presentation to help
-suggest directions for future work.
+suggest directions for future work.
Regardless of where it appears, a good way to discuss your visualization \index{visualization!explanation} is as
-a story:
+a story:
-1) Establish the setting and scope, and describe why you did what you did.
+1) Establish the setting and scope, and describe why you did what you did.
2) Pose the question that your visualization answers. Justify why the question is important to answer.
-3) Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you
+3) Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you
can emphasize different aspects based on what is important to answer your question:
- **trends (lines):** Does a line describe the trend well? If so, the trend is *linear*, and if not, the trend is *nonlinear*. Is the trend increasing, decreasing, or neither?
Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line "jump around" a lot) or smooth?
- **distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?
- **distributions of two variables (scatters):** Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
relationship (the data are too noisy to make any conclusion)?
- - **amounts (bars):** How large are the bars relative to one another? Are there patterns in different groups of bars?
+ - **amounts (bars):** How large are the bars relative to one another? Are there patterns in different groups of bars?
4) Summarize your findings, and use them to motivate whatever you will discuss next.
Below are two examples of how one might take these four steps in describing the example visualizations that appeared earlier in this chapter.
Each of the steps is denoted by its numeral in parentheses, e.g. (3).
-**Mauna Loa Atmospheric CO$_{\text{2}}$ Measurements:** (1) \index{Mauna Loa} Many
+**Mauna Loa Atmospheric CO$_{\text{2}}$ Measurements:** (1) \index{Mauna Loa} Many
current forms of energy generation and conversion—from automotive
engines to natural gas power plants—rely on burning fossil fuels and produce
greenhouse gases, typically primarily carbon dioxide (CO$_{\text{2}}$), as a
@@ -1409,7 +1409,7 @@ such as file size/type limitations (e.g., if you are submitting your
visualization as part of a conference paper or to a poster printing shop) and
where it will be displayed (e.g., online, in a paper, on a poster, on a
billboard, in talk slides). Generally speaking, images come in two flavors:
-*raster* \index{bitmap|see{raster graphics}}\index{raster graphics} formats
+*raster* \index{bitmap|see{raster graphics}}\index{raster graphics} formats
and *vector* \index{vector graphics} formats.
**Raster** images are represented as a 2-D grid of square pixels, each
@@ -1420,20 +1420,20 @@ is not noticeable. *Lossless* formats, on the other hand, allow a perfect
display of the original image.
\index{raster graphics!file types}
-- *Common file types:*
- - [JPEG](https://en.wikipedia.org/wiki/JPEG) (`.jpg`, `.jpeg`): lossy, usually used for photographs
+- *Common file types:*
+ - [JPEG](https://en.wikipedia.org/wiki/JPEG) (`.jpg`, `.jpeg`): lossy, usually used for photographs
- [PNG](https://en.wikipedia.org/wiki/Portable_Network_Graphics) (`.png`): lossless, usually used for plots / line drawings
- [BMP](https://en.wikipedia.org/wiki/BMP_file_format) (`.bmp`): lossless, raw image data, no compression (rarely used)
- [TIFF](https://en.wikipedia.org/wiki/TIFF) (`.tif`, `.tiff`): typically lossless, no compression, used mostly in graphic arts, publishing
- *Open-source software:* [GIMP](https://www.gimp.org/)
-**Vector** images are represented as a collection of mathematical
-objects (lines, surfaces, shapes, curves). When the computer displays the image, it
+**Vector** images are represented as a collection of mathematical
+objects (lines, surfaces, shapes, curves). When the computer displays the image, it
redraws all of the elements using their mathematical formulas.
\index{vector graphics!file types}
-- *Common file types:*
- - [SVG](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics) (`.svg`): general-purpose use
+- *Common file types:*
+ - [SVG](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics) (`.svg`): general-purpose use
- [EPS](https://en.wikipedia.org/wiki/Encapsulated_PostScript) (`.eps`), general-purpose use (rarely used)
- *Open-source software:* [Inkscape](https://inkscape.org/)
@@ -1446,16 +1446,16 @@ computer has to draw all the elements each time it is displayed. For example,
if you have a scatter plot with 1 million points stored as an SVG file, it may
take your computer some time to open the image. On the other hand, you can zoom
into / scale up vector graphics as much as you like without the image looking
-bad, while raster images eventually start to look "pixelated."
+bad, while raster images eventually start to look "pixelated."
> **Note:** The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
> store *both* raster and vector formats. If you try to open a PDF and it's taking a long time
-> to load, it may be because there is a complicated vector graphics image that your computer is rendering.
+> to load, it may be because there is a complicated vector graphics image that your computer is rendering.
\index{PDF}
\index{portable document format|see{PDF}}
-Let's learn how to save plot images to these different file formats using a
-scatter plot of
+Let's learn how to save plot images to these different file formats using a
+scatter plot of
the [Old Faithful data set](https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat) [@faithfuldata],
shown in Figure \@ref(fig:03-plot-line).
@@ -1464,23 +1464,23 @@ library(svglite) # we need this to save SVG files
faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point() +
labs(x = "Waiting time to next eruption \n (minutes)",
- y = "Eruption time \n (minutes)") +
+ y = "Eruption time \n (minutes)") +
theme(text = element_text(size = 12))
faithful_plot
```
Now that we have a named `ggplot` plot object, we can use the `ggsave` function
-to save a file containing this image.
-`ggsave` works by taking a file name to create for the image
-as its first argument.
-This can include the path to the directory where you would like to save the file
+to save a file containing this image.
+`ggsave` works by taking a file name to create for the image
+as its first argument.
+This can include the path to the directory where you would like to save the file
(e.g., `img/viz/filename.png` to save a file named `filename` to the `img/viz/` directory),
and the name of the plot object to save as its second argument.
The kind of image to save is specified by the file extension.
-For example,
+For example,
to create a PNG image file, we specify that the file extension is `.png`.
-Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types
+Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types
for the `faithful_plot`:
```r
@@ -1492,21 +1492,21 @@ ggsave("img/viz/faithful_plot.svg", faithful_plot)
```
```{r, filesizes, echo = FALSE}
-file_sizes <- tibble(`Image type` = c("Raster",
- "Raster",
- "Raster",
+file_sizes <- tibble(`Image type` = c("Raster",
+ "Raster",
+ "Raster",
"Raster",
"Vector"),
`File type` = c("PNG", "JPG", "BMP", "TIFF", "SVG"),
- `Image size` = c(paste(round(file.info("img/viz/faithful_plot.png")["size"]
+ `Image size` = c(paste(round(file.info("img/viz/faithful_plot.png")["size"]
/ 1000000, 2), "MB"),
- paste(round(file.info("img/viz/faithful_plot.jpg")["size"]
+ paste(round(file.info("img/viz/faithful_plot.jpg")["size"]
/ 1000000, 2), "MB"),
- paste(round(file.info("img/viz/faithful_plot.bmp")["size"]
+ paste(round(file.info("img/viz/faithful_plot.bmp")["size"]
/ 1000000, 2), "MB"),
- paste(round(file.info("img/viz/faithful_plot.tiff")["size"]
+ paste(round(file.info("img/viz/faithful_plot.tiff")["size"]
/ 1000000, 2), "MB"),
- paste(round(file.info("img/viz/faithful_plot.svg")["size"]
+ paste(round(file.info("img/viz/faithful_plot.svg")["size"]
/ 1000000, 2), "MB")))
kable(file_sizes, booktabs = TRUE,
caption = "File sizes of the scatter plot of the Old Faithful data set when saved as different file formats.") |>
@@ -1518,7 +1518,7 @@ Wow, that's quite a difference! Notice that for such a simple plot with few
graphical elements (points), the vector graphics format (SVG) is over 100 times
smaller than the uncompressed raster images (BMP, TIFF). Also, note that the
JPG format is twice as large as the PNG format since the JPG compression
-algorithm is designed for natural images (not plots).
+algorithm is designed for natural images (not plots).
In Figure \@ref(fig:03-raster-image), we also show what
the images look like when we zoom in to a rectangle with only 2 data points.
@@ -1535,8 +1535,8 @@ knitr::include_graphics("img/viz/png-vs-svg.png")
## Exercises
-Practice exercises for the material covered in this chapter
-can be found in the accompanying
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
[worksheets repository](https://worksheets.datasciencebook.ca)
in the "Effective data visualization" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -1563,7 +1563,7 @@ and guidance that the worksheets provide will function as intended.
look if you want to learn how to make more intricate visualizations in
`ggplot2` than what is included in this chapter.
- The [`theme` function documentation](https://ggplot2.tidyverse.org/reference/theme.html)
- is an excellent reference to see how you can fine tune the non-data aspects
+ is an excellent reference to see how you can fine tune the non-data aspects
of your visualization.
- *R for Data Science* [@wickham2016r] has a chapter on [dates and
times](https://r4ds.had.co.nz/dates-and-times.html). This chapter is where
diff --git a/source/wrangling.Rmd b/source/wrangling.Rmd
index 08cb282ec..d88edebf5 100644
--- a/source/wrangling.Rmd
+++ b/source/wrangling.Rmd
@@ -21,40 +21,35 @@ application, providing more practice working through a whole case study.
By the end of the chapter, readers will be able to do the following:
- - Define the term "tidy data".
- - Discuss the advantages of storing data in a tidy data format.
- - Define what vectors, lists, and data frames are in R, and describe how they relate to
- each other.
- - Describe the common types of data in R and their uses.
- - Recall and use the following functions for their
- intended data wrangling tasks:
- - `across`
- - `c`
- - `filter`
- - `group_by`
- - `select`
- - `map`
- - `mutate`
- - `pull`
- - `pivot_longer`
- - `pivot_wider`
- - `rowwise`
- - `separate`
- - `summarize`
- - Recall and use the following operators for their
- intended data wrangling tasks:
- - `==`
- - `%in%`
- - `!`
- - `&`
- - `|`
- - `|>` and `%>%`
+- Define the term "tidy data".
+- Discuss the advantages of storing data in a tidy data format.
+- Define what vectors, lists, and data frames are in R, and describe how they relate to
+ each other.
+- Describe the common types of data in R and their uses.
+- Use the following functions for their intended data wrangling tasks:
+ - `c`
+ - `pivot_longer`
+ - `pivot_wider`
+ - `separate`
+ - `select`
+ - `filter`
+ - `mutate`
+ - `summarize`
+ - `map`
+ - `group_by`
+ - `across`
+ - `rowwise`
+- Use the following operators for their intended data wrangling tasks:
+ - `==`, `!=`, `<`, `<=`, `>`, and `>=`
+ - `%in%`
+ - `!`, `&`, and `|`
+ - `|>` and `%>%`
## Data frames, vectors, and lists
In Chapters \@ref(intro) and \@ref(reading), *data frames* were the focus:
we learned how to import data into R as a data frame, and perform basic operations on data frames in R.
-In the remainder of this book, this pattern continues. The vast majority of tools we use will require
+In the remainder of this book, this pattern continues. The vast majority of tools we use will require
that data are represented as a data frame in R. Therefore, in this section,
we will dig more deeply into what data frames are and how they are represented in R.
This knowledge will be helpful in effectively utilizing these objects in our data analyses.
@@ -103,7 +98,7 @@ elements are ordered, and they must all be of the same **data type**;
R has several different basic data types, as shown in Table \@ref(tab:datatype-table).
Figure \@ref(fig:02-vector) provides an example of a vector where all of the elements are
of character type.
-You can create vectors in R using the `c` function \index{c function} (`c` stands for "concatenate"). For
+You can create vectors in R using the `c` function \index{c function} (`c` stands for "concatenate"). For
example, to create the vector `region` as shown in Figure
\@ref(fig:02-vector), you would write:
@@ -115,11 +110,11 @@ year
> **Note:** Technically, these objects are called "atomic vectors." In this book
> we have chosen to call them "vectors," which is how they are most commonly
> referred to in the R community. To be totally precise, "vector" is an umbrella term that
-> encompasses both atomic vector and list objects in R. But this creates a
-> confusing situation where the term "vector" could
-> mean "atomic vector" *or* "the umbrella term for atomic vector and list,"
-> depending on context. Very confusing indeed! So to keep things simple, in
-> this book we *always* use the term "vector" to refer to "atomic vector."
+> encompasses both atomic vector and list objects in R. But this creates a
+> confusing situation where the term "vector" could
+> mean "atomic vector" *or* "the umbrella term for atomic vector and list,"
+> depending on context. Very confusing indeed! So to keep things simple, in
+> this book we *always* use the term "vector" to refer to "atomic vector."
> We encourage readers who are enthusiastic to learn more to read the
> Vectors chapter of *Advanced R* [@wickham2019advanced].
@@ -146,8 +141,8 @@ Table: (#tab:datatype-table) Basic data types in R
\index{double}\index{dbl|see{double}}
\index{logical}\index{lgl|see{logical}}
\index{factor}\index{fct|see{factor}}
-It is important in R to make sure you represent your data with the correct type.
-Many of the `tidyverse` functions we use in this book treat
+It is important in R to make sure you represent your data with the correct type.
+Many of the `tidyverse` functions we use in this book treat
the various data types differently. You should use integers and double types
(which both fall under the "numeric" umbrella type) to represent numbers and perform
arithmetic. Doubles are more common than integers in R, though; for instance, a double data type is the
@@ -167,7 +162,7 @@ Lists \index{list} are also objects in R that have multiple, ordered elements.
Vectors and lists differ by the requirement of element type
consistency. All elements within a single vector must be of the same type (e.g.,
all elements are characters), whereas elements within a single list can be of
-different types (e.g., characters, integers, logicals, and even other lists). See Figure \@ref(fig:02-vec-vs-list).
+different types (e.g., characters, integers, logicals, and even other lists). See Figure \@ref(fig:02-vec-vs-list).
``` {r 02-vec-vs-list, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A vector versus a list.", fig.retina = 2, out.width = "100%"}
image_read("img/wrangling/data_frame_slides_cdn.008.jpeg") %>%
@@ -178,16 +173,16 @@ image_read("img/wrangling/data_frame_slides_cdn.008.jpeg") %>%
A data frame \index{data frame!definition} is really a special kind of list that follows two rules:
-1. Each element itself must either be a vector or a list.
+1. Each element itself must either be a vector or a list.
2. Each element (vector or list) must have the same length.
-Not all columns in a data frame need to be of the same type.
+Not all columns in a data frame need to be of the same type.
Figure \@ref(fig:02-dataframe) shows a data frame where
the columns are vectors of different types.
-But remember: because the columns in this example are *vectors*,
-the elements must be the same data type *within each column.*
+But remember: because the columns in this example are *vectors*,
+the elements must be the same data type *within each column.*
On the other hand, if our data frame had *list* columns, there would be no such requirement.
-It is generally much more common to use *vector* columns, though,
+It is generally much more common to use *vector* columns, though,
as the values for a single variable are usually all of the same type.
``` {r 02-dataframe, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Data frame and vector types.", fig.retina = 2, out.width = "100%"}
@@ -195,13 +190,13 @@ image_read("img/wrangling/data_frame_slides_cdn.009.jpeg") %>%
image_crop("3632x700")
```
-Data frames are actually included in R itself, without the need for any additional packages. However, the
-`tidyverse` functions that we use
+Data frames are actually included in R itself, without the need for any additional packages. However, the
+`tidyverse` functions that we use
throughout this book all work with a special kind of data frame called a *tibble*. Tibbles have some additional \index{tibble}
features and benefits over built-in data frames in R. These include the
ability to add useful attributes (such as grouping, which we will discuss later)
-and more predictable type preservation when subsetting.
-Because a tibble is just a data frame with some added features,
+and more predictable type preservation when subsetting.
+Because a tibble is just a data frame with some added features,
we will collectively refer to both built-in R data frames and
tibbles as *data frames* in this book.
@@ -219,7 +214,7 @@ tibbles as *data frames* in this book.
Vectors, data frames and lists are basic types of *data structure* in R, which
are core to most data analyses. We summarize them in Table
-\@ref(tab:datastructure-table). There are several other data structures in the R programming
+\@ref(tab:datastructure-table). There are several other data structures in the R programming
language (*e.g.,* matrices), but these are beyond the scope of this book.
Table: (#tab:datastructure-table) Basic data structures in R
@@ -228,13 +223,13 @@ Table: (#tab:datastructure-table) Basic data structures in R
| --- |------------ |
| vector | An ordered collection of one, or more, values of the *same data type*. |
| list | An ordered collection of one, or more, values of *possibly different data types*. |
-| data frame | A list of either vectors or lists of the *same length*, with column names. We typically use a data frame to represent a data set. |
+| data frame | A list of either vectors or lists of the *same length*, with column names. We typically use a data frame to represent a data set. |
## Tidy data
There are many ways a tabular data set can be organized. This chapter will focus
on introducing the **tidy data** \index{tidy data!definition} format of organization and how to make your raw
-(and likely messy) data tidy. A tidy data frame satisfies
+(and likely messy) data tidy. A tidy data frame satisfies
the following three criteria [@wickham2014tidy]:
- each row is a single observation,
@@ -242,8 +237,8 @@ the following three criteria [@wickham2014tidy]:
- each value is a single cell (i.e., its entry in the data
frame is not shared with another value).
-Figure \@ref(fig:02-tidy-image) demonstrates a tidy data set that satisfies these
-three criteria.
+Figure \@ref(fig:02-tidy-image) demonstrates a tidy data set that satisfies these
+three criteria.
``` {r 02-tidy-image, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Tidy data satisfies three criteria.", fig.retina = 2, out.width = "80%"}
image_read("img/wrangling/tidy_data.001.jpeg") |>
@@ -252,7 +247,7 @@ image_read("img/wrangling/tidy_data.001.jpeg") |>
There are many good reasons for making sure your data are tidy as a first step in your analysis.
The most important is that it is a single, consistent format that nearly every function
-in the `tidyverse` recognizes. No matter what the variables and observations
+in the `tidyverse` recognizes. No matter what the variables and observations
in your data represent, as long as the data frame \index{tidy data!arguments for}
is tidy, you can manipulate it, plot it, and analyze it using the same tools.
If your data is *not* tidy, you will have to write special bespoke code
@@ -275,17 +270,17 @@ below!
### Tidying up: going from wide to long using `pivot_longer`
One task that is commonly performed to get data into a tidy format \index{pivot\_longer}
-is to combine values that are stored in separate columns,
+is to combine values that are stored in separate columns,
but are really part of the same variable, into one.
-Data is often stored this way
-because this format is sometimes more intuitive for human readability
+Data is often stored this way
+because this format is sometimes more intuitive for human readability
and understanding, and humans create data sets.
-In Figure \@ref(fig:02-wide-to-long),
-the table on the left is in an untidy, "wide" format because the year values
-(2006, 2011, 2016) are stored as column names.
-And as a consequence,
-the values for population for the various cities
-over these years are also split across several columns.
+In Figure \@ref(fig:02-wide-to-long),
+the table on the left is in an untidy, "wide" format because the year values
+(2006, 2011, 2016) are stored as column names.
+And as a consequence,
+the values for population for the various cities
+over these years are also split across several columns.
For humans, this table is easy to read, which is why you will often find data
stored in this wide format. However, this format is difficult to work with
@@ -301,9 +296,9 @@ greatly simplified once the data is tidied.
Another problem with data in this format is that we don't know what the
numbers under each year actually represent. Do those numbers represent
-population size? Land area? It's not clear.
-To solve both of these problems,
-we can reshape this data set to a tidy data format
+population size? Land area? It's not clear.
+To solve both of these problems,
+we can reshape this data set to a tidy data format
by creating a column called "year" and a column called
"population." This transformation—which makes the data
"longer"—is shown as the right table in
@@ -314,15 +309,15 @@ knitr::include_graphics("img/wrangling/pivot_functions.001.jpeg")
```
We can achieve this effect in R using the `pivot_longer` function from the `tidyverse` package.
-The `pivot_longer` function combines columns,
-and is usually used during tidying data
-when we need to make the data frame longer and narrower.
+The `pivot_longer` function combines columns,
+and is usually used during tidying data
+when we need to make the data frame longer and narrower.
To learn how to use `pivot_longer`, we will work through an example with the
`region_lang_top5_cities_wide.csv` data set. This data set contains the
-counts of how many Canadians cited each language as their mother tongue for five
+counts of how many Canadians cited each language as their mother tongue for five
major Canadian cities (Toronto, Montréal, Vancouver, Calgary and Edmonton) from
the 2016 Canadian census. \index{Canadian languages}
-To get started,
+To get started,
we will load the `tidyverse` package and use `read_csv` to load the (untidy) data.
``` {r 02-tidyverse, warning=FALSE, message=FALSE}
@@ -332,26 +327,26 @@ lang_wide <- read_csv("data/region_lang_top5_cities_wide.csv")
lang_wide
```
-What is wrong with the untidy format above?
-The table on the left in Figure \@ref(fig:img-pivot-longer-with-table)
+What is wrong with the untidy format above?
+The table on the left in Figure \@ref(fig:img-pivot-longer-with-table)
represents the data in the "wide" (messy) format.
-From a data analysis perspective, this format is not ideal because the values of
-the variable *region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
+From a data analysis perspective, this format is not ideal because the values of
+the variable *region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
are stored as column names. Thus they
are not easily accessible to the data analysis functions we will apply
to our data set. Additionally, the *mother tongue* variable values are
spread across multiple columns, which will prevent us from doing any desired
visualization or statistical tasks until we combine them into one column. For
-instance, suppose we want to know the languages with the highest number of
+instance, suppose we want to know the languages with the highest number of
Canadians reporting it as their mother tongue among all five regions. This
-question would be tough to answer with the data in its current format.
-We *could* find the answer with the data in this format,
+question would be tough to answer with the data in its current format.
+We *could* find the answer with the data in this format,
though it would be much easier to answer if we tidy our
-data first. If mother tongue were instead stored as one column,
-as shown in the tidy data on the right in
+data first. If mother tongue were instead stored as one column,
+as shown in the tidy data on the right in
Figure \@ref(fig:img-pivot-longer-with-table),
we could simply use the `max` function in one line of code
-to get the maximum value.
+to get the maximum value.
(ref:img-pivot-longer-with-table) Going from wide to long with the `pivot_longer` function.
@@ -359,7 +354,7 @@ to get the maximum value.
knitr::include_graphics("img/wrangling/pivot_functions.003.jpeg")
```
-Figure \@ref(fig:img-pivot-longer) details the arguments that we need to specify
+Figure \@ref(fig:img-pivot-longer) details the arguments that we need to specify
in the `pivot_longer` function to accomplish this data transformation.
(ref:img-pivot-longer) Syntax for the `pivot_longer` function.
@@ -389,9 +384,9 @@ lang_mother_tidy
> **Note**: In the code above, the call to the
> `pivot_longer` function is split across several lines. This is allowed in
-> certain cases; for example, when calling a function as above, as long as the
+> certain cases; for example, when calling a function as above, as long as the
> line ends with a comma `,` R knows to keep reading on the next line.
-> Splitting long lines like this across multiple lines is encouraged
+> Splitting long lines like this across multiple lines is encouraged
> as it helps significantly with code readability. Generally speaking, you should
> limit each line of code to about 80 characters.
@@ -409,18 +404,18 @@ been met:
Suppose we have observations spread across multiple rows rather than in a single \index{pivot\_wider}
row. For example, in Figure \@ref(fig:long-to-wide), the table on the left is in an
untidy, long format because the `count` column contains three variables
-(population, commuter count, and year the city was incorporated)
-and information about each observation
-(here, population, commuter, and incorporated values for a region) is split across three rows.
-Remember: one of the criteria for tidy data
+(population, commuter count, and year the city was incorporated)
+and information about each observation
+(here, population, commuter, and incorporated values for a region) is split across three rows.
+Remember: one of the criteria for tidy data
is that each observation must be in a single row.
Using data in this format—where two or more variables are mixed together
in a single column—makes it harder to apply many usual `tidyverse` functions.
-For example, finding the maximum number of commuters
+For example, finding the maximum number of commuters
would require an additional step of filtering for the commuter values
before the maximum can be computed.
-In comparison, if the data were tidy,
+In comparison, if the data were tidy,
all we would have to do is compute the maximum value for the commuter column.
To reshape this untidy data set to a tidy (and in this case, wider) format,
we need to create columns called "population", "commuters", and "incorporated."
@@ -431,12 +426,12 @@ knitr::include_graphics("img/wrangling/pivot_functions.002.jpeg")
```
To tidy this type of data in R, we can use the `pivot_wider` function.
-The `pivot_wider` function generally increases the number of columns (widens)
-and decreases the number of rows in a data set.
-To learn how to use `pivot_wider`,
-we will work through an example
-with the `region_lang_top5_cities_long.csv` data set.
-This data set contains the number of Canadians reporting
+The `pivot_wider` function generally increases the number of columns (widens)
+and decreases the number of rows in a data set.
+To learn how to use `pivot_wider`,
+we will work through an example
+with the `region_lang_top5_cities_long.csv` data set.
+This data set contains the number of Canadians reporting
the primary language at home and work for five
major cities (Toronto, Montréal, Vancouver, Calgary and Edmonton).
@@ -445,14 +440,14 @@ lang_long <- read_csv("data/region_lang_top5_cities_long.csv")
lang_long
```
-What makes the data set shown above untidy?
-In this example, each observation is a language in a region.
-However, each observation is split across multiple rows:
-one where the count for `most_at_home` is recorded,
-and the other where the count for `most_at_work` is recorded.
-Suppose the goal with this data was to
+What makes the data set shown above untidy?
+In this example, each observation is a language in a region.
+However, each observation is split across multiple rows:
+one where the count for `most_at_home` is recorded,
+and the other where the count for `most_at_work` is recorded.
+Suppose the goal with this data was to
visualize the relationship between the number of
-Canadians reporting their primary language at home and work.
+Canadians reporting their primary language at home and work.
Doing that would be difficult with this data in its current form,
since these two variables are stored in the same column.
Figure \@ref(fig:img-pivot-wider-table) shows how this data
@@ -464,7 +459,7 @@ will be tidied using the `pivot_wider` function.
knitr::include_graphics("img/wrangling/pivot_functions.004.jpeg")
```
-Figure \@ref(fig:img-pivot-wider) details the arguments that we need to specify
+Figure \@ref(fig:img-pivot-wider) details the arguments that we need to specify
in the `pivot_wider` function.
@@ -518,10 +513,10 @@ lang_messy
```
First we’ll use `pivot_longer` to create two columns, `region` and `value`,
-similar to what we did previously.
+similar to what we did previously.
The new `region` columns will contain the region names,
-and the new column `value` will be a temporary holding place for the
-data that we need to further separate, i.e., the
+and the new column `value` will be a temporary holding place for the
+data that we need to further separate, i.e., the
number of Canadians reporting their primary language at home and work.
``` {r}
@@ -534,12 +529,12 @@ lang_messy_longer <- pivot_longer(lang_messy,
lang_messy_longer
```
-Next we'll use `separate` to split the `value` column into two columns.
-One column will contain only the counts of Canadians
-that speak each language most at home,
-and the other will contain the counts of Canadians
-that speak each language most at work for each region.
-Figure \@ref(fig:img-separate)
+Next we'll use `separate` to split the `value` column into two columns.
+One column will contain only the counts of Canadians
+that speak each language most at home,
+and the other will contain the counts of Canadians
+that speak each language most at work for each region.
+Figure \@ref(fig:img-separate)
outlines what we need to specify to use `separate`.
(ref:img-separate) Syntax for the `separate` function.
@@ -580,19 +575,19 @@ messy data set. R read these columns in as character types, and by default,
It makes sense for `region`, `category`, and `language` to be stored as a
character (or perhaps factor) type. However, suppose we want to apply any functions that treat the
-`most_at_home` and `most_at_work` columns as a number (e.g., finding rows
-above a numeric threshold of a column).
-In that case,
-it won't be possible to do if the variable is stored as a `character`.
+`most_at_home` and `most_at_work` columns as a number (e.g., finding rows
+above a numeric threshold of a column).
+In that case,
+it won't be possible to do if the variable is stored as a `character`.
Fortunately, the `separate` function provides a natural way to fix problems
-like this: we can set `convert = TRUE` to convert the `most_at_home`
+like this: we can set `convert = TRUE` to convert the `most_at_home`
and `most_at_work` columns to the correct data type.
``` {r}
tidy_lang <- separate(lang_messy_longer,
col = value,
into = c("most_at_home", "most_at_work"),
- sep = "/",
+ sep = "/",
convert = TRUE
)
@@ -605,19 +600,19 @@ indicating they are integer data types (i.e., numbers)!
## Using `select` to extract a range of columns
Now that the `tidy_lang` data is indeed *tidy*, we can start manipulating it \index{select!helpers}
-using the powerful suite of functions from the `tidyverse`.
-For the first example, recall the `select` function from Chapter \@ref(intro),
-which lets us create a subset of columns from a data frame.
+using the powerful suite of functions from the `tidyverse`.
+For the first example, recall the `select` function from Chapter \@ref(intro),
+which lets us create a subset of columns from a data frame.
Suppose we wanted to select only the columns `language`, `region`,
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
learned in Chapter \@ref(intro), we would pass the `tidy_lang` data frame as
well as all of these column names into the `select` function:
``` {r}
-selected_columns <- select(tidy_lang,
- language,
- region,
- most_at_home,
+selected_columns <- select(tidy_lang,
+ language,
+ region,
+ most_at_home,
most_at_work)
selected_columns
```
@@ -636,8 +631,8 @@ column_range <- select(tidy_lang, language:most_at_work)
column_range
```
-Notice that we get the same output as we did above,
-but with less (and clearer!) code. This type of operator
+Notice that we get the same output as we did above,
+but with less (and clearer!) code. This type of operator
is especially handy for large data sets.
Suppose instead we wanted to extract columns that followed a particular pattern
@@ -659,14 +654,14 @@ select(tidy_lang, contains("_"))
```
There are many different `select` helpers that select
-variables based on certain criteria.
-The additional resources section at the end of this chapter
+variables based on certain criteria.
+The additional resources section at the end of this chapter
provides a comprehensive resource on `select` helpers.
## Using `filter` to extract rows
-Next, we revisit the `filter` function from Chapter \@ref(intro),
-which lets us create a subset of rows from a data frame.
+Next, we revisit the `filter` function from Chapter \@ref(intro),
+which lets us create a subset of rows from a data frame.
Recall the two main arguments to the `filter` function:
the first is the name of the data frame object, and \index{filter!logical statements}
the second is a *logical statement* to use when filtering the rows.
@@ -678,12 +673,12 @@ one can use in the `filter` function to select subsets of rows.
### Extracting rows that have a certain value with `==`
Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the
official languages of Canada (English and French).
-We can `filter` for these rows by using the *equivalency operator* (`==`)
-to compare the values of the `category` column
-with the value `"Official languages"`.
-With these arguments, `filter` returns a data frame with all the columns
-of the input data frame
-but only the rows we asked for in the logical statement, i.e.,
+We can `filter` for these rows by using the *equivalency operator* (`==`)
+to compare the values of the `category` column
+with the value `"Official languages"`.
+With these arguments, `filter` returns a data frame with all the columns
+of the input data frame
+but only the rows we asked for in the logical statement, i.e.,
those where the `category` column holds the value `"Official languages"`.
We name this data frame `official_langs`.
@@ -695,7 +690,7 @@ official_langs
### Extracting rows that do not have a certain value with `!=`
What if we want all the other language categories in the data set *except* for
-those in the `"Official languages"` category? We can accomplish this with the `!=`
+those in the `"Official languages"` category? We can accomplish this with the `!=`
operator, which means "not equal to". So if we want to find all the rows
where the `category` does *not* equal `"Official languages"` we write the code
below.
@@ -706,21 +701,21 @@ filter(tidy_lang, category != "Official languages")
### Extracting rows satisfying multiple conditions using `,` or `&` {#filter-and}
-Suppose now we want to look at only the rows
-for the French language in Montréal.
-To do this, we need to filter the data set
-to find rows that satisfy multiple conditions simultaneously.
-We can do this with the comma symbol (`,`), which in the case of `filter`
-is interpreted by R as "and".
-We write the code as shown below to filter the `official_langs` data frame
-to subset the rows where `region == "Montréal"`
+Suppose now we want to look at only the rows
+for the French language in Montréal.
+To do this, we need to filter the data set
+to find rows that satisfy multiple conditions simultaneously.
+We can do this with the comma symbol (`,`), which in the case of `filter`
+is interpreted by R as "and".
+We write the code as shown below to filter the `official_langs` data frame
+to subset the rows where `region == "Montréal"`
*and* the `language == "French"`.
``` {r}
filter(official_langs, region == "Montréal", language == "French")
```
-We can also use the ampersand (`&`) logical operator, which gives
+We can also use the ampersand (`&`) logical operator, which gives
us cases where *both* one condition *and* another condition
are satisfied. You can use either comma (`,`) or ampersand (`&`) in the `filter`
function interchangeably.
@@ -733,12 +728,12 @@ filter(official_langs, region == "Montréal" & language == "French")
### Extracting rows satisfying at least one condition using `|`
Suppose we were interested in only those rows corresponding to cities in Alberta
-in the `official_langs` data set (Edmonton and Calgary).
+in the `official_langs` data set (Edmonton and Calgary).
We can't use `,` as we did above because `region`
-cannot be both Edmonton *and* Calgary simultaneously.
-Instead, we can use the vertical pipe (`|`) logical operator,
-which gives us the cases where one condition *or*
-another condition *or* both are satisfied.
+cannot be both Edmonton *and* Calgary simultaneously.
+Instead, we can use the vertical pipe (`|`) logical operator,
+which gives us the cases where one condition *or*
+another condition *or* both are satisfied.
In the code below, we ask R to return the rows
where the `region` columns are equal to "Calgary" *or* "Edmonton".
@@ -748,10 +743,10 @@ filter(official_langs, region == "Calgary" | region == "Edmonton")
### Extracting rows with values in a vector using `%in%`
-Next, suppose we want to see the populations of our five cities.
-Let's read in the `region_data.csv` file
-that comes from the 2016 Canadian census,
-as it contains statistics for number of households, land area, population
+Next, suppose we want to see the populations of our five cities.
+Let's read in the `region_data.csv` file
+that comes from the 2016 Canadian census,
+as it contains statistics for number of households, land area, population
and number of dwellings for different regions.
```{r, include = FALSE}
@@ -763,16 +758,16 @@ region_data <- read_csv("data/region_data.csv")
region_data
```
-To get the population of the five cities
-we can filter the data set using the `%in%` operator.
-The `%in%` operator is used to see if an element belongs to a vector.
+To get the population of the five cities
+we can filter the data set using the `%in%` operator.
+The `%in%` operator is used to see if an element belongs to a vector.
Here we are filtering for rows where the value in the `region` column
matches any of the five cities we are intersted in: Toronto, Montréal,
Vancouver, Calgary, and Edmonton.
``` {r}
city_names <- c("Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton")
-five_cities <- filter(region_data,
+five_cities <- filter(region_data,
region %in% city_names)
five_cities
```
@@ -786,7 +781,7 @@ five_cities
> elements in `vectorB`. Then the second element of `vectorA` is compared
> to all the elements in `vectorB`, and so on. Notice the difference between `==` and
> `%in%` in the example below.
->
+>
>``` {r}
>c("Vancouver", "Toronto") == c("Toronto", "Vancouver")
>c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")
@@ -799,13 +794,13 @@ census_popn <- 35151728
most_french <- 2669195
```
-We saw in Section \@ref(filter-and) that
-`r format(most_french, scientific = FALSE, big.mark = ",")` people reported
-speaking French in Montréal as their primary language at home.
-If we are interested in finding the official languages in regions
-with higher numbers of people who speak it as their primary language at home
-compared to French in Montréal, then we can use `filter` to obtain rows
-where the value of `most_at_home` is greater than
+We saw in Section \@ref(filter-and) that
+`r format(most_french, scientific = FALSE, big.mark = ",")` people reported
+speaking French in Montréal as their primary language at home.
+If we are interested in finding the official languages in regions
+with higher numbers of people who speak it as their primary language at home
+compared to French in Montréal, then we can use `filter` to obtain rows
+where the value of `most_at_home` is greater than
`r format(most_french, scientific = FALSE, big.mark = ",")`.
We use the `>` symbol to look for values *above* a threshold, and the `<` symbol
to look for values *below* a threshold. The `>=` and `<=` symbols similarly look
@@ -815,27 +810,27 @@ for *equal to or above* a threshold and *equal to or below* a threshold.
filter(official_langs, most_at_home > 2669195)
```
-`filter` returns a data frame with only one row, indicating that when
-considering the official languages,
-only English in Toronto is reported by more people
-as their primary language at home
+`filter` returns a data frame with only one row, indicating that when
+considering the official languages,
+only English in Toronto is reported by more people
+as their primary language at home
than French in Montréal according to the 2016 Canadian census.
## Using `mutate` to modify or add columns
### Using `mutate` to modify columns
-In Section \@ref(separate),
+In Section \@ref(separate),
when we first read in the `"region_lang_top5_cities_messy.csv"` data,
all of the variables were "character" data types. \index{mutate}
-During the tidying process,
-we used the `convert` argument from the `separate` function
-to convert the `most_at_home` and `most_at_work` columns
-to the desired integer (i.e., numeric class) data types.
+During the tidying process,
+we used the `convert` argument from the `separate` function
+to convert the `most_at_home` and `most_at_work` columns
+to the desired integer (i.e., numeric class) data types.
But suppose we didn't use the `convert` argument,
and needed to modify the column type some other way.
-Below we create such a situation
+Below we create such a situation
so that we can demonstrate how to use `mutate`
-to change the column types of a data frame.
+to change the column types of a data frame.
`mutate` is a useful function to modify or create new data frame columns.
``` {r warning=FALSE, message=FALSE}
@@ -846,26 +841,26 @@ lang_messy_longer <- pivot_longer(lang_messy,
values_to = "value")
tidy_lang_chr <- separate(lang_messy_longer, col = value,
into = c("most_at_home", "most_at_work"),
- sep = "/")
+ sep = "/")
official_langs_chr <- filter(tidy_lang_chr, category == "Official languages")
-official_langs_chr
+official_langs_chr
```
-To use `mutate`, again we first specify the data set in the first argument,
-and in the following arguments,
-we specify the name of the column we want to modify or create
+To use `mutate`, again we first specify the data set in the first argument,
+and in the following arguments,
+we specify the name of the column we want to modify or create
(here `most_at_home` and `most_at_work`), an `=` sign,
and then the function we want to apply (here `as.numeric`).
-In the function we want to apply,
-we refer directly to the column name upon which we want it to act
+In the function we want to apply,
+we refer directly to the column name upon which we want it to act
(here `most_at_home` and `most_at_work`).
In our example, we are naming the columns the same
-names as columns that already exist in the data frame
-("most\_at\_home", "most\_at\_work")
-and this will cause `mutate` to *overwrite* those columns
+names as columns that already exist in the data frame
+("most\_at\_home", "most\_at\_work")
+and this will cause `mutate` to *overwrite* those columns
(also referred to as modifying those columns *in-place*).
-If we were to give the columns a new name,
+If we were to give the columns a new name,
then `mutate` would create new columns with the names we specified.
`mutate`'s general syntax is detailed in Figure \@ref(fig:img-mutate).
@@ -896,7 +891,7 @@ indicating they are double data types (which is a numeric data type)!
``` {r , include = FALSE}
census_popn <- 35151728
-number_most_home <- filter(official_langs,
+number_most_home <- filter(official_langs,
language == "English" & region == "Toronto") |>
pull(most_at_home)
@@ -911,27 +906,27 @@ the 2016 Canadian census. What does this number mean to us? To understand this
number, we need context. In particular, how many people were in Toronto when
this data was collected? From the 2016 Canadian census profile, the population
of Toronto was reported to be
-`r format(toronto_popn, scientific = FALSE, big.mark = ",")` people.
-The number of people who report that English is their primary language at home
-is much more meaningful when we report it in this context.
-We can even go a step further and transform this count to a relative frequency
+`r format(toronto_popn, scientific = FALSE, big.mark = ",")` people.
+The number of people who report that English is their primary language at home
+is much more meaningful when we report it in this context.
+We can even go a step further and transform this count to a relative frequency
or proportion.
-We can do this by dividing the number of people reporting a given language
-as their primary language at home by the number of people who live in Toronto.
-For example,
-the proportion of people who reported that their primary language at home
+We can do this by dividing the number of people reporting a given language
+as their primary language at home by the number of people who live in Toronto.
+For example,
+the proportion of people who reported that their primary language at home
was English in the 2016 Canadian census was
`r format(round(number_most_home/toronto_popn, 2), scientific = FALSE, big.mark = ",")`
in Toronto.
-Let's use `mutate` to create a new column in our data frame
-that holds the proportion of people who speak English
-for our five cities of focus in this chapter.
-To accomplish this, we will need to do two tasks
+Let's use `mutate` to create a new column in our data frame
+that holds the proportion of people who speak English
+for our five cities of focus in this chapter.
+To accomplish this, we will need to do two tasks
beforehand:
1. Create a vector containing the population values for the cities.
-2. Filter the `official_langs` data frame
+2. Filter the `official_langs` data frame
so that we only keep the rows where the language is English.
To create a vector containing the population values for the five cities
@@ -943,7 +938,7 @@ city_pops <- c(5928040, 4098927, 2463431, 1392609, 1321426)
city_pops
```
-And next, we will filter the `official_langs` data frame
+And next, we will filter the `official_langs` data frame
so that we only keep the rows where the language is English.
We will name the new data frame we get from this `english_langs`:
@@ -952,14 +947,14 @@ english_langs <- filter(official_langs, language == "English")
english_langs
```
-Finally, we can use `mutate` to create a new column,
-named `most_at_home_proportion`, that will have value that corresponds to
+Finally, we can use `mutate` to create a new column,
+named `most_at_home_proportion`, that will have value that corresponds to
the proportion of people reporting English as their primary
language at home.
-We will compute this by dividing the column by our vector of city populations.
+We will compute this by dividing the column by our vector of city populations.
```{r, include = TRUE}
-english_langs <- mutate(english_langs,
+english_langs <- mutate(english_langs,
most_at_home_proportion = most_at_home / city_pops)
english_langs
@@ -967,14 +962,14 @@ english_langs
In the computation above, we had to ensure that we ordered the `city_pops` vector in the
same order as the cities were listed in the `english_langs` data frame.
-This is because R will perform the division computation we did by dividing
-each element of the `most_at_home` column by each element of the
+This is because R will perform the division computation we did by dividing
+each element of the `most_at_home` column by each element of the
`city_pops` vector, matching them up by position.
Failing to do this would have resulted in the incorrect math being performed.
-> **Note:** In more advanced data wrangling,
-> one might solve this problem in a less error-prone way though using
-> a technique called "joins."
+> **Note:** In more advanced data wrangling,
+> one might solve this problem in a less error-prone way though using
+> a technique called "joins."
> We link to resources that discuss this in the additional
> resources at the end of this chapter.
@@ -982,21 +977,21 @@ Failing to do this would have resulted in the incorrect math being performed.