Merge pull request #444 from UBC-DSCI/dev

Update master with dev
UBC-DSCI · Jul 5, 2022 · 40741b6 · 40741b6
2 parents fec7a19 + 3dee232
commit 40741b6
Show file tree

Hide file tree

Showing 17 changed files with 152 additions and 147 deletions.
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -170,6 +170,7 @@ total set of variables per image in this data set is:
 11. Symmetry: how similar the nucleus is when mirrored 
 12. Fractal Dimension: a measurement of how "rough" the perimeter is 
 
+\pagebreak
 
 Below we use `glimpse` \index{glimpse} to preview the data frame. This function can 
 make it easier to inspect the data when we have a lot of columns, 
@@ -192,7 +193,7 @@ glimpse(cancer)
 ```
 
 Recall that factors have what are called "levels", which you can think of as categories. We
-can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
+can verify the levels of the `Class` column by using the `levels`\index{levels}\index{factor!levels} function.
 This function should return the name of each category in that column. Given
 that we only have two different values in our `Class` column (B for benign and M 
 for malignant), we only expect to get two names back.  Note that the `levels` function requires a *vector* argument; 
@@ -582,7 +583,6 @@ three predictors.
 new_obs_Perimeter <- 0
 new_obs_Concavity <- 3.5
 new_obs_Symmetry <- 1
-
 cancer |>
   select(ID, Perimeter, Concavity, Symmetry, Class) |>
   mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + 
@@ -846,8 +846,8 @@ loaded, and the standardized version of that same data. But first, we need to
 standardize the `unscaled_cancer` data set with `tidymodels`.
 
 In the `tidymodels` framework, all data preprocessing happens 
-using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes]
-Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for 
+using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes].
+Here we will initialize a recipe\index{recipe} \index{tidymodels!recipe|see{recipe}} for 
 the `unscaled_cancer` data above, specifying
 that the `Class` variable is the target, and all other variables are predictors:
 
@@ -1296,7 +1296,7 @@ The `tidymodels` package collection also provides the `workflow`, a way to chain
 To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
 First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
 
-```{r 05-workflow}
+```{r 05-workflow, message = FALSE, warning = FALSE}
 # load the unscaled cancer data 
 # and make sure the target Class variable is a factor
 unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
@@ -1320,7 +1320,7 @@ formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe.
 You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is
 placed in a workflow.
 
-We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
+We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions,\index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
 and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data.
 Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit}
 is again because we included the formula in the recipe, so there is no need to respecify it:
@@ -1364,6 +1364,8 @@ The basic idea is to create a grid of synthetic new observations using the `expa
 predict the label of each, and visualize the predictions with a colored scatter having a very high transparency 
 (low `alpha` value) and large point radius. See if you can figure out what each line is doing!
 
+\pagebreak 
+
 > **Note:** Understanding this code is not required for the remainder of the
 > textbook. It is included for those readers who would like to use similar
 > visualizations in their own data analyses. 

diff --git a/classification2.Rmd b/classification2.Rmd
@@ -120,7 +120,7 @@ in the analysis, would we not get a different result each time?
 The trick is that in R&mdash;and other programming languages&mdash;randomness 
 is not actually random! Instead, R uses a *random number generator* that
 produces a sequence of numbers that
-are completely determined by a \index{seed} \index{random seed|see{seed}}
+are completely determined by a\index{seed} \index{random seed|see{seed}}
  *seed value*. Once you set the seed value 
 using the \index{seed!set.seed} `set.seed` function, everything after that point may *look* random,
 but is actually totally reproducible. As long as you pick the same seed
@@ -134,34 +134,34 @@ Here, we pass in the number `1`.
 
 ```{r}
 set.seed(1)
-random_numbers <- sample(0:9, 10, replace=TRUE)
-random_numbers
+random_numbers1 <- sample(0:9, 10, replace=TRUE)
+random_numbers1
 ```
 
-You can see that `random_numbers` is a list of 10 numbers
+You can see that `random_numbers1` is a list of 10 numbers
 from 0 to 9 that, from all appearances, looks random. If 
 we run the `sample` function again, we will 
 get a fresh batch of 10 numbers that also look random.
 
 ```{r}
-random_numbers <- sample(0:9, 10, replace=TRUE)
-random_numbers
+random_numbers2 <- sample(0:9, 10, replace=TRUE)
+random_numbers2
 ```
 
 If we want to force R to produce the same sequences of random numbers,
 we can simply call the `set.seed` function again with the same argument
-value.
+value. 
 
 ```{r}
 set.seed(1)
-random_numbers <- sample(0:9, 10, replace=TRUE)
-random_numbers
+random_numbers1_again <- sample(0:9, 10, replace=TRUE)
+random_numbers1_again
 
-random_numbers <- sample(0:9, 10, replace=TRUE)
-random_numbers
+random_numbers2_again <- sample(0:9, 10, replace=TRUE)
+random_numbers2_again
 ```
 
-And if we choose 
+Notice that after setting the seed, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose 
 a different value for the seed&mdash;say, 4235&mdash;we
 obtain a different sequence of random numbers.
 
@@ -323,7 +323,7 @@ our test data does not influence any aspect of our model training. Once we have
 created the standardization preprocessor, we can then apply it separately to both the
 training and test data sets.
 
-Fortunately, the `recipe` framework from `tidymodels` helps us handle \index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
+Fortunately, the `recipe` framework from `tidymodels` helps us handle\index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
 this properly. Below we construct and prepare the recipe using only the training
 data (due to `data = cancer_train` in the first line).
 
@@ -411,7 +411,6 @@ the table of predicted labels and correct labels, using the `conf_mat` function:
 ```{r 06-confusionmat}
 confusion <- cancer_test_predictions |>
              conf_mat(truth = Class, estimate = .pred_class)
-
 confusion
 ```
 
@@ -497,7 +496,7 @@ for the application.
 ## Tuning the classifier
 
 The vast majority of predictive models in statistics and machine learning have
-*parameters*. A *parameter* \index{parameter}\index{tuning parameter|see{parameter}}
+*parameters*. A *parameter*\index{parameter}\index{tuning parameter|see{parameter}}
 is a number you have to pick in advance that determines
 some aspect of how the model behaves. For example, in the $K$-nearest neighbors
 classification algorithm, $K$ is a parameter that we have to pick
@@ -663,7 +662,7 @@ cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
 cancer_vfold
 ```
 
-Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
+Then, when we create our data analysis workflow, we use the `fit_resamples` function\index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
 instead of the `fit` function for training. This runs cross-validation on each
 train/validation split. 
 
@@ -689,7 +688,7 @@ knn_fit <- workflow() |>
 knn_fit
 ```
 
-The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
+The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
 of the classifier's validation accuracy across the folds. You will find results
 related to the accuracy in the row with `accuracy` listed under the `.metric` column. 
 You should consider the mean (`mean`) to be the estimated accuracy, while the standard 

diff --git a/clustering.Rmd b/clustering.Rmd
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
 ## Chapter learning objectives 
 By the end of the chapter, readers will be able to do the following:
 
-* Describe a case where clustering is appropriate, 
+* Describe a situation in which clustering is an appropriate technique to use, 
 and what insight it might extract from the data.
 * Explain the K-means clustering algorithm.
 * Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
 limitations and assumptions of the K-means clustering algorithm.
 
 ## Clustering
-Clustering \index{clustering} is a data analysis task 
+Clustering \index{clustering} is a data analysis technique 
 involving separating a data set into subgroups of related data. 
 For example, we might use clustering to separate a
 data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
 or values to help us. 
 This approach has both advantages and disadvantages. 
 Clustering requires no additional annotation or input on the data. 
-For example, it would be nearly impossible to annotate 
-all the articles on Wikipedia with human-made topic labels. 
-However, we can still cluster the articles without this information 
+For example, while it would be nearly impossible to annotate 
+all the articles on Wikipedia with human-made topic labels, 
+we can cluster the articles without this information 
 to find groupings corresponding to topics automatically. 
-
-Given that there is no response variable, it is not as easy to evaluate
+However, given that there is no response variable, it is not as easy to evaluate
 the "quality" of a clustering.  With classification, we can use a test data set
 to assess prediction performance. In clustering, there is not a single good
 choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
 improves it by making adjustments to the assignment of data
 to clusters until it cannot improve any further. But how do we measure
 the "quality" of a clustering, and what does it mean to improve it? 
-In K-means clustering, we measure the quality of a cluster by its
-\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
-*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
+In K-means clustering, we measure the quality of a cluster 
+by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD). 
+Computing this involves two steps.
 First, we find the cluster centers by computing the mean of each variable 
 over data points in the cluster. For example, suppose we have a 
 cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
 ```
 
 If we set K less than 3, then the clustering merges separate groups of data; this causes a large 
-total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On 
+total WSSD, since the cluster center is not close to any of the data in the cluster. On 
 the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still 
 decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of 
 clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly 
@@ -890,7 +889,7 @@ not_standardized_data
 ```
 
 And then we apply the `scale` function to every column in the data frame 
-using `mutate` + `across`.
+using `mutate` and `across`.
 
 ```{r 10-mapdf-scale-data}
 standardized_data <- not_standardized_data |>
@@ -903,8 +902,8 @@ standardized_data
 
 To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
 least two arguments: the data frame containing the data you wish to cluster,
-and K, the number of clusters (here we choose K = 3). Note that since the K-means
-algorithm uses a random initialization of assignments, but since we set the random seed
+and K, the number of clusters (here we choose K = 3). Note that the K-means
+algorithm uses a random initialization of assignments; but since we set the random seed
 earlier, the clustering will be reproducible.
 
 ```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
 If we wanted to get one of the clusterings out 
 of the list column in the data frame,
 we could use a familiar friend: `pull`.
-`pull` will return to us a data frame column as a simpler data structure,
-here that would be a list.
+`pull` will return to us a data frame column as a simpler data structure;
+here, that would be a list.
 And then to extract the first item of the list, 
 we can use the `pluck` function. We pass  
 it the index for the element we would like to extract 
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
 the more likely we are to find a good clustering (if one exists).
 What value should you choose for `nstart`? The answer is that it depends
 on many factors: the size and characteristics of your data set,
-as well as the speed and size of your computer.
+as well as how powerful your computer is.
 The larger the `nstart` value the better from an analysis perspective, 
 but there is a trade-off that doing many clusterings 
 could take a long time.

diff --git a/img/intro-bootstrap.jpeg b/img/intro-bootstrap.jpeg
diff --git a/img/pivot_functions.key b/img/pivot_functions.key
diff --git a/img/pivot_functions/pivot_functions.002.jpeg b/img/pivot_functions/pivot_functions.002.jpeg
diff --git a/img/tidydata_bootstrap_train_test_images.key b/img/tidydata_bootstrap_train_test_images.key
diff --git a/inference.Rmd b/inference.Rmd
@@ -39,7 +39,7 @@ By the end of the chapter, readers will be able to do the following:
 
 * Describe real-world examples of questions that can be answered with statistical inference.
 * Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
-* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
+* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
 * Explain the difference between a population parameter and a sample point estimate.
 * Use R to draw random samples from a finite population.
 * Use R to create a sampling distribution from a finite population.
@@ -90,14 +90,14 @@ knitr::include_graphics("img/population_vs_sample.png")
 Note that proportions are not the *only* kind of population parameter we might
 be interested in. For example, suppose an undergraduate student studying at the University
 of British Columbia in Canada is looking for an apartment
-to rent. They need to create a budget, so they want to know something about
-studio apartment rental prices in Vancouver, BC. This student might 
-formulate the following question:
+to rent. They need to create a budget, so they want to know about
+studio apartment rental prices in Vancouver. This student might 
+formulate the question:
 
-*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
+*What is the average price per month of studio apartment rentals in Vancouver?*
 
 In this case, the population consists of all studio apartment rentals in Vancouver, and the
-population parameter is the *average price-per-month*. Here we used the average
+population parameter is the *average price per month*. Here we used the average
 as a measure of the center to describe the "typical value" of studio apartment
 rental prices. But even within this one example, we could also be interested in
 many other population parameters. For instance, we know that not every studio
@@ -1148,9 +1148,9 @@ boot_est_dist +
 
 To finish our estimation of the population parameter, we would report the point
 estimate and our confidence interval's lower and upper bounds. Here the sample
-mean price-per-night of 40 Airbnb listings was 
+mean price per night of 40 Airbnb listings was 
 \$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
-population mean price-per-night for all Airbnb listings in Vancouver is between
+population mean price per night for all Airbnb listings in Vancouver is between
 \$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
 Notice that our interval does indeed contain the true
 population mean value, \$`r round(mean(airbnb$price),2)`\! However, in

diff --git a/intro.Rmd b/intro.Rmd
@@ -25,10 +25,10 @@ By the end of the chapter, readers will be able to do the following:
 - Identify the different types of data analysis question and categorize a question into the correct type.
 - Load the `tidyverse` package into R.
 - Read tabular data with `read_csv`.
-- Use `?` to access help and documentation tools in R.
 - Create new variables and objects in R using the assignment symbol.
 - Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
 - Visualize data with a `ggplot` bar plot.
+- Use `?` to access help and documentation tools in R.
 
 ## Canadian languages data set
 
@@ -312,7 +312,7 @@ to be surrounded by quotes.
 After making the assignment, we can use the special name words we have created in
 place of their values. For example, if we want to do something with the value `3` later on, 
 we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
-R just interprets this as adding 2 and 3:
+R just interprets this as adding 3 and 2:
 ```{r naming-things2}
 my_number + 2
 ```
@@ -374,7 +374,7 @@ Aboriginal languages in the data set, and then use `select` to obtain only the
 columns we want to include in our table.
 
 ### Using `filter` to extract rows
-Looking at the `can_lang` data above, we see the column `category` contains different
+Looking at the `can_lang` data above, we see the `category` column contains different
 high-level categories of languages, which include "Aboriginal languages",
 "Non-Official & Non-Aboriginal languages" and "Official languages".  To answer
 our question we want to filter our data set so we restrict our attention 
@@ -528,7 +528,7 @@ image_read("img/ggplot_function.jpeg") |>
   image_crop("1625x1900")
 ```
 
-```{r barplot-mother-tongue, fig.width=5, fig.height=3, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
+```{r barplot-mother-tongue, fig.width=5, fig.height=3.1, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
 ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
   geom_bar(stat = "identity")
 ```
@@ -687,7 +687,7 @@ Figure \@ref(fig:01-help) shows the documentation that will pop up,
 including a high-level description of the function, its arguments, 
 a description of each, and more. Note that you may find some of the
 text in the documentation a bit too technical right now 
-(for example, what is `dbplyr`, and what is grouped data?).
+(for example, what is `dbplyr`, and what is a lazy data frame?).
 Fear not: as you work through this book, many of these terms will be introduced
 to you, and slowly but surely you will become more adept at understanding and navigating 
 documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind that the documentation