From 6e39efb4217af37ed63b1fa63cafe926d5c8ed9c Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 15 Nov 2023 17:08:36 -0800 Subject: [PATCH 01/18] empty commit From 44fd82795ed499e267b595b47173a1028870fcdb Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Wed, 15 Nov 2023 22:31:07 -0800 Subject: [PATCH 02/18] move citation out of caption for pdf build sync with python --- source/reading.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/reading.Rmd b/source/reading.Rmd index 2fdeae677..df94e2886 100644 --- a/source/reading.Rmd +++ b/source/reading.Rmd @@ -1160,9 +1160,9 @@ of accessing data through an API in this book, with the hope that it gives you e idea that you can learn how to use another API if needed. In particular, in this book we will show you the basics of how to use the `httr2` package in R\index{API!httr2}\index{httr2}\index{NASA} to access data from the NASA "Astronomy Picture of the Day" API (a great source of desktop backgrounds, by the way—take a look at the stunning -picture of the Rho-Ophiuchi cloud complex in Figure \@ref(fig:NASA-API-Rho-Ophiuchi) from July 13, 2023!). +picture of the Rho-Ophiuchi cloud complex [@rhoophiuchi] in Figure \@ref(fig:NASA-API-Rho-Ophiuchi) from July 13, 2023!). -(ref:NASA-API-Rho-Ophiuchi) The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular cloud complex [@rhoophiuchi]. +(ref:NASA-API-Rho-Ophiuchi) The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular cloud complex. ```{r NASA-API-Rho-Ophiuchi, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:NASA-API-Rho-Ophiuchi)", fig.retina = 2, out.width="60%"} knitr::include_graphics("img/reading/NASA-API-Rho-Ophiuchi.png") From a321df297f01457fe48dac28e9e131761efa7d53 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 00:09:44 -0800 Subject: [PATCH 03/18] intro index --- source/intro.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/intro.Rmd b/source/intro.Rmd index 4d2e5bb78..8e11d79aa 100644 --- a/source/intro.Rmd +++ b/source/intro.Rmd @@ -590,7 +590,7 @@ Canadian Residents)" would be much more informative. Adding additional layers \index{plot!layers} to our visualizations that we create in `ggplot` is one common and easy way to improve and refine our data visualizations. New layers are added to `ggplot` objects using the `+` symbol. For example, we can -use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions +use the `xlab` (short for x axis label) \index{ggplot!xlab} and `ylab` (short for y axis label) \index{ggplot!ylab} functions to add layers where we specify meaningful and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to From e4b5de2efe897313f3c4bd6d7b5911e6e69749e4 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 10:31:33 -0800 Subject: [PATCH 04/18] Reading index --- source/reading.Rmd | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/source/reading.Rmd b/source/reading.Rmd index df94e2886..c87a1a548 100644 --- a/source/reading.Rmd +++ b/source/reading.Rmd @@ -92,7 +92,7 @@ Suppose our computer's filesystem looks like the picture in Figure \@ref(fig:file-system-for-export-to-intro-datascience). We are working in a file titled `worksheet_02.ipynb`, and our current working directory is `worksheet_02`; typically, as is the case here, the working directory is the directory containing the file you are currently -working on.\index{Happiness Report} +working on. ```{r file-system-for-export-to-intro-datascience, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Example file system.", fig.retina = 2, out.width="100%"} knitr::include_graphics("img/reading/filesystem.jpeg") @@ -120,8 +120,9 @@ Note that there is no forward slash at the beginning of a relative path; if we a R would look for a folder named `data` in the root folder of the computer—but that doesn't exist! Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional -special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and -the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could +special places: the *current directory* \index{path!current} and the *previous directory*. \index{path!previous} +We indicate the current working directory with a single dot `.`, and \index{aaaaaacurdirsymb@\texttt{.}|see{path}} +the previous directory with two dots `..`. \index{aaaaaprevdirsymb@\texttt{..}|see{path}} So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`, then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip! @@ -1202,15 +1203,15 @@ That should be more than enough for our purposes in this section. #### Accessing the NASA API {-} -The NASA API is what is known as an *HTTP API*: this is a particularly common +The NASA API is what is known as an *HTTP API*: \index{API!HTTP} this is a particularly common kind of API, where you can obtain data simply by accessing a particular URL as if it were a regular website. To make a query to the NASA -API, we need to specify three things. First, we specify the URL *endpoint* of +API, we need to specify three things. First, we specify the URL *endpoint* of \index{API!endpoint} the API, which is simply a URL that helps the remote server understand which API you are trying to access. NASA offers a variety of APIs, each with its own endpoint; in the case of the NASA "Astronomy Picture of the Day" API, the URL endpoint is `https://api.nasa.gov/planetary/apod`. Second, we write `?`, which denotes that a -list of *query parameters* will follow. And finally, we specify a list of +list of *query parameters* \index{API!query parameters} will follow. And finally, we specify a list of query parameters of the form `parameter=value`, separated by `&` characters. The NASA "Astronomy Picture of the Day" API accepts the parameters shown in Figure \@ref(fig:NASA-API-parameters). @@ -1249,7 +1250,8 @@ Rho Ophiuchi","url":"https://apod.nasa.gov/apod/image/2307/STScI-01_RhoOph1024.p Neat! There is definitely some data there, but it's a bit hard to see what it all is. As it turns out, this is a common format for data called -*JSON* (JavaScript Object Notation). We won't encounter this kind of data much in this book, +*JSON* (JavaScript Object Notation). \index{JSON}\index{JavaScript Object Notation|see{JSON}} +We won't encounter this kind of data much in this book, but for now you can interpret this data as `key : value` pairs separated by commas. For example, if you look closely, you'll see that the first entry is `"date":"2023-07-13"`, which indicates that we indeed successfully received @@ -1260,6 +1262,7 @@ the `httr2` package, and construct the query using the `request` function, which you will recognize the same query URL that we pasted into the browser earlier. We will then send the query using the `req_perform` function, and finally obtain a JSON representation of the response using the `resp_body_json` function. +\index{httr2!req_perform}\index{httr2!resp_body_json} From 3a602f8ed6b7acb0acebabe21f72cf7e519d72ab Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 11:04:17 -0800 Subject: [PATCH 05/18] index wrangling --- source/wrangling.Rmd | 59 +++++++++++--------------------------------- 1 file changed, 15 insertions(+), 44 deletions(-) diff --git a/source/wrangling.Rmd b/source/wrangling.Rmd index d88edebf5..f3ed0a87e 100644 --- a/source/wrangling.Rmd +++ b/source/wrangling.Rmd @@ -136,11 +136,11 @@ Table: (#tab:datatype-table) Basic data types in R | factor | fct | used to represent data with a limited number of values (usually categories) | a `color` variable with levels `red`, `green` and `orange` | \index{data types} -\index{character}\index{chr|see{character}} -\index{integer}\index{int|see{integer}} -\index{double}\index{dbl|see{double}} -\index{logical}\index{lgl|see{logical}} -\index{factor}\index{fct|see{factor}} +\index{data types!character (chr)}\index{chr|see{character}} +\index{data types!integer (int)}\index{int|see{integer}} +\index{data types!double (dbl)}\index{dbl|see{double}} +\index{data types!logical (lgl)}\index{lgl|see{logical}} +\index{data types!factor (fct)}\index{fct|see{factor}} It is important in R to make sure you represent your data with the correct type. Many of the `tidyverse` functions we use in this book treat the various data types differently. You should use integers and double types @@ -216,6 +216,7 @@ Vectors, data frames and lists are basic types of *data structure* in R, which are core to most data analyses. We summarize them in Table \@ref(tab:datastructure-table). There are several other data structures in the R programming language (*e.g.,* matrices), but these are beyond the scope of this book. +\index{data structures!vector}\index{data structures!list}\index{data structures!data frame} Table: (#tab:datastructure-table) Basic data structures in R @@ -669,11 +670,12 @@ the second is a *logical statement* to use when filtering the rows. This section will highlight more advanced usage of the `filter` function. In particular, this section provides an in-depth treatment of the variety of logical statements one can use in the `filter` function to select subsets of rows. +\index{logical statement|see{logical operator}} ### Extracting rows that have a certain value with `==` Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the official languages of Canada (English and French). -We can `filter` for these rows by using the *equivalency operator* (`==`) +We can `filter` for these rows by using the *equivalency operator* (`==`) \index{logical operator!equivalency} to compare the values of the `category` column with the value `"Official languages"`. With these arguments, `filter` returns a data frame with all the columns @@ -690,7 +692,7 @@ official_langs ### Extracting rows that do not have a certain value with `!=` What if we want all the other language categories in the data set *except* for -those in the `"Official languages"` category? We can accomplish this with the `!=` +those in the `"Official languages"` category? We can accomplish this with the `!=` \index{logical operator!inequivalency} operator, which means "not equal to". So if we want to find all the rows where the `category` does *not* equal `"Official languages"` we write the code below. @@ -709,7 +711,7 @@ We can do this with the comma symbol (`,`), which in the case of `filter` is interpreted by R as "and". We write the code as shown below to filter the `official_langs` data frame to subset the rows where `region == "Montréal"` -*and* the `language == "French"`. +*and* the `language == "French"`. \index{logical operator!and} ``` {r} filter(official_langs, region == "Montréal", language == "French") @@ -735,7 +737,7 @@ Instead, we can use the vertical pipe (`|`) logical operator, which gives us the cases where one condition *or* another condition *or* both are satisfied. In the code below, we ask R to return the rows -where the `region` columns are equal to "Calgary" *or* "Edmonton". +where the `region` columns are equal to "Calgary" *or* "Edmonton". \index{logical operator!or} ``` {r} filter(official_langs, region == "Calgary" | region == "Edmonton") @@ -760,7 +762,7 @@ region_data To get the population of the five cities we can filter the data set using the `%in%` operator. -The `%in%` operator is used to see if an element belongs to a vector. +The `%in%` operator is used to see if an element belongs to a vector. \index{logical operator!containment} Here we are filtering for rows where the value in the `region` column matches any of the five cities we are intersted in: Toronto, Montréal, Vancouver, Calgary, and Edmonton. @@ -804,7 +806,8 @@ where the value of `most_at_home` is greater than `r format(most_french, scientific = FALSE, big.mark = ",")`. We use the `>` symbol to look for values *above* a threshold, and the `<` symbol to look for values *below* a threshold. The `>=` and `<=` symbols similarly look -for *equal to or above* a threshold and *equal to or below* a threshold. +for *equal to or above* a threshold and *equal to or below* a +threshold. \index{logical operator!greater than}\index{logical operator!less than} ``` {r} filter(official_langs, most_at_home > 2669195) @@ -973,38 +976,6 @@ Failing to do this would have resulted in the incorrect math being performed. > We link to resources that discuss this in the additional > resources at the end of this chapter. - - - ## Combining functions using the pipe operator, `|>` In R, we often have to call multiple functions in a sequence to process a data @@ -1425,7 +1396,7 @@ simpler alternative is to just use a different `map` function. There are quite a few to choose from, they all work similarly, but their name reflects the type of output you want from the mapping operation. Table \@ref(tab:map-table) lists the commonly used `map` functions as well -as their output type. \index{map!map\_\* functions} +as their output type. \index{map!map functions} Table: (#tab:map-table) The `map` functions in R. From 2cabe8ea056efef90de6e58fc775e97f1de87b59 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 11:22:11 -0800 Subject: [PATCH 06/18] viz index --- source/viz.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/viz.Rmd b/source/viz.Rmd index a76e0a12d..76e654d03 100644 --- a/source/viz.Rmd +++ b/source/viz.Rmd @@ -241,7 +241,7 @@ We see that there are two columns in the `co2_df` data frame; `date_measured` an The `date_measured` column holds the date the measurement was taken, and is of type `date`. The `ppm` column holds the value of CO$_{\text{2}}$ in parts per million -that was measured on each date, and is type `double`. +that was measured on each date, and is type `double`.\index{dates and times} > **Note:** `read_csv` was able to parse the `date_measured` column into the > `date` vector type because it was entered @@ -1225,9 +1225,9 @@ admirable job given the technology available at the time. When you create a histogram in R, the default number of bins used is 30. Naturally, this is not always the right number to use. You can set the number of bins yourself by using -the `bins` argument in the `geom_histogram` geometric object. +the `bins` argument in the `geom_histogram` geometric object. \label{ggplot!bins} You can also set the *width* of the bins using the -`binwidth` argument in the `geom_histogram` geometric object. +`binwidth` argument in the `geom_histogram` geometric object. \label{ggplot!binwidth} But what number of bins, or bin width, is the right one to use? Unfortunately there is no hard rule for what the right bin number From fbbd73d842972d14f2476c57c386d24df459e90a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 12:09:51 -0800 Subject: [PATCH 07/18] jupyter vsnctl idcs --- source/jupyter.Rmd | 2 +- source/version-control.Rmd | 10 ++++------ 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/source/jupyter.Rmd b/source/jupyter.Rmd index 77bbea022..26aa49471 100644 --- a/source/jupyter.Rmd +++ b/source/jupyter.Rmd @@ -377,7 +377,7 @@ right-clicking on the file's name in the Jupyter file explorer, selecting **Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)). Suppose you do not specify to open the data file with an editor. In that case, Jupyter will render a nice table -for you, and you will not be able to see the column delimiters, and therefore +for you, and you will not be able to see the column delimiters, \index{delimiter} and therefore you will not know which function to use, nor which arguments to use and values to specify for them. diff --git a/source/version-control.Rmd b/source/version-control.Rmd index 77325a0b0..ae215155f 100644 --- a/source/version-control.Rmd +++ b/source/version-control.Rmd @@ -137,6 +137,7 @@ a workspace on a server (e.g., JupyterHub). The other copy is typically stored in a repository hosting service (e.g., GitHub), where we can easily share it with our collaborators. This copy is commonly referred to as \index{repository!remote} the **remote repository**. +\index{repository|see{version control}} ```{r vc1-no-changes, fig.cap = 'Schematic of local and remote version control repositories.', fig.retina = 2, out.width="100%"} image_read("img/version-control/vc1-no-changes.png") |> @@ -200,10 +201,10 @@ image_read("img/version-control/vc2-changes.png") |> ``` Once you reach a point that you want Git to keep a record -of the current version of your work, you need to commit +of the current version of your work, you need to **commit** \index{git!commit} (i.e., snapshot) your changes. A prerequisite to this is telling Git which files should be included in that snapshot. We call this step **adding** the -files to the **staging area**. \index{git!add, staging area} +files to the **staging area**. \index{git!add, staging area}\index{staging area|see{git}} Note that the staging area is not a real physical location on your computer; it is instead a conceptual placeholder for these files until they are committed. The benefit of the Git version control system using a staging area is that you @@ -309,7 +310,7 @@ Repositories can be set up with a variety of configurations, including a name, optional description, and the inclusion (or not) of several template files. One of the most important configuration items to choose is the visibility to the outside world, either public or private. *Public* repositories \index{repository!public} can be viewed by anyone. -*Private* repositories can be viewed by only you. Both public and private repositories +*Private* repositories \index{repository!private} can be viewed by only you. Both public and private repositories are only editable by you, but you can change that by giving access to other collaborators. To get started with a *public* repository having a template `README.md` file, take the @@ -530,9 +531,6 @@ image_read("img/version-control/generate-pat_03.png") ### Cloning a repository using Jupyter - - *Cloning* a \index{git!clone} remote repository from GitHub to create a local repository results in a copy that knows where it was obtained from so that it knows where to send/receive From e869d581af7ae5a14970e52f1f7bbae55b73773f Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 12:17:42 -0800 Subject: [PATCH 08/18] setup index --- source/setup.Rmd | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/source/setup.Rmd b/source/setup.Rmd index b52bb87d7..c416500a4 100644 --- a/source/setup.Rmd +++ b/source/setup.Rmd @@ -61,7 +61,7 @@ exactly right! To keep things simple, we instead recommend that you install [Docker](https://docker.com). Docker lets you run your Jupyter notebooks inside a pre-built *container* that comes with precisely the right versions of all software packages needed run the worksheets that come with this book. -\index{Docker} +\index{Docker}\index{container} > **Note:** A *container* is a virtualized user space within your computer. > Within the container, you can run software in isolation without interfering with the @@ -77,6 +77,7 @@ all software packages needed run the worksheets that come with this book. visit [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/), and download the `Docker Desktop Installer.exe` file. Double-click the file to open the installer and follow the instructions on the installation wizard, choosing **WSL-2** instead of **Hyper-V** when prompted. +\index{Docker!installation} > **Note:** Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see: > @@ -90,7 +91,7 @@ and follow the instructions on the installation wizard, choosing **WSL-2** inste > to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. **Running JupyterLab** Run Docker Desktop. Once it is running, you need to download and run the -Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a +Docker *image* that \index{Docker!image}\index{Docker!tag} we have made available for the worksheets (an *image* is like a "snapshot" of a computer with all the right packages pre-installed). You only need to do this step one time; the image will remain the next time you run Docker Desktop. In the Docker Desktop search bar, enter `ubcdsci/r-dsci-100`, as this is @@ -177,7 +178,8 @@ sudo chmod u+x get-docker.sh sudo sh get-docker.sh ``` -**Running JupyterLab** First, open the [`Dockerfile` in the worksheets repository](https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/Dockerfile), +**Running JupyterLab** First, open +the [`Dockerfile` in the worksheets repository](https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-worksheets/main/Dockerfile), and look for the line `FROM ubcdsci/r-dsci-100:` followed by a tag consisting of a sequence of numbers and letters. Then in the terminal, navigate to the directory where you want to run JupyterLab, and run the following command, replacing `TAG` with the *tag* you found earlier. From 6a7057a1439e7e94a4fc39ffbe33c37a09d26227 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 12:28:17 -0800 Subject: [PATCH 09/18] index inference --- source/inference.Rmd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/source/inference.Rmd b/source/inference.Rmd index c3ce541c8..45dd898fa 100644 --- a/source/inference.Rmd +++ b/source/inference.Rmd @@ -270,7 +270,7 @@ We first group the data by the `replicate` variable—to group the set of listings in each sample together—and then use `summarize` to compute the proportion in each sample. We print both the first and last few entries of the resulting data frame -below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples. +below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.\index{group\_by}\index{summarize} ```{r 11-example-proportions6, echo = TRUE, message = FALSE, warning = FALSE} sample_estimates <- samples |> @@ -381,7 +381,7 @@ one_sample <- airbnb |> We can create a histogram to visualize the distribution of observations in the sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean -of our sample. +of our sample.\index{ggplot!geom\_histogram} ```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5} sample_distribution <- ggplot(one_sample, aes(price)) + @@ -1116,6 +1116,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively. +\index{percentile} \index{quantile} \index{pull} \index{select} From 948ef88c49b371f6b81e4467ef6383f46b48e238 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 12:38:48 -0800 Subject: [PATCH 10/18] clustering index --- source/clustering.Rmd | 18 +++--------------- 1 file changed, 3 insertions(+), 15 deletions(-) diff --git a/source/clustering.Rmd b/source/clustering.Rmd index 77a04341d..c3c1dec8d 100644 --- a/source/clustering.Rmd +++ b/source/clustering.Rmd @@ -164,7 +164,7 @@ library(tidyverse) set.seed(1) ``` -Now we can load and preview the `penguins` data. +Now we can load and preview the `penguins` data.\index{read function!read\_csv} ```{r message = FALSE, warning = FALSE} penguins <- read_csv("data/penguins.csv") @@ -639,7 +639,7 @@ in the fourth iteration; both the centers and labels will remain the same from t ### Random restarts -Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution. +Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart} can get "stuck" in a bad solution. For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means. ```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."} @@ -910,7 +910,7 @@ set.seed(1) We can perform K-means clustering in R using a `tidymodels` workflow similar to those in the earlier classification and regression chapters. -We will begin by loading the `tidyclust`\index{tidyclust} library, which contains the necessary +We will begin by loading the `tidyclust`\index{K-means}\index{tidyclust} library, which contains the necessary functionality. ```{r, echo = TRUE, warning = FALSE, message = FALSE} library(tidyclust) @@ -993,18 +993,6 @@ clustered_data <- kmeans_fit |> clustered_data ``` - - Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can visualize them as shown in Figure \@ref(fig:10-plot-clusters-2). Note that we are plotting the *un-standardized* data here; if we for some reason wanted to From 11256f454b780f459956b02d82bfd968a23320ee Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 13:00:47 -0800 Subject: [PATCH 11/18] cls1 index --- source/classification1.Rmd | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/source/classification1.Rmd b/source/classification1.Rmd index b214e8d16..607b8fac4 100644 --- a/source/classification1.Rmd +++ b/source/classification1.Rmd @@ -1295,7 +1295,7 @@ upsampled_plot ### Missing data -One of the most common issues in real data sets in the wild is *missing data*, +One of the most common issues in real data sets in the wild is *missing data*,\index{missing data} i.e., observations where the values of some of the variables were not recorded. Unfortunately, as common as it is, handling missing data properly is very challenging and generally relies on expert knowledge about the data, setting, @@ -1329,7 +1329,7 @@ data. So how can we perform K-nearest neighbors classification in the presence of missing data? Well, since there are not too many observations with missing entries, one option is to simply remove those observations prior to building the K-nearest neighbors classifier. We can accomplish this by using the -`drop_na` function from `tidyverse` prior to working with the data. +`drop_na` function from `tidyverse` prior to working with the data.\label{missing data!drop\_na} ```{r 05-naomit} no_missing_cancer <- missing_cancer |> drop_na() @@ -1342,7 +1342,8 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries are filled in using the mean of the present entries in each variable. To perform mean imputation, we -add the `step_impute_mean` step to the `tidymodels` preprocessing recipe. +add the `step_impute_mean` \index{recipe!step\_impute\_mean}\index{missing data!mean imputation} +step to the `tidymodels` preprocessing recipe. ```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE} impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |> step_impute_mean(all_predictors()) |> From 39b41a31dbd18c0bf303c45cd1ebc261172d387c Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 13:21:11 -0800 Subject: [PATCH 12/18] cls2 index --- source/classification2.Rmd | 32 +++++++++----------------------- 1 file changed, 9 insertions(+), 23 deletions(-) diff --git a/source/classification2.Rmd b/source/classification2.Rmd index ac649c2ad..dea7cd0b8 100644 --- a/source/classification2.Rmd +++ b/source/classification2.Rmd @@ -117,7 +117,7 @@ a single number. But prediction accuracy by itself does not tell the whole story. In particular, accuracy alone only tells us how often the classifier makes mistakes in general, but does not tell us anything about the *kinds* of mistakes the classifier makes. A more comprehensive view of performance can be -obtained by additionally examining the **confusion matrix**. The confusion +obtained by additionally examining the **confusion matrix**. The confusion\index{confusion matrix} matrix shows how many test set labels of each type are predicted correctly and incorrectly, which gives us more detail about the kinds of mistakes the classifier tends to make. Table \@ref(tab:confusion-matrix) shows an example @@ -148,7 +148,8 @@ disastrous error, since it may lead to a patient who requires treatment not rece Since we are particularly interested in identifying malignant cases, this classifier would likely be unacceptable even with an accuracy of 89%. -Focusing more on one label than the other is +Focusing more on one label than the other +is\index{positive label}\index{negative label}\index{true positive}\index{false positive}\index{true negative}\index{false negative} common in classification problems. In such cases, we typically refer to the label we are more interested in identifying as the *positive* label, and the other as the *negative* label. In the tumor example, we would refer to malignant @@ -166,7 +167,7 @@ therefore, 100% accuracy). However, classifiers in practice will almost always make some errors. So you should think about which kinds of error are most important in your application, and use the confusion matrix to quantify and report them. Two commonly used metrics that we can compute using the confusion -matrix are the **precision** and **recall** of the classifier. These are often +matrix are the **precision** and **recall** of the classifier.\index{precision}\index{recall} These are often reported together with accuracy. *Precision* quantifies how many of the positive predictions the classifier made were actually positive. Intuitively, we would like a classifier to have a *high* precision: for a classifier with @@ -582,7 +583,7 @@ We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% ac on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%. That sounds pretty good! Wait, *is* it good? Or do we need something higher? -In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment} +In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}\index{precision!assessment}\index{recall!assessment} depends on the application; you must critically analyze your accuracy in the context of the problem you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99% of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!). @@ -845,7 +846,7 @@ The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation! of the classifier's validation accuracy across the folds. You will find results related to the accuracy in the row with `accuracy` listed under the `.metric` column. You should consider the mean (`mean`) to be the estimated accuracy, while the standard -error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this +error (`std_err`) is\index{standard error}\index{sem|see{standard error}} a measure of how uncertain we are in the mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may @@ -859,7 +860,7 @@ knn_fit |> collect_metrics() ``` -We can choose any number of folds, and typically the more we use the better our +We can choose any number of folds,\index{cross-validation!folds} and typically the more we use the better our accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time @@ -1180,6 +1181,7 @@ knn_fit Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the `predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix. +\index{predict}\index{precision}\index{recall}\index{conf\_mat} ```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE} cancer_test_predictions <- predict(knn_fit, cancer_test) |> @@ -1393,24 +1395,8 @@ accs <- accs |> unlist() nghbrs <- nghbrs |> unlist() fixedaccs <- fixedaccs |> unlist() -## get accuracy if we always just guess the most frequent label -#base_acc <- cancer_irrelevant |> -# group_by(Class) |> -# summarize(n = n()) |> -# mutate(frac = n/sum(n)) |> -# summarize(mx = max(frac)) |> -# select(mx) -#base_acc <- base_acc$mx |> unlist() - # plot res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs) -#res <- res |> mutate(base_acc = base_acc) -#plt_irrelevant_accuracies <- res |> -# ggplot() + -# geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned K-NN")) + -# geom_hline(data=res, mapping=aes(yintercept=base_acc, linetype="Always Predict Benign")) + -# labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") + -# scale_linetype_manual(name="Method", values = c("dashed", "solid")) plt_irrelevant_accuracies <- ggplot(res) + geom_line(mapping = aes(x=ks, y=accs)) + @@ -1533,7 +1519,7 @@ Therefore we will continue the rest of this section using forward selection. ### Forward selection in R -We now turn to implementing forward selection in R. +We now turn to implementing forward selection in R.\index{variable selection!implementation} Unfortunately there is no built-in way to do this using the `tidymodels` framework, so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors to work with in this illustrative example—`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`—as From e564234dc32c4579ee78107900a3ff8f58a1b376 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 13:34:09 -0800 Subject: [PATCH 13/18] regresin1 index --- source/regression1.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/regression1.Rmd b/source/regression1.Rmd index 636a799c8..09bcb0d46 100644 --- a/source/regression1.Rmd +++ b/source/regression1.Rmd @@ -85,7 +85,7 @@ another example, we could try to use the size of a house to predict its sale price. Both of these response variables—race time and sale price—are numerical, and so predicting them given past data is considered a regression problem. -Just like in the \index{classification!comparison to regression} +Just like in the \index{classification!comparison to regression}\index{regression!comparison to classification} classification setting, there are many possible methods that we can use to predict numerical response variables. In this chapter we will focus on the **K-nearest neighbors** algorithm [@knnfix; @knncover], and in the next chapter From fdd9f34c1e8870820365d3499ad8f0467172c561 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 13:39:47 -0800 Subject: [PATCH 14/18] reg2 indx --- source/regression2.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/regression2.Rmd b/source/regression2.Rmd index bcc666127..225eee7b2 100644 --- a/source/regression2.Rmd +++ b/source/regression2.Rmd @@ -706,7 +706,7 @@ lm_plot_outlier_large ### Multicollinearity The second, and much more subtle, issue can occur when performing multivariable -linear regression. In particular, if you include multiple predictors that are \index{colinear}\index{multicolinear|see{colinear}} +linear regression. In particular, if you include multiple predictors that are \index{multicollinearity} strongly linearly related to one another, the coefficients that describe the plane of best fit can be very unreliable—small changes to the data can result in large changes in the coefficients. Consider an extreme example using From 8569651d60a3a3ddf9fcacf7f81768dcc8ef3f72 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Thu, 16 Nov 2023 16:51:10 -0800 Subject: [PATCH 15/18] bugfix index cls1 --- source/classification1.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/classification1.Rmd b/source/classification1.Rmd index 607b8fac4..666450b46 100644 --- a/source/classification1.Rmd +++ b/source/classification1.Rmd @@ -1329,7 +1329,7 @@ data. So how can we perform K-nearest neighbors classification in the presence of missing data? Well, since there are not too many observations with missing entries, one option is to simply remove those observations prior to building the K-nearest neighbors classifier. We can accomplish this by using the -`drop_na` function from `tidyverse` prior to working with the data.\label{missing data!drop\_na} +`drop_na` function from `tidyverse` prior to working with the data.\index{missing data!drop\_na} ```{r 05-naomit} no_missing_cancer <- missing_cancer |> drop_na() From e076a5ad3868d73beac9548cbc93d5088da33b5b Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 17 Nov 2023 09:00:02 -0800 Subject: [PATCH 16/18] bugfix index viz --- source/viz.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/viz.Rmd b/source/viz.Rmd index 76e654d03..d4e68e6fc 100644 --- a/source/viz.Rmd +++ b/source/viz.Rmd @@ -1225,9 +1225,9 @@ admirable job given the technology available at the time. When you create a histogram in R, the default number of bins used is 30. Naturally, this is not always the right number to use. You can set the number of bins yourself by using -the `bins` argument in the `geom_histogram` geometric object. \label{ggplot!bins} +the `bins` argument in the `geom_histogram` geometric object. \index{ggplot!bins} You can also set the *width* of the bins using the -`binwidth` argument in the `geom_histogram` geometric object. \label{ggplot!binwidth} +`binwidth` argument in the `geom_histogram` geometric object. \index{ggplot!binwidth} But what number of bins, or bin width, is the right one to use? Unfortunately there is no hard rule for what the right bin number From 9e3a0ddb49c1fd62bb82e25a150a6dd401a4cee3 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 17 Nov 2023 09:58:55 -0800 Subject: [PATCH 17/18] bugfixing index --- source/reading.Rmd | 2 +- source/viz.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/source/reading.Rmd b/source/reading.Rmd index c87a1a548..9e2844580 100644 --- a/source/reading.Rmd +++ b/source/reading.Rmd @@ -1262,7 +1262,7 @@ the `httr2` package, and construct the query using the `request` function, which you will recognize the same query URL that we pasted into the browser earlier. We will then send the query using the `req_perform` function, and finally obtain a JSON representation of the response using the `resp_body_json` function. -\index{httr2!req_perform}\index{httr2!resp_body_json} +\index{httr2!req\_perform}\index{httr2!resp\_body\_json} diff --git a/source/viz.Rmd b/source/viz.Rmd index d4e68e6fc..1cf1ab09d 100644 --- a/source/viz.Rmd +++ b/source/viz.Rmd @@ -774,7 +774,7 @@ To change the color palette, we add the `scale_color_brewer` layer indicating the palette we want to use. You can use this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check -if your visualizations \index{color palette!color blindness simulator} +if your visualizations \index{color blindness simulator} are color-blind friendly. Below we pick the `"Set2"` palette, with the result shown in Figure \@ref(fig:scatter-color-by-category-palette). From 4e49b0162a290e961811aaf4c44753c80d0f756a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 17 Nov 2023 10:05:53 -0800 Subject: [PATCH 18/18] index bugfixes --- source/clustering.Rmd | 2 +- source/intro.Rmd | 2 +- source/reading.Rmd | 2 +- source/regression1.Rmd | 4 ++-- source/version-control.Rmd | 2 +- source/viz.Rmd | 8 ++++---- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/source/clustering.Rmd b/source/clustering.Rmd index c3c1dec8d..8a7aefd09 100644 --- a/source/clustering.Rmd +++ b/source/clustering.Rmd @@ -295,7 +295,7 @@ improves it by making adjustments to the assignment of data to clusters until it cannot improve any further. But how do we measure the "quality" of a clustering, and what does it mean to improve it? In K-means clustering, we measure the quality of a cluster -by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD). +by its\index{within-cluster sum of squared distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps. First, we find the cluster centers by computing the mean of each variable over data points in the cluster. For example, suppose we have a diff --git a/source/intro.Rmd b/source/intro.Rmd index 8e11d79aa..3e81dee21 100644 --- a/source/intro.Rmd +++ b/source/intro.Rmd @@ -388,7 +388,7 @@ filtering the rows. A logical statement evaluates to either `TRUE` or `FALSE`; `filter` keeps only those rows for which the logical statement evaluates to `TRUE`. For example, in our analysis, we are interested in keeping only languages in the "Aboriginal languages" higher-level category. We can use -the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values +the *equivalency operator* `==` \index{logical operator!equivalency} to compare the values of the `category` column with the value `"Aboriginal languages"`; you will learn about many other kinds of logical statements in Chapter \@ref(wrangling). Similar to when we loaded the data file and put quotes around the file name, here we need diff --git a/source/reading.Rmd b/source/reading.Rmd index 9e2844580..3563553dd 100644 --- a/source/reading.Rmd +++ b/source/reading.Rmd @@ -78,7 +78,7 @@ into R, but before we can talk about *how* we read the data into R with these functions, we first need to talk about *where* the data lives. When you load a data set into R, you first need to tell R where those files live. The file could live on your computer (*local*) -\index{location|see{path}} \index{path!local, remote, relative, absolute} +\index{location|see{path}} \index{path!local}\index{path!remote}\index{path!relative}\index{path!absolute} or somewhere on the internet (*remote*). The place where the file lives on your computer is referred to as its "path". You can diff --git a/source/regression1.Rmd b/source/regression1.Rmd index 09bcb0d46..d279df77b 100644 --- a/source/regression1.Rmd +++ b/source/regression1.Rmd @@ -303,8 +303,8 @@ Note that for the remainder of the chapter we'll be working with the entire Sacramento data set, as opposed to the smaller sample of 30 points that we used earlier in the chapter (Figure \@ref(fig:07-small-eda-regr)). -\index{training data} -\index{test data} +\index{training set} +\index{test set} ```{r 07-sacramento-seed-before-train-test-split, echo = FALSE, message = FALSE, warning = FALSE} # hidden seed -- make sure this is the same as what appears in reg2 right before train/test split diff --git a/source/version-control.Rmd b/source/version-control.Rmd index ae215155f..ba85cb653 100644 --- a/source/version-control.Rmd +++ b/source/version-control.Rmd @@ -204,7 +204,7 @@ Once you reach a point that you want Git to keep a record of the current version of your work, you need to **commit** \index{git!commit} (i.e., snapshot) your changes. A prerequisite to this is telling Git which files should be included in that snapshot. We call this step **adding** the -files to the **staging area**. \index{git!add, staging area}\index{staging area|see{git}} +files to the **staging area**. \index{git!add}\index{git!staging area}\index{staging area|see{git}} Note that the staging area is not a real physical location on your computer; it is instead a conceptual placeholder for these files until they are committed. The benefit of the Git version control system using a staging area is that you diff --git a/source/viz.Rmd b/source/viz.Rmd index 1cf1ab09d..bc052a15d 100644 --- a/source/viz.Rmd +++ b/source/viz.Rmd @@ -356,7 +356,7 @@ visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. To add axis labels, we use the `xlab` and `ylab` functions. To change the font size, we use the `theme` function with the `text` argument: -\index{ggplot!xlab,ylab} +\index{ggplot!xlab}\index{ggplot!ylab} \index{ggplot!theme} ```{r 03-data-co2-line-2, warning=FALSE, message=FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time with clearer axes and labels."} @@ -675,11 +675,11 @@ to assess a few key characteristics of the data: - **Direction:** if the y variable tends to increase when the x variable increases, then y has a **positive** relationship with x. If y tends to decrease when x increases, then y has a **negative** relationship with x. If y does not meaningfully increase or decrease - as x increases, then y has **little or no** relationship with x. \index{relationship!positive, negative, none} + as x increases, then y has **little or no** relationship with x. \index{relationship!positive}\index{relationship!negative}\index{relationship!none} - **Strength:** if the y variable *reliably* increases, decreases, or stays flat as x increases, - then the relationship is **strong**. Otherwise, the relationship is **weak**. Intuitively, \index{relationship!strong, weak} + then the relationship is **strong**. Otherwise, the relationship is **weak**. Intuitively, \index{relationship!strong}\index{relationship!weak} the relationship is strong when the scatter points are close together and look more like a "line" or "curve" than a "cloud." -- **Shape:** if you can draw a straight line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**. \index{relationship!linear, nonlinear} +- **Shape:** if you can draw a straight line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**. \index{relationship!linear}\index{relationship!nonlinear} In Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props), we see that as the percentage of people who have a language as their mother tongue increases,