Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index update #564

Merged
merged 18 commits into from
Nov 17, 2023
7 changes: 4 additions & 3 deletions source/classification1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1295,7 +1295,7 @@ upsampled_plot

### Missing data

One of the most common issues in real data sets in the wild is *missing data*,
One of the most common issues in real data sets in the wild is *missing data*,\index{missing data}
i.e., observations where the values of some of the variables were not recorded.
Unfortunately, as common as it is, handling missing data properly is very
challenging and generally relies on expert knowledge about the data, setting,
Expand Down Expand Up @@ -1329,7 +1329,7 @@ data. So how can we perform K-nearest neighbors classification in the presence
of missing data? Well, since there are not too many observations with missing
entries, one option is to simply remove those observations prior to building
the K-nearest neighbors classifier. We can accomplish this by using the
`drop_na` function from `tidyverse` prior to working with the data.
`drop_na` function from `tidyverse` prior to working with the data.\index{missing data!drop\_na}

```{r 05-naomit}
no_missing_cancer <- missing_cancer |> drop_na()
Expand All @@ -1342,7 +1342,8 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
values based on the other observations in the data set. One reasonable choice
is to perform *mean imputation*, where missing entries are filled in using the
mean of the present entries in each variable. To perform mean imputation, we
add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
add the `step_impute_mean` \index{recipe!step\_impute\_mean}\index{missing data!mean imputation}
step to the `tidymodels` preprocessing recipe.
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
step_impute_mean(all_predictors()) |>
Expand Down
32 changes: 9 additions & 23 deletions source/classification2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ a single number. But prediction accuracy by itself does not tell the whole
story. In particular, accuracy alone only tells us how often the classifier
makes mistakes in general, but does not tell us anything about the *kinds* of
mistakes the classifier makes. A more comprehensive view of performance can be
obtained by additionally examining the **confusion matrix**. The confusion
obtained by additionally examining the **confusion matrix**. The confusion\index{confusion matrix}
matrix shows how many test set labels of each type are predicted correctly and
incorrectly, which gives us more detail about the kinds of mistakes the
classifier tends to make. Table \@ref(tab:confusion-matrix) shows an example
Expand Down Expand Up @@ -148,7 +148,8 @@ disastrous error, since it may lead to a patient who requires treatment not rece
Since we are particularly interested in identifying malignant cases, this
classifier would likely be unacceptable even with an accuracy of 89%.

Focusing more on one label than the other is
Focusing more on one label than the other
is\index{positive label}\index{negative label}\index{true positive}\index{false positive}\index{true negative}\index{false negative}
common in classification problems. In such cases, we typically refer to the label we are more
interested in identifying as the *positive* label, and the other as the
*negative* label. In the tumor example, we would refer to malignant
Expand All @@ -166,7 +167,7 @@ therefore, 100% accuracy). However, classifiers in practice will almost always
make some errors. So you should think about which kinds of error are most
important in your application, and use the confusion matrix to quantify and
report them. Two commonly used metrics that we can compute using the confusion
matrix are the **precision** and **recall** of the classifier. These are often
matrix are the **precision** and **recall** of the classifier.\index{precision}\index{recall} These are often
reported together with accuracy. *Precision* quantifies how many of the
positive predictions the classifier made were actually positive. Intuitively,
we would like a classifier to have a *high* precision: for a classifier with
Expand Down Expand Up @@ -582,7 +583,7 @@ We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% ac
on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%.
That sounds pretty good! Wait, *is* it good? Or do we need something higher?

In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}\index{precision!assessment}\index{recall!assessment}
depends on the application; you must critically analyze your accuracy in the context of the problem
you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
Expand Down Expand Up @@ -845,7 +846,7 @@ The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!
of the classifier's validation accuracy across the folds. You will find results
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
error (`std_err`) is\index{standard error}\index{sem|see{standard error}} a measure of how uncertain we are in the mean value. A detailed treatment of this
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
Expand All @@ -859,7 +860,7 @@ knn_fit |>
collect_metrics()
```

We can choose any number of folds, and typically the more we use the better our
We can choose any number of folds,\index{cross-validation!folds} and typically the more we use the better our
accuracy estimate will be (lower standard error). However, we are limited
by computational power: the
more folds we choose, the more computation it takes, and hence the more time
Expand Down Expand Up @@ -1180,6 +1181,7 @@ knn_fit
Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
`predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to
the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix.
\index{predict}\index{precision}\index{recall}\index{conf\_mat}

```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE}
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
Expand Down Expand Up @@ -1393,24 +1395,8 @@ accs <- accs |> unlist()
nghbrs <- nghbrs |> unlist()
fixedaccs <- fixedaccs |> unlist()

## get accuracy if we always just guess the most frequent label
#base_acc <- cancer_irrelevant |>
# group_by(Class) |>
# summarize(n = n()) |>
# mutate(frac = n/sum(n)) |>
# summarize(mx = max(frac)) |>
# select(mx)
#base_acc <- base_acc$mx |> unlist()

# plot
res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs)
#res <- res |> mutate(base_acc = base_acc)
#plt_irrelevant_accuracies <- res |>
# ggplot() +
# geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned K-NN")) +
# geom_hline(data=res, mapping=aes(yintercept=base_acc, linetype="Always Predict Benign")) +
# labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") +
# scale_linetype_manual(name="Method", values = c("dashed", "solid"))

plt_irrelevant_accuracies <- ggplot(res) +
geom_line(mapping = aes(x=ks, y=accs)) +
Expand Down Expand Up @@ -1533,7 +1519,7 @@ Therefore we will continue the rest of this section using forward selection.

### Forward selection in R

We now turn to implementing forward selection in R.
We now turn to implementing forward selection in R.\index{variable selection!implementation}
Unfortunately there is no built-in way to do this using the `tidymodels` framework,
so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
to work with in this illustrative example&mdash;`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as
Expand Down
20 changes: 4 additions & 16 deletions source/clustering.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ library(tidyverse)
set.seed(1)
```

Now we can load and preview the `penguins` data.
Now we can load and preview the `penguins` data.\index{read function!read\_csv}

```{r message = FALSE, warning = FALSE}
penguins <- read_csv("data/penguins.csv")
Expand Down Expand Up @@ -295,7 +295,7 @@ improves it by making adjustments to the assignment of data
to clusters until it cannot improve any further. But how do we measure
the "quality" of a clustering, and what does it mean to improve it?
In K-means clustering, we measure the quality of a cluster
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
by its\index{within-cluster sum of squared distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
Computing this involves two steps.
First, we find the cluster centers by computing the mean of each variable
over data points in the cluster. For example, suppose we have a
Expand Down Expand Up @@ -639,7 +639,7 @@ in the fourth iteration; both the centers and labels will remain the same from t

### Random restarts

Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart} can get "stuck" in a bad solution.
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.

```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
Expand Down Expand Up @@ -910,7 +910,7 @@ set.seed(1)

We can perform K-means clustering in R using a `tidymodels` workflow similar
to those in the earlier classification and regression chapters.
We will begin by loading the `tidyclust`\index{tidyclust} library, which contains the necessary
We will begin by loading the `tidyclust`\index{K-means}\index{tidyclust} library, which contains the necessary
functionality.
```{r, echo = TRUE, warning = FALSE, message = FALSE}
library(tidyclust)
Expand Down Expand Up @@ -993,18 +993,6 @@ clustered_data <- kmeans_fit |>
clustered_data
```

<!--
If for some reason we need access to just the cluster assignments,
we can extract those from the fit as a data frame using
the `extract_cluster_assignment` function. Note that in this case,
the cluster assignments variable is named `.cluster`, while the `augment`
function earlier creates a variable named `.pred_cluster`.

```{r 10-kmeans-extract-clusterasgn}
extract_cluster_assignment(kmeans_fit)
```
-->

Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can
visualize them as shown in Figure \@ref(fig:10-plot-clusters-2).
Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
Expand Down
5 changes: 3 additions & 2 deletions source/inference.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ We first group the data by the `replicate` variable&mdash;to group the
set of listings in each sample together&mdash;and then use `summarize`
to compute the proportion in each sample.
We print both the first and last few entries of the resulting data frame
below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.
below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.\index{group\_by}\index{summarize}

```{r 11-example-proportions6, echo = TRUE, message = FALSE, warning = FALSE}
sample_estimates <- samples |>
Expand Down Expand Up @@ -381,7 +381,7 @@ one_sample <- airbnb |>

We can create a histogram to visualize the distribution of observations in the
sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
of our sample.
of our sample.\index{ggplot!geom\_histogram}

```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
sample_distribution <- ggplot(one_sample, aes(price)) +
Expand Down Expand Up @@ -1116,6 +1116,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol

To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
\index{percentile}
\index{quantile}
\index{pull}
\index{select}
Expand Down
4 changes: 2 additions & 2 deletions source/intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ filtering the rows. A logical statement evaluates to either `TRUE` or `FALSE`;
`filter` keeps only those rows for which the logical statement evaluates to `TRUE`.
For example, in our analysis, we are interested in keeping only languages in the
"Aboriginal languages" higher-level category. We can use
the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
the *equivalency operator* `==` \index{logical operator!equivalency} to compare the values
of the `category` column with the value `"Aboriginal languages"`; you will learn about
many other kinds of logical statements in Chapter \@ref(wrangling). Similar to
when we loaded the data file and put quotes around the file name, here we need
Expand Down Expand Up @@ -590,7 +590,7 @@ Canadian Residents)" would be much more informative.
Adding additional layers \index{plot!layers} to our visualizations that we create in `ggplot` is
one common and easy way to improve and refine our data visualizations. New
layers are added to `ggplot` objects using the `+` symbol. For example, we can
use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions
use the `xlab` (short for x axis label) \index{ggplot!xlab} and `ylab` (short for y axis label) \index{ggplot!ylab} functions
to add layers where we specify meaningful
and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
Expand Down
2 changes: 1 addition & 1 deletion source/jupyter.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ right-clicking on the file's name in the Jupyter file explorer, selecting
**Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)).
Suppose you do not specify to open
the data file with an editor. In that case, Jupyter will render a nice table
for you, and you will not be able to see the column delimiters, and therefore
for you, and you will not be able to see the column delimiters, \index{delimiter} and therefore
you will not know which function to use, nor which arguments to use and values
to specify for them.

Expand Down
Loading
Loading