Skip to content

Commit

Permalink
Merge pull request #444 from UBC-DSCI/dev
Browse files Browse the repository at this point in the history
Update master with dev
  • Loading branch information
trevorcampbell authored Jul 5, 2022
2 parents fec7a19 + 3dee232 commit 40741b6
Show file tree
Hide file tree
Showing 17 changed files with 152 additions and 147 deletions.
14 changes: 8 additions & 6 deletions classification1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ total set of variables per image in this data set is:
11. Symmetry: how similar the nucleus is when mirrored
12. Fractal Dimension: a measurement of how "rough" the perimeter is

\pagebreak

Below we use `glimpse` \index{glimpse} to preview the data frame. This function can
make it easier to inspect the data when we have a lot of columns,
Expand All @@ -192,7 +193,7 @@ glimpse(cancer)
```

Recall that factors have what are called "levels", which you can think of as categories. We
can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
can verify the levels of the `Class` column by using the `levels`\index{levels}\index{factor!levels} function.
This function should return the name of each category in that column. Given
that we only have two different values in our `Class` column (B for benign and M
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
Expand Down Expand Up @@ -582,7 +583,6 @@ three predictors.
new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
new_obs_Symmetry <- 1
cancer |>
select(ID, Perimeter, Concavity, Symmetry, Class) |>
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
Expand Down Expand Up @@ -846,8 +846,8 @@ loaded, and the standardized version of that same data. But first, we need to
standardize the `unscaled_cancer` data set with `tidymodels`.

In the `tidymodels` framework, all data preprocessing happens
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes]
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes].
Here we will initialize a recipe\index{recipe} \index{tidymodels!recipe|see{recipe}} for
the `unscaled_cancer` data above, specifying
that the `Class` variable is the target, and all other variables are predictors:

Expand Down Expand Up @@ -1296,7 +1296,7 @@ The `tidymodels` package collection also provides the `workflow`, a way to chain
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:

```{r 05-workflow}
```{r 05-workflow, message = FALSE, warning = FALSE}
# load the unscaled cancer data
# and make sure the target Class variable is a factor
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
Expand All @@ -1320,7 +1320,7 @@ formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe.
You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is
placed in a workflow.

We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions,\index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data.
Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit}
is again because we included the formula in the recipe, so there is no need to respecify it:
Expand Down Expand Up @@ -1364,6 +1364,8 @@ The basic idea is to create a grid of synthetic new observations using the `expa
predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
(low `alpha` value) and large point radius. See if you can figure out what each line is doing!

\pagebreak

> **Note:** Understanding this code is not required for the remainder of the
> textbook. It is included for those readers who would like to use similar
> visualizations in their own data analyses.
Expand Down
33 changes: 16 additions & 17 deletions classification2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ in the analysis, would we not get a different result each time?
The trick is that in R&mdash;and other programming languages&mdash;randomness
is not actually random! Instead, R uses a *random number generator* that
produces a sequence of numbers that
are completely determined by a \index{seed} \index{random seed|see{seed}}
are completely determined by a\index{seed} \index{random seed|see{seed}}
*seed value*. Once you set the seed value
using the \index{seed!set.seed} `set.seed` function, everything after that point may *look* random,
but is actually totally reproducible. As long as you pick the same seed
Expand All @@ -134,34 +134,34 @@ Here, we pass in the number `1`.

```{r}
set.seed(1)
random_numbers <- sample(0:9, 10, replace=TRUE)
random_numbers
random_numbers1 <- sample(0:9, 10, replace=TRUE)
random_numbers1
```

You can see that `random_numbers` is a list of 10 numbers
You can see that `random_numbers1` is a list of 10 numbers
from 0 to 9 that, from all appearances, looks random. If
we run the `sample` function again, we will
get a fresh batch of 10 numbers that also look random.

```{r}
random_numbers <- sample(0:9, 10, replace=TRUE)
random_numbers
random_numbers2 <- sample(0:9, 10, replace=TRUE)
random_numbers2
```

If we want to force R to produce the same sequences of random numbers,
we can simply call the `set.seed` function again with the same argument
value.
value.

```{r}
set.seed(1)
random_numbers <- sample(0:9, 10, replace=TRUE)
random_numbers
random_numbers1_again <- sample(0:9, 10, replace=TRUE)
random_numbers1_again
random_numbers <- sample(0:9, 10, replace=TRUE)
random_numbers
random_numbers2_again <- sample(0:9, 10, replace=TRUE)
random_numbers2_again
```

And if we choose
Notice that after setting the seed, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose
a different value for the seed&mdash;say, 4235&mdash;we
obtain a different sequence of random numbers.

Expand Down Expand Up @@ -323,7 +323,7 @@ our test data does not influence any aspect of our model training. Once we have
created the standardization preprocessor, we can then apply it separately to both the
training and test data sets.

Fortunately, the `recipe` framework from `tidymodels` helps us handle \index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
Fortunately, the `recipe` framework from `tidymodels` helps us handle\index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
this properly. Below we construct and prepare the recipe using only the training
data (due to `data = cancer_train` in the first line).

Expand Down Expand Up @@ -411,7 +411,6 @@ the table of predicted labels and correct labels, using the `conf_mat` function:
```{r 06-confusionmat}
confusion <- cancer_test_predictions |>
conf_mat(truth = Class, estimate = .pred_class)
confusion
```

Expand Down Expand Up @@ -497,7 +496,7 @@ for the application.
## Tuning the classifier

The vast majority of predictive models in statistics and machine learning have
*parameters*. A *parameter* \index{parameter}\index{tuning parameter|see{parameter}}
*parameters*. A *parameter*\index{parameter}\index{tuning parameter|see{parameter}}
is a number you have to pick in advance that determines
some aspect of how the model behaves. For example, in the $K$-nearest neighbors
classification algorithm, $K$ is a parameter that we have to pick
Expand Down Expand Up @@ -663,7 +662,7 @@ cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
cancer_vfold
```

Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
Then, when we create our data analysis workflow, we use the `fit_resamples` function\index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
instead of the `fit` function for training. This runs cross-validation on each
train/validation split.

Expand All @@ -689,7 +688,7 @@ knn_fit <- workflow() |>
knn_fit
```

The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
of the classifier's validation accuracy across the folds. You will find results
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
Expand Down
33 changes: 16 additions & 17 deletions clustering.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:

* Describe a case where clustering is appropriate,
* Describe a situation in which clustering is an appropriate technique to use,
and what insight it might extract from the data.
* Explain the K-means clustering algorithm.
* Interpret the output of a K-means analysis.
Expand All @@ -46,7 +46,7 @@ and do this using R.
limitations and assumptions of the K-means clustering algorithm.

## Clustering
Clustering \index{clustering} is a data analysis task
Clustering \index{clustering} is a data analysis technique
involving separating a data set into subgroups of related data.
For example, we might use clustering to separate a
data set of documents into groups that correspond to topics, a data set of
Expand All @@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
or values to help us.
This approach has both advantages and disadvantages.
Clustering requires no additional annotation or input on the data.
For example, it would be nearly impossible to annotate
all the articles on Wikipedia with human-made topic labels.
However, we can still cluster the articles without this information
For example, while it would be nearly impossible to annotate
all the articles on Wikipedia with human-made topic labels,
we can cluster the articles without this information
to find groupings corresponding to topics automatically.

Given that there is no response variable, it is not as easy to evaluate
However, given that there is no response variable, it is not as easy to evaluate
the "quality" of a clustering. With classification, we can use a test data set
to assess prediction performance. In clustering, there is not a single good
choice for evaluation. In this book, we will use visualization to ascertain the
Expand Down Expand Up @@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
improves it by making adjustments to the assignment of data
to clusters until it cannot improve any further. But how do we measure
the "quality" of a clustering, and what does it mean to improve it?
In K-means clustering, we measure the quality of a cluster by its
\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
In K-means clustering, we measure the quality of a cluster
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
Computing this involves two steps.
First, we find the cluster centers by computing the mean of each variable
over data points in the cluster. For example, suppose we have a
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
Expand Down Expand Up @@ -839,7 +838,7 @@ p1
```

If we set K less than 3, then the clustering merges separate groups of data; this causes a large
total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
total WSSD, since the cluster center is not close to any of the data in the cluster. On
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
Expand Down Expand Up @@ -890,7 +889,7 @@ not_standardized_data
```

And then we apply the `scale` function to every column in the data frame
using `mutate` + `across`.
using `mutate` and `across`.

```{r 10-mapdf-scale-data}
standardized_data <- not_standardized_data |>
Expand All @@ -903,8 +902,8 @@ standardized_data

To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
least two arguments: the data frame containing the data you wish to cluster,
and K, the number of clusters (here we choose K = 3). Note that since the K-means
algorithm uses a random initialization of assignments, but since we set the random seed
and K, the number of clusters (here we choose K = 3). Note that the K-means
algorithm uses a random initialization of assignments; but since we set the random seed
earlier, the clustering will be reproducible.

```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
Expand Down Expand Up @@ -1000,8 +999,8 @@ penguin_clust_ks
If we wanted to get one of the clusterings out
of the list column in the data frame,
we could use a familiar friend: `pull`.
`pull` will return to us a data frame column as a simpler data structure,
here that would be a list.
`pull` will return to us a data frame column as a simpler data structure;
here, that would be a list.
And then to extract the first item of the list,
we can use the `pluck` function. We pass
it the index for the element we would like to extract
Expand Down Expand Up @@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
the more likely we are to find a good clustering (if one exists).
What value should you choose for `nstart`? The answer is that it depends
on many factors: the size and characteristics of your data set,
as well as the speed and size of your computer.
as well as how powerful your computer is.
The larger the `nstart` value the better from an analysis perspective,
but there is a trade-off that doing many clusterings
could take a long time.
Expand Down
Binary file modified img/intro-bootstrap.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/pivot_functions.key
Binary file not shown.
Binary file modified img/pivot_functions/pivot_functions.002.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/tidydata_bootstrap_train_test_images.key
Binary file not shown.
16 changes: 8 additions & 8 deletions inference.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ By the end of the chapter, readers will be able to do the following:

* Describe real-world examples of questions that can be answered with statistical inference.
* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
* Explain the difference between a population parameter and a sample point estimate.
* Use R to draw random samples from a finite population.
* Use R to create a sampling distribution from a finite population.
Expand Down Expand Up @@ -90,14 +90,14 @@ knitr::include_graphics("img/population_vs_sample.png")
Note that proportions are not the *only* kind of population parameter we might
be interested in. For example, suppose an undergraduate student studying at the University
of British Columbia in Canada is looking for an apartment
to rent. They need to create a budget, so they want to know something about
studio apartment rental prices in Vancouver, BC. This student might
formulate the following question:
to rent. They need to create a budget, so they want to know about
studio apartment rental prices in Vancouver. This student might
formulate the question:

*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
*What is the average price per month of studio apartment rentals in Vancouver?*

In this case, the population consists of all studio apartment rentals in Vancouver, and the
population parameter is the *average price-per-month*. Here we used the average
population parameter is the *average price per month*. Here we used the average
as a measure of the center to describe the "typical value" of studio apartment
rental prices. But even within this one example, we could also be interested in
many other population parameters. For instance, we know that not every studio
Expand Down Expand Up @@ -1148,9 +1148,9 @@ boot_est_dist +

To finish our estimation of the population parameter, we would report the point
estimate and our confidence interval's lower and upper bounds. Here the sample
mean price-per-night of 40 Airbnb listings was
mean price per night of 40 Airbnb listings was
\$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
population mean price-per-night for all Airbnb listings in Vancouver is between
population mean price per night for all Airbnb listings in Vancouver is between
\$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
Notice that our interval does indeed contain the true
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
Expand Down
10 changes: 5 additions & 5 deletions intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ By the end of the chapter, readers will be able to do the following:
- Identify the different types of data analysis question and categorize a question into the correct type.
- Load the `tidyverse` package into R.
- Read tabular data with `read_csv`.
- Use `?` to access help and documentation tools in R.
- Create new variables and objects in R using the assignment symbol.
- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
- Visualize data with a `ggplot` bar plot.
- Use `?` to access help and documentation tools in R.

## Canadian languages data set

Expand Down Expand Up @@ -312,7 +312,7 @@ to be surrounded by quotes.
After making the assignment, we can use the special name words we have created in
place of their values. For example, if we want to do something with the value `3` later on,
we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
R just interprets this as adding 2 and 3:
R just interprets this as adding 3 and 2:
```{r naming-things2}
my_number + 2
```
Expand Down Expand Up @@ -374,7 +374,7 @@ Aboriginal languages in the data set, and then use `select` to obtain only the
columns we want to include in our table.

### Using `filter` to extract rows
Looking at the `can_lang` data above, we see the column `category` contains different
Looking at the `can_lang` data above, we see the `category` column contains different
high-level categories of languages, which include "Aboriginal languages",
"Non-Official & Non-Aboriginal languages" and "Official languages". To answer
our question we want to filter our data set so we restrict our attention
Expand Down Expand Up @@ -528,7 +528,7 @@ image_read("img/ggplot_function.jpeg") |>
image_crop("1625x1900")
```

```{r barplot-mother-tongue, fig.width=5, fig.height=3, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
```{r barplot-mother-tongue, fig.width=5, fig.height=3.1, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
geom_bar(stat = "identity")
```
Expand Down Expand Up @@ -687,7 +687,7 @@ Figure \@ref(fig:01-help) shows the documentation that will pop up,
including a high-level description of the function, its arguments,
a description of each, and more. Note that you may find some of the
text in the documentation a bit too technical right now
(for example, what is `dbplyr`, and what is grouped data?).
(for example, what is `dbplyr`, and what is a lazy data frame?).
Fear not: as you work through this book, many of these terms will be introduced
to you, and slowly but surely you will become more adept at understanding and navigating
documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind that the documentation
Expand Down
Loading

0 comments on commit 40741b6

Please sign in to comment.