10_dl_cnn.Rmd

# Convolutional neural networks {#dlcnn}

```{r include = FALSE}
library(keras)
knitr::opts_chunk$set(message = FALSE, 
  warning = FALSE, error = TRUE)
tensorflow::tf$random$set_seed(1234)

junk <- keras_model_sequential()

hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
  # this hook is used only when the linewidth option is not NULL
  if (!is.null(n <- options$linewidth)) {
    x = knitr:::split_lines(x)
    # any lines wider than n should be wrapped
    if (any(nchar(x) > n)) x = strwrap(x, width = n)
    x = paste(x, collapse = '\n')
  }
  hook_output(x, options)
})
```

The first neural networks\index{network architecture} we built in Chapter \@ref(dldnn) did not have the capacity to learn much about structure, sequences, or long-range dependencies in our text data. The LSTM networks we trained in Chapter \@ref(dllstm) were especially suited to learning long-range dependencies. In this final chapter, we will focus on \index{neural network!convolutional} **convolutional neural network** (CNN) architecture [@kim2014], which can learn local, spatial structure within data.

CNNs can be well-suited for modeling text data because text often contains quite a lot of local structure. A CNN does not learn long-range structure within a sequence like an LSTM, but instead detects local patterns. A CNN network layer takes data (like text) as input and then hopefully produces output that represents specific structures\index{language!structure} in the data.

```{block, type = "rmdnote"}
Let's take more time with CNNs in this chapter to explore their construction, different features, and the hyperparameters we can tune.
```

## What are CNNs?

CNNs can work with data of different dimensions (like two-dimensional images or three-dimensional video), but for text modeling, we typically work in one dimension. The illustrations and explanations in this chapter use only one dimension to match the text use case. 
Figure \@ref(fig:cnn-architecture) illustrates a typical CNN architecture.
A convolutional filter slides along the sequence to produce a new, smaller sequence. This is repeated multiple times, typically with different parameters for each layer, until we are left with a small data cube that we can transform into our required output shape, a value between 0 and 1 in the case of binary classification.

```{r cnn-architecture, echo= FALSE, fig.cap="A template CNN architecture for one-dimensional input data. A sequence of consecutive CNN layers incremently reduces the size, ending with single output value.", out.width="100%"}
knitr::include_graphics("diagram-files/cnn-architecture.png")
```

This figure isn't entirely accurate because we technically don't feed characters into a CNN, but instead use one-hot sequence encoding (Section \@ref(onehotsequence)) with a possible word embedding.
Let's talk about two of the most important CNN concepts, **kernels** and **kernel size**.

### Kernel

The kernel is a small vector that slides along the input. When it is sliding, it performs element-wise multiplication of the values in the input and its own weights, and then sums up the values to get a single value. 
Sometimes an activation function is applied as well.
It is these weights that are trained via gradient descent to find the best fit.
In Keras, the `filters` represent how many different kernels are trained in each layer. You typically start with fewer `filters` at the beginning of your network and then increase them as you go along. 

### Kernel size

The most prominent hyperparameter is the kernel size. 
The kernel size is the length of the vector that contains the weights. A kernel of size 5 will have 5 weights. These kernels can capture local information similarly to how n-grams capture location patterns. Increasing the size of the kernel decreases the size of the output, as shown in Figure \@ref(fig:cnn-kernel-size).

```{r cnn-kernel-size, echo= FALSE, fig.cap="The kernel size affects the size of the output. A kernel size of 3 uses the information from 3 values to compute 1 value.", out.width="100%"}
knitr::include_graphics("diagram-files/cnn-kernel-size.png")
```

Larger kernels learn larger and less frequent patterns, while smaller kernels will find fine-grained features. 
Notice how the choice of token affects how we think about kernel size. 
For character tokenization, a kernel size of 5 will (in early layers) find patterns in subwords more often than patterns across words, since 5 characters will typically not span multiple words. 
By contrast, a kernel size of 5 with word tokenization will learn patterns within sentences instead. 

## A first CNN model {#firstcnn}

\index{neural network!convolutional}We will be using the same data, which we examine in Sections \@ref(kickstarter) and \@ref(kickstarter-blurbs) and use throughout Chapters \@ref(dldnn) and \@ref(dllstm). This data set contains short text blurbs for prospective crowdfunding campaigns on Kickstarter, along with if they were successful. Our goal of this modeling is to predict successful campaigns from the text contained in the blurb. We will also use the same \index{preprocessing}preprocessing and feature engineering recipe that we created and described in Sections \@ref(dnnrecipe) and \@ref(firstlstm).

```{r include=FALSE}
library(tidyverse)

kickstarter <- read_csv("data/kickstarter.csv.gz")
kickstarter

library(tidymodels)
set.seed(1234)
kickstarter_split <- kickstarter %>%
  filter(nchar(blurb) >= 15) %>%
  initial_split()

kickstarter_train <- training(kickstarter_split)
kickstarter_test <- testing(kickstarter_split)

library(textrecipes)

max_words <- 20000
max_length <- 30

kick_rec <- recipe(~blurb, data = kickstarter_train) %>%
  step_tokenize(blurb) %>%
  step_tokenfilter(blurb, max_tokens = max_words) %>%
  step_sequence_onehot(blurb, sequence_length = max_length)

set.seed(234)
kick_val <- validation_split(kickstarter_train, strata = state)

kick_prep <- prep(kick_rec)
kick_analysis <- bake(kick_prep, new_data = analysis(kick_val$splits[[1]]),
                      composition = "matrix")

kick_assess <- bake(kick_prep, new_data = assessment(kick_val$splits[[1]]),
                    composition = "matrix")

state_analysis <- analysis(kick_val$splits[[1]]) %>% pull(state)
state_assess <- assessment(kick_val$splits[[1]]) %>% pull(state)
```

Our first CNN will look a lot like what is shown in Figure \@ref(fig:cnn-architecture).
We start with an embedding layer, followed by a single one-dimensional convolution layer `layer_conv_1d()`, then a global max pooling layer `layer_global_max_pooling_1d()`, a densely connected layer, and end with a dense layer with a sigmoid activation function to give us one value between 0 and 1 to use in our binary classification task.

```{r}
library(keras)

simple_cnn_model <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = max_length) %>%
  layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

simple_cnn_model
```

We are using the same embedding layer with the same `max_length` as in the previous networks so there is nothing new there.
The `layer_global_max_pooling_1d()` layer collapses the remaining CNN output into one dimension so we can finish it off with a densely connected layer and the sigmoid activation function.

This might not end up being the best CNN configuration, but it is a good starting point.
One of the challenges when working with CNNs is to ensure that we manage the dimensionality correctly. The length of the sequence decreases by `(kernel_size - 1)` for each layer. For this input, we have a sequence of length `max_length = 30`, which is decreased by `(5 - 1) = 4` resulting in a sequence of 26, as shown in the printed output of `simple_cnn_model`. We could create seven layers with `kernel_size = 5`, since we would end with `30 - 4 - 4 - 4 - 4 - 4 - 4 - 4 = 2` elements in the resulting sequence. However, we would not be able to do a network with 3 layers of 
`kernel_size = 7` followed by 3 layers of `kernel_size = 5` since the resulting sequence would be `30 - 6 - 6 - 6 - 4 - 4 - 4 = 0` and we must have a positive length for our sequence.
Remember that `kernel_size` is not the only argument that will change the length of the resulting sequence. \index{network architecture}

```{block, type = "rmdnote"}
Constructing a sequence layer by layer and using the print method from **keras** to check the configuration is a great way to make sure your architecture is valid.
```


The compilation and fitting are the same as we have seen before, using a validation split created with tidymodels as shown in Sections \@ref(evaluate-dnn) and \@ref(lstmevaluation). 

```{r}
simple_cnn_model %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

cnn_history <- simple_cnn_model %>% fit(
  x = kick_analysis,
  y = state_analysis,
  batch_size = 512,
  epochs = 10,
  validation_data = list(kick_assess, state_assess)
)
```

\index{optimization algorithm}
```{block, type = "rmdnote"}
We are using the `"adam"` optimizer since it performs well for many kinds of models. You may have to experiment to find the optimizer that works best for your model and data.
```

Now that the model is done fitting, we can evaluate it on the validation data set using the same `keras_predict()` function we created in Section \@ref(evaluate-dnn) and used throughout Chapters \@ref(dldnn) and \@ref(dllstm).

```{r}
val_res <- keras_predict(simple_cnn_model, kick_assess, state_assess)
val_res
```

We can calculate some standard metrics with `metrics()`.

```{r}
metrics(val_res, state, .pred_class, .pred_1)
```

We already see improvement over the densely connected network from Chapter \@ref(dldnn), our best performing model on the Kickstarter data so far.

The heatmap\index{matrix!confusion} in Figure \@ref(fig:cnnheatmap) shows that the model performs about the same for the two classes, success and failure for the crowdfunding campaigns; we are getting fairly good results from a baseline CNN model!

```{r cnnheatmap, fig.cap="Confusion matrix for first CNN model predictions of Kickstarter campaign success"}
val_res %>%
  conf_mat(state, .pred_class) %>%
  autoplot(type = "heatmap")
```

The ROC curve in Figure \@ref(fig:cnnroccurve) shows how the model performs at different thresholds.

```{r cnnroccurve, opts.label = "fig.square", fig.cap="ROC curve for first CNN model predictions of Kickstarter campaign success"}
val_res %>%
  roc_curve(truth = state, .pred_1) %>%
  autoplot() +
  labs(
    title = "Receiver operator curve for Kickstarter blurbs"
  )
```

## Case study: adding more layers

Now that we know how our basic CNN performs, we can see what happens when we apply some common modifications to it.
This case study will examine: 

- how we can add additional _convolutional_ layers to our base model and

- how additional _dense_ layers can be added.

\index{network architecture}Let's start by adding another fully connected layer. We take the architecture we used in `simple_cnn_model` and add another `layer_dense()` after the first `layer_dense()` in the model.
Increasing the depth of the model via the fully connected layers allows the model to find more complex patterns.
There is, however, a trade-off. Adding more layers adds more weights to the model, making it more complex and harder to train. If you don't have enough data or the patterns you are trying to classify aren't that complex, then model performance will suffer since the model will start overfitting as it starts memorizing patterns in the training data that don't generalize to new data.


```{block, type = "rmdwarning"}
When working with CNNs, the different layers perform different tasks. A convolutional layer extracts local patterns as it slides along the sequences, while a fully connected layer finds global patterns. 
```

We can think of the convolutional layers as doing preprocessing\index{preprocessing} on the text, which is then fed into the dense neural network that tries to fit the best curve. Adding more fully connected layers allows the network to create more intricate curves, and adding more convolutional layers creates richer features that are used when fitting the curves. Your job when constructing a CNN is to make the architecture just complex enough to match the data without overfitting. One ad-hoc rule to follow when refining your network architecture is to start small and keep adding layers until the validation error does not improve anymore.

```{r}
cnn_double_dense <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = max_length) %>%
  layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

cnn_double_dense
```

We can compile and fit this new model. We will try to keep as much as we can constant as we compare the different models.

```{r}
cnn_double_dense %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- cnn_double_dense %>% fit(
  x = kick_analysis,
  y = state_analysis,
  batch_size = 512,
  epochs = 10,
  validation_data = list(kick_assess, state_assess)
)
```

```{r}
val_res_double_dense <- keras_predict(
  cnn_double_dense,
  kick_assess,
  state_assess
)

metrics(val_res_double_dense, state, .pred_class, .pred_1)
```

This model performs well, but it is not entirely clear that it is working much better than the first CNN model we tried. This could be an indication that the original model had enough fully connected layers for the amount of training data we have available.

```{block, type = "rmdwarning"}
If we have two models with nearly identical performance, we should choose the less complex of the two, since it will have faster performance.
```

We can also change the number of convolutional layers, by adding more such layers.

```{r}
cnn_double_conv <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = max_length) %>%
  layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
  layer_max_pooling_1d(pool_size = 2) %>%
  layer_conv_1d(filter = 64, kernel_size = 3, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

cnn_double_conv
```

There are a lot of different ways we can extend the network by adding convolutional layers with `layer_conv_1d()`. We must consider the individual characteristics of each layer, with respect to kernel size, as well as other CNN parameters we have not discussed in detail yet like stride, padding, and dilation rate. We also have to consider the progression of these layers within the network itself. 
The model is using an increasing number of filters in each layer, doubling the number of filters for each layer. This is to ensure that there are more filters later on to capture enough of the global information.

This model is using a kernel size of 5 twice. There aren't any hard rules about how you structure kernel sizes, but the sizes you choose will change what features the model can detect.  

```{block, type = "rmdnote"}
The early layers extract general or low-level features while the later layers learn finer detail or high-level features in the data. The choice of kernel size determines the size of these features.
```

Having a small kernel size in the first layer will let the model detect low-level features locally.

We are also including a max-pooling layer with `layer_max_pooling_1d()` between the convolutional layers. This layer performs a pooling operation that calculates the maximum values in its pooling window; in this model, that is set to 2.
This is done in the hope that the pooled features will be able to perform better by weeding out the small weights.
This is another parameter you can tinker with when you are designing the network.

We compile this model like the others, again trying to keep as much as we can constant. The only thing that changed in this model compared to the first is the addition of a `layer_max_pooling_1d()` and a `layer_conv_1d()`.

```{r}
cnn_double_conv %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- cnn_double_conv %>% fit(
  x = kick_analysis,
  y = state_analysis,
  batch_size = 512,
  epochs = 10,
  validation_data = list(kick_assess, state_assess)
)
```

```{r}
val_res_double_conv <- keras_predict(
  cnn_double_conv,
  kick_assess,
  state_assess
)

metrics(val_res_double_conv, state, .pred_class, .pred_1)
```

This model also performs well compared to earlier results. Let us extract the the prediction using `keras_predict()` we defined in Section \@ref(evaluate-dnn).

```{r}
all_cnn_model_predictions <- bind_rows(
  mutate(val_res, model = "Basic CNN"),
  mutate(val_res_double_dense, model = "Double Dense"),
  mutate(val_res_double_conv, model = "Double Conv")
)

all_cnn_model_predictions
```

Now that the results are combined in `all_cnn_model_predictions` we can calculate group-wise evaluation statistics by grouping them by the `model` variable.

```{r}
all_cnn_model_predictions %>%
  group_by(model) %>%
  metrics(state, .pred_class, .pred_1)
```

We can also compute ROC curves for all our models so far. Figure \@ref(fig:allcnnroccurve) shows the three different ROC curves together in one chart.

```{r allcnnroccurve, opts.label = "fig.square", fig.cap="ROC curve for three CNN variants' predictions of Kickstarter campaign success"}
all_cnn_model_predictions %>%
  group_by(model) %>%
  roc_curve(truth = state, .pred_1) %>%
  autoplot() +
  labs(
    title = "Receiver operator curve for Kickstarter blurbs"
  )
```

The curves are _very_ close in this chart, indicating that we don't have much to gain by adding more layers and that they don't improve performance substantively.
This doesn't mean that we are done with CNNs! There are still many things we can explore, like different tokenization approaches and hyperparameters that can be trained.

## Case study: byte pair encoding

\index{tokenization!subword}In our models in this chapter so far we have used words as the token of interest. We saw in Section \@ref(casestudyngrams) how n-grams can be used in modeling as well.
One of the reasons why the Kickstarter data set is hard to work with is because the text is quite short so we don't have that many individual tokens to work with in a given blurb.
Another choice of token is _subwords_, where we split the text into smaller units than words; longer words especially will be broken into multiple subword units. One way to tokenize text into subword units is _byte pair encoding_ [@Gage1994ANA].
This algorithm has been repurposed to work on text by iteratively merging frequently occurring subword pairs.
Methods such as [BERT](https://github.com/google-research/bert) and [GPT-2](https://openai.com/blog/better-language-models/) use subword units for text with great success.
The byte pair encoding algorithm has a hyperparameter controlling the size of the vocabulary. Setting it to higher values allows the models to find more rarely used character sequences in the text.

Byte pair encoding offers a good trade-off between character-level and word-level information, and can also encode unknown words. For example, suppose that the model is aware of the word "woman". A simple tokenizer would have to put a word such as "womanhood" into an unknown bucket or ignore it completely, whereas byte pair encoding should be able to pick up on the subwords "woman" and "hood" (or "woman", "h", and "ood", depending on whether the model found "hood" as a common enough subword).
Using a subword tokenizer such as byte pair encoding should let us see the text with more granularity since we will have more and smaller tokens for each observation.

```{block2, type = "rmdnote"}
Character-level CNNs have also proven successful in some contexts. They have been explored by @Zhang2015 and work quite well on some shorter texts such as headlines and tweets [@Vosoughi2016].
```

We need to remind ourselves that these models don't contain any linguistic knowledge at all; they only "learn" the morphological\index{morphology} patterns of sequences of characters (Section \@ref(morphology)) in the training set. This does not make the models useless, but it should set our expectations about what any given model is capable of.

Since we are using a completely different preprocessing approach, we need to specify a new feature engineering recipe. 

```{block, type = "rmdpackage"}
The **textrecipes** package has a tokenization engine to perform byte pair encoding, but we need to determine the vocabulary size and the appropriate sequence length. 
```

Let's write a function that takes a character vector and a vocabulary size and returns a dataframe with the number of tokens in each observation.

```{r}
library(textrecipes)

get_bpe_token_dist <- function(vocab_size, x) {
  recipe(~text, data = tibble(text = x)) %>%
    step_mutate(text = tolower(text)) %>%
    step_tokenize(text,
                  engine = "tokenizers.bpe",
                  training_options = list(vocab_size = vocab_size)) %>%
    prep() %>%
    bake(new_data = NULL) %>%
    transmute(n_tokens = lengths(textrecipes:::get_tokens(text)),
              vocab_size = vocab_size)
}
```

We can use `map()` to try a handful of different vocabulary sizes.

```{r}
bpe_token_dist <- map_dfr(
  c(2500, 5000, 10000, 20000),
  get_bpe_token_dist,
  kickstarter_train$blurb
)
bpe_token_dist
```

If we compare with the word count distribution we saw in Figure \@ref(fig:kickstarterwordlength), then we see in Figure \@ref(fig:kickstartersubwordlength) that any of these choices for vocabulary size will result in more tokens overall.

```{r kickstartersubwordlength, fig.cap="Distribution of subword count for Kickstarter campaign blurbs for different vocabulary sizes"}
bpe_token_dist %>%
  ggplot(aes(n_tokens)) +
  geom_bar() +
  facet_wrap(~vocab_size) +
  labs(x = "Number of subwords per campaign blurb",
       y = "Number of campaign blurbs")
```

Let's pick a vocabulary size of 10,000 and a corresponding sequence length of 40. To use byte pair encoding as a tokenizer in textrecipes set `engine = "tokenizers.bpe"`; the vocabulary size can be denoted using the `training_options` argument. Everything else in the recipe stays the same.

```{r}
max_subwords <- 10000
bpe_max_length <- 40

bpe_rec <- recipe(~blurb, data = kickstarter_train) %>%
  step_mutate(blurb = tolower(blurb)) %>%
  step_tokenize(blurb,
                engine = "tokenizers.bpe",
                training_options = list(vocab_size = max_subwords)) %>%
  step_sequence_onehot(blurb, sequence_length = bpe_max_length)

bpe_prep <- prep(bpe_rec)

bpe_analysis <- bake(bpe_prep, new_data = analysis(kick_val$splits[[1]]),
                     composition = "matrix")
bpe_assess <- bake(bpe_prep, new_data = assessment(kick_val$splits[[1]]),
                   composition = "matrix")
```

Our model will be very similar to the baseline CNN model from Section \@ref(firstcnn); we'll use a larger kernel size of 7 to account for the finer detail in the tokens.

```{r}
cnn_bpe <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = bpe_max_length) %>%
  layer_conv_1d(filter = 32, kernel_size = 7, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

cnn_bpe
```

We can compile and train like we have done so many times now.

```{r}
cnn_bpe %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

bpe_history <- cnn_bpe %>% fit(
  bpe_analysis,
  state_analysis,
  epochs = 10,
  validation_data = list(bpe_assess, state_assess),
  batch_size = 512
)

bpe_history
```

The performance is doing quite well, which is a pleasant surprise! This is what we hoped would happen if we switched to a higher-detail tokenizer.

The \index{matrix!confusion}confusion matrix in Figure \@ref(fig:bpeheatmap) also clearly shows that there isn't much bias between the two classes with this new tokenizer.

```{r bpeheatmap, fig.cap="Confusion matrix for CNN model using byte pair encoding tokenization"}
val_res_bpe <- keras_predict(cnn_bpe, bpe_assess, state_assess)

val_res_bpe %>%
  conf_mat(state, .pred_class) %>%
  autoplot(type = "heatmap")
```

What are the subwords being used in this model? We can extract them from `step_sequence_onehot()` using `tidy()` on the prepped recipe. All the tokens that start with an `"h"` are seen here.

```{r}
bpe_rec %>%
  prep() %>%
  tidy(3) %>%
  filter(str_detect(token, "^h")) %>%
  pull(token)
```

Notice how some of these subword tokens\index{tokenization!subword} are full words, and some are parts of words. This is what allows the model to be able to "read" long unknown words by combining many smaller subwords.
We can also look at common long words.

```{r}
bpe_rec %>%
  prep() %>%
  tidy(3) %>%
  arrange(desc(nchar(token))) %>%
  slice_head(n = 25) %>%
  pull(token)
```

These 25 words were common enough to get their own subword token, and helps us understand the nature of these Kickstarter crowdfunding campaigns.

```{block, type = "rmdwarning"}
Examining the longest subword tokens gives you a good sense of the data you are working with!
```


## Case study: explainability with LIME {#lime}

\index{models!explainability}
\index{models!interpretability|see {models, explainability}}
We noted in Section \@ref(dllimitations) that one of the significant limitations of deep learning models is that they are hard to reason about. One of the ways to understand a predictive model, even a "black box"\index{black box} one, is using an algorithm for observation-level variable importance like the *Local Interpretable Model-Agnostic Explanations* [@ribeiro2016why] algorithm, or **LIME** for short. 

```{block, type = "rmdnote"}
As indicated by its name, LIME is an approach to compute local feature importance, or explainability at the individual observation level. It does not offer global feature importance, or explainability for the model as a whole.
```

```{block2, type = "rmdpackage"}
The **lime** package in R [@R-lime] implements the LIME algorithm; it can take a prediction from a model and determine a small set of features in the original data that drives the outcome of the prediction. 
```

To use this package we need to write a helper function to get the data in the format we want. The `lime()` function takes two mandatory arguments, `x` and `model`. The `model` argument is the trained model we are trying to explain. The `lime()` function works out of the box with Keras models so we should be good to go there.  The `x` argument is the training data used for training the model. This is where we need to to create a helper function; the lime package is expecting `x` to be a character vector so we'll need a function that takes a character vector as input and returns the matrix the Keras model is expecting.

```{r}
kick_prepped_rec <- prep(kick_rec)

text_to_matrix <- function(x) {
  bake(
    kick_prepped_rec,
    new_data = tibble(blurb = x),
    composition = "matrix"
  )
}
```

```{block, type = "rmdnote"}
Since the function needs to be able to work with just the `x` parameter alone, we need to put `prepped_recipe` inside the function rather than passing it in as an argument. This will work with R's scoping rules but does require you to create a new function for each recipe.
```

Let's select a couple of training observations to explain.

```{r, linewidth=80}
sentence_to_explain <- kickstarter_train %>%
  slice(c(1, 5)) %>%
  pull(blurb)

sentence_to_explain
```

We now load the lime package and pass observations into `lime()` along with the model we are trying to explain and the preprocess function.

```{block, type = "rmdwarning"}
Be sure that the preprocessing function _matches_ the preprocessing that was used to train the model.
```

```{r}
library(lime)

explainer <- lime(
  x = sentence_to_explain,
  model = simple_cnn_model,
  preprocess = text_to_matrix
)
```

This `explainer` object can now be used with `explain()` to generate explanations for the sentences. We set `n_labels = 1` to only get explanations for the first label, since we are working with a binary classification model^[The explanations of the second label would just be the inverse of the first label. If you have more than two labels, it makes sense to explore some or all of them.]. We set `n_features = 12` to return the 12 most important features. If we were dealing with longer text, we might want to change `n_features` to return more features (tokens).

```{r}
explanation <- explain(
  x = sentence_to_explain,
  explainer = explainer,
  n_labels = 1,
  n_features = 12
)

explanation
```

The output comes in a tibble format where `feature` and `feature_weight` are included, but fortunately lime contains some functions to visualize these weights. Figure \@ref(fig:limeplotfeatures) shows the result of using `plot_features()`, with each facet containing an observation-label pair and the bars showing the weight of the different tokens. Bars in the positive direction (darker) indicate that the weights _support_ the prediction and bars in the negative direction (lighter) indicate _contradictions_. This chart is great for finding the most prominent features in an observation.

```{r, eval=FALSE}
plot_features(explanation)
```

```{r limeplotfeatures, echo=FALSE, fig.cap="Plot of most important features for a CNN model predicting two observations."}
suppressMessages(
plot_features(explanation) +
  scale_fill_manual(values = discrete_colors[1:2], drop = FALSE)
)
```

\index{models!explainability}Figure \@ref(fig:limeplottextexplanations) shows the weights by highlighting the words directly in the text. This gives us a way to see if any local patterns contain a lot of weight.

```{r, eval=FALSE}
plot_text_explanations(explanation)
```

```{r limeplottextexplanations, echo=FALSE, fig.cap="Feature highlighting of words for two examples explained by a CNN model.", out.width="90%"}
if (knitr:::is_html_output()) {
  positive_colors <- prismatic::clr_lighten(
    rep(discrete_colors[1], 6), 
    c(-0.1, 0.1, 0.3, 0.5, 0.7, 0.9)
  )
  
  negative_colors <- prismatic::clr_lighten(
    rep(discrete_colors[2], 6), 
    c(-0.1, 0.1, 0.3, 0.5, 0.7, 0.9)
  )
  
  all_styles <- paste(collapse = "\n",
  paste0(".match_positive, .positive_1, .positive_2, .positive_3, .positive_4, .positive_5
         { border: 1px solid ", positive_colors[1], ";}"),
  paste0(".match_negative, .negative_1, .negative_2, .negative_3, .negative_4, .negative_5
         { border: 1px solid ", negative_colors[1], ";}"),
  paste0(".plot_text_explanations .", "positive", "_", 1:5, " {
    background-color: ", positive_colors[-1], ";}"),
  paste0(".plot_text_explanations .", "negative", "_", 1:5, " {
    background-color: ", negative_colors[-1], ";}")
  )
  
  plot_text_explanations(explanation) %>%
    htmlwidgets::prependContent(htmltools::tags$style(all_styles))
} else {
  knitr::include_graphics("images/plot_text_explanations_1.png")
}
```

```{block, type = "rmdnote"}
The `interactive_text_explanations()` function can be used to launch an interactive Shiny app where you can explore the model weights.
```

\index{models!explainability}One of the ways a deep learning model is hard to explain is that changes to a part of the input can affect how the input is being used as a whole. Remember that in bag-of-words models adding another token when predicting would just add another unit in the weight; this is not always the case when using deep learning models.
The following example shows this effect. We have created two very similar sentences in `fake_sentences`. 

```{r}
fake_sentences <- c(
  "Fun and exciting dice game for the whole family",
  "Fun and exciting dice game for the family"
)

explainer <- lime(
  x = fake_sentences,
  model = simple_cnn_model,
  preprocess = text_to_matrix
)

explanation <- explain(
  x = fake_sentences,
  explainer = explainer,
  n_labels = 1,
  n_features = 12
)
```

Explanations based on these two sentences are fairly similar as we can see in Figure \@ref(fig:robustlimeplottextexplanations). However, notice how the removal of the word "whole" affects the weights of the other words in the examples, in some cases switching the sign from supporting to contradicting.

```{r, eval=FALSE}
plot_text_explanations(explanation)
```


```{r robustlimeplottextexplanations, echo=FALSE, fig.cap="Feature highlighting of words in two examples explained by a CNN model.", out.width="90%"}
if (knitr:::is_html_output()) {
  plot_text_explanations(explanation) %>%
    htmlwidgets::prependContent(htmltools::tags$style(all_styles))
} else {
  knitr::include_graphics("images/plot_text_explanations_2.png")
}
```

\index{models!explainability}It is these kinds of correlated patterns that can make deep learning models hard to reason about and can deliver surprising results.

```{block, type = "rmdnote"}
The LIME algorithm and **lime** R package are not limited to explaining CNNs. This approach can be used with any of the models we have used in this book, even the ones trained with **parsnip**.
```

## Case study: hyperparameter search {#keras-hyperparameter}

\index{models!tuning}So far in all our deep learning models, we have only used one configuration of hyperparameters. Sometimes we want to try different hyperparameters out and find what works best for our model like we did in Sections \@ref(mlregressionfull) and \@ref(mlclassificationfull) using the **tune** package. We can use the [**tfruns**](https://tensorflow.rstudio.com/tools/tfruns/overview/) package to run multiple Keras models and compare the results.

This workflow will be a little different than what we have seen in the book so far since we will have to create a `.R` file that contains the necessary modeling steps and then use that file to fit multiple models. Such an example file named `cnn-spec.R` used for the following models is available [on GitHub](https://raw.githubusercontent.com/EmilHvitfeldt/smltar/master/cnn-spec.R). The first thing we need to do is specify what hyperparameters we want to vary. By convention, this object is named `FLAGS` and it is created using the `flags()` function. For each parameter we want to tune, we add a corresponding `flag_*()` function, which can be `flag_integer()`, `flag_boolean()`, `flag_numeric()`, or `flag_string()` depending on what we need to tune.

```{block, type = "rmdwarning"}
Be sure you are using the right type for each of these flags; Keras is quite picky! If Keras is expecting an integer and gets a numeric then you will get an error.
```

```{r, eval=FALSE}
FLAGS <- flags(
  flag_integer("kernel_size1", 5),
  flag_integer("strides1", 1)
)
```

Notice how we are giving each flag a name and a possible value. The value itself isn't important, as it is not used once we start running multiple models, but it needs to be the right type for the model we are using.

Next, we specify the Keras model we want to run.

```{r, eval=FALSE}
model <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = max_length) %>%
  layer_conv_1d(filter = 32,
                kernel_size = FLAGS$kernel_size1,
                strides = FLAGS$strides1,
                activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "adam",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)
```

We target the hyperparameters we want to change by marking them as `FLAGS$name`. So in this model, we are tuning different values of `kernel_size` and `strides`, which are denoted by the `kernel_size1` and `strides1` flag, respectively.

Lastly, we must specify how the model is trained and evaluated.

```{r, eval=FALSE}
history <- model %>%
  fit(
    x = kick_analysis,
    y = state_analysis,
    batch_size = 512,
    epochs = 10,
    validation_data = list(kick_assess, state_assess)
  )

plot(history)

score <- model %>% evaluate(
  kick_assess, state_assess
)

cat("Test accuracy:", score["accuracy"], "\n")
```

This is mostly the same as what we have seen before. When we are running these different models, the scripts will be run in the environment they are initialized from, so the models will have access to objects like `prepped_training` and `kickstarter_train`, and we don't have to create them inside the file.

Now that we have the file set up we need to specify the different hyperparameters we want to try. Three different values for the kernel size and two different values for the stride length give us `3 * 2 = 6` different runs.

```{r}
hyperparams <- list(
  kernel_size1 = c(3, 5, 7),
  strides1 = c(1, 2)
)
```

```{block, type = "rmdnote"}
This is a small selection of hyperparameters and ranges. There is much more room for experimentation.
```

Now we have everything we need for hyperparameter searching. Load up **tfruns** and pass the name of the file we just created along with `hyperparams` to the `tuning_run()` function.

```{r, results='hide', cache=FALSE, message=FALSE, eval=FALSE}
library(tfruns)
runs <- tuning_run(
  file = "cnn-spec.R",
  runs_dir = "_tuning",
  flags = hyperparams
)

runs_results <- as_tibble(ls_runs())
```

```{r, echo=FALSE, message=FALSE}
runs_results <- readr::read_csv("inst/runs_results.csv")
runs_results
```

You don't have to, but we have manually specified the `runs_dir` argument, which is where the results of the tuning will be saved.

A summary of all the runs in the folder can be retrieved with `ls_runs()`; here we use `as_tibble()` to get the results as a tibble.

```{r, cache=FALSE}
runs_results
```

We can condense the results down a little bit by only pulling out the flags we are looking at and arranging them according to their performance.

```{r, cache=FALSE}
best_runs <- runs_results %>%
  select(metric_val_accuracy, flag_kernel_size1, flag_strides1) %>%
  arrange(desc(metric_val_accuracy))

best_runs
```

There isn't much performance difference between the different choices but using kernel size of `r best_runs$flag_kernel_size1[1]` and stride length of `r best_runs$flag_strides1[1]` narrowly came out on top.

## Cross-validation for evaluation

In Section \@ref(dnncross), we saw how we can use resampling to create cross-validation folds for evaluation. The Kickstarter data set we are using is big enough that we have ample data for a single training set, validation set, and testing set that all contain enough observations in them to give reliable performance metrics. However, it is important to understand how to implement other resampling strategies for situations when your data budget may not be as plentiful or when you need to compute performance metrics that are more precise.

```{r}
set.seed(345)
kick_folds <- vfold_cv(kickstarter_train, v = 5)
kick_folds
```

Each of these folds has an analysis or training set and an assessment or validation set. Instead of training our model one time and getting one measure of performance, we can train our model `v` times and get `v` measures (five, in this case), for more reliability.

Last time we saw how to create a custom function to handle preprocessing, fitting, and evaluation. We will use the same approach of creating the function, but this time use the model specification from Section \@ref(firstcnn).

```{r}
fit_split <- function(split, prepped_rec) {
  ## preprocessing
  x_train <- bake(prepped_rec, new_data = analysis(split),
                  composition = "matrix")
  x_val   <- bake(prepped_rec, new_data = assessment(split),
                  composition = "matrix")
  
  ## create model
  y_train <- analysis(split) %>% pull(state)
  y_val   <- assessment(split) %>% pull(state)
  
  mod <- keras_model_sequential() %>%
    layer_embedding(input_dim = max_words + 1, output_dim = 16,
                    input_length = max_length) %>%
    layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
    layer_global_max_pooling_1d() %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 1, activation = "sigmoid") %>%
    compile(
      optimizer = "adam",
      loss = "binary_crossentropy",
      metrics = c("accuracy")
    )
  
  ## fit model
  mod %>%
    fit(
      x_train,
      y_train,
      epochs = 10,
      validation_data = list(x_val, y_val),
      batch_size = 512,
      verbose = FALSE
    )
  
  ## evaluate model
  keras_predict(mod, x_val, y_val) %>%
    metrics(state, .pred_class, .pred_1)
}
```

We can `map()` this function across all our cross-validation folds. This takes longer than our previous models to train, since we are training for 10 epochs each on 5 folds.

```{r}
cv_fitted <- kick_folds %>%
  mutate(validation = map(splits, fit_split, kick_prep))

cv_fitted
```

Now we can use `unnest()` to find the metrics we computed.

```{r}
cv_fitted %>%
  unnest(validation)
```

We can summarize the unnested results to match what we normally would get from `collect_metrics()`

```{r}
cv_fitted %>%
  unnest(validation) %>%
  group_by(.metric) %>%
  summarize(
    mean = mean(.estimate),
    n = n(),
    std_err = sd(.estimate) / sqrt(n)
  )
```

The metrics have little variance just like they did last time, which is reassuring; our model is robust with respect to the evaluation metrics. 

## The full game: CNN {#cnnfull}

We've come a long way in this chapter, and looked at the many different modifications to the simple CNN model we started with. Most of the alterations didn't add much so this final model is not going to be much different than what we have seen so far.

\index{models!challenges}

```{block, type = "rmdwarning"}
There are an incredible number of ways to change a deep learning network architecture, but in most realistic situations, the benefit in model performance from such changes is modest.
```


### Preprocess the data {#cnnfullpreprocess}

For this final model, we are not going to use our separate validation data again, so we only need to \index{preprocess}preprocess the training data.

```{r}
max_words <- 2e4
max_length <- 30

kick_rec <- recipe(~ blurb, data = kickstarter_train) %>%
  step_tokenize(blurb) %>%
  step_tokenfilter(blurb, max_tokens = max_words) %>%
  step_sequence_onehot(blurb, sequence_length = max_length)

kick_prep <- prep(kick_rec)
kick_matrix <- bake(kick_prep, new_data = NULL, composition = "matrix")

dim(kick_matrix)
```

### Specify the model {#cnnfullmodel}

Instead of using specific validation data that we can then compute performance metrics for, let's go back to specifying `validation_split = 0.1` and let the Keras model choose the validation set.

```{r cnnfinalmod}
final_mod <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_words + 1, output_dim = 16,
                  input_length = max_length) %>%
  layer_conv_1d(filter = 32, kernel_size = 7,
                strides = 1, activation = "relu") %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

final_mod %>%
  compile(
    optimizer = "adam",
    loss = "binary_crossentropy",
    metrics = c("accuracy")
  )

final_history <- final_mod %>%
  fit(
    kick_matrix,
    kickstarter_train$state,
    epochs = 10,
    validation_split = 0.1,
    batch_size = 512,
    verbose = FALSE
  )

final_history
```

This looks promising! Let's finally turn to the testing set, for the first time during this chapter, to evaluate this last model on data that has never been touched as part of the fitting process.

```{r}
kick_matrix_test <- bake(kick_prep, new_data = kickstarter_test,
                         composition = "matrix")
final_res <- keras_predict(final_mod, kick_matrix_test, kickstarter_test$state)
final_res %>% metrics(state, .pred_class, .pred_1)
```

This is our best-performing model in this chapter on CNN models, although not by much. We can again create an ROC curve, this time using the test data in Figure \@ref(fig:cnnfinalroc).

```{r cnnfinalroc, opts.label = "fig.square", fig.cap="ROC curve for final CNN model predictions on testing set of Kickstarter campaign success"}
final_res %>%
  roc_curve(state, .pred_1) %>%
  autoplot()
```

We have been able to incrementally improve our model by adding to the structure and making good choices about \index{preprocessing}preprocessing. We can visualize this final CNN model's performance using a \index{matrix!confusion}confusion matrix as well, in Figure \@ref(fig:cnnheatmapfinal).

```{r cnnheatmapfinal, fig.cap="Confusion matrix for final CNN model predictions on testing set of Kickstarter campaign success"}
final_res %>%
  conf_mat(state, .pred_class) %>%
  autoplot(type = "heatmap")
```

Notice that this final model performs better then any of the models we have tried so far in this chapter, Chapter \@ref(dldnn), and Chapter \@ref(dllstm). 

```{block, type = "rmdnote"}
For this particular data set of short text blurbs, a CNN model able to learn local features performed the best, better than either a densely connected neural network or an LSTM.
```

## Summary {#dlcnnsummary}

CNNs are a type of neural network that can learn local spatial patterns. They essentially perform feature extraction\index{feature engineering}, which can then be used efficiently in later layers of a network. Their simplicity and fast running time, compared to models like LSTMs, makes them excellent candidates for supervised models for text.

### In this chapter, you learned:

- how to preprocess text data for CNN models

- about CNN network architectures

- how CNN layers can be stacked to extract patterns of varying detail

- how byte pair encoding can be used to tokenize for finer detail

- how to do hyperparameter search in Keras with **tfruns**

- how to evaluate CNN models for text