-
Notifications
You must be signed in to change notification settings - Fork 97
/
03_stopwords.Rmd
456 lines (353 loc) · 30.8 KB
/
03_stopwords.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
# Stop words {#stopwords}
Once we have split text into tokens, it often becomes clear that not all words carry the same amount of information, if any information at all, for a predictive modeling task. Common words that carry little (or perhaps no) meaningful information are called *stop words*. It is common advice and practice to remove stop words for various NLP tasks, but the task of stop word removal is more nuanced than many resources may lead you to believe. In this chapter, we will investigate what a stop word list is, the differences between them, and the effects of using them in your preprocessing workflow.
The concept of stop words has a long history with Hans Peter Luhn credited with coining the term in 1960 [@Luhn1960]. Examples of these words in English are "a", "the", "of", and "didn't". These words are very common and typically don't add much to the meaning of a text but instead ensure the structure of a sentence is sound.
```{block, type = "rmdnote"}
Categorizing words as either informative or non-informative is limiting, and we prefer to consider words as having a more fluid or continuous amount of information associated with them. This informativeness is context-specific as well. In fact, stop words themselves are often important in genre or authorship identification.
```
Historically, one of the main reasons for removing stop words was to decrease the computational time for text mining; it can be regarded as a dimensionality reduction of text data and was commonly-used in search engines to give better results [@Huston2010].
Stop words can have different roles in a corpus. We generally categorize stop words into three groups: global, subject, and document stop words.
\index{stop words!global}Global stop words are words that are almost always low in meaning in a given language; these are words such as "of" and "and" in English that are needed to glue text together. These words are likely a safe bet for removal, but they are low in number. You can find some global stop words in pre-made stop word lists (Section \@ref(premadestopwords)).
\index{stop words!subject}Next up are subject-specific stop words. These words are uninformative for a given subject area. Subjects can be broad like finance and medicine or can be more specific like obituaries, health code violations, and job listings for librarians in Kansas.
Words like "bath", "bedroom", and "entryway" are generally not considered stop words in English, but they may not provide much information for differentiating suburban house listings and could be subject stop words for certain analysis. You will likely need to manually construct such a stop word list (Section \@ref(homemadestopwords)). These kinds of stop words may improve your performance if you have the domain expertise to create a good list.
\index{stop words!document}Lastly, we have document-level stop words. These words do not provide any or much information for a given document. These are difficult to classify and won't be worth the trouble to identify. Even if you can find document stop words, it is not obvious how to incorporate this kind of information in a regression or classification task.
## Using premade stop word lists {#premadestopwords}
A quick option for using stop words is to get a list that has already been created. This is appealing because it is not difficult, but be aware that not all lists are created equal. @nothman-etal-2018-stop found some alarming results in a study of 52 stop word lists available in open-source software packages. Among some of the more grave issues were misspellings\index{misspellings} ("fify" instead of "fifty"), the inclusion of clearly informative words such as "computer" and "cry", and internal inconsistencies, such as including the word "has" but not the word "does". This is not to say that you should never use a stop word list that has been included in an open-source software project. However, you should always inspect and verify the list you are using, both to make sure it hasn't changed since you used it last, and also to check that it is appropriate for your use case.
There is a broad selection of stop word lists available today. For the purpose of this chapter, we will focus on three of the lists of English stop words provided by the **stopwords** package [@R-stopwords]. The first is from the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System\index{stop word lists!SMART}, an information retrieval system developed at Cornell University in the 1960s [@Lewis2014]. The second is the \index{stop word lists!Snowball}English Snowball stop word list [@porter2001snowball], and the last is the English list from the [Stopwords ISO](https://github.com/stopwords-iso/stopwords-iso) collection\index{stop word lists!Stopwords ISO}. These stop word lists are all considered general purpose and not domain-specific.
```{block, type = "rmdpackage"}
The **stopwords** package contains a comprehensive collection of stop word lists in one place for ease of use in analysis and other packages.
```
Before we start delving into the content inside the lists, let's take a look at how many words are included in each.
```{r, results='hold'}
library(stopwords)
length(stopwords(source = "smart"))
length(stopwords(source = "snowball"))
length(stopwords(source = "stopwords-iso"))
```
The lengths of these lists are quite different, with the longest list being over seven times longer than the shortest! Let's examine the overlap of the words that appear in the three lists in an UpSet plot in Figure \@ref(fig:stopwordoverlap). An UpSet plot [@Lex2014] visualizes intersections and aggregates of intersections of sets using a matrix layout, presenting the number of elements as well as summary statistics.
\index{stop word lists!SMART}
\index{stop word lists!Snowball}
\index{stop word lists!Stopwords ISO}
```{r stopwordoverlap, echo=FALSE, fig.cap="Set intersections for three common stop word lists visualized as an UpSet plot"}
library(UpSetR)
fromList(list(smart = stopwords(source = "smart"),
snowball = stopwords(source = "snowball"),
iso = stopwords(source = "stopwords-iso"))) %>%
upset(empty.intersections = "on")
```
The UpSet plot in Figure \@ref(fig:stopwordoverlap) shows us that these three lists are almost true subsets of each other. The only exception is a set of 10 words that appear in Snowball and ISO but not in the SMART list. What are those words?
```{r}
setdiff(stopwords(source = "snowball"),
stopwords(source = "smart"))
```
\index{stop word lists!SMART}
\index{stop word lists!Snowball}
All these words are contractions. This is *not* because the SMART lexicon doesn't include contractions; if we look, there are almost 50 of them.
```{r}
str_subset(stopwords(source = "smart"), "'")
```
We seem to have stumbled upon an inconsistency: why does \index{stop word lists!SMART}SMART include `"he's"` but not `"she's"`? It is hard to say, but this could be worth rectifying before applying these stop word lists to an analysis or model preprocessing.\index{preprocessing!challenges} This stop word list was likely generated by selecting the most frequent words across a large corpus of text that had more representation for text about men than women. This is once again a reminder that we should always look carefully at any pre-made word list or another artifact we use to make sure it works well with our needs^[This advice applies to any kind of pre-made lexicon or word list, not just stop words. For instance, the same concerns apply to sentiment lexicons. The NRC sentiment lexicon of @Mohammad13 associates the word "white" with trust and the word "black" with sadness, which could have unintended consequences when analyzing text about racial groups.].
```{block, type = "rmdwarning"}
It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case.
```
\index{preprocessing}When you select a stop word list, it is important that you consider its size and breadth. Having a small and concise list of words can moderately reduce your token count while not having too great of an influence on your models, assuming that you picked appropriate words. As the size of your stop word list grows, each word added will have a diminishing positive effect with the increasing risk that a meaningful word has been placed on the list by mistake. In Section \@ref(casestudystopwords), we show the effects of different stop word lists on model training.
### Stop word removal in R
Now that we have seen stop word lists, we can move forward with removing these words. The particular way we remove stop words depends on the shape of our data. If you have your text in a tidy format with one word per row, you can use `filter()` from **dplyr** with a negated `%in%` if you have the stop words as a vector, or you can use `anti_join()` from **dplyr** if the stop words are in a `tibble()`. Like in our previous chapter, let's examine the text of "The Fir-Tree" by Hans Christian Andersen, and use **tidytext** to tokenize the text into words.
```{r}
library(hcandersenr)
library(tidyverse)
library(tidytext)
fir_tree <- hca_fairytales() %>%
filter(book == "The fir tree",
language == "English")
tidy_fir_tree <- fir_tree %>%
unnest_tokens(word, text)
```
\index{stop word lists!Snowball}Let's use the Snowball stop word list as an example. Since the stop words return from this function as a vector, we will use `filter()`.
```{r, results='hold'}
tidy_fir_tree %>%
filter(!(word %in% stopwords(source = "snowball")))
```
If we use the `get_stopwords()` function from **tidytext** instead, then we can use the `anti_join()` function.
```{r}
tidy_fir_tree %>%
anti_join(get_stopwords(source = "snowball"))
```
The result of these two stop word removals is the same since we used the same stop word list in both cases.
## Creating your own stop words list {#homemadestopwords}
Another way to get a stop word list is to create one yourself. Let's explore a few different ways to find appropriate words to use. We will use the tokenized data from "The Fir-Tree" as a first example. Let's take the words and rank them by their count or frequency.
(ref:tidyfirtree) Words from "The Fir Tree" ordered by count or frequency
```{r, eval=!knitr:::is_html_output(), echo=FALSE, fig.cap='(ref:tidyfirtree)'}
tidy_fir_tree %>%
count(word, sort = TRUE) %>%
slice(1:120) %>%
mutate(row = rep(1:5, each = n() / 5),
column = rep(rev(seq_len(n() / 5)), length.out = n())) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
ggplot(aes(row, column, label = word)) +
geom_text(hjust = 0) +
xlim(c(1, 5.5)) +
theme_void() +
labs(title = 'Most frequent tokens in "The Fir-Tree"')
```
```{r, eval=knitr:::is_html_output(), echo=FALSE, results='markup'}
tidy_fir_tree %>%
count(word, sort = TRUE) %>%
slice(1:120) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
pull(word) %>%
columnize()
```
We recognize many of what we would consider stop words in the first column here, with three big exceptions. We see `"tree"` at 3, `"fir"` at 12, and `"little"` at 22. These words appear high on our list, but they do provide valuable information as they all reference the main character. \index{preprocessing!challenges}What went wrong with this approach? Creating a stop word list using high-frequency words works best when it is created on a **corpus** of documents\index{corpus}, not an individual document. This is because the words found in a single document will be document-specific and the overall pattern of words will not generalize that well.
```{block2, type = "rmdnote"}
In NLP, a corpus is a set of texts or documents. The set of Hans Christian Andersen's fairy tales can be considered a corpus, with each fairy tale a document within that corpus. The set of United States Supreme Court opinions can be considered a different corpus, with each written opinion being a document within *that* corpus. Both data sets are described in more detail in Appendix \@ref(appendixdata).
```
\index{corpus!definition}
The word `"tree"` does seem important as it is about the main character, but it could also be appearing so often that it stops providing any information. Let's try a different approach, extracting high-frequency words from the corpus of *all* English fairy tales by H.C. Andersen.
```{r, eval=!knitr:::is_html_output(), echo=FALSE, fig.cap="Words in all English fairy tales by Hans Christian Andersen ordered by count or frequency"}
library(hcandersenr)
library(tidytext)
hcandersen_en %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
slice(1:120) %>%
mutate(row = rep(1:5, each = n() / 5),
column = rep(rev(seq_len(n() / 5)), length.out = n())) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
ggplot(aes(row, column, label = word)) +
geom_text(hjust = 0) +
xlim(c(1, 5.5)) +
theme_void() +
labs(
title = "120 most frequent tokens in H.C. Andersen's English fairy tales"
)
```
```{r, eval=knitr:::is_html_output(), echo=FALSE, results='markup'}
library(hcandersenr)
library(tidytext)
hcandersen_en %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
slice(1:120) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
pull(word) %>%
columnize()
```
This list is more appropriate for our concept of stop words, and now it is time for us to make some choices. How many do we want to include in our stop word list? Which words should we add and/or remove based on prior information? Selecting the number of words to remove is best done by a case-by-case basis as it can be difficult to determine a priori how many different "meaningless" words appear in a corpus. Our suggestion is to start with a low number like 20 and increase by 10 words until you get to words that are not appropriate as stop words for your analytical purpose.
It is worth keeping in mind that such a list is not perfect.\index{preprocessing!challenges} Depending on how your text was generated or processed, strange tokens can surface as possible stop words due to encoding or optical character recognition errors. Further, these results are based on the corpus of documents we have available, which is potentially biased. In our example here, all the fairy tales were written by the same European white man from the early 1800s.
```{block, type = "rmdnote"}
This bias can be minimized by removing words we would expect to be over-represented or to add words we expect to be under-represented.
```
Easy examples are to include the complements to the words in the list if they are not already present. Include "big" if "small" is present, "old" if "young" is present. This example list has words associated with women often listed lower in rank than words associated with men. With `"man"` being at rank 79 but `"woman"` at rank `r hcandersenr::hcandersen_en %>% tidytext::unnest_tokens(word, text) %>% count(word, sort = TRUE) %>% pull(word) %>% magrittr::equals("woman") %>% which()`, choosing a threshold of 100 would lead to only one of these words being included. Depending on how important you think such nouns are going to be in your texts, consider either adding `"woman"` or deleting `"man"`.^[On the other hand, the more biased stop word list may be helpful when modeling a corpus with gender imbalance, depending on your goal; words like "she" and "her" can identify where women are mentioned.]
\index{bias}Figure \@ref(fig:genderrank) shows how the words associated with men have a higher rank than the words associated with women. By using a single threshold to create a stop word list, you would likely only include one form of such words.
```{r genderrank, echo=FALSE, fig.width = 8, fig.cap="Tokens ranked according to total occurrences, with rank 1 having the most occurrences"}
gender_words <- tribble(
~men, ~women,
"he", "she",
"his", "her",
"man", "woman",
"men", "women",
"boy", "girl",
"he's", "she's",
"he'd", "she'd",
"he'll", "she'll",
"himself", "herself"
)
ordered_words <- hcandersen_en %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
pull(word)
gender_words_plot <- gender_words %>%
mutate(male_index = match(men, ordered_words),
female_index = match(women, ordered_words)) %>%
mutate(slope = log10(male_index) - log10(female_index)) %>%
pivot_longer(male_index:female_index) %>%
mutate(value = log10(value),
label = ifelse(name == "male_index", men, women)) %>%
mutate(name = factor(x = name,
levels = c("male_index", "female_index"),
labels = c("men", "women")))
limit <- max(abs(gender_words_plot$slope)) * c(-1, 1)
gender_words_plot %>%
ggplot(aes(name, value, group = women)) +
geom_line(aes(color = slope), size = 1) +
scale_y_reverse(labels = function(x) 10 ^ x) +
geom_text(aes(label = label)) +
scale_color_distiller(type = "div", limit = limit) +
guides(color = "none") +
theme(panel.border = element_blank(), panel.grid.major.x = element_blank()) +
labs(x = NULL, y = "Word rank (log scale)") +
labs(title = paste("Masculine gendered words appear more often in",
"H.C. Andersen's fairy tales"))
```
Imagine now we would like to create a stop word list that spans multiple different genres, in such a way that the subject-specific stop words don't overlap. For this case, we would like words to be denoted as a stop word only if it is a stop word in all the genres. You could find the words individually in each genre and use the right intersections. However, that approach might take a substantial amount of time.
Below is a bad approach where we try to create a multi-language list of stop words. To accomplish this we calculate the [*inverse document frequency*](https://www.tidytextmining.com/tfidf.html) (IDF) \index{inverse document frequency}of each word. The IDF of a word is a quantity that is low for commonly-used words in a collection of documents and high for words not used often in a collection of documents. It is typically defined as
$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
If the word "dog" appears in 4 out of 100 documents then it would have an `idf("dog") = log(100/4) = 3.22`, and if the word "cat" appears in 99 out of 100 documents then it would have an `idf("cat") = log(100/99) = 0.01`. Notice how the idf values goes to zero (as a matter of fact when a term appears in all the documents then the idf of that word is 0 `log(100/100) = log(1) = 0`), the more documents it is contained in.
What happens if we create a stop word list based on words with the lowest IDF? The following function takes a tokenized dataframe and returns a dataframe with a column for each word and a column for the IDF.
```{r}
library(rlang)
calc_idf <- function(df, word, document) {
words <- df %>% pull({{word}}) %>% unique()
n_docs <- length(unique(pull(df, {{document}})))
n_words <- df %>%
nest(data = c({{word}})) %>%
pull(data) %>%
map_dfc(~ words %in% unique(pull(.x, {{word}}))) %>%
rowSums()
tibble(word = words,
idf = log(n_docs / n_words))
}
```
Here is the result when we try to create a cross-language list of stop words, by taking each fairy tale as a document. It is not very good!
```{block, type = "rmdnote"}
The overlap between words that appear in each language is very small, but these words are what we mostly see in this list.
```
```{r, eval=!knitr:::is_html_output(), echo=FALSE, fig.cap="Words from all of H.C. Andersen's fairy tales in Danish, English, French, German, and Spanish, counted and ordered by IDF"}
hcandersenr::hca_fairytales() %>%
unnest_tokens(word, text) %>%
mutate(document = paste(language, book)) %>%
select(word, document) %>%
calc_idf(word, document) %>%
arrange(idf) %>%
slice(1:120) %>%
mutate(row = rep(1:5, each = n() / 5),
column = rep(rev(seq_len(n() / 5)), length.out = n())) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
ggplot(aes(row, column, label = word)) +
geom_text(hjust = 0) +
xlim(c(1, 5.5)) +
theme_void() +
labs(title = paste("120 tokens in H.C. Andersen's fairy tales with",
"lowest IDF, multi-language"))
```
```{r, eval=knitr:::is_html_output(), echo=FALSE, results='markup'}
hcandersenr::hca_fairytales() %>%
unnest_tokens(word, text) %>%
mutate(document = paste(language, book)) %>%
select(word, document) %>%
calc_idf(word, document) %>%
arrange(idf) %>%
slice(1:120) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
pull(word) %>%
columnize()
```
```{block, type = "rmdwarning"}
This didn't work very well because there is very little overlap between common words. Instead, let us limit the calculation to only one language and calculate the IDF of each word we can find compared to words that appear in a lot of documents.
```
\index{inverse document frequency}
```{r, eval=!knitr:::is_html_output(), echo=FALSE, fig.cap="Words from all of H.C. Andersen's fairy tales in English, counted and ordered by IDF"}
hcandersenr::hcandersen_en %>%
unnest_tokens(word, text) %>%
select(word, book) %>%
calc_idf(word, book) %>%
arrange(idf) %>%
slice(1:120) %>%
mutate(row = rep(1:5, each = n() / 5),
column = rep(rev(seq_len(n() / 5)), length.out = n())) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
ggplot(aes(row, column, label = word)) +
geom_text(hjust = 0) +
xlim(c(1, 5.5)) +
theme_void() +
labs(title = paste("120 tokens in H.C. Andersen's fairy tales with",
"lowest IDF, English only"))
```
```{r, eval=knitr:::is_html_output(), echo=FALSE, results='markup'}
hcandersenr::hcandersen_en %>%
unnest_tokens(word, text) %>%
select(word, book) %>%
calc_idf(word, book) %>%
arrange(idf) %>%
slice(1:120) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
pull(word) %>%
columnize()
```
This time we get better results. The list starts with "a", "the", "and", and "to" and continues with many more reasonable choices of stop words. We need to look at these results manually to turn this into a list. We need to go as far down in rank as we are comfortable with. You as a data practitioner are in full control of how you want to create the list. If you don't want to include "little" you are still able to add "are" to your list even though it is lower on the list.
## All stop word lists are context-specific
\index{preprocessing!challenges}Context is important in text modeling, so it is important to ensure that the stop word lexicon you use reflects the word space that you are planning on using it in. One common concern to consider is how pronouns bring information to your text. Pronouns are included in many different stop word lists (although inconsistently), but they will often *not* be noise in text data. Similarly, @Bender2021 discuss how a list of about 400 "Dirty, Naughty, Obscene or Otherwise Bad Words"\index{language!obscene} were used to filter and remove text before training a trillion parameter large language model, to protect it from learning offensive language, but the authors point out that in some community contexts, such words are reclaimed or used to describe marginalized identities.\index{context!importance of}
On the other hand, sometimes you will have to add in words yourself, depending on the domain. If you are working with texts for dessert recipes, certain ingredients (sugar, eggs, water) and actions (whisking, baking, stirring) may be frequent enough to pass your stop word threshold, but you may want to keep them as they may be informative. Throwing away "eggs" as a common word would make it harder or downright impossible to determine if certain recipes are vegan or not while whisking and stirring may be fine to remove as distinguishing between recipes that do and don't require a whisk might not be that big of a deal.
## What happens when you remove stop words
We have discussed different ways of finding and removing stop words; now let's see what happens once you do remove them. First, let's explore the impact of the number of words that are included in the list. Figure \@ref(fig:stopwordresults) shows what percentage of words are removed as a function of the number of words in a text. The different colors represent the three different stop word lists we have considered in this chapter.
```{r stopwordresults, echo=FALSE, fig.cap="Proportion of words removed for different stop word lists and different document lengths"}
library(tokenizers)
count_no_stopwords <- function(tokens, source) {
map_int(tokens, ~ length(setdiff(.x, stopwords(source = source))))
}
plotting_data <- hcandersen_en %>%
nest(data = c(text)) %>%
mutate(tokens = map(data, ~ unlist(tokenize_words(.x$text))),
no_snowball = count_no_stopwords(tokens, "snowball"),
no_smart = count_no_stopwords(tokens, "smart"),
no_iso = count_no_stopwords(tokens, "stopwords-iso"),
n_tokens = lengths(tokens)) %>%
pivot_longer(no_snowball:no_iso) %>%
mutate(value = 1 - value / n_tokens)
stopwords_labels <- c("snowball (175)", "smart (571)", "stopwords-iso (1298)")
plotting_data %>%
mutate(name = factor(name,
levels = c("no_snowball", "no_smart", "no_iso"),
labels = stopwords_labels),
name = fct_rev(name)) %>%
ggplot(aes(n_tokens, value, color = name)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
scale_y_continuous(labels = scales::percent) +
labs(
x = "Number of words in fairy tale",
y = "Percentage of words removed",
color = "Removed",
title = paste("Stop words take up a larger part of the text in",
"longer fairy tales"),
subtitle = paste("Each vertical trio of points represents an",
"H.C. Andersen fairy tale")
)
```
We notice, as we would predict, that larger stop word lists remove more words than shorter stop word lists. In this example with fairy tales, over half of the words have been removed, with the largest list removing over 80% of the words. We observe that shorter texts have a lower percentage of stop words. Since we are looking at fairy tales, this could be explained by the fact that a story has to be told regardless of the length of the fairy tale, so shorter texts are going to be denser with more informative words.
Another problem you may face is dealing with misspellings.\index{misspellings}
```{block, type = "rmdwarning"}
Most premade stop word lists assume that all the words are spelled correctly.
```
Handling misspellings when using premade lists can be done by manually adding common misspellings.\index{misspellings} You could imagine creating all words that are a certain string distance away from the stop words, but we do not recommend this as you would quickly include informative words this way.
\index{preprocessing!challenges}One of the downsides of creating your own stop word lists using frequencies is that you are limited to using words that you have already observed. It could happen that "she'd" is included in your training corpus but the word "he'd" did not reach the threshold. This is a case where you need to look at your words and adjust accordingly. Here the large premade stop word lists can serve as inspiration for missing words.
In Section \@ref(casestudystopwords), we investigate the influence of removing stop words in the context of modeling. Given the right list of words, we see no harm to the model performance, and sometimes find improvement due to noise reduction [@Feldman2007].
## Stop words in languages other than English
So far in this chapter, we have focused on English stop words, but English is not representative of every language. The notion of "short" and "long" lists we have used so far are specific to English as a language. You should expect different languages\index{language!Non-English} to have a different number of "uninformative" words, and for this number to depend on the morphological\index{morphology} richness of a language; lists that contain all possible morphological variants of each stop word could become quite large.
Different languages have different numbers of words in each class of words. An example is how the grammatical case influences the articles used in German. The following tables show the use of [definite and indefinite articles in German](https://deutsch.lingolia.com/en/grammar/nouns-and-articles/articles-noun-markers). Notice how German nouns have three genders (masculine, feminine, and neuter), which are not uncommon in languages around the world. Articles are almost always considered to be stop words in English as they carry very little information. However, German articles give some indication of the case, which can be used when selecting a list of stop words in German.
```{r, echo=FALSE}
library(magrittr)
library(gt)
tibble::tribble(
~Masculine, ~Feminine, ~Neuter, ~Plural, ~case,
"der", "die", "das", "die", "Nominative",
"den", "die", "das", "die", "Accusative",
"dem", "der", "dem", "den", "Dative",
"des", "der", "des", "der", "Genitive"
) %>%
gt(rowname_col = "case") %>%
tab_header(title = "German Definite Articles (the)")
```
```{r, echo=FALSE}
tibble::tribble(
~Masculine, ~Feminine, ~Neuter, ~case,
"ein", "eine", "ein", "Nominative",
"einen", "eine", "ein", "Accusative",
"einem", "einer", "einem", "Dative",
"eines", "einer", "eines", "Genitive"
) %>%
gt(rowname_col = "case") %>%
tab_header(title = "German Indefinite Articles (a/an)")
```
Building lists of stop words in Chinese has been done both manually and automatically [@Zou2006ACC] but so far none has been accepted as a standard [@Zou2006]. A full discussion of stop word identification in Chinese text would be out of scope for this book, so we will just highlight some of the challenges that differentiate it from English.
```{block, type = "rmdwarning"}
Chinese text is much more complex than portrayed here. With different systems and billions of users, there is much we won't be able to touch on here.
```
\index{language!Non-English}The main difference from English is the use of logograms instead of letters to convey information. However, Chinese characters should not be confused with Chinese words. The majority of words in modern Chinese are composed of multiple characters. This means that inferring the presence of words is more complicated, and the notion of stop words will affect how this segmentation of characters is done.
## Summary {#stopwordssummary}
In many standard NLP workflows, the removal of stop words is presented as a default or the correct choice without comment. Although removing stop words can improve the accuracy of your machine learning using text data, choices around such a step are complex. The content of existing stop word lists varies tremendously, and the available strategies for building your own can have subtle to not-so-subtle effects on your model results.
### In this chapter, you learned:
- what a stop word is and how to remove stop words from text data
- how different stop word lists can vary
- that the impact of stop word removal is different for different kinds of texts
- about the bias built in to stop word lists and strategies for building such lists