-
Notifications
You must be signed in to change notification settings - Fork 100
/
Copy path12_regex.Rmd
261 lines (192 loc) · 11 KB
/
12_regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
\cleardoublepage
# (APPENDIX) Appendices {-}
# Regular expressions {#regexp}
> Some people, when confronted with a problem, think: "I know, I'll use regular expressions." Now they have two problems.
> `r tufte::quote_footer('--- [Jamie Zawinski](https://en.wikiquote.org/wiki/Jamie_Zawinski)')`
This section will give a brief overview on how to write and use a regular expression), often abbreviated *regex*. Regular expressions are a way to specify or search for patterns of strings using a sequence of characters. By combining a selection of simple patterns, we can capture quite complicated strings.
Many functions in R take advantage of regular expressions. Some examples from base R include `grep`, `grepl`, `regexpr`, `gregexpr`, `sub`, `gsub`, and `strsplit`, as well as `ls` and `list.files`. The **stringr** package [@Wickham19] uses regular expressions extensively; the regular expressions are passed as the `pattern =` argument. Regular expressions can be used to detect, locate, or extract parts of a string.
## Literal characters
The most basic regular expression consists of only a single character. Here let's detect if each of the following strings in the character vector `animals` contains the letter "j".
```{r}
library(stringr)
animals <- c("jaguar", "jay", "bat")
str_detect(animals, "j")
```
We are also able to *extract* the match with `str_extract`. This may not seem too useful right now, but it becomes very helpful once we use more advanced regular expressions.
```{r}
str_extract(animals, "j")
```
Lastly we are able to *locate* the position of a match using `str_locate`.
```{r}
str_locate(animals, "j")
```
```{block, type = "rmdnote"}
The functions `str_detect`, `str_extract`, and `str_locate` are some of the most simple and powerful main functions in **stringr**, but the **stringr** package includes many more functions. To see the remaining functions, run `help(package = "stringr")` to open the documentation.
```
We can also match multiple characters in a row.
```{r}
animals <- c("jaguar", "jay", "bat")
str_detect(animals, "jag")
```
Notice how these characters are case sensitive.
```{r}
wows <- c("wow", "WoW", "WOW")
str_detect(wows, "wow")
```
### Meta characters
There are 14 meta characters that carry special meaning inside regular expressions. We need to "escape" them with a backslash if we want to match the literal character (and backslashes need to be doubled in R). Think of "escaping" as stripping the character of its special meaning.
The plus symbol `+` is one of the special meta characters for regular expressions.
```{r}
math <- c("1 + 2", "14 + 5", "3 - 5")
str_detect(math, "\\+")
```
If we tried to use the plus sign without escaping it, like `"+"`, we would get an error and this line of code would not run.
The complete list of meta characters is displayed in Table \@ref(tab:metacharacters) [@levithan2012regular].
```{r metacharacters, echo=FALSE}
tibble::tribble(
~Description, ~Character,
"opening square bracket", "[",
"closing square bracket", "]",
"backslash", "\\\\",
"caret", "^",
"dollar sign", "$",
"period/dot", ".",
"vertical bar", "|",
"question mark", "?",
"asterisk", "*",
"plus sign", "+",
"opening curly brackets", "{",
"closing curly brackets", "}",
"opening parentheses", "(",
"closing parentheses", ")"
) %>%
knitr::kable(caption = "\\textbf{TABLE A.1:} All meta characters", booktabs = TRUE)
```
## Full stop, the wildcard
Let's start with the full stop/period/dot, which acts as a "wildcard". This means that this character will match anything in place other then a newline character.
```{r}
strings <- c("cat", "cut", "cue")
str_extract(strings, "c.")
str_extract(strings, "c.t")
```
## Character classes
So far we have only been able to match either exact characters or wildcards. **Character classes** (also called character sets) let us do more than that. A character class allows us to match a character specified inside the class. A character class is constructed with square brackets. The character class `[ac]` will match *either* an "a" or a "c".
```{r}
strings <- c("a", "b", "c")
str_detect(strings, "[ac]")
```
```{block, type = "rmdnote"}
Spaces inside character classes are meaningful as they are interpreted as literal characters. Thus the character class `"[ac]"` will match the letter "a" and "c", while the character class `"[a c]"` will match the letters "a" and "c" but also a space.
```
We can use a hyphen character to define a range of characters. Thus `[1-5]` is the same as `[12345]`.
```{r}
numbers <- c("1", "2", "3", "4", "5", "6", "7", "8", "9")
str_detect(numbers, "[2-7]")
sentence <- "This is a long sentence with 2 numbers with 1 digits."
str_locate_all(sentence, "[1-2a-b]")
```
We can also negate characters in a class with a caret `^`. Placing a caret immediately inside the opening square bracket will make the regular expression match anything *not* inside the class. Thus the regular expression `[^ac]` will match anything that isn't the letter "a" or "c".
```{r}
strings <- c("a", "b", "c")
str_detect(strings, "[^ac]")
```
### Shorthand character classes
Certain character classes are so commonly used that they have been predefined with names. A couple of these character classes have even shorter shorthands. The class `[:digit:]` denotes all the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 but it can also be described by `\\d`. Table \@ref(tab:characterclasses) presents these useful predefined character classes.
```{r characterclasses, echo=FALSE}
tibble::tribble(
~Description, ~Class,
"Digits; [0-9]", "[:digit:] or \\\\\\\\d",
"Alphabetic characters, uppercase and lowercase [A-z]", "[:alpha:]",
"Alphanumeric characters, letters, and digits [A-z0-9]", "[:alnum:]",
"Graphical characters [[:alnum:][:punct:]]", "[:graph:]",
"Printable characters [[:alnum:][:punct:][:space:]]", "[:print:]",
"Lowercase letters [a-z]", "[:lower:]",
"Uppercase letters [A-Z]", "[:upper:]",
"Control characters such as newline, carriage return, etc.", "[:cntrl:]",
"Punctuation characters: !\"#$%&’()*+,-./:;<=>?@[]^_`{|}~", "[:punct:]",
"Space and tab", "[:blank:]",
"Space, tab, vertical tab, newline, form feed, carriage return", "[:space:] or \\\\\\\\s",
"Hexadecimal digits [0-9A-Fa-f]", "[:xdigit:]",
"Not space [^[:space:]]", "\\\\\\\\S",
"Word characters: letters, digits, and underscores [A-z0-9_]", "\\\\\\\\w",
"Non-word characters [^A-z0-9_]", "\\\\\\\\W",
"Non-digits [^0-9]", "\\\\\\\\D"
) %>%
knitr::kable(caption = "\\textbf{TABLE A.2:} All character classes", booktabs = TRUE)
```
Notice that these shorthands are locale specific. This means that the Danish character ø will be picked up in class `[:lower:]` but not in the class `[a-z]` as the character isn't located between a and z.
## Quantifiers
We can specify how many times we expect something to occur using quantifiers. If we want to find a digit with four numerals, we don't have to write `[:digit:][:digit:][:digit:][:digit:]`. Table \@ref(tab:greedyquantifiers) shows how to specify repetitions. Notice that `?` is shorthand for `{0,1}`, `*` is shorthand for `{0,}`, and `+` is shorthand for `{1,}` [@levithan2012regular].
```{r greedyquantifiers, echo=FALSE}
tibble::tribble(
~Regex, ~Matches,
"?", "zero or one times",
"*", "zero or more times",
"+", "one or more times",
"{n}", "exactly n times",
"{n,}", "at least n times",
"{n,m}", "between n and m times"
) %>%
knitr::kable(caption = "\\textbf{TABLE A.3:} Regular expression quantifiers", booktabs = TRUE)
```
We can detect both color and colour by placing a quantifier after the "u" that detects 0 or 1 times used.
```{r}
col <- c("colour", "color", "farver")
str_detect(col, "colou?r")
```
And we can extract four-digit numbers using `{4}`.
```{r}
sentences <- c("The year was 1776.", "Alexander Hamilton died at 47.")
str_extract(sentences, "\\d{4}")
```
Sometimes we want the repetition to happen over multiple characters. This can be achieved by wrapping what we want repeated in parentheses. In the following example, we want to match all the instances of "NA" in the string. We put `"NA "` inside a set of parentheses and `+` after to make sure we match at least once.
```{r}
batman <- "NA NA NA NA NA NA NA NA NA NA NA NA NA NA BATMAN!!!"
str_extract(batman, "(NA )+")
```
However, notice that this also matches the last space, which we don't want. We can fix this by matching zero or more "NA " followed by exactly 1 "NA".
```{r}
batman <- "NA NA NA NA NA NA NA NA NA NA NA NA NA NA BATMAN!!!"
str_extract(batman, "(NA )*(NA){1}")
```
By default these matches are "greedy", meaning that they will try to match the longest string possible. We can instead make them "lazy" by placing a `?` after, as shown in Table \@ref(tab:lazyquantifiers). This will make the regular expressions try to match the shortest string possible instead of the longest.
```{r lazyquantifiers, echo=FALSE}
tibble::tribble(
~regex, ~matches,
"??", "zero or one times, prefers 0",
"*?", "zero or more times, match as few times as possible",
"+?", "one or more times, match as few times as possible",
"{n}?", "exactly n times, match as few times as possible",
"{n,}?", "at least n times, match as few times as possible",
"{n,m}?", "between n and m times, match as few times as possible but at least n"
) %>%
knitr::kable(caption = "\\textbf{TABLE A.4:} Lazy quantifiers", booktabs = TRUE)
```
Comparing greedy and lazy matches gives us 3 and 7 "NA "'s, respectively.
```{r}
batman <- "NA NA NA NA NA NA NA NA NA NA NA NA NA NA BATMAN!!!"
str_extract(batman, "(NA ){3,7}")
str_extract(batman, "(NA ){3,7}?")
```
## Anchors
The meta characters `^` and `$` have special meaning in regular expressions. They force the engine to check the beginning and end of the string, respectively, hence the name **anchor**. A mnemonic device to remember this is "First you get the power(`^`) and then you get the money(`\$`)".
```{r}
seasons <- c("The summer is hot this year",
"The spring is a lovely time",
"Winter is my favorite time of the year",
"Fall is a time of peace")
str_detect(seasons, "^The")
str_detect(seasons, "year$")
```
We can also combine the two to match a string completely.
```{r}
folder_names <- c("analysis", "data-raw", "data", "R")
str_detect(folder_names, "^data$")
```
## Additional resources
This appendix covered some of the basics of getting started with (or refreshed about) regular expressions. If you want to learn more:
- RStudio maintains [an excellent collection of cheat sheets](https://www.rstudio.com/resources/cheatsheets/), some of which are related to regular expressions.
- www.rexegg.com has many pages of valuable information, including [this "quick start" page with helpful tables](https://www.rexegg.com/regex-quickstart.html).
- https://www.regular-expressions.info/ is another great general regular expression site.
- The [strings chapter](https://r4ds.had.co.nz/strings.html) in *R for Data Science* [@Wickham2017] delves into examples written in R.
- Lastly if you want to go down to the metal, check out [*Mastering Regular Expressions*](http://shop.oreilly.com/product/9780596528126.do).