-
Notifications
You must be signed in to change notification settings - Fork 26
/
NEWS
288 lines (178 loc) · 9.54 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
NEWS
====
Versioning
----------
Releases will be numbered with the following semantic versioning format:
<major>.<minor>.<patch>
And constructed with the following guidelines:
* Breaking backward compatibility bumps the major (and resets the minor
and patch)
* New additions without breaking backward compatibility bumps the minor
(and resets the patch)
* Bug fixes and misc changes bumps the patch
textclean 0.9.4 -
----------------------------------------------------------------
BUG FIXES
* `replace_emoticon` replaced emoticon-like substrings within actual words.
Spotted thanks to Carolyn Challoner; see issue #46.
* `replace_number` failed if the number pattern contained two leading decimals
or hyphens. Spotted thanks to Stefano De Sabbata; see issue #60.
* `replace_word_elongation` failed for repeating of the same character but of
different case (e.g., `replace_word_elongation("Ooo")` resulted in `NA`. This
has been corrected. Additionally, the `elongation.search.pattern` defined as
`"(?i)(?:^|\\b)\\w*([a-z])(?:\\1{2,})\\w*($|\\b)"` has been moved exterally, to
a parameter, allowing the user to alter this pattern if desired. Spotted
thanks to Stefano De Sabbata; see issue #59.
NEW FEATURES
* `replace_misspelling` added as a way to replace misspelled words with their
most likely replacement using **hunspell** in the backend. Suggested by Surin
Space; see issue #39.
* `as_ordinal` added as a convenience wrapper for `english::ordinal` that
takes integers and converts them to ordinal form.
* `%like%` added as an binary operator similar to SQL's LIKE.
MINOR FEATURES
* `fix_mdyyyy` added to correct dates in the form of m/d/yyyy to yyyy-mm-dd.
IMPROVEMENTS
* `replace_html` pics up the ability to replace "«" & "»" with ASCII
equivalents "<<" & ">>". Suggested by Ilya Shutov; see issue #48.
* All internal calls to `grepl()` now have `perl = TRUE` added as this is
generally a speed up. Suggested by Kyle Haynes (see #51).
CHANGES
* `filter_element()` and `filter_row()` have been deprecated for a few years.
They have now been removed.
textclean 0.9.3
----------------------------------------------------------------
Version update to comply with changes in the **glue** package's API.
textclean 0.8.0 - 0.9.2
----------------------------------------------------------------
BUG FIXES
* `fgsub` had a bug in which the the original `pattern` in `fgsub` matches the
location in the string but when the replacement occurs this was done on the
entire string rather than the location of the first `pattern` match. This
means the extracted string was used as a search and might be found in places
other than the original location (e.g., a leading boundary in '^T' replaced
with '__' may have led to '__he __itle' rather than '__he Title' as expected
in the string 'The Title'). See #35 for details. The fix will add some time
to the computation but is safer.
NEW FEATURES
* `replace_to`/`replace_from` added to remove from/to begin/end of string to/from
a character(s).
* The following replacement functions were added to provide remediation for
problems found in `check_text`: `replace_email`, `replace_hash`,
`replace_tag`, & `replace_url`.
MINOR FEATURES
* `check_text` picks up a `checks` and `n` argument. The former allows the user
to specify which checks to conduct. The latter allows the user to truncate the
output to n number of elements with a closing `...[truncated]...`. This makes
the function more useful and the code easier to maintain.
IMPROVEMENTS
* `replace_non_ascii` did not replace all non-ASCII characters. This has been
fixed by an explicit replacement of '[^ -~]+' which are all non-ASCII characters.
See issue #34 for details.
textclean 0.7.3
----------------------------------------------------------------
Maintenance release to bring package up to date with the lexicon package API changes.
textclean 0.7.0 - 0.7.2
----------------------------------------------------------------
NEW FEATURES
* `match_tokens` added to find all the tokens that match a regex(es) within a
given text vector. This useful when combined with the `replace_tokens`
function.
* Fixed versions of `drop_element`/`keep_element` added to allow for dropping
elements specified by a known vector rather than a regex.
* The `collapse` and `glue` functions from the **glue** package are reexported
for easy string manipulation.
* `replace_date` added for normalizing dates.
* `replace_time` added for normalizing time stamps.
* `replace_money` added for normalizing money references.
* `mgsub` picks up a `safe` argument using the **mgsub** package as the backend.
In addition `mgsub_regex_safe` added to make the usage explicit. The safe mode
comes at the cost of speed.
IMPROVEMENTS
* `replace_names` drops the replacement of
`c('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un')` which are
likely words and not names.
* `replace_html` picks ups some additional symbol replacements including:
`c("™", "“", "”", "‘", "’", "•", "·",
"⋅", "–", "—", "≠", "½", "¼", "¾",
"°", "←", "→", "…")`.
textclean 0.6.0 - 0.6.3
----------------------------------------------------------------
NEW FEATURES
* `replace_kern` added to replace a form of informal emphasis in which the
writer takes words >2 letters long, capitalizes the entire word, and places
spaces in between each letter. This was contributed by Stack Overflow's
@ctwheels: https://stackoverflow.com/a/47438305/1000343.
* `replace_internet_slang` added to replace Internet acronyms and abbreviations
with machine friendly word equivalents.
* `replace_word_elongation` added to replace word elongations (a.k.a. "word
lengthening") with the most likely normalized word form. See
http://www.aclweb.org/anthology/D11-105 for details.
* `fgsub` added for the ability to match, extract, operate a function over the
extracted strings, & replace the original matches with the extracted strings.
This performs similar functionality to `gsubfn::gsubfn` but is less powerful.
For more powerful needs see the **gsubfn** package.
textclean 0.4.0 - 0.5.1
----------------------------------------------------------------
BUG FIXES
* `replace_grade` did not use `fixed = TRUE` for its call to `mgsub`. This could
result in the plus signs being interpreted as meta-characters. This has been
corrected.
NEW FEATURES
* `replace_names` added to remove/replace common first and last names from text
data.
* `make_plural` added to make a vector of singular noun forms plural.
* `replace_emoji` and `replace_emoji_identifier` added for replacing emojis with
text or an identifier token for use in the **sentimentr** package.
MINOR FEATURES
* `mgsub_regex` and `mgsub_fixed` to provide wrappers for `mgsub` that makes
their use apparent without setting the `fixed` command.
* `replace_curly_quote` added to replace curly quotes with straight versions.
IMPROVEMENTS
* `replace_non_ascii` now uses `stringi::stri_trans_general` to coerce more
non-ASCII characters to ASCII format.
* `check_text` now checks for HTML characters/tags. Thanks to @Peter Gensler
for suggesting this (see issue #15).
CHANGES
* `filter_` functions deprecated in favor of `drop_`/`keep_` versions of filter
functions. This was change was to address the opposite meaning that **dplyr**'s
`filter` has, which retains rows matching a pattern be default.
textclean 0.3.1
----------------------------------------------------------------
BUG FIXES
* `replace_tokens` added to complement `mgsub` for times when the user wants to
replace fixed tokens with a single value or remove them entirely. This yields
an optimized solution that is much faster than `mgsub`.
CHANGES
* `mgusb` no longer uses `trim = TRUE` by default.
textclean 0.2.1 - 0.3.0
----------------------------------------------------------------
BUG FIXES
* `check_text` reported to use `replace_incomplete` rather than
`add_missing_endmark` when endmark is missing.
NEW FEATURES
* The `replace_emoticon`, `replace_grade` and `replace_rating` functions have
been moved from the **sentimentr** package to **textclean** as these are
cleaning functions. This makes the functions more modular and generalizable
to all types of text cleaning. These functions are still imported and
exported by **sentimentr**.
* `replace_html` added to remove html tags and repalce symbols with appropriate
ASCII symbols.
* `add_missing_endmarks` added to detect missing endmarks and replace with the
desired symbol.
IMPROVEMENTS
* `replace_number` now uses the *english* package making it faster and more
maintainable. In addition, the function now handles decimal places as well.
textclean 0.1.0 - 0.2.0
----------------------------------------------------------------
BUG FIXES
* `check_text` reported `NA` as non-ASCII. This has been fixed.
NEW FEATURES
* `check_text` added to report on potential problems in a text vector.
* `replace_ordinal` added to replace ordinal numbers (e.g., 1st) with word
representation (e.g., first).
* `swap` added to swap two patterns simultaneously.
* `filter_element` added to exclude matching elements from a vector.
textclean 0.0.1
----------------------------------------------------------------
This package is a collection of tools to clean and process text. Many of these tools have been taken from the **qdap** package and revamped to be more intuitive, better named, and faster.