NEWS

NEWS 
====

Versioning
----------

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

* Breaking backward compatibility bumps the major (and resets the minor 
  and patch)
* New additions without breaking backward compatibility bumps the minor 
  (and resets the patch)
* Bug fixes and misc changes bumps the patch


textclean 0.9.4 -  
----------------------------------------------------------------

BUG FIXES

* `replace_emoticon` replaced emoticon-like substrings within actual words.  
  Spotted thanks to Carolyn Challoner; see issue #46.

* `replace_number` failed if the number pattern contained two leading decimals 
  or hyphens.  Spotted thanks to Stefano De Sabbata; see issue #60.
  
* `replace_word_elongation` failed for repeating of the same character but of
  different case (e.g., `replace_word_elongation("Ooo")` resulted in `NA`.  This
  has been corrected.  Additionally, the `elongation.search.pattern` defined as 
  `"(?i)(?:^|\\b)\\w*([a-z])(?:\\1{2,})\\w*($|\\b)"` has been moved exterally, to
  a parameter, allowing the user to alter this pattern if desired.  Spotted 
  thanks to Stefano De Sabbata; see issue #59.

NEW FEATURES

* `replace_misspelling` added as a way to replace misspelled words with their 
  most likely replacement using **hunspell** in the backend.  Suggested by Surin
  Space; see issue #39.
  
* `as_ordinal` added as a convenience wrapper for `english::ordinal` that 
  takes integers and converts them to ordinal form.
  
* `%like%` added as an binary operator similar to SQL's LIKE.

MINOR FEATURES

* `fix_mdyyyy` added to correct dates in the form of m/d/yyyy to yyyy-mm-dd.

IMPROVEMENTS

* `replace_html` pics up the ability to replace "&laquo;" & "&raquo;" with ASCII
  equivalents "<<" & ">>".  Suggested by Ilya Shutov; see issue #48.

* All internal calls to `grepl()` now have `perl = TRUE` added as this is 
  generally a speed up. Suggested by Kyle Haynes (see #51).
  
CHANGES

* `filter_element()` and `filter_row()` have been deprecated for a few years.  
  They have now been removed.
  

textclean 0.9.3 
----------------------------------------------------------------

Version update to comply with changes in the **glue** package's API.


textclean 0.8.0 - 0.9.2
----------------------------------------------------------------

BUG FIXES

* `fgsub` had a bug in which the the original `pattern` in `fgsub` matches the 
  location in the string but when the replacement occurs this was done on the 
  entire string rather than the location of the first `pattern` match.  This
  means the extracted string was used as a search and might be found in places
  other than the original location (e.g., a leading boundary in '^T' replaced
  with '__' may have led to '__he __itle' rather than '__he Title' as expected
  in the string 'The Title').  See #35 for details.  The fix will add some time 
  to the computation but is safer.

NEW FEATURES

*  `replace_to`/`replace_from` added to remove from/to begin/end of string to/from 
  a character(s).
  
* The following replacement functions were added to provide remediation for 
  problems found in `check_text`: `replace_email`, `replace_hash`, 
  `replace_tag`, & `replace_url`.

MINOR FEATURES

* `check_text` picks up a `checks` and `n` argument.  The former allows the user
  to specify which checks to conduct.  The latter allows the user to truncate the
  output to n number of elements with a closing `...[truncated]...`.  This makes
  the function more useful and the code easier to maintain.

IMPROVEMENTS

* `replace_non_ascii` did not replace all non-ASCII characters.  This has been
  fixed by an explicit replacement of '[^ -~]+' which are all non-ASCII characters.
  See issue #34 for details.


textclean 0.7.3
----------------------------------------------------------------

Maintenance release to bring package up to date with the lexicon package API changes.


textclean 0.7.0 - 0.7.2
----------------------------------------------------------------

NEW FEATURES

* `match_tokens` added to find all the tokens that match a regex(es) within a
  given text vector.  This useful when combined with the `replace_tokens` 
  function.
  
* Fixed versions of `drop_element`/`keep_element` added to allow for dropping
  elements specified by a known vector rather than a regex.

* The `collapse` and `glue` functions from the **glue** package are reexported
  for easy string manipulation.
  
* `replace_date` added for normalizing dates.

* `replace_time` added for normalizing time stamps.

* `replace_money` added for normalizing money references.

* `mgsub` picks up a `safe` argument using the **mgsub** package as the backend.
  In addition `mgsub_regex_safe` added to make the usage explicit.  The safe mode
  comes at the cost of speed.

IMPROVEMENTS

* `replace_names` drops the replacement of 
    `c('An', 'To', 'Oh', 'So', 'Do', 'He', 'Ha', 'In', 'Pa', 'Un')` which are 
    likely words and not names.
    
* `replace_html` picks ups some additional symbol replacements including:
  `c("&trade;", "&ldquo;", "&rdquo;", "&lsquo;", "&rsquo;", "&bull;", "&middot;", 
  "&sdot;", "&ndash;", "&mdash;", "&ne;", "&frac12;", "&frac14;", "&frac34;", 
  "&deg;", "&larr;", "&rarr;", "&hellip;")`.


textclean 0.6.0 - 0.6.3
----------------------------------------------------------------

NEW FEATURES

* `replace_kern` added to replace a form of informal emphasis in which the
  writer takes words >2 letters long, capitalizes the entire word, and places
  spaces in between each letter.  This was contributed by Stack Overflow's
  @ctwheels: https://stackoverflow.com/a/47438305/1000343.

* `replace_internet_slang` added to replace Internet acronyms and abbreviations
  with machine friendly word equivalents.
  
* `replace_word_elongation` added to replace word elongations (a.k.a. "word 
  lengthening") with the most likely normalized word form.  See 
  http://www.aclweb.org/anthology/D11-105 for details.
  
* `fgsub` added for the ability to match, extract, operate a function over the
  extracted strings, & replace the original matches with the extracted strings.
  This performs similar functionality to `gsubfn::gsubfn` but is less powerful.
  For more powerful needs see the **gsubfn** package.


textclean 0.4.0 - 0.5.1
----------------------------------------------------------------

BUG FIXES

* `replace_grade` did not use `fixed = TRUE` for its call to `mgsub`.  This could
  result in the plus signs being interpreted as meta-characters.  This has been 
  corrected.

NEW FEATURES

* `replace_names` added to remove/replace common first and last names from text 
  data.
  
* `make_plural` added to make a vector of singular noun forms plural.

* `replace_emoji` and `replace_emoji_identifier` added for replacing emojis with
  text or an identifier token for use in the **sentimentr** package.

MINOR FEATURES

* `mgsub_regex` and `mgsub_fixed` to provide wrappers for `mgsub` that makes
  their use apparent without setting the `fixed` command.
  
* `replace_curly_quote` added to replace curly quotes with straight versions.

IMPROVEMENTS

* `replace_non_ascii` now uses `stringi::stri_trans_general` to coerce more 
  non-ASCII characters to ASCII format.
  
* `check_text` now checks for HTML characters/tags.  Thanks to @Peter Gensler
  for suggesting this (see issue #15). 

CHANGES

* `filter_` functions deprecated in favor of `drop_`/`keep_` versions of filter
  functions.  This was change was to address the opposite meaning that **dplyr**'s 
  `filter` has, which retains rows matching a pattern be default.


textclean 0.3.1
----------------------------------------------------------------

BUG FIXES

* `replace_tokens` added to complement `mgsub` for times when the user wants to 
  replace fixed tokens with a single value or remove them entirely.  This yields 
  an optimized solution that is much faster than `mgsub`.

CHANGES

* `mgusb` no longer uses `trim = TRUE` by default.

textclean 0.2.1 - 0.3.0
----------------------------------------------------------------

BUG FIXES

* `check_text` reported to use `replace_incomplete` rather than 
  `add_missing_endmark` when endmark is missing.
  
NEW FEATURES

* The `replace_emoticon`, `replace_grade` and `replace_rating` functions have 
  been moved from the **sentimentr** package to **textclean** as these are 
  cleaning functions.  This makes the functions more modular and generalizable 
  to all types of text cleaning.  These functions are still imported and 
  exported by **sentimentr**.
  
* `replace_html` added to remove html tags and repalce symbols with appropriate
  ASCII symbols.
  
* `add_missing_endmarks` added to detect missing endmarks and replace with the 
  desired symbol.

IMPROVEMENTS

* `replace_number` now uses the *english* package making it faster and more 
  maintainable.  In addition, the function now handles decimal places as well.


textclean 0.1.0 - 0.2.0
----------------------------------------------------------------

BUG FIXES

* `check_text` reported `NA` as non-ASCII.  This has been fixed.

NEW FEATURES

* `check_text` added to report on potential problems in a text vector.

* `replace_ordinal` added to replace ordinal numbers (e.g., 1st) with word 
  representation (e.g., first).
  
* `swap` added to swap two patterns simultaneously.

* `filter_element` added to exclude matching elements from a vector.


textclean 0.0.1 
----------------------------------------------------------------

This package is a collection of tools to clean and process text.  Many of these tools have been taken from the **qdap** package and revamped to be more intuitive, better named, and faster.