Skip to content

Commit

Permalink
clean_factor
Browse files Browse the repository at this point in the history
  • Loading branch information
msberends committed Nov 19, 2024
1 parent b262df1 commit 7dcfb33
Show file tree
Hide file tree
Showing 27 changed files with 56 additions and 45 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/check-full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/website.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
6 changes: 6 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# cleaner 1.5.5

* For `clean_factor()` switched the names and values of `levels`
* Fix CRAN check error


# cleaner 1.5.4

* For `clean_Date()` and `clean_POSIXct()`: allow argument `max_date` to be the same length as `x`
Expand Down
24 changes: 14 additions & 10 deletions R/clean.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand All @@ -26,7 +26,7 @@
#' @param false [regex] to interpret values as `FALSE` (which defaults to [regex_false()]), see Details
#' @param na [regex] to force interpret values as `NA`, i.e. not as `TRUE` or `FALSE`
#' @param remove [regex] to define the character(s) that should be removed, see Details
#' @param levels new factor levels, may be named with regular expressions to match existing values, see Details
#' @param levels new factor levels, may be named regular expressions to match existing values, see Details
#' @param droplevels logical to indicate whether non-existing factor levels should be dropped
#' @param ordered logical to indicate whether the factor levels should be ordered
#' @param fixed logical to indicate whether regular expressions should be turned off
Expand All @@ -39,33 +39,33 @@
#' @param format character string giving a date-time format as used by [strptime()].
#'
#' For `clean_Date(..., guess_each = TRUE)`, this can be a vector of values to be used for guessing, see Examples.
#' @param ... for `clean_Date` and `clean_POSIXct`: other parameters passed on these functions
#' @param ... for `clean_Date` and `clean_POSIXct`: other arguments passed on these functions
#' @inheritParams base::as.POSIXct
#' @details
#' Using `clean()` on a vector will guess a cleaning function based on the potential number of `NA`s it returns. Using `clean()` on a data frame to apply this guessed cleaning over all columns.
#'
#' Info about the different functions:
#'
#' - **`clean_logical()`**:
#' Use parameters `true` and `false` to match values using case-insensitive regular expressions ([regex]). Unmatched values are considered `NA`. By default, values are matched with [`regex_true`](#regex_true) and [`regex_false`](#regex_false). This allows support for values "Yes" and "No" in various languages. Use parameter `na` to override values as `NA` that would otherwise be matched with `true` or `false`. See Examples.
#' Use arguments `true` and `false` to match values using case-insensitive regular expressions ([regex]). Unmatched values are considered `NA`. By default, values are matched with [regex_true()] and [regex_false()]. This allows support for values "Yes" and "No" in various languages. Use argument `na` to override values as `NA` that would otherwise be matched with `true` or `false`. See Examples.
#'
#' - **`clean_factor()`**:
#' Use parameter `levels` to set new factor levels. They can be case-insensitive regular expressions to match existing values of `x`. For matching, new values for `levels` are internally temporarily sorted descending on text length. See Examples.
#' Use argument `levels` to set new factor levels. They can be named case-insensitive regular expressions to match existing values of `x`. For matching, new values for `levels` are internally temporarily sorted descending on text length. See Examples.
#'
#' - **`clean_numeric()`, `clean_double()`, `clean_integer()` and `clean_character()`**:
#' Use parameter `remove` to match values that must be removed from the input, using regular expressions ([regex]). In the case of `clean_numeric()`, commas will be read as dots and only the last dot will be kept. Function `clean_character()` will keep middle spaces by default. See Examples.
#' Use argument `remove` to match values that must be removed from the input, using regular expressions ([regex]). In the case of `clean_numeric()`, commas will be read as dots and only the last dot will be kept. Function `clean_character()` will keep middle spaces by default. See Examples.
#'
#' - **`clean_percentage()`**:
#' This new class works like `clean_numeric()`, but transforms it with [`as.percentage`](#as.percentage), which will retain the original values but will print them as percentages. See Examples.
#' This new class works like `clean_numeric()`, but transforms it with [as.percentage()], which will retain the original values but will print them as percentages. See Examples.
#'
#' - **`clean_currency()`**:
#' This new class works like `clean_numeric()`, but transforms it with [`as.currency`](#as.currency). The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee, and the South African rand. See Examples.
#' This new class works like `clean_numeric()`, but transforms it with [as.currency()]. The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee, and the South African rand. See Examples.
#'
#' - **`clean_Date()`**:
#' Use parameter `format` to define a date format or leave it empty to have the format guessed. Use `"Excel"` to read values as Microsoft Excel dates. The `format` parameter will be evaluated with [`format_datetime`](#format_datetime), meaning that a format like `"d-mmm-yy"` will be translated internally to `"%e-%b-%y"` for convenience. See Examples.
#' Use argument `format` to define a date format or leave it empty to have the format guessed. Use `"Excel"` to read values as Microsoft Excel dates. The `format` argument will be evaluated with [format_datetime()], meaning that a format like `"d-mmm-yy"` will be translated internally to `"%e-%b-%y"` for convenience. See Examples.
#'
#' - **`clean_POSIXct()`**:
#' Use parameter `remove` to match values that must be removed from the input, using regular expressions ([regex]). The resulting string will be coerced to a date/time element with class `POSIXct`, using [`as.POSIXct()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXct.html). See Examples.
#' Use argument `remove` to match values that must be removed from the input, using regular expressions ([regex]). The resulting string will be coerced to a date/time element with class `POSIXct`, using [as.POSIXct()]. See Examples.
#'
#' The use of invalid regular expressions in any of the above functions will not return an error (as in base R) but will instead interpret the expression as a fixed value and will throw a warning.
#' @rdname clean
Expand All @@ -92,6 +92,7 @@
#' clean_factor(gender_age, c("M", "F"))
#' clean_factor(gender_age, c("Male", "Female"))
#' clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE)
#' clean_factor(gender_age, levels = c("Group A" = "female", "Group B" = "male 50+", Other = ".*"))
#'
#' clean_Date("13jul18", "ddmmmyy")
#' clean_Date("12 August 2010")
Expand Down Expand Up @@ -195,6 +196,9 @@ clean_logical <- function(x, true = regex_true(), false = regex_false(), na = NU
#' @rdname clean
#' @export
clean_factor <- function(x, levels = unique(x), ordered = FALSE, droplevels = FALSE, fixed = FALSE, ignore.case = TRUE) {
if (!is.null(names(levels))) {
levels <- stats::setNames(names(levels), levels)
}
if (!all(levels %in% x)) {
new_x <- rep(NA_character_, length(x))
# sort descending on character length
Expand Down
2 changes: 1 addition & 1 deletion R/currency.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/format_datetime.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/format_names.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/format_p_value.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/freq.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/helpers.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/na_replace.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/percentage.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/rdate.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/regex_true_false.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion R/zzz.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,9 @@ Use `clean()` to clean data. It guesses what kind of data class would best fit y
You can also name your levels to let them match your values. They support regular expressions too:

```r
clean_factor(gender_age, levels = c("female" = "Group A",
"male 50+" = "Group B",
".*" = "Other"))
clean_factor(gender_age, levels = c("Group A" = "female",
"Group B" = "male 50+",
Other = ".*"))
#> [1] Other Group B Group A Group A
#> Levels: Group A Group B Other
```
Expand Down
2 changes: 1 addition & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
2 changes: 1 addition & 1 deletion data_raw/unclean.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# https://github.com/msberends/cleaner #
# #
# LICENCE #
# (c) 2022 Berends MS ([email protected]) #
# 2019-2024 Berends MS ([email protected]) #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
Expand Down
Loading

0 comments on commit 7dcfb33

Please sign in to comment.