Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninformative error message when exhausting names #83

Open
joshwlambert opened this issue Jan 18, 2024 · 2 comments
Open

Uninformative error message when exhausting names #83

joshwlambert opened this issue Jan 18, 2024 · 2 comments

Comments

@joshwlambert
Copy link
Contributor

It seems that when the number of names is exhausted when using randomNames() (with sample.with.replacement = FALSE) it gives an uninformative error message about sampling. It would be great if the {randomNames} package could provide the user with an custom informative error message when the requested number of names is too large. This error message can also suggest turning sample.with.replacement to TRUE to help.

Here is a reprex to show an example

library(randomNames)
set.seed(1)
gender <- rep(c("M", "F"), 2525)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
str(names)
#>  chr [1:5050] "Sebastian Clayton" "Melisa White" "Eli Jackson" "Malisse Ha" ...

gender <- rep(c("M", "F"), 3000)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Created on 2024-01-18 with reprex v2.0.2

@dbetebenner
Copy link
Member

Thank you for the comment.

I will think about how to add better messaging for the circumstance you provide.

If you are interested in getting a longer list of unique first.last name combinations, you can change sample.with.replacement = TRUE and then select out the unique combinations that occur.

The error you provide is because the internal data probably doesn't have enough female or male first names. Since the package is making combinations of first and last, there are probably millions of those.

To get 25,000 first/last name combinations you could do the following:

gender <- rep(c("M", "F"), 15000)
names <- randomNames::randomNames(
which.names = "both",
name.sep = " ",
name.order = "first.last",
gender = gender,
sample.with.replacement = TRUE
)

unique_names <- head(unique(names), 25000)

I asked for 30,000 names to begin with to make sure I had 25,000 uniques.

I've considered how to add this little trick for creating LONG lists of names, but haven't quite figured out how to put this into the package well.

@joshwlambert
Copy link
Contributor Author

Thanks for the response. I hadn't realised that sample.with.replacement = TRUE had a higher capacity for unique names. The suggestion of oversampling and then subsetting out the unique names worked well for my case. Here is a function I put together for that {simulist} package that is using {randomNames}. Feel free to use some of this code if it would be useful for {randomNames}.

#' Sample names using [randomNames::randomNames()]
#'
#' @description
#' Sample names for specified genders by sampling with replacement to avoid
#' exhausting number of name when `sample.with.replacement = FALSE`. The
#' duplicated names during sampling need to be removed to ensure each
#' individual has a unique name. In order to have enough unique names, more
#' names than required are sampled from [randomNames()], and the level of
#' oversampling is determined by the `buffer_factor` argument. A
#' `buffer_factor` too high and the more names are sampled which takes longer,
#' a `buffer_factor` too low and not enough unique names are sampled and
#' the `.sample_names()` function will need to loop until it has enough
#' unique names.
#'
#' @inheritParams .add_date
#' @param buffer_factor A single `numeric` determining the level of
#' oversampling (or buffer) when creating a vector of unique names from
#' [randomNames()].
#'
#' @return A `character` vector.
#' @keywords internal
.sample_names <- function(.data,
                          buffer_factor = 1.5) {
  m_idx <- .data$gender == "m"
  f_idx <- .data$gender == "f"
  num_m <- sum(m_idx)
  num_f <- sum(f_idx)
  num_sample_m <- ceiling(num_m * buffer_factor)
  num_sample_f <- ceiling(num_f * buffer_factor)

  # create sample of names so there are no duplicates
  names_m <- character(0)
  while(length(names_m) < num_m) {
    names_m <- unique(
      randomNames::randomNames(
        which.names = "both",
        name.sep = " ",
        name.order = "first.last",
        gender = rep("M", num_sample_m),
        sample.with.replacement = TRUE
      )
    )
  }

  names_f <- character(0)
  while(length(names_f) < num_f) {
    names_f <- unique(
      randomNames::randomNames(
        which.names = "both",
        name.sep = " ",
        name.order = "first.last",
        gender = rep("F", num_sample_f),
        sample.with.replacement = TRUE
      )
    )
  }

  # subset to use required number of names
  names_m <- names_m[1:num_m]
  names_f <- names_f[1:num_f]

  # order names with gender codes from .data
  names_mf <- vector(mode = "character", length = nrow(.data))
  names_mf[m_idx] <- names_m
  names_mf[f_idx] <- names_f

  # return vector of names
  names_mf
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants