Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: <add factor_level_method argument for df_explicit_na function> #1322

Open
3 tasks done
kaipingyang opened this issue Oct 8, 2024 · 1 comment
Open
3 tasks done

Comments

@kaipingyang
Copy link

Feature description

Hi @shajoezhu, in data preprocessing, we often use df_explicit_na function to convert character variables into factor variables.
df_explicit_na source code uses factor function, factor defaults to sort(unique(x)) to assign the sorting result of data to level.
We found the following issues:

We note that the results of R's sort function for default sorting are not consistent with the results of SAS's proc sort.
But the sort function method with "radix" and the tidyverse arrange function give the same results as SAS.

> library(tidyverse)
> data <- data.frame(
+   var = c("Cellulitis","COVID-19","Conjunctivitis","_","-","%")
+ )
> 
> sort(data$var)
[1] "-"              "%"              "_"             
[4] "Cellulitis"     "Conjunctivitis" "COVID-19"      
> sort(unique(data$var), method = "radix")
[1] "%"              "-"              "COVID-19"      
[4] "Cellulitis"     "Conjunctivitis" "_"             
> data %>% arrange(var)
             var
1              %
2              -
3       COVID-19
4     Cellulitis
5 Conjunctivitis
6              _

We need to specify the factor level as sort(unique(x), method = "radix") to get the same factor level order as the SAS proc sort.

> factor(data$var)
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19
> factor(data$var, levels = sort(unique(data$var), method = "radix"))
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: % - COVID-19 Cellulitis Conjunctivitis _

Therefore, the result of df_explicit_na function transformation is also inconsistent with SAS.

> data1 <- df_explicit_na(data)
> data1$var
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19

So we want to add factor_level_method argument for df_explicit_na function:

  • When factor_level_method = "data", the factor levels are sorted according to the order in which each value first appears in the data, that is, unique(x).
  • When factor_level_method = "sort_auto" or "default", factor's level is sort(unique(x)).
  • When factor_level_method = "sort_radix", the factor level is sort(unique(x), method = "radix").

Furthermore, can we specify how the level of a specific variable should be set by passing a vector with a name?
such as: factor_level_method = c("a" = "data", "b" = "sort_radix").

Code of Conduct

  • I agree to follow this project's Code of Conduct.

Contribution Guidelines

  • I agree to follow this project's Contribution Guidelines.

Security Policy

  • I agree to follow this project's Security Policy.
@Melkiades
Copy link
Contributor

@kaipingyang I think this makes perfect sense! Feel free to open a PR with the above new parameter. I can personally review it so we get this in asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants