You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thank you for developing and maintaining this great package!
I am using the Dirichlet regression and would like to approximate model quality with some sort of R-squared. Is there a sensible way to devise a pseudo-r2 for a Dirichlet model? When doing so, how should we treat precision in the alternative parametrization?
I have tried a few options (discussed below), but I do not have much detailed knowledge on pseudo-r2 (I also tried, rsq package version 2.5, which did not work):
1 – Spearman’s correlation between observed-fitted values, squared (fitted values are at the probability scale).
2 - MuMIn::r.squaredLR(). A null model has to be declared manually for Dirichlet models
3 – performance::r2(), which returns Nagelkerke’s R2.
4 – McFadden’s pseudo R2 (manual calculation, with a manually devised null model)
The definition of a null model is a bit tricky in the case of the alternative parametrization: should it remove the modelling of precision using explanatory variables? I used intercept-only for all categories in the common parametrization, while in the alternative parametrization I tried: intercept-only for means plus explanatory variables for precision, and intercept-only both for means and precision. I did this for two datasets provided in the DirichletReg package (artic lake and blood samples).
As I see it, the only consistent and sensible result is provided by the squared correlation(observed-fitted). MuMIn::r.squaredLR() could return very high values (regardless of model parametrization): close to 0.9 for a model with correlation(observed-fitted)^2 = 0.6. performance::r2() gave similar values as MuMIn::r.squaredLR() for the alternative parametrization, while for the common it appeared more reasonable.
Intriguingly, MuMIn and performance return the same value if the null model in MuMIn is declared with update(mod, . ~ 1). Possibly this is a default behavior that does not account for the pipes that separate the estimation of means and precision (or multiple categories in the common parametrization).
The McFadden returned consistently negative values: the loglikelihoods were positive for full and null models, and greater for the full models, so their ratio does not shrink towards 0 the better the full model is. Therefore: 1 – loglik(full_model)/loglik(null_model) becomes negative. It has already been pointed that McFadden will not work if both likelihoods are positive (https://stats.stackexchange.com/questions/552281/calculating-different-pseudo-r2-for-a-betareg-model).
Thank you!
library(DirichletReg)
#> Warning: package 'DirichletReg' was built under R version 4.2.3#> Loading required package: Formula
library(MuMIn)
#> Warning: package 'MuMIn' was built under R version 4.2.3
library(performance)
#> Warning: package 'performance' was built under R version 4.2.3
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2#> Warning: package 'tibble' was built under R version 4.2.3#> Warning: package 'tidyr' was built under R version 4.2.3#> Warning: package 'purrr' was built under R version 4.2.3#> Warning: package 'dplyr' was built under R version 4.2.3#> Warning: package 'stringr' was built under R version 4.2.3# Calculate multiple metrics of "model quality"dirichlet.pseudo.r2<-function(resp, model){
null<-if(model[["parametrization"]] =="common"){update(model, formula(paste0(". ~ ", paste0(rep("1 |", ncol(resp) -1), collapse=""), "1")))}else(update(model, .~1|1))
cor.fit.obs.sq<- sapply(1:ncol(resp), function(x, df=resp, mod=model) cor(fitted(mod)[, x], df[, x], method="spearman")^2)
cor.fit.obs.sq.mean<- mean(cor.fit.obs.sq)
mumin.lrt<-if(model[["parametrization"]] =="common"){MuMIn::r.squaredLR(object=model, null=null)}else(c(MuMIn::r.squaredLR(object=model, null= update(model, .~1|.)), MuMIn::r.squaredLR(object=model, null=null)))
performance.r2<- as.numeric(performance::r2(model))
mc.fadden<-if(model[["parametrization"]] =="common"){1- as.numeric(logLik(model))/as.numeric(logLik(null))}else(c(1- as.numeric(logLik(model))/as.numeric(logLik(update(model, .~1|.))), 1- as.numeric(logLik(model))/as.numeric(logLik(null))))
result<-data.frame("cor.fit.obs.sq"= paste0(round(cor.fit.obs.sq, digits=3), collapse=", "),
"cor.fit.obs.sq.mean"= round(cor.fit.obs.sq.mean, digits=3),
"mumin.lrt"= paste0(round(mumin.lrt, digits=3), collapse=", "),
"performance.r2"= round(performance.r2, digits=3),
"mc.fadden"= paste0(round(mc.fadden, digits=3), collapse=", "))
return(result)
}
# Artic lake (from DirichletReg package)al.data<- DR_data(ArcticLake[, 1:3])
#> Warning in DR_data(ArcticLake[, 1:3]): not all rows sum up to 1 => normalization#> forcedal.mod.c<- DirichReg(al.data~depth, ArcticLake, "common")
al.mod.a<- DirichReg(al.data~depth|depth, ArcticLake, "alternative")
# Blood samples (from DirichletReg package)bld.reduced<- subset(BloodSamples, !is.na(Disease)) # remove NAbld.data<- DR_data(bld.reduced[, 1:4])
#> Warning in DR_data(bld.reduced[, 1:4]): not all rows sum up to 1 =>#> normalization forcedbld.mod.c<- DirichReg(bld.data~Disease, bld.reduced, "common")
bld.mod.a<- DirichReg(bld.data~Disease|Disease, bld.reduced, "alternative")
# Compare the pseudo-r2 metrics
bind_rows(
list("artic.lake.common"= dirichlet.pseudo.r2(resp=al.data, model=al.mod.c),
"artic.lake.alternative"= dirichlet.pseudo.r2(resp=al.data, model=al.mod.a),
"blood.samples.common"= dirichlet.pseudo.r2(resp=bld.data, model=bld.mod.c),
"blood.samples.alternative"= dirichlet.pseudo.r2(resp=bld.data, model=bld.mod.a)),
.id="model_parametrization")
#> model_parametrization cor.fit.obs.sq cor.fit.obs.sq.mean#> 1 artic.lake.common 0.741, 0.195, 0.736 0.557#> 2 artic.lake.alternative 0.741, 0.194, 0.736 0.557#> 3 blood.samples.common 0.022, 0.229, 0, 0.248 0.125#> 4 blood.samples.alternative 0.022, 0.229, 0, 0.248 0.125#> mumin.lrt performance.r2 mc.fadden#> 1 0.958 0.177 -1.564#> 2 0.921, 0.958 0.921 -0.956, -1.56#> 3 0.222 0.036 -0.025#> 4 0.207, 0.222 0.207 -0.023, -0.025# Compare values with models declared with update(model, . ~ 1)performance::r2(bld.mod.a)
#> Nagelkerke's R2: 0.207MuMIn::r.squaredLR(bld.mod.a, null= update(bld.mod.a, .~1))
#> [1] 0.206889#> attr(,"adj.r.squared")#> [1] -1.01549e-05performance::r2(al.mod.a)
#> Nagelkerke's R2: 0.921MuMIn::r.squaredLR(al.mod.a, null= update(al.mod.a, .~1))
#> [1] 0.9208783#> attr(,"adj.r.squared")#> [1] -0.0698377performance::r2(bld.mod.c)
#> Nagelkerke's R2: 0.036MuMIn::r.squaredLR(bld.mod.c, null= update(bld.mod.c, .~1))
#> [1] 0.03581869#> attr(,"adj.r.squared")#> [1] -1.446171e-06performance::r2(al.mod.c)
#> Nagelkerke's R2: 0.177MuMIn::r.squaredLR(al.mod.c, null= update(al.mod.c, .~1))
#> [1] 0.1771588#> attr(,"adj.r.squared")#> [1] -0.001197615# Log -likelihood
lapply(list(al.mod.a, al.mod.c, bld.mod.a, bld.mod.c), logLik)
#> [[1]]#> 'log Lik.' 101.1859 (df=6)#> #> [[2]]#> 'log Lik.' 101.3697 (df=6)#> #> [[3]]#> 'log Lik.' 152.3073 (df=8)#> #> [[4]]#> 'log Lik.' 152.3073 (df=8)
lapply(list(update(al.mod.a, .~1|1), update(al.mod.c, .~1|1|1), update(bld.mod.a, .~1|1), update(bld.mod.c, .~1|1|1|1)), logLik)
#> [[1]]#> 'log Lik.' 39.52929 (df=3)#> #> [[2]]#> 'log Lik.' 39.52929 (df=3)#> #> [[3]]#> 'log Lik.' 148.537 (df=4)#> #> [[4]]#> 'log Lik.' 148.537 (df=4)
sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)#> Platform: x86_64-w64-mingw32/x64 (64-bit)#> Running under: Windows 10 x64 (build 22631)#> #> Matrix products: default#> #> locale:#> [1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8 #> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C #> [5] LC_TIME=Spanish_Spain.utf8 #> #> attached base packages:#> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages:#> [1] forcats_0.5.1 stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 #> [5] readr_2.1.2 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.0 #> [9] tidyverse_1.3.1 performance_0.10.5 MuMIn_1.47.5 DirichletReg_0.7-1#> [13] Formula_1.2-4 #> #> loaded via a namespace (and not attached):#> [1] lubridate_1.8.0 lattice_0.20-45 zoo_1.8-10 assertthat_0.2.1#> [5] digest_0.6.33 utf8_1.2.2 cellranger_1.1.0 R6_2.5.1 #> [9] backports_1.4.1 reprex_2.0.2 stats4_4.2.0 evaluate_0.15 #> [13] httr_1.4.3 highr_0.9 pillar_1.9.0 miscTools_0.6-28#> [17] rlang_1.1.1 readxl_1.4.0 rstudioapi_0.13 Matrix_1.4-1 #> [21] rmarkdown_2.14 munsell_0.5.0 broom_0.8.0 modelr_0.1.8 #> [25] compiler_4.2.0 xfun_0.40 pkgconfig_2.0.3 maxLik_1.5-2 #> [29] htmltools_0.5.6 insight_0.19.5 tidyselect_1.2.0 fansi_1.0.3 #> [33] crayon_1.5.1 tzdb_0.3.0 dbplyr_2.1.1 withr_2.5.0 #> [37] grid_4.2.0 nlme_3.1-157 jsonlite_1.8.0 gtable_0.3.0 #> [41] lifecycle_1.0.3 DBI_1.1.2 magrittr_2.0.3 scales_1.2.1 #> [45] cli_3.6.1 stringi_1.7.6 fs_1.5.2 xml2_1.3.3 #> [49] ellipsis_0.3.2 generics_0.1.2 vctrs_0.6.3 sandwich_3.0-1 #> [53] tools_4.2.0 glue_1.6.2 hms_1.1.1 fastmap_1.1.0 #> [57] yaml_2.3.5 colorspace_2.0-3 rvest_1.0.2 knitr_1.39 #> [61] haven_2.5.0
Hi! Thank you for developing and maintaining this great package!
I am using the Dirichlet regression and would like to approximate model quality with some sort of R-squared. Is there a sensible way to devise a pseudo-r2 for a Dirichlet model? When doing so, how should we treat precision in the alternative parametrization?
I have tried a few options (discussed below), but I do not have much detailed knowledge on pseudo-r2 (I also tried,
rsq
package version 2.5, which did not work):1 – Spearman’s correlation between observed-fitted values, squared (fitted values are at the probability scale).
2 -
MuMIn::r.squaredLR()
. A null model has to be declared manually for Dirichlet models3 –
performance::r2()
, which returns Nagelkerke’s R2.4 – McFadden’s pseudo R2 (manual calculation, with a manually devised null model)
The definition of a null model is a bit tricky in the case of the alternative parametrization: should it remove the modelling of precision using explanatory variables? I used intercept-only for all categories in the common parametrization, while in the alternative parametrization I tried: intercept-only for means plus explanatory variables for precision, and intercept-only both for means and precision. I did this for two datasets provided in the
DirichletReg
package (artic lake and blood samples).As I see it, the only consistent and sensible result is provided by the squared correlation(observed-fitted).
MuMIn::r.squaredLR()
could return very high values (regardless of model parametrization): close to 0.9 for a model with correlation(observed-fitted)^2 = 0.6.performance::r2()
gave similar values asMuMIn::r.squaredLR()
for the alternative parametrization, while for the common it appeared more reasonable.Intriguingly,
MuMIn
andperformance
return the same value if the null model inMuMIn
is declared withupdate(mod, . ~ 1)
. Possibly this is a default behavior that does not account for the pipes that separate the estimation of means and precision (or multiple categories in the common parametrization).The McFadden returned consistently negative values: the loglikelihoods were positive for full and null models, and greater for the full models, so their ratio does not shrink towards 0 the better the full model is. Therefore: 1 – loglik(full_model)/loglik(null_model) becomes negative. It has already been pointed that McFadden will not work if both likelihoods are positive (https://stats.stackexchange.com/questions/552281/calculating-different-pseudo-r2-for-a-betareg-model).
Thank you!
Created on 2024-02-16 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: