You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary
While using the caret package to compare machine learning models, we observed that the accuracy of Logit Boost (LogitBoost) models might be overestimated. This issue arises because observations where predictions are not made are excluded from the confusion matrix and subsequent performance metric calculations.
Description
We recently conducted a model comparison study using AI methods to attribute cases of human salmonellosis to various animal-based food sources (e.g., pig, bovine, poultry). We employed the caret package in R to compare Random Forest (RF), Logit Boost (LB), and Support Vector Machines (SVM) models. The train function was used, both with and without cross-validation, to train the models.
During the prediction phase on the test dataset (with a known response variable), we observed the following behavior:
All models predicted the food source for each observation, except Logit Boost.
For Logit Boost, when two or more predicted probabilities were identical for a given observation, the model appropriately failed to make a prediction, returning NA instead.
However, the issue arises during accuracy calculation:
Non-predicted observations (NA values) were excluded from the confusion matrix.
This exclusion potentially inflates the reported accuracy of Logit Boost, as it only considers successfully predicted observations.
In our opinion the model performance evaluation could account for the missing predictions to ensure a more accurate comparison.
Below is an example illustrating the discrepancy between RF and LB predictions (without cross-validation), using an artificial dataset. The confusion matrix and accuracy for RF include all observations, while for LB, they exclude cases with NA predictions.
#N of observations included in confusion matrix
sum(table(predictions_logitboost)) #55
dim(data_test)[1] #150
#% predicted
sum(table(predictions_logitboost))/dim(data_test)[1] #0.37
#% did not predicted
1-sum(table(predictions_logitboost))/dim(data_test)[1] #0.63
#N of observations included in confusion matrix
sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% did not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0
Summary
While using the caret package to compare machine learning models, we observed that the accuracy of Logit Boost (LogitBoost) models might be overestimated. This issue arises because observations where predictions are not made are excluded from the confusion matrix and subsequent performance metric calculations.
Description
We recently conducted a model comparison study using AI methods to attribute cases of human salmonellosis to various animal-based food sources (e.g., pig, bovine, poultry). We employed the caret package in R to compare Random Forest (RF), Logit Boost (LB), and Support Vector Machines (SVM) models. The train function was used, both with and without cross-validation, to train the models.
During the prediction phase on the test dataset (with a known response variable), we observed the following behavior:
However, the issue arises during accuracy calculation:
In our opinion the model performance evaluation could account for the missing predictions to ensure a more accurate comparison.
Below is an example illustrating the discrepancy between RF and LB predictions (without cross-validation), using an artificial dataset. The confusion matrix and accuracy for RF include all observations, while for LB, they exclude cases with NA predictions.
#################################################################################################
----------------------- Example code ----------------------------------------------
library(caret)
--------------------- DATASET
set.seed(123)
n=500
Y variable
response <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))
X variables: qualitative and quantitative
predictor <- factor(sample(c("X", "Y", "Z"), n, replace = TRUE))
quant_var1 <- rnorm(n, mean = 50, sd = 10)
quant_var2 <- runif(n, min = 0, max = 100)
dataset<- data.frame(Species = response,
Predictor = predictor,
Quantitative1 = quant_var1,
Quantitative2 = quant_var2)
---------------------- TRAIN AND TEST WITHOUT CROSS VALIDATION
set.seed(839)
id=sample(1:NROW(dataset), 0.70*NROW(dataset))
data_train=dataset[id,]
data_test= dataset[-id,]
data_test2<-data_test
#---- Logit Boost -------------------------------------------------------------------- #
set.seed(839)
model_logitboost=train(Species ~ .,
data = data_train,
method = "LogitBoost", metric="Accuracy",
trControl = trainControl(method = "none"))
model_logitboost
predictions_logitboost=predict(model_logitboost, data_test)
conf_matrix_logitboost=confusionMatrix(predictions_logitboost,data_test$Species)
conf_matrix_logitboost
#Accuracy
accuracy<-conf_matrix_logitboost$overall[1]
accuracy
#0.2545455
probability<-predict(model_logitboost, data_test, type="prob")
probability2=cbind(probability, predictions_logitboost)
#N of observations included in confusion matrix
sum(table(predictions_logitboost)) #55
dim(data_test)[1] #150
#% predicted
sum(table(predictions_logitboost))/dim(data_test)[1] #0.37
#% did not predicted
1-sum(table(predictions_logitboost))/dim(data_test)[1] #0.63
-------------------------------------------------------------------------------------
------------------------------------- Random Forest -------------------------------------
set.seed(839)
model_rf=train(Species~ .,
data = data_train,
method = "rf", metric="Accuracy",
trControl = trainControl(method = "none"))
model_rf
model_rf$finalModel
predictions_rf=predict(model_rf, data_test)
conf_matrix_rf=confusionMatrix(predictions_rf, data_test$Species)
conf_matrix_rf
sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0
#Accuracy
accuracy<-conf_matrix_rf$overall[1]
accuracy
#0.2866667
probability<-predict(model_rf, data_test, type="prob")
probability2=cbind(probability, predictions_rf)
#N of observations included in confusion matrix
sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% did not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0
-------------------------------------------------------------------------------------
The text was updated successfully, but these errors were encountered: