Wrong accuracy value for logit boost model #1373

mmancin · 2024-11-20T09:23:44Z

Summary
While using the caret package to compare machine learning models, we observed that the accuracy of Logit Boost (LogitBoost) models might be overestimated. This issue arises because observations where predictions are not made are excluded from the confusion matrix and subsequent performance metric calculations.

Description
We recently conducted a model comparison study using AI methods to attribute cases of human salmonellosis to various animal-based food sources (e.g., pig, bovine, poultry). We employed the caret package in R to compare Random Forest (RF), Logit Boost (LB), and Support Vector Machines (SVM) models. The train function was used, both with and without cross-validation, to train the models.

During the prediction phase on the test dataset (with a known response variable), we observed the following behavior:

All models predicted the food source for each observation, except Logit Boost.
For Logit Boost, when two or more predicted probabilities were identical for a given observation, the model appropriately failed to make a prediction, returning NA instead.

However, the issue arises during accuracy calculation:

Non-predicted observations (NA values) were excluded from the confusion matrix.
This exclusion potentially inflates the reported accuracy of Logit Boost, as it only considers successfully predicted observations.

In our opinion the model performance evaluation could account for the missing predictions to ensure a more accurate comparison.

Below is an example illustrating the discrepancy between RF and LB predictions (without cross-validation), using an artificial dataset. The confusion matrix and accuracy for RF include all observations, while for LB, they exclude cases with NA predictions.

#################################################################################################

----------------------- Example code ----------------------------------------------

library(caret)

--------------------- DATASET

set.seed(123)
n=500

Y variable

response <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))

X variables: qualitative and quantitative

predictor <- factor(sample(c("X", "Y", "Z"), n, replace = TRUE))
quant_var1 <- rnorm(n, mean = 50, sd = 10)
quant_var2 <- runif(n, min = 0, max = 100)

dataset<- data.frame(Species = response,
Predictor = predictor,
Quantitative1 = quant_var1,
Quantitative2 = quant_var2)

---------------------- TRAIN AND TEST WITHOUT CROSS VALIDATION

set.seed(839)
id=sample(1:NROW(dataset), 0.70*NROW(dataset))
data_train=dataset[id,]
data_test= dataset[-id,]
data_test2<-data_test

#---- Logit Boost -------------------------------------------------------------------- #
set.seed(839)
model_logitboost=train(Species ~ .,
data = data_train,
method = "LogitBoost", metric="Accuracy",
trControl = trainControl(method = "none"))
model_logitboost

predictions_logitboost=predict(model_logitboost, data_test)
conf_matrix_logitboost=confusionMatrix(predictions_logitboost,data_test$Species)
conf_matrix_logitboost

#Accuracy
accuracy<-conf_matrix_logitboost$overall[1]
accuracy
#0.2545455

probability<-predict(model_logitboost, data_test, type="prob")
probability2=cbind(probability, predictions_logitboost)

#N of observations included in confusion matrix
sum(table(predictions_logitboost)) #55
dim(data_test)[1] #150
#% predicted
sum(table(predictions_logitboost))/dim(data_test)[1] #0.37
#% did not predicted
1-sum(table(predictions_logitboost))/dim(data_test)[1] #0.63

-------------------------------------------------------------------------------------

------------------------------------- Random Forest -------------------------------------

set.seed(839)
model_rf=train(Species~ .,
data = data_train,
method = "rf", metric="Accuracy",
trControl = trainControl(method = "none"))

model_rf
model_rf$finalModel

predictions_rf=predict(model_rf, data_test)
conf_matrix_rf=confusionMatrix(predictions_rf, data_test$Species)
conf_matrix_rf

sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0

#Accuracy
accuracy<-conf_matrix_rf$overall[1]
accuracy
#0.2866667

probability<-predict(model_rf, data_test, type="prob")
probability2=cbind(probability, predictions_rf)

#N of observations included in confusion matrix
sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% did not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0

-------------------------------------------------------------------------------------

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong accuracy value for logit boost model #1373

Wrong accuracy value for logit boost model #1373

mmancin commented Nov 20, 2024

Wrong accuracy value for logit boost model #1373

Wrong accuracy value for logit boost model #1373

Comments

mmancin commented Nov 20, 2024

----------------------- Example code ----------------------------------------------

--------------------- DATASET

Y variable

X variables: qualitative and quantitative

---------------------- TRAIN AND TEST WITHOUT CROSS VALIDATION

-------------------------------------------------------------------------------------

------------------------------------- Random Forest -------------------------------------

-------------------------------------------------------------------------------------