Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong accuracy value for logit boost model #1373

Open
mmancin opened this issue Nov 20, 2024 · 0 comments
Open

Wrong accuracy value for logit boost model #1373

mmancin opened this issue Nov 20, 2024 · 0 comments

Comments

@mmancin
Copy link

mmancin commented Nov 20, 2024

Summary
While using the caret package to compare machine learning models, we observed that the accuracy of Logit Boost (LogitBoost) models might be overestimated. This issue arises because observations where predictions are not made are excluded from the confusion matrix and subsequent performance metric calculations.

Description
We recently conducted a model comparison study using AI methods to attribute cases of human salmonellosis to various animal-based food sources (e.g., pig, bovine, poultry). We employed the caret package in R to compare Random Forest (RF), Logit Boost (LB), and Support Vector Machines (SVM) models. The train function was used, both with and without cross-validation, to train the models.

During the prediction phase on the test dataset (with a known response variable), we observed the following behavior:

  1. All models predicted the food source for each observation, except Logit Boost.
  2. For Logit Boost, when two or more predicted probabilities were identical for a given observation, the model appropriately failed to make a prediction, returning NA instead.

However, the issue arises during accuracy calculation:

  1. Non-predicted observations (NA values) were excluded from the confusion matrix.
  2. This exclusion potentially inflates the reported accuracy of Logit Boost, as it only considers successfully predicted observations.

In our opinion the model performance evaluation could account for the missing predictions to ensure a more accurate comparison.

Below is an example illustrating the discrepancy between RF and LB predictions (without cross-validation), using an artificial dataset. The confusion matrix and accuracy for RF include all observations, while for LB, they exclude cases with NA predictions.

#################################################################################################

----------------------- Example code ----------------------------------------------

library(caret)

--------------------- DATASET

set.seed(123)
n=500

Y variable

response <- factor(sample(c("A", "B", "C", "D"), n, replace = TRUE))

X variables: qualitative and quantitative

predictor <- factor(sample(c("X", "Y", "Z"), n, replace = TRUE))
quant_var1 <- rnorm(n, mean = 50, sd = 10)
quant_var2 <- runif(n, min = 0, max = 100)

dataset<- data.frame(Species = response,
Predictor = predictor,
Quantitative1 = quant_var1,
Quantitative2 = quant_var2)

---------------------- TRAIN AND TEST WITHOUT CROSS VALIDATION

set.seed(839)
id=sample(1:NROW(dataset), 0.70*NROW(dataset))
data_train=dataset[id,]
data_test= dataset[-id,]
data_test2<-data_test

#---- Logit Boost -------------------------------------------------------------------- #
set.seed(839)
model_logitboost=train(Species ~ .,
data = data_train,
method = "LogitBoost", metric="Accuracy",
trControl = trainControl(method = "none"))
model_logitboost

predictions_logitboost=predict(model_logitboost, data_test)
conf_matrix_logitboost=confusionMatrix(predictions_logitboost,data_test$Species)
conf_matrix_logitboost

#Accuracy
accuracy<-conf_matrix_logitboost$overall[1]
accuracy
#0.2545455

probability<-predict(model_logitboost, data_test, type="prob")
probability2=cbind(probability, predictions_logitboost)

#N of observations included in confusion matrix
sum(table(predictions_logitboost)) #55
dim(data_test)[1] #150
#% predicted
sum(table(predictions_logitboost))/dim(data_test)[1] #0.37
#% did not predicted
1-sum(table(predictions_logitboost))/dim(data_test)[1] #0.63

-------------------------------------------------------------------------------------

------------------------------------- Random Forest -------------------------------------

set.seed(839)
model_rf=train(Species~ .,
data = data_train,
method = "rf", metric="Accuracy",
trControl = trainControl(method = "none"))

model_rf
model_rf$finalModel

predictions_rf=predict(model_rf, data_test)
conf_matrix_rf=confusionMatrix(predictions_rf, data_test$Species)
conf_matrix_rf

sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0

#Accuracy
accuracy<-conf_matrix_rf$overall[1]
accuracy
#0.2866667

probability<-predict(model_rf, data_test, type="prob")
probability2=cbind(probability, predictions_rf)

#N of observations included in confusion matrix
sum(table(predictions_rf)) #150
dim(data_test)[1] #150
#% predicted
sum(table(predictions_rf))/dim(data_test)[1] #1
#% did not predicted
1-sum(table(predictions_rf))/dim(data_test)[1] #0

-------------------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant