In in this project we tried to create different machine learning supervised models for pedicting whether an owner of a given house should consider applying for insurance. The data preprocessing and visualization will be included in this project.
The machine learning techniques/models used are: Decision Tree Classifier, Logistic Regression, Random Forest Classifier, SVM Classifier, and MLP Classifier.
- The
main.ipynb
is the main entry point for the project. - Other files contain useful code for formatting, normalizing, and preprocessing of the data.
Before jumping in and applying any classification algorithm, we should first understand and visualize the dataset. Here's some of the visualizations that we made for our dataset:
We can see that there are no highly correlated features in our dataset.
We can see that we are facing a problem of unbalanced data, so we should apply some oversampling techniques to avoid biased models.
After understanding the dataset and the different features, we can now apply some data preprocessing to prepare the data for the classifcation model. In this project we applied the following data preprocessing:
- NaN values: after careful study of the data, we removed some entries having NaN values, and replaces others with either mean, previous val, or next value.
- Outliers: we used the Boxplot method to determine outliers, and again made a study on whether to remove these outliers or replace them with other values.
- Encoding: we had to encode non-numeric values in order for the ML algrorithm to function correctly.
- Normalization: in order to make it easier for the ML algorithm to learn, we applied scaling techninques like RobustScaler to normalize the data.
We applied different classification models and made some evaluation and comparisons to select the best model.
After training this classfier, we got the following results on the test data:
- accuracy (in %): 77.17745691662785
- Confustion Matrix:
- accuracy (in %): 76.61853749417791
- Confustion Matrix: