Overview:
This project applies machine learning models to predict credit default risks, offering insights to aid financial institutions in making data-driven lending decisions. It focuses on data preprocessing, feature selection, model training, and evaluation.
- Objective: Prepare a clean dataset by handling missing values, normalizing numerical features, and encoding categorical variables.
- Key Steps:
- Address missing data using mean imputation for numeric features.
- Encode categorical variables (e.g.,
OneHotEncoder
for nominal data). - Normalize continuous features to standardize scales and prevent any single feature from dominating.
- Used Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forest and XGBoost models to identify the most predictive features, focusing on critical indicators such as loan grade, home ownership status, and financial burden.
- After performing SMOTE to balance the dataset, feature importance was re-evaluated to enhance model interpretability and efficiency.
- Models trained:
- Logistic Regression: Baseline model for simple interpretation.
- K-Nearest Neighbors (KNN): Used for non-linear boundary detection.
- Decision Trees and Random Forest: To capture complex, non-linear interactions.
- XGBoost: Primary model due to its robustness in handling imbalanced data.
- Evaluation Metrics: AUC-ROC, F1 Score, Precision, and Recall.
- XGBoost demonstrated the highest performance with an AUC of 0.94, balancing recall and precision effectively.
- Feature importance highlighted loan grade and home ownership as critical risk indicators.
- Python Libraries: scikit-learn, pandas, NumPy
- Tools: Jupyter Notebook, Colab