(Inspired by the 2018 Home Credit Kaggle Competition)
This project demonstrates a proof-of-concept for risk evaluation using machine learning, focusing on predicting loan default risk. As part of a start-up product team simulation, I built a complete pipeline for risk assessment, from data exploration and preprocessing to model development and deployment. The model is aimed at helping financial institutions assess the risk of lending, especially in cases where customers have limited financial history.
Financial institutions often struggle to assess the credit risk of individuals with little or no credit history, such as first-time homebuyers and small business owners. This project explores whether machine learning can improve the accuracy of these assessments by analyzing a range of customer data, including loan applications, credit bureau reports, and payment histories.
The project uses a comprehensive set of financial data from Home Credit, including:
- Current loan applications
- Previous loan applications
- Historical loan balances
- Credit Bureau data
- Payment history records
All data can be found here.
- Data Exploration: Analyzed and visualized the relationships within the data to understand key features related to loan default risk.
- Data Preprocessing: Handled missing data, performed feature engineering, and created scalable preprocessing steps.
- Predictive Modeling: Developed and tuned machine learning models to predict the likelihood of loan default, with an emphasis on performance metrics.
- App Deployment: Deployed the final model as a containerized application to demonstrate its real-world use case.
.
├── README.md
├── data
(contains some ready-made folders for data created in notebooks)
├── deployment
│ ├── Dockerfile
│ ├── app
│ │ └── main.py
│ ├── data (folder for X.parquet which is generated in notebook 5_ML_models)
│ ├── model
│ │ └── model.pkl
│ └── requirements.txt
| (requirements for containerization)
├── notebooks
(RUN THESE IN ORDER TO SEE MY DEVELOPMENT PROCESS FOR THIS PROJECT)
│ ├── 1_feature_investigation_main_dataset.ipynb
│ ├── 2_feature_engineering_supplementary_data.ipynb
│ ├── 3_feature_preprocess_preliminary_models.ipynb
│ ├── 4_EDA.ipynb
│ └── 5_ML_models.ipynb
├── requirements.txt
| (an exact replica of my development environment)
└── utils
├── __init__.py
├── feature_tools.py
├── machine_learning.py
├── plot.py
└── utils.py
I recommend creating a virtual environment, in this case I call it "home-credit".
In terminal:
python -m venv home-credit
Activate venv in terminal
source home-credit/bin/activate
side note: can deactivate venv with
deactivate
Install all requirements by first going to the directory where requirements.txt is (e.g. project root directory)
cd name/of/root/directory
and then typing in terminal:
pip install -r requirements.txt
Now you are ready to run the Jupyter notebooks found in the notebooks directory using your favorite IDE or
jupyter lab
Step through the notebooks sequentially to gain an understanding of my workflow and the predictive algorithm that I generated.
Moreover, in the deployment directory there are all the necessary files in order run a containerized version of the loan prediction app.
See full list of requirements with exact versions to recreate my development environment in requirements.txt
Key Requirements:
- Boruta
- jupyterlab
- lightgbm
- matplotlib
- numpy
- optuna
- pandas
- phik
- scikit-learn
- scipy
- seaborn
- shap
- tqdm
Miguel A. Diaz-Acevedo at [email protected]