Note: For the best viewing experience of the Jupyter Notebook, please use this nbviewer link.
- Introduction
- Dataset Overview
- Data Preprocessing
- Exploratory Data Analysis
- Feature Engineering
- Model Training and Selection
- Hyperparameter Optimization
- Model Evaluation
- Key Findings and Insights
- Conclusion
This repository serves as a comprehensive guide to predicting health insurance charges through a machine learning lens. Drawing inspiration from demographic and health-related factors, the project seeks not just to predict but also to unravel the intricate weave of variables that govern healthcare costs.
Our primary aim, grounded in Supervised Learning, revolves around Regression, using models such as Random Forest Regressor and XGBRegressor. We harness metrics like RMSE, MAE, and R-squared value to assess model accuracy. Yet, our vision goes beyond mere numbers; we aspire to shed light on the nuanced relationships influencing these charges. Through this analysis, we hope to offer meaningful insights, beneficial for both insurance companies and individuals, encapsulating our mission: "To use personal information to accurately and insightfully predict healthcare costs."
The dataset used for this project consists of health insurance details of individuals, including demographics, smoking habits, body mass index, number of children, region, and corresponding charges. With this comprehensive data, the project aims to draw correlations and patterns influencing insurance prices.
Data preprocessing involved handling missing values, converting categorical variables into numerical formats, and ensuring the dataset is optimized for machine learning models.
Detailed EDA was performed to understand the dataset's structure, unearth patterns, identify outliers, and ascertain potential variables affecting the insurance charges.
Strategic feature engineering techniques were employed to harness the data's full potential. This involved creating interaction terms, binning, and encoding categorical features to ensure the dataset is primed for predictions.
Multiple models, including Linear Regression, Random Forest, XGBoost, CatBoostRegressor and Support Vector Machines, were trained. Their performance metrics were compared to select the best fit for the prediction task.
To ensure the models perform optimally, hyperparameters were fine-tuned using GridSearchCV, resulting in improved predictive performance.
The final model's performance was gauged using various metrics, including RMSE, MAE, and R-squared, providing a holistic evaluation of its efficacy.
Insights drawn from the model emphasized the importance of certain variables, such as smoking habits, BMI, and age, in determining insurance costs. Detailed interpretations have been provided to understand the magnitude and direction of these impacts.
The project illuminated various hidden determinants of health insurance charges. By harnessing machine learning, I derived actionable insights, paving the way for both consumers and insurance providers to make informed decisions.
For an optimal viewing experience of the Jupyter Notebook, use the following nbviewer link: View Notebook on nbviewer
This link provides a superior rendering compared to the default GitHub file viewer.
For successful execution, you'll need:
- Python 3.x
- Jupyter Notebook
- Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn
- Clone this repository.
- Install the required libraries.
- Navigate to and open the Jupyter Notebook.
Your constructive feedback and queries are always welcome!
A huge thanks to the data science community for their continuous efforts in making datasets available for public use and promoting an environment of collective learning.
This project is licensed under the MIT License. Refer to the LICENSE.md file for detailed information.