Skip to content

Using Random Forest and XGBoost for regression to predict health insurance charges based on patient data. Features EDA, preprocessing, and in-depth insights.

License

Notifications You must be signed in to change notification settings

FutureGoose/predicting_insurance_charges

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Health Insurance Charges: Unraveling Cost Determinants through Machine Learning 🌡️💸

Note: For the best viewing experience of the Jupyter Notebook, please use this nbviewer link.

Table of Contents 📘

  1. Introduction
  2. Dataset Overview
  3. Data Preprocessing
  4. Exploratory Data Analysis
  5. Feature Engineering
  6. Model Training and Selection
  7. Hyperparameter Optimization
  8. Model Evaluation
  9. Key Findings and Insights
  10. Conclusion

1. Introduction 🌟

This repository serves as a comprehensive guide to predicting health insurance charges through a machine learning lens. Drawing inspiration from demographic and health-related factors, the project seeks not just to predict but also to unravel the intricate weave of variables that govern healthcare costs.

Our primary aim, grounded in Supervised Learning, revolves around Regression, using models such as Random Forest Regressor and XGBRegressor. We harness metrics like RMSE, MAE, and R-squared value to assess model accuracy. Yet, our vision goes beyond mere numbers; we aspire to shed light on the nuanced relationships influencing these charges. Through this analysis, we hope to offer meaningful insights, beneficial for both insurance companies and individuals, encapsulating our mission: "To use personal information to accurately and insightfully predict healthcare costs."

2. Dataset Overview 📁

The dataset used for this project consists of health insurance details of individuals, including demographics, smoking habits, body mass index, number of children, region, and corresponding charges. With this comprehensive data, the project aims to draw correlations and patterns influencing insurance prices.

3. Data Preprocessing 🧹

Data preprocessing involved handling missing values, converting categorical variables into numerical formats, and ensuring the dataset is optimized for machine learning models.

4. Exploratory Data Analysis 📊

Detailed EDA was performed to understand the dataset's structure, unearth patterns, identify outliers, and ascertain potential variables affecting the insurance charges.

5. Feature Engineering ⚙️

Strategic feature engineering techniques were employed to harness the data's full potential. This involved creating interaction terms, binning, and encoding categorical features to ensure the dataset is primed for predictions.

6. Model Training and Selection 🤖

Multiple models, including Linear Regression, Random Forest, XGBoost, CatBoostRegressor and Support Vector Machines, were trained. Their performance metrics were compared to select the best fit for the prediction task.

7. Hyperparameter Optimization 🔧

To ensure the models perform optimally, hyperparameters were fine-tuned using GridSearchCV, resulting in improved predictive performance.

8. Model Evaluation 🎯

The final model's performance was gauged using various metrics, including RMSE, MAE, and R-squared, providing a holistic evaluation of its efficacy.

9. Key Findings and Insights 💡

Insights drawn from the model emphasized the importance of certain variables, such as smoking habits, BMI, and age, in determining insurance costs. Detailed interpretations have been provided to understand the magnitude and direction of these impacts.

10. Conclusion 🎉

The project illuminated various hidden determinants of health insurance charges. By harnessing machine learning, I derived actionable insights, paving the way for both consumers and insurance providers to make informed decisions.


Getting Started 🏁

For an optimal viewing experience of the Jupyter Notebook, use the following nbviewer link: View Notebook on nbviewer

This link provides a superior rendering compared to the default GitHub file viewer.

Prerequisites 📋

For successful execution, you'll need:

  • Python 3.x
  • Jupyter Notebook
  • Libraries: NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn

Installing 🛠️

  1. Clone this repository.
  2. Install the required libraries.
  3. Navigate to and open the Jupyter Notebook.

Your constructive feedback and queries are always welcome!


Acknowledgments 🙏

A huge thanks to the data science community for their continuous efforts in making datasets available for public use and promoting an environment of collective learning.

License 📄

This project is licensed under the MIT License. Refer to the LICENSE.md file for detailed information.

About

Using Random Forest and XGBoost for regression to predict health insurance charges based on patient data. Features EDA, preprocessing, and in-depth insights.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published