Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Commit

Permalink
Merge pull request #23 from Nima-Jamshidi/milestone-02
Browse files Browse the repository at this point in the history
Milestone 02
  • Loading branch information
Nima-Jamshidi authored Mar 8, 2020
2 parents 81163e8 + 9c5c85c commit 82c105e
Show file tree
Hide file tree
Showing 9 changed files with 332 additions and 84 deletions.
91 changes: 63 additions & 28 deletions docs/milestone2.Rmd
Original file line number Diff line number Diff line change
@@ -1,22 +1,25 @@
---
title: "Draft"
author: "Diana Lin & Nima Jamshidi"
date: "07/03/2020"
output: html_document
author: "Nima Jamshidi & Diana Lin"
date: "3/6/2020"
output:
pdf_document:
toc: true
html_document:
toc: true
keep_md: true

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r load, echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
library(here)
library(corrplot)
library(scales)
library(glue)
library(psych)
library(hablar)
library(knitr)
```

## Introduction
Expand All @@ -25,13 +28,21 @@ The dataset we have chosen to work with is the "Medical Expenses" dataset used i

This dataset is very interesting as the USA does not have universal healthcare, and is known for bankrupting its citizens with hospital visits despite having insurance. It will be interesting to see the relationship between characteristics of a beneficiary, such as `BMI` and `Smoking` status, and the `charges` incurred.

Originally, this dataset was used to train a machine learning algorithm to accurately predict insurance costs using linear regression.
## Research Question
In this study, we are analyzing the data to find a relationship between the features and the amount of insurance cost.

Does having an increased BMI increase your insurance costs? What about age? Number of dependents? Smoking status?
Are certain areas of the USA associated with higher insurance costs?

In order to answer the questions above we're planning to perform a linear regression analysis and plot the regression line and relevant variables. The variables need to be normalized before performing the regression analysis.


## Data Description

This dataset explains the medical insurance costs of a small sample of the USA population. Each row corresponds to a beneficiary. Various metadata was recorded as well.

```{r load the data}

```{r load the data, echo=FALSE}
# import the data
costs <- read_csv(
here("data", "raw", "Medical_Cost.csv"),
Expand All @@ -47,12 +58,8 @@ costs <- read_csv(
)
```

The columns (except the last one) in this dataset correspond to metadata, where the last column is the monetary charges of medical insurance:
```{r columns}
colnames(costs)
```
The columns (except the last one) in this dataset correspond to metadata, where the last column is the monetary charges of medical insurance. Here are the possible values for each of the columns:

Here are the possible values for each of the above column names:

Variable | Type | Description
---------|------|---------------
Expand All @@ -67,37 +74,61 @@ Charges | double | the monetary charges the beneficiary was billed by health ins
## Exploring the Dataset

Here is a summary of the dataset, and the values of each variable:
```{r summary}
summary(costs)

```{r summary, echo=FALSE}
options(knitr.kable.NA="")
kable(summary(costs))
```

### Correlogram

In this section we are inspecting the data set to see if there is any correlation between the variables. From now on we want to consider charges as our dependent variable.
Next, we want to inspect the data set to see if there is any correlation between the variables. From now on we want to consider charges as our dependent variable.
In order to analyze correlation between variables, the ones that are categorical with two categories, are translated into binery vectors. The only categorical variable with more than two categories, is region. We split this variable into four different binery vectors, each indicating if the sample data has category (1) or not (0).

After using dummy variables for sex, smoker, and region, according to the correlogram below, smoker and charges has the strongest correlation of 0.79. No high collinearity between independent variables is observed.

![](../images/corrplot.png)
<center>

![](../images/corrplot.png){width="450" height="450"}

</center>



In order to to check if there is any cluster of data points, we use faceted plot. While the data between regions and sex does not appear to vary much, the smokers vs nonsmokers of each facet appear to cluster together, with the non-smokers having an overall lower medical cost.


### Faceted Plot
<center>

Here we want to explore the data to see if there is any cluster of data points. While the data between regions and sex does not appear to vary much, the smokers vs nonsmokers of each facet appear to cluster together, with the non-smokers having an overall lower medical cost.
![](../images/facet.png){width="450" height="450"}

</center>

![](../images/facet.png)

### Histogram

How is the distribution of sex among different age groups?
Looking at the dataset, there appears to be more beneficiaries in the 20-60 age range. The biggest difference in the number of beneficiaries from different sex is seen in the 20-30 bracket.

![](../images/age_histogram.png)

### Stacked Bar Chart
<center>

![](../images/age_histogram.png){width="450" height="450"}

</center>



How about the distribution of sex among the regions?
This plot shows the distribution of sex in each of the four regions. At a glance, the dataset looks very even when it comes to sex, but there are slightly more beneficiaries in the southeast.

![](../images/region_barchart.png)

<center>

![](../images/region_barchart.png){width="450" height="450"}

</center>




## Methods

Expand All @@ -121,4 +152,8 @@ This plot shows the distribution of sex in each of the four regions. At a glance

```{r conclusion}
# PLACE HOLDER FOR LINEAR REGRESSION
```
```

## References
1. Medical Costs Dataset - https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41
2. BMI - https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi-m.htm
176 changes: 121 additions & 55 deletions docs/milestone2.html

Large diffs are not rendered by default.

147 changes: 147 additions & 0 deletions docs/milestone2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
title: "Draft"
author: "Nima Jamshidi & Diana Lin"
date: "3/6/2020"
output:
html_document:
toc: true
keep_md: true
pdf_document:
toc: true
---






## Introduction

The dataset we have chosen to work with is the "Medical Expenses" dataset used in the book [Machine Learning with R](https://www.amazon.com/Machine-Learning-R-Brett-Lantz/dp/1782162143), by Brett Lantz. This dataset was extracted from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance/home) by Github user [\@meperezcuello](https://gist.github.com/meperezcuello). The information about this dataset has been extracted from their [GitHub Gist](https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41).

This dataset is very interesting as the USA does not have universal healthcare, and is known for bankrupting its citizens with hospital visits despite having insurance. It will be interesting to see the relationship between characteristics of a beneficiary, such as `BMI` and `Smoking` status, and the `charges` incurred.

## Research Question
In this study, we are analyzing the data to find a relationship between the features and the amount of insurance cost.

Does having an increased BMI increase your insurance costs? What about age? Number of dependents? Smoking status?
Are certain areas of the USA associated with higher insurance costs?

In order to answer the questions above we're planning to perform a linear regression analysis and plot the regression line and relevant variables. The variables need to be normalized before performing the regression analysis.


## Data Description

This dataset explains the medical insurance costs of a small sample of the USA population. Each row corresponds to a beneficiary. Various metadata was recorded as well.




The columns (except the last one) in this dataset correspond to metadata, where the last column is the monetary charges of medical insurance. Here are the possible values for each of the columns:


Variable | Type | Description
---------|------|---------------
Age | integer | the primary beneficiary's age in years
Sex | factor | the beneficiary's sex: `female` or `male`
BMI | double | the beneficiary's Body Mass Index, a measure of their body fat based on height and weight (measured in kg/m<sup>2</sup>), an ideal range of 18.5 to 24.9
Children | integer | the number of dependents on the primary beneficiary's insurance policy
Smoker | factor | whether or not the beneficiary is a smoker: `yes` or `no`
Region | factor | the beneficiary's residential area in the USA: `southwest`, `southeast`, `northwest`, or `northeast`
Charges | double | the monetary charges the beneficiary was billed by health insurance

## Exploring the Dataset

Here is a summary of the dataset, and the values of each variable:


age sex bmi children smoker region charges
--- -------------- ----------- -------------- -------------- --------- -------------- --------------
Min. :18.00 female:662 Min. :15.96 Min. :0.000 yes: 274 southwest:325 Min. : 1122
1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 no :1064 southeast:364 1st Qu.: 4740
Median :39.00 Median :30.40 Median :1.000 northwest:325 Median : 9382
Mean :39.21 Mean :30.66 Mean :1.095 northeast:324 Mean :13270
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000 3rd Qu.:16640
Max. :64.00 Max. :53.13 Max. :5.000 Max. :63770


Next, we want to inspect the data set to see if there is any correlation between the variables. From now on we want to consider charges as our dependent variable.
In order to analyze correlation between variables, the ones that are categorical with two categories, are translated into binery vectors. The only categorical variable with more than two categories, is region. We split this variable into four different binery vectors, each indicating if the sample data has category (1) or not (0).

After using dummy variables for sex, smoker, and region, according to the correlogram below, smoker and charges has the strongest correlation of 0.79. No high collinearity between independent variables is observed.

<center>

![](../images/corrplot.png){width="450" height="450"}

</center>



In order to to check if there is any cluster of data points, we use faceted plot. While the data between regions and sex does not appear to vary much, the smokers vs nonsmokers of each facet appear to cluster together, with the non-smokers having an overall lower medical cost.


<center>

![](../images/facet.png){width="450" height="450"}

</center>



How is the distribution of sex among different age groups?
Looking at the dataset, there appears to be more beneficiaries in the 20-60 age range. The biggest difference in the number of beneficiaries from different sex is seen in the 20-30 bracket.


<center>

![](../images/age_histogram.png){width="450" height="450"}

</center>



How about the distribution of sex among the regions?
This plot shows the distribution of sex in each of the four regions. At a glance, the dataset looks very even when it comes to sex, but there are slightly more beneficiaries in the southeast.


<center>

![](../images/region_barchart.png){width="450" height="450"}

</center>




## Methods


```r
# PLACE HOLDER FOR LINEAR REGRESSION
```

## Results


```r
# PLACE HOLDER FOR LINEAR REGRESSION
```

## Discussion


```r
# PLACE HOLDER FOR LINEAR REGRESSION
```

## Conclusion


```r
# PLACE HOLDER FOR LINEAR REGRESSION
```

## References
1. Medical Costs Dataset - https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41
2. BMI - https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi-m.htm
Binary file added docs/milestone2.pdf
Binary file not shown.
Binary file added docs/milestone2_files/figure-html/bar chart-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/milestone2_files/figure-html/facet-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/milestone2_files/figure-html/stack-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion scripts/load_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ suppressMessages(library(here))
suppressMessages(library(RCurl))
suppressMessages(library(glue))

# where are data is: https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv
# where our data is: https://gist.github.com/meperezcuello/82a9f1c1c473d6585e750ad2e3c05a41/raw/d42d226d0dd64e7f5395a0eec1b9190a10edbc03/Medical_Cost.csv

# read in command-line arguments
opt <- docopt(doc)
Expand Down

0 comments on commit 82c105e

Please sign in to comment.