forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter3.Rmd
183 lines (116 loc) · 8.66 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Exercise 3: Logistic regression: analysis
```{r, message=F}
library(dplyr)
library(ggplot2)
```
(The data wrangling script can be found [here](https://github.com/jonathanholder/IODS-project/blob/master/data/create_alc.R))
## 1: Creating this file
## 2: Reading data & short description
[metadata source](https://archive.ics.uci.edu/ml/datasets/Student+Performance)
'This data approach[es] student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).'
The data has been pre-processed, and is here used to analyse the student's alcohol consumption (one of the 'social features', probably) and a binary variable $high_use denotes high alcohol usage (>2 averaged over weekend and workday consumption, scale: 1 (very low) - 5 (very high)).
Below you find an overview over the variables in the data, and a few sample observations. Overall, there are 382 observations/students of 38 variables (for more information on the individual variables, see metadata link above):
```{r}
alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=T)
str(alc)
dim(alc)
head(alc)
```
## 3: 4 variables, 4 hypotheses
H 1. Common assumption that males drink more than females ($sex)
H 2. Learning interferes with drinking (or vice versa; high use positively correlated with $failures)
H 3. Hangovers reduce attendance (high use positively correlated with $absences)
H 4. When in a relationship, heavy drinking is of less interest (high use positively correlated with $romantic == no)
### H 1: Common assumption that males drink more than females ($sex)
At a first glance, H1 seems to be supported by the data: while there are about four times as many females with low alcohol consumption as there are females with high consumption, there are less than twice as many males with low alcohol consumption compared to high consumption (see table below). The average absolute alcohol use (based on the alc_use variable) is also higher among males.
```{r, echo=F}
table("Sex"=alc$sex, "High use"=alc$high_use)
print(paste0("Average alcohol use: F: ", round(mean(alc$alc_use[alc$sex=="F"]), 2), ", M: ", round(mean(alc$alc_use[alc$sex=="M"]), 2)))
ggplot(data=alc, aes(x=high_use, fill=sex))+geom_bar()+facet_wrap("sex")
```
### H 2. Learning interferes with drinking (or vice versa; high use positively correlated with $failures)
H2 seems to be supported as well; while amongst students who have not failed any courses, about a quarter indicate high alcohol consumption, about half of the students with 2 or 3 failed courses indicate high consumption.
```{r, echo=F}
table("failures"=alc$failures, "High use"=alc$high_use)
ggplot(data=alc, aes(x=failures, fill=high_use))+geom_bar()#+facet_wrap("high_use")
```
### H 3. Hangovers reduce attendance (high use positively correlated with $absences)
H3 seems to be supported as well; on average, students with high alcohol consumption have about 1.8 more absences (~40%) than those with low consumption.
```{r, echo=F}
print(paste0("Average absences: low use: ", round(mean(alc$absences[alc$high_use==F]), 2), ", high use: ", round(mean(alc$absences[alc$high_use==T], 2))))
ggplot(data=alc, aes(x=high_use, y=absences))+geom_boxplot()
```
The difference also becomes clear when looking at the histograms of absences for high and low alcohol usage separately; in the case of high usage (top), the tail is heavier than in that of lower usage (bottom):
```{r, echo=F}
hist(alc$absences[alc$high_use==T], breaks=c(seq(0, 80, 5)))
hist(alc$absences[alc$high_use==F], breaks=c(seq(0, 80, 5)))
?hist
```
### H 4. When in a relationship, heavy drinking is of less interest (high use positively correlated with $romantic == "no")
H4 does not seem to be strongly supported by the data, as the in between alcohol usage between those in and without a relationship do not differ greatly (in relationship: ~30% high usage, not in a relationship: ~27%)
```{r, echo=F}
table("In a relationship"=alc$romantic, "High use"=alc$high_use)
ggplot(data=alc, aes(x=romantic, fill=sex))+geom_bar()+facet_wrap("high_use")+ ggtitle("high_use == ")
```
## 5: Logistic regression model
Below you find the summary for a logistic regression model of high alcohol usage with sex, absences, failures, and being in a romantic relationship as predictors:
```{r, echo=F}
m1 <- glm(high_use ~ sex + absences + failures + romantic, data = alc, family = "binomial")
summary(m1)
```
As already suspected at a first glance at the data distribution, male sex, absences and failures are positively and significantly correlated with high alcohol consumption, confirming H1, H2 and H3. H4 is not supported by the model, as being in a relationship does not correlate significantly with high consumption.
```{r, echo = F, message = F}
OR <- coef(m1) %>% exp
CI <- exp(confint(m1))
cbind(OR, CI)
```
When looking at the odds ratios of the model coefficients, the following can be derived:
Sex: being male increases the odds to exhibit high use by 2.7 compared to females
Absences: each absence increases the likelyhood to exhibit high use by 0.08
Failures: each failed course increases the likelyhood to exhibit high use by 0.51
Romantic: being in a relationship decreases the likelyhood to exhibit high use by 0.25 (though this correlation is insignificant)
The confidence intervals: are widest for the the sex variable, so its effect (in absolute terms1) is the most uncertain.
The only predictor with a 95% confidence interval that includes 1 is that of being in a relationship. In practice, this means that it is unclear if the predictor has a positive (odds ratio >1) or negative (odds ratio <1) correlation with the target variable high_use - in other words, the predictor is not significantly correlated with high_use (see model summary).
## 6: Predictive power of the model
To analyse the predicitve power of the model, only the predictor variables with significant correlation with high alcohol usage were used, i.e. high_use ~ sex + absences + failure (the instructions about dropping insignificantpredictors were a bit unclear to me).
Below you find a summary of the model (largely equivalent to m1):
```{r, echo=F}
m2 <- glm(high_use ~ sex + absences + failures, data = alc, family = "binomial")
summary(m2)
```
This model was then used to predict high_use (input: training data/whole data set); here's are two tables cross tables of predicted and observed high_use (top: N, bottom: proportions)
```{r, echo=F}
alc$hu_prob <- predict(m2, type = "response")
alc$hu_pred <- ifelse(alc$hu_prob>0.5, TRUE, FALSE)
table(prediction = alc$hu_pred, observation = alc$high_use) %>% addmargins
table(prediction = alc$hu_pred, observation = alc$high_use) %>% prop.table %>% addmargins
```
Here, it becomes clear that the model is rather conservative in predicting high alcohol usage, predicting only 10% of the students falling within this definition (data: 29%), correspondingly producing relatively more false negatives (3%) than false positives (23%). The training error can be calculated as
training error = false positives (0.031) + false negatives (0.225) = 0.256
... this is confirmed by the loss function provided in the course:
```{r}
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$hu_prob)
```
The higher prevalence of false negatives becomes also clear from a scatterplot (false negatives: red points on upper line, false positives; cyan points on the lower line):
```{r, echo=F}
p <- ggplot(alc, aes(x = hu_prob, y = high_use, col=hu_pred))+geom_point()
p
```
A simple guessing scheme to 'predict' high alcohol use can be formulated as
```{r}
set.seed(1)
alc$hu_guess <- sample(x=c(rep(FALSE, 270), rep(TRUE, 112)), size=nrow(alc), replace=T)
```
i.e. the probability of assigning high usage to a student is based on the prevalence of high usage in the data. This simple guessing strategy's 'predictive power' can be assessed in the same way as the above model:
```{r}
table(guess = alc$hu_guess, observation = alc$high_use) %>% prop.table %>% addmargins
loss_func(class = alc$high_use, prob = alc$hu_guess)
```
training error = false positives 0.225 + false negatives 0.186 = 0.411
The overall predictive power of the model seems therefore superior to bare guessing.
An interesting closing remark: Note that guessing in fact predicts high usage correctly more often (~11%) than the model (~7%) (!).