-
Notifications
You must be signed in to change notification settings - Fork 0
/
PSTAT 131 HW 1-2.Rmd
106 lines (64 loc) · 5.04 KB
/
PSTAT 131 HW 1-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "Homework 1"
output: html_document
---
```{r setup, include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE,message = F,warning = F)
```
# Question 1
Supervised learning is a kind of model with response variable. The main idea of supervised learning is to predict the response and generalize this prediction model to observations out of train set. The unsupervised learning is a kind of models without response variable. The main idea of unsupervised learning is to discover the potential patterns existed in the data set. The major difference between supervised and unsupervised models are whether there is response variable.
# Question 2
The main difference between regression models and classification models is the type of response variable. In classification models, the response variable is often nominal categories such as whether suffering from diseases, whether the game would be win or not. In regression models, the response variable is ofter numerical such as GDP, height, weights and others.
# Question 3
Metrics for regression ML problems:
- RMSE: the root of mean squared error, which is calculated as $\sqrt {\frac{\sum(y-\hat{y})^2}{n}}$.
- ABE: the absolute error, which is calculated as $\frac{\sum(abs(y-\hat{y}))}{n}$. Different with RMSE, ABE is not continuous, which makes gredient method is not useable.
Metrics for classification ML problems:
- Accuracy: accuracy is calculated as the proportion between correct predicted observations and the total observations.
- log-likelihood: While the response is binary, log-likelihood can be calculated as $\frac{\sum (y_i log(\hat{p_i})+(1-y_i)log(1-\hat{p_i}))}{n}$ in which $y=1$ and $y=0$ represents different level in this binary response.
# Question 4
- Descriptive models: To describe the data set by some statistic parameters, such as mean representing central trend, standard deviance representing the discrete trend, the quantile representing the summary of distribution.
- Inferential models: is used to learn about how the observations are generated. Thus inferential models often output the probability of each level in response.
- Predictive models: is used to predict the outcomes for observations out of train set generalize.
# Question 5
## 1.
- Mechanistic models: constructed based on the mechanism how predictors affect the response. The effects of predictors are often shown as parameters or coefficients in models.
- Empirically-driven: constructed based on the data without explainable model.
## 2.
mechanistic models are easier to understand than empirically-driven models. It is because that mechanistic models constructed based on a visual model to show the relationship between predictors and response variable.
## 3.
In mechanistic models, bias of model would increase while predictors are added into the model but variance of model would also decrease meanwhile. Thus to trade-off this problem, the irrelevant predictors should be removed.
In empirically-driven models, bias of model would decrease while complexity of model increase but variance of model would also increase meanwhile. Thus to trade-off this problem, the complexity should be decreased such as adding regularization and removing some predictors.
# Question 6
The first should be considered as the predictive model due to we are not interested in how the voter’s profile/data affect the favor of the candidate.
The second should be considered as the inference model due to the main idea of this model is to answer "How would a voter’s likelihood of support for the candidate change if they had personal contact with the candidate", which indicates that the affect of personal contact with the candidate on the voter’s favor should be investigated in statistic inference way
# Exercise 1
```{r}
data(mpg)
ggplot(aes(hwy),data=mpg)+geom_histogram()
```
# Exercise 2
```{r}
ggplot(aes(x=hwy,y=cty),data=mpg)+geom_point()
```
The scatters shows that `hwy` and `cty` are positively linear-correlated. This mean that `cty` would increase while `hwy` increase. Besides, the extent of increment is a constant while increment `cty` is also a constant.
# Exercise 3
```{r}
mpg%>%group_by(manufacturer)%>%summarise(counts=n())%>%
ggplot(aes(x=reorder(manufacturer,counts),y=counts))+
geom_bar(stat='identity')+coord_flip()+labs(x='Manufacturer')
```
Dodge produced the most cars and Lincoln produced the least cars.
# Exercise 4
```{r}
ggplot(aes(x=cyl,y=hwy,group=cyl),data=mpg)+geom_boxplot()
```
It shows that as `cyl` would affect `hwy` in negative way.
# Exercise 5
```{r}
library(dplyr)
library(corrplot)
corrplot(cor(mpg%>%select(displ,cyl,cty,hwy)))
```
Categorical variables and predictor `year` are removed firstly due to `Year` is not meaningful in this question and the categorical variables are not suitable for calculating correlation. As a result, `hwy` is positively correlated with `displ` and `cyl` but negatively correlated with `cty` and `hwy`. `cty` is positively correlated with `displ` and `cyl` but negatively correlated with `cty`. `cyl` is negatively correlated with `displ` and `cyl`.