-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlessons_functional_forms2.Rmd
226 lines (172 loc) · 8.98 KB
/
lessons_functional_forms2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Functional Forms"
output:
html_document:
fig_width: 5
fig_height: 4
---
## Learning Objectives
- Learning about multiple ways of transforming your covariates and specifing your models.
## Required Packages and Datasets
Install the packages if you still don't have them (e.g. `install.packages(gapminder)`). If you don't have the `uwpols501` package, you need to install it from GitHub. Install & load the `devtools` package first, and then execute `install_github("UW-POLS501/r-uwpols501")`.
```{r, message = FALSE}
library(dplyr)
library(uwpols501)
library(gapminder)
library(broom)
library(ggplot2)
data(gapminder)
data(turnout)
```
Also go to the following link to download a WDI dataset from 1980 to 2010.
`https://www.dropbox.com/s/qqb3vg66hxqfqzb/wdi_co2_gdp_1980_2010.csv?dl=0`
Load the `wdi` dataset.
```{r}
wdi <- read.csv("data/wdi_co2_gdp_1980_2010.csv")
```
## Intro
I use as a source of inspiration in this lesson Chris Adolph's POLS-503 [notes](http://faculty.washington.edu/cadolph/503/topic5.pw.pdf).
In a linear model, any unit change of a given covariate has always the same effect on the outcome variable. If the coefficient for that covariate is for example 0.54, it doesn't matter if the unit change is from 1 to 2 or from 100,001 to 100,002. The expected effect on the outcome variable is always the same: 0.54 increase in the units of the dependent variable.
However, theoretically we may sometimes think that the way a covariate affects the response variable is not linear. For example, is the relationship between `income` and `life expectancy` linear? Or Acemoglu and Robinson (2006) argue that the relationship between `inequality` and `democratization` follows a U-shape. How do we deal with this non-linear relationships? Can we still use a linear models to estimate them?
$y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}... + \epsilon$
## Logrithmic Transformations
Think a little bit more about the relationship between Life Expectancy and GDP per capita. This relation is probably not linear. As the GDP per capita of a country increases, on average its citizens have more resources, better nutrition, etc., so you would expect the Life Expectancy of that country to increase. However, at least with the technology and medicines that are currently available, at a certain point Life Expectancy cannot keep increasing at the same rate or it cannot increase at all.
Load the `gapminder` dataset from the `gapminder` package.
Estimate a model that portrays `lifeExp` as a linear function of `gdpPercap`.
$lifeExp = \beta_{0} + \beta_{1}gdpPercap + \epsilon$
```{r}
mod1a <- lm(lifeExp ~ gdpPercap, data = gapminder)
```
Plot the `lifeExp` that we would predict using this model against the covariate `gdpPercap`.
```{r, fig.align='center'}
mod1a_augm <- augment(mod1a, gapminder)
ggplot(mod1a_augm, aes(x = gdpPercap, y = .fitted)) +
geom_point()
```
Uau! This model predicts that the Life Expectancy in Kuwait is 140 years!
Plot now the outcome variable `lifeExp` against the response variable `gdpPercap`.
```{r, fig.align='center'}
ggplot(gapminder, aes(y = lifeExp, x = gdpPercap)) +
geom_point()
```
This graphs clearly shows that in the beginning, a unit increase on `gdpPercap` matters a lot in explaining `lifeExp`. However, at some point, extra units increase do not make much of a difference. We call this type of relation a *Logarithmic Relation*. How can we specify this logarithmic relation in a linear model?
If we believe that there is a *logarithmic* relation between the outcome variable and a covariate, this is the same as saying that there is a *linear* relation between the outcome variable and the $log$ of the covariate. Thus, we just need to transform the covariate of interest by calculating its $log$ and then add the resulting variable to the linear model.
To show this, include a new variable to the `gapminder` dataset: the *log* of `gdpPercap`.
```{r}
gapminder <- mutate(gapminder, log_gdpPercap = log(gdpPercap))
```
Compare the distribution of `gdpPercap` and `log_gdpPercap` by looking at their `quantiles()`.
```{r}
data.frame(qGdp = quantile(gapminder$gdpPercap),
qLogGdp = quantile(gapminder$log_gdpPercap))
```
Plot `lifeExp` against this new variable. Observe how there is a linear relation between the outcome and the *log* of the covariate. Observe also that `log_gdpPercap` is not expressed on the original units (\$ per capita) anymore but in *log* \$ per capita.
```{r, fig.align='center'}
ggplot(gapminder, aes(x = log_gdpPercap, y = lifeExp)) +
geom_point()
```
Estimate a model now that portrays `lifeExp` as a linear function of the *log* of `gdpPercap`.
$lifeExp = \beta_{0} + \beta_{1}log(gdpPercap) + \epsilon$
```{r}
mod1b <- lm(lifeExp ~ log_gdpPercap, data = gapminder)
```
Plot `mod1b`'s predicted values of `lifeExp` against `gdpPercap`.
```{r, fig.align='center'}
mod1b_augm <- augment(mod1b, gapminder)
ggplot(mod1b_augm, aes(y = .fitted, x = gdpPercap)) +
geom_point()
```
That makes a lot more sense!
Finally let's see how well each model does in predicting the actual `lifeExp` values.
```{r, fig.align='center'}
ggplot(mod1a_augm, aes(x = lifeExp, y = .fitted)) +
geom_point()
ggplot(mod1b_augm, aes(x = lifeExp, y = .fitted)) +
geom_point()
```
`r challenge_start()`
How would you create a dataset that contains the actual and predicted values of both models? Having such dataset would facilitate creating a side-by-side plot using the `facet_wrap` of `ggplot`.
`r challenge_end()`
`r solution_start()`
```{r}
predictions <- NULL
models_list <- list(mod1a, mod1b)
for (i in 1:length(models_list)) {
m <- models_list[[i]]
new_data <- data.frame(actual = m$model$lifeExp,
predicted = m$fitted.values,
model = i)
predictions <- rbind(predictions, new_data)
}
```
```{r, fig.align='center'}
ggplot(predictions, aes(x = actual, y = predicted)) +
geom_point() +
facet_wrap(~model)
```
`r solution_end()`
`r challenge_start()`
1. Look at the regression table for `mod1b`. How would you interpret the coefficients?
2. Can you think of any covariate that you use in your own projects that has a *logarithmic relation* with the outcome variable you study? Describe that relationship.
`r challenge_end()`
## Squared Terms
Sometimes we believe that there is a U-shaped relation between a covariate and the dependent variable. For example, some scholars such as [David I. Stern](http://www.sterndavidi.com/publications_type.html) argue that there is an [Environmental Kuznets Curve](https://en.wikipedia.org/wiki/Kuznets_curve#Environmental_Kuznets_curve). This means that there is an inverted U-shaped relation between environmental quality and economic growth.
As we discussed in the beginning, sometimes there is a squared relation between a covariate and our dependent variable of interest (e.g. [Kuznets curve](https://en.wikipedia.org/wiki/Kuznets_curve)).
![Source: Wikipedia](img/env_curve.png)
How do we incorporate this theoretical expectation into a linera model? Using a *squared term*!
Estimate the following models:
1. $CO2emissions = \beta_{0} + \beta_{1}gdpCapita + \epsilon$
2. $CO2emissions = \beta_{0} + \beta_{1}log(gdpCapita) + \epsilon$
3. $CO2emissions = \beta_{0} + \beta_{1}gdpCapita + \beta_{2}gdpCapita^{2} + \epsilon$
Estimate the models and use the `augment` function to add the predicted values of `co2` to the dataset.
```{r}
mod2a <- lm(co2 ~ gdpCapita, wdi)
mod2b <- lm(co2 ~ gdpCapita + I(gdpCapita^2), wdi)
mod2c <- lm(co2 ~ log(gdpCapita), wdi)
mod2a_aug <- augment(mod2a, wdi)
mod2b_aug <- augment(mod2b, wdi)
mod2c_aug <- augment(mod2c, wdi)
```
For the 3 models, plot the predicted values of `co2`against `gdpCapita`.
```{r, fig.align='center'}
ggplot(mod2a_aug, aes(x = gdpCapita, y = .fitted)) +
geom_point()
ggplot(mod2b_aug, aes(x = gdpCapita, y = .fitted)) +
geom_point()
ggplot(mod2c_aug, aes(x = gdpCapita, y = .fitted)) +
geom_point()
```
For the 3 models, plot the actual v. predicted values of `co2`.
```{r, fig.align='center'}
ggplot(mod2a_aug, aes(x = co2, y = .fitted)) +
geom_point()
ggplot(mod2b_aug, aes(x = co2, y = .fitted)) +
geom_point()
ggplot(mod2c_aug, aes(x = co2, y = .fitted)) +
geom_point()
```
`r challenge_start()`
How would you create a dataset that contains the actual and predicted values of `co2`, and `gdpCapita` for the 3 models ? And how would you replicate the 2 types of plots we just used using the `facet_wrap`?
`r challenge_end()`
`r solution_start()`
```{r}
predictions2 <- NULL
models_list2 <- list(mod2a, mod2b, mod2c)
for (i in 1:length(models_list2)) {
m <- models_list2[[i]]
new_data <- data.frame(actual_co2 = m$model$co2,
predicted_co2 = m$fitted.values,
model = i)
predictions2 <- rbind(predictions2, new_data)
}
predictions2$gdpCapita <- mod2a$model$gdpCapita
```
```{r}
ggplot(predictions2, aes(x = gdpCapita, y = predicted_co2)) +
geom_point() +
facet_wrap(~model)
ggplot(predictions2, aes(x = actual_co2, y = predicted_co2)) +
geom_point() +
facet_wrap(~model)
```
`r solution_end()`