forked from rafavdz/quants_workbook
-
Notifications
You must be signed in to change notification settings - Fork 1
/
11-Answers.Rmd
373 lines (249 loc) · 15.1 KB
/
11-Answers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
# Workbook suggested answers {#answers .unnumbered}
## Introduction {.unnumbered}
This chapter presents the suggested `R` code to answer the workbook activities and exercises throughout the course labs in Quantitative Research Methods for Social Sciences. This covers from Lab 3 to Lab 9.
Before looking at the answers, try asking your tutor for help. Also, we strongly recommend web resources, such as <https://stackoverflow.com/> or <https://community.rstudio.com/>. By solving the issues, you will learn a lot! ;)
## Lab 3. Data wrangling {.unnumbered}
```{r eval = FALSE}
## Load the packages
library(tidyverse)
# Read the data from the .rds file
clean_data <- readRDS("data/nilt_r_object.rds")
# Glimpse clean_data
glimpse(clean_data)
# Glimpse the nilt data
glimpse(nilt)
```
## Lab 4. Exploratory data analysis {.unnumbered}
Preamble code
```{r}
## Load the packages
library(tidyverse)
# Read the data from the .rds file
nilt <- readRDS("data/nilt_r_object.rds")
```
```{r}
#Subset
nilt_subset <- select(nilt, rsex, rage, highqual, religcat, uninatid, ruhappy, rhourswk, persinc2)
```
### Activity #1
From your R Studio Cloud script, do the following activities using the data stored in the `nilt_subset` object:
- Create a One-Way contingency table for `uninatid` in the `nilt_subset` dataset using the `sumtable()` function;
```{r}
# Load the vtable package to create summary tables
library(vtable)
# Create table
sumtable(nilt_subset, vars = c('uninatid'))
```
- Using the variables `religcat` and `uninatid`, generate a Two-Way contingency table;
```{r}
sumtable(nilt_subset, vars = c('religcat'), group = 'uninatid')
```
### Activity #2
Using the data in the `nilt_subset` object, complete the following activities.
- Using the `hist()` function plot a histogram of personal income `persinc2`. From the NILT documentation this variable refers to annual personal income in £ before taxes and other deductions;
```{r}
hist(nilt_subset$persinc2)
```
- Create a summary of the personal income `persinc2` variable, using the `sumtable()` function.
```{r}
sumtable(nilt_subset, vars = c('persinc2'))
```
- Compute the mean and standard deviation of the personal income `persinc2`, grouped by happiness `ruhappy`.
```{r}
sumtable(nilt_subset, vars = c('persinc2'), group = 'ruhappy')
```
## Lab 6. Visual exploratory analysis {.unnumbered}
Preamble code
```{r}
## Load the packages
library(tidyverse)
# Load the data from the .rds file we created in the last lab
nilt <- readRDS("data/nilt_r_object.rds")
#Create subset
nilt_subset <- select(nilt, rsex, rage, highqual, religcat, uninatid, ruhappy, rhourswk, persinc2)
```
### Excercices
Using the `nilt_subset` object, complete the tasks below in the Rmd file 'Lab_4', which you created earlier. Insert a new chunk for each of these activities and include brief comments as text in the Rmd document to introduce the plots and discuss the results (tip ---leave an empty line between your text and the next chunk to separate the description and the plots):
- Create a first-level header to start a section called "Categorical analysis";
```{r}
## Categorical analysis
```
- Create a simple bar plot using the `geom_bar()` geometry to visualize the political affiliation reported by the respondents using the variable `uninatid`;
```{r}
ggplot(nilt_subset, aes(x=uninatid)) +
geom_bar() +
labs(title = "Political afiliation", x= "Party")
```
- Based on the plot above, create a 'stacked bar plot' to visualize the political affiliation by religion, using the `uninatid` and `religcat` variables;
```{r}
ggplot(nilt_subset, aes(x=uninatid, fill = religcat)) +
geom_bar() +
labs(title = "Political afiliation by religion",
x= "Party", fill = "Religion")
```
- Create a new first-level header to start a section called "Numeric analysis";
```{r}
## Numeric analysis
```
- Create a scatter plot about the relationship between personal income `persinc2` on the Y axis and number of hours worked a week `rhourswk` on the X axis;
```{r}
ggplot(nilt_subset, aes(x= rhourswk, y=persinc2)) +
geom_point() +
labs(title= 'Income and number of hours worked a week',
x = 'Number of hours worked a week', y= 'Personal income (£ a year)' )
```
- Finally, create a box plot to visualize personal income `persinc2` on the Y axis and self-reported level of happiness `ruhappy` on the x axis... Interesting result, Isn't it? Talk to your lab group-mates and tutors about your results on Zoom (live) or your Lab Group on Teams (online anytime);
```{r}
ggplot(nilt_subset, aes(x= ruhappy, y=persinc2)) +
geom_boxplot() +
labs(title= 'Personal income and happiness',
x='Hapiness level', y='Personal income (£ a year)')
```
- Briefly comment each of the plots as text in your Rmd file;
- Knit the .Rmd document as HTML or PDF. The knitted file will be saved automatically in your project. You can come back to the Rmd file to make changes if needed and knit it again as many times as you wish.
## Lab 7. Correlation {.unnumbered}
```{r }
## Load the packages
library(tidyverse)
# Load the data from the .rds file we created in lab 3
nilt <- readRDS("data/nilt_r_object.rds")
```
```{r}
# Age of respondent’s spouse/partner
nilt$spage <- as.numeric(nilt$spage)
# Migration
nilt <- mutate_at(nilt, vars(mil10yrs, miecono, micultur), as.numeric)
```
```{r}
# overall perception towards migrants
nilt <- rowwise(nilt) %>%
# sum values
mutate(mig_per = sum(mil10yrs, miecono, micultur, na.rm = T )) %>%
ungroup() %>%
# assign NA to values that sum 0
mutate(mig_per = na_if(mig_per, 0))
```
### Activity 1
Using the `nilt` data object, visualize the relationship of the following variables by creating a new chunk. Run the chunk individually and comment on what you observe from the result as text in the Rmd file (remember to leave an empty line between your text and the chunk).
- Create a scatter plot to visualize the correlation between the respondent's overall opinion in relation to migration `mig_per` and the respondent's age `rage`. Remember that we just created the `mig_per` variable by summing three variables which were in a 0-10 scale (the higher the value, the better the person's perception is). In `aes()`, specify `rage` on the X axis and `mig_per` on the Y axis. Use the `ggplot()` function and `geom_point()`. Also, include a straight line describing the points using the `geom_smooth()` function. Within this function, set the `method` argument to `'lm'`.
```{r}
ggplot(nilt, aes(x=rage, mig_per)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Perception of migration vs age',
x= 'Respondent age', y= 'Perception of migration (0-30)')
```
- What type of relationship do you observe? Comment the overall result of the plot and whether this is in line with your previous expectation.
## Lab 8. Linear model. Simple linear regression {.unnumbered}
```{r}
## Load the packages
library(tidyverse)
# Read the data from the .rds file
nilt <- readRDS("data/nilt_r_object.rds")
```
```{r}
m3 <- lm(persinc2 ~ rhourswk, data = nilt)
```
### Lab activities
Use the `nilt` data set object in your `linear_model_intro` file to:
1. Plot a scatter plot using `ggplot`. In the aesthetics, locate `rhourswk` in the X axis, and `persinc2` in the Y axis. In the `geom_point()`, jitter the points by specifying the `position = 'jitter'`. Also, include the best fit line using the `geom_smooth()` function, and specify the `method = 'lm'` inside.
```{r}
ggplot(nilt, aes(x= rhourswk, y= persinc2)) +
geom_point(position = 'jitter') +
geom_smooth(method = 'lm')
```
2. Print the summary of `m3` using the `summary()` function.
```{r}
summary(m3)
```
3. Is the the relationship of hours worked a week significant? Re: Yes. The p-value (fourth column of the 'Coefficients' table) is lower than 0.05.
4. What is the adjusted r-squared? How would you interpret it? Re: the adjusted R-squared is 0.14. This can be interpreted in terms of percentage, e.g. 14% of the variance in personal income can be explained by the number of hours worked a week.
5. What is the sample size to fit the model? Re: The total number of observations in the data set is 1,204 and the model summary says that 747 observations were deleted due to missingness. Therefore, the sample size is 457 (1204-747).
6. What is the expected income in pounds a year for a respondent who works 30 hours a week according to coefficients of this model?
```{r}
5170.4 + 463.2 * 30
```
7. Plot a histogram of the residuals of `m3` using the `residuals()` function inside `hist()`. Do the residuals look normally distributed (as in a bell-shaped curve)?
```{r}
hist(residuals(m3))
```
Overall, the residuals look normally distributed with the exception of the values to the right-hand side of the plot (between 40000 and 60000).
## Lab 9. Multivariate linear model {.unnumbered}
### Lab activities
1. Load the packages, and the data that you will need in your file using the code below:
```{r eval=FALSE}
## Load the packages
library(moderndive)
library(tidyverse)
# Read the data from the .rds file
nilt <- readRDS("data/nilt_r_object.rds")
```
2. Print a table for the highest level of qualification `highqual` using the `table()` function.
```{r}
table(nilt$highqual)
```
3. Generate a scatter plot using `ggplot`. Within `aes()`, locate the number of hours worked a week `rhourswk` on the X axis and the personal income `persinc2` on the Y axis, and specify the `color` of the dots by the highest level of qualification `highqual`. Use the `geom_point()` function and 'jitter' the points using the argument `position`. Add the parallel slopes using the `geom_parallel_slopes()` function and set the standard error `se` to `FALSE`. What is your interpretation of the plot? Write down your comments to introduce the plot.
```{r}
ggplot(nilt, aes(x = rhourswk, y= persinc2, color = highqual)) +
geom_point(position = 'jitter') +
moderndive::geom_parallel_slopes(se = FALSE) +
labs(title = "Personal income",
subtitle = 'Personal income and number of hours worked a week by education level',
x= 'Number of hours worked a week', y= 'Personal income (£ a year)',
color = 'Highest education level')
```
4. Fit a linear model model using the `lm()` function to analyse the personal income `persinc2` using the number of works worked a week `rhourswk`, the highest level of qualification `highqual`, and the age of the respondent `rage` as independent variables. Store the model in an object called `m4` and print the summary.
```{r}
m4 <- lm(persinc2 ~ rhourswk + rage + highqual, nilt)
summary(m4)
```
5. Comment on the results of the model by mentioning which of the variables is significant and their respective p-value, the adjusted r-squared of the model, and the number of observations used to fit the model. Re: All the independent variables including the number of hours worked a week, age, and all the categories of highest qualification level compared to 'Degree or higher' are significant to predict personal income in the model 'm4', considering that the p-value is lower than 0.05. We can confirm this from the fourth column of the 'Coefficients' table. The adjusted R-squared of the model is 0.37. This means that 37.6% of the variance in personal income can be explained by these variables. The size of the sample used to fit this model is 457, considering that the 'nilt' data set contains 1204 observations but 747 were deleted due to missingness (1204 - 747).
6. Plot a histogram of the residuals for model `m4`. Do they look normally distributed? Can we trust our estimates or would you advise to carry out further actions to verify the adequate interpretation of this model?
```{r}
hist(residuals(m4))
```
The distribution of the residuals in 'm4' look overall normally distributed. However, the distribution is not perfectly symmetric. Therefore, we would advice to conduct further checks to test the linear model assumptions.
<!-- ## Logistic regression -->
<!-- ```{r} -->
<!-- ## Load the packages -->
<!-- library(tidyverse) -->
<!-- library(jtools) -->
<!-- # Read the data from the .rds file -->
<!-- nilt <- readRDS("data/nilt_r_object.rds") -->
<!-- ``` -->
<!-- ### Lab activities -->
<!-- You will explore the interreligious relations in Northern Ireland measured by the acceptance of a close relative marrying someone of a different religion in the variable `smarrrlg`. This variable asks _Would you mind or not mind if one of your close relatives were to marry someone of a different religion?_. The possible answers are three: _1 Would mind a lot_ / _2 Would mind a little_ / _3 Would not mind_. -->
<!-- Using the `nilt` object complete the following activities in your `logit_model.R` R Script: -->
<!-- 1. Coerce the variable `smarrrlg` to factor using the `as_factor()` function. -->
<!-- ```{r} -->
<!-- nilt$smarrrlg <- as_factor(nilt$smarrrlg) -->
<!-- ``` -->
<!-- 2. Print a simple table of this variable using the `table()` function. -->
<!-- ```{r} -->
<!-- table(nilt$smarrrlg) -->
<!-- ``` -->
<!-- 3. From the three possible options, create a new binary variable called `marr_mix` using the `ifelse()` function. Assign '1' if the response is _Would mind a lot_ or _2 Would mind a little_ and '0' otherwise. You can copy and run the code below. -->
<!-- ```{r} -->
<!-- nilt$marr_mix <- ifelse(nilt$smarrrlg == 'Would mind a lot' | nilt$smarrrlg == 'Would mind a little', 1, 0) -->
<!-- ``` -->
<!-- 4. Create a stacked bar plot using `ggplot()`, in `aes()` set `marr_mix` in the x axis, and `fill` to `religcat`. Use the `geom_bar()` function, and within this specify the `position` argument to 'fill'. Filter out the missing values for `marr_mix` before passing the data to ggplot. -->
<!-- ```{r} -->
<!-- nilt %>% -->
<!-- filter(!is.na(marr_mix)) %>% -->
<!-- ggplot(aes(x= marr_mix, fill = religcat)) + -->
<!-- geom_bar(position = 'fill') -->
<!-- ``` -->
<!-- 5. Fit a logit model using the `marr_mix` as the dependent variable, `persinc2` and `religcat` as the independent variables. Assign it to and object called `m9`. -->
<!-- ```{r} -->
<!-- m9 <- glm(marr_mix ~ persinc2 + religcat, data = nilt, family = 'binomial') -->
<!-- ``` -->
<!-- 6. Use the `jtools::summ()` function to print the summary of the results, setting the `exp` argument to `TRUE`. -->
<!-- ```{r} -->
<!-- jtools::summ(m9, exp = TRUE) -->
<!-- ``` -->
<!-- 7. Which of the independent variables is/are significant? -->
<!-- Re: From the results of 'm9', personal income and being protestant are significant to study who is more likely to mind if a respondent's close relative were to marry someone of a different religion. -->
<!-- 8. Do you think personal income is playing a role? -->
<!-- Re: Although personal income p-value is equal to 0.05, which can be consider as significant, we cannot expect this is playing a role, since the odds ratio are the same regardless of changes in personal income. We can confirm this from the first column of the coefficients table (persinc2 = 1). -->
<!-- 9. How do you interpret the coefficient for protestant respondents? -->
<!-- Re: The odds ratio of a respondent minding about a close relative marring someone of a different religion are 3.1 times higher for Protestant respondents compared to Catholic ones. -->