-
Notifications
You must be signed in to change notification settings - Fork 0
/
CW2.Rmd
1275 lines (952 loc) · 55.6 KB
/
CW2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analytical report for SDG and coffee production data"
author: "10000"
output:
pdf_document:
keep_tex: yes
number_sections: yes
html_document:
number_sections: no
toc: yes
toc_float: yes
code_download: yes
code_folding: hide
always_allow_html: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,warning = FALSE, message = FALSE)
```
```{r libraries,echo= FALSE}
# include your libraries here
# use echo = FALSE if you do not want the Code button to appear
library(tidyverse)
library(modelr)
# install.packages("psych")
library(PerformanceAnalytics)
library(kableExtra)
library(psych)
library(broom)
library(mlmRev)
library(purrr)
library(effects)
library(ggpubr)
library(GGally)
```
<!--
You can include non-coding comments using these delimiters. What is written inside these delimiters wont be knitted into the html document!
Do not change the structure of the document, just fill in the sections as indicated
You can only change the headings indicated with [*]
-->
# Sustainable development goals (SDG) data
```{r part_1}
sdg <- read_csv("https://people.bath.ac.uk/kai21/MA50258/data/sdg.csv")
sdg <- na.omit(sdg) #259
```
### How does the infant mortality rate change over time for different countries?
This data set was provided from 89 countries associated with 9 variables including education and socio-demographic information by year. There are removed as the number of missing rows is 9 in total. Before starting the main analysis, I made a question to decide how to analyse this data set.
```{r summary_table}
sdg <- sdg %>% mutate(year_centre =year - 2015)
sdg_summary <- sdg %>% select(2:3,5:8,10) %>%
group_by(continent, year)
describe(sdg_summary[3:7])[c('mean','sd','median','min','max')] %>% kable(digits=2, caption="summary statistics") %>%
kable_styling(full_width = F)
```
According to the summary statistics, gdp and infant mortality rate are left skewed as the value of mean is bigger than median values and others are shown to be left skewed.
By generating year_centre, I also tried to check the effects in linear mixed models I will mention below.
```{r corr_matrix, out.width="85%", fig.align='center'}
ggpairs(sdg[c(2, 5,6,7,8)])
```
By a correlation matrix, it is shown that the distribution of infant mortality rate in the matrix is exceedingly left-skewed, as indicated by the values in the table above. Moreover, it is highly correlated with gdp and life expectancy and mortality rate, so, I highly recommend using these two variables for modelling and discovering how affect the response variable over years.
```{r visualisation_part1}
# different regression forms for years_education
plot1 <- sdg %>%
ggplot(aes(x = years_education, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ x, se = F, col="red")+
labs(title="Linear smoothing")
plot2 <- sdg %>%
ggplot(aes(x = years_education, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ I(x^2), se = F, col="red")+
labs(title="Quadratic smoothing")
plot3 <- sdg %>%
ggplot(aes(x = years_education, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ I(x^3), se = F, col="red") +
labs(title="Cubic smoothing")
# for gdp
plot4 <- sdg %>% ggplot(aes(gdp, infant_mortality_rate))+
geom_point()+
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,
formula = y ~ I(log(x)), se = F, col="red") +
labs(title="log transformation")
# for life_expectancy
plot5 <- sdg %>%
ggplot(aes(x = life_expectancy, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ x, se = F, col="red")
plot6 <- sdg %>%
ggplot(aes(x = life_expectancy, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ I(x^2), se = F, col="red")
plot7 <- sdg %>%
ggplot(aes(x = life_expectancy, y = infant_mortality_rate)) +
geom_point() +
scale_y_continuous(trans="log") +
coord_trans(y = scales::exp_trans(exp(1))) +
geom_smooth(method = lm,formula = y ~ I(x^3), se = F, col="red")
ggarrange(plot1, plot2, plot3,plot4, plot5, plot6,plot7, nrow = 2, ncol=4)
```
I decided to plot the loess smoothing to each numeric variable on the scatter plot below. By making the cubic smoothing, I tried to figure out the non-linearity relationship which I didn't notice between the response variable transformed as log and variates. After transforming the response variables as log, it looked linearity from life_expectancy and years_education. Also, the plot between gdp with log transformation and the transformed response variable has a non-linearity relationship, however, as it looks fit, it will take account into when modelling.
We've assumed the data was to be independent which means individuals do not affect each other in terms of their possibilities of the different events so far, however, it is not always a tenable assumption in reality. If the data set followed the non-linear dependency and in particular, collected on some set of subjects at different points in time called longitudinal data, it should be linear mixed effect models. That's why I decide to apply the linear mixed models for this data set. For these models, it should have the group structures which there are dependent within the groups, but not necessarily between groups. To evaluate the differences between continents, we should consider random effects and fixed effects both.
Hence, the response variable is the rate of infant mortality and the explanatory variables are gdp, years_education, life_expectancy and income_class. As I would like to investigate the effect of time on the average response variable and the effect between countries.
```{r mixed_effects_summary}
# random effects
wo_plot <- sdg %>% ggplot(aes(year, infant_mortality_rate, group=country, col=continent))+
geom_line() +
labs(x="year",
y="infant mortality",
title="The general trend by country, year")
fmla_null <- infant_mortality_rate ~ 1 + year_centre
mod_null <- lm(fmla_null, sdg)
# tidy(mod_null)[,1:3] %>%
# kable(digits = 3,
# caption = "Null Model") %>%
# kable_styling(full_width = F)
# overall_mean<-mean(sdg$infant_mortality_rate)
#
mod_null_no_center <- lm(infant_mortality_rate ~ 1 + year, sdg)
# overall_mean-tidy(mod_null_no_center)[2,2]*mean(sdg$year)
#
# tidy(mod_null_no_center)[,1:3] %>%
# kable(digits = 3,
# caption = "Null Model (no centering)") %>%
# kable_styling(full_width = F)
with_plot <- sdg %>% add_predictions(mod_null) %>%
ggplot(aes(x = year_centre,
y = pred,
group = country)) +
geom_line(alpha = 0.2,
col = "blue") +
geom_line(aes(x = year_centre,
y = infant_mortality_rate,
group = country,
col = continent),
alpha = 0.2) +
labs(x = "year",
y = "Infant mortality",
title="prediction for null model")
ggarrange(wo_plot, with_plot, nrow=1, ncol=2)
```
According to the first table on the left, I can roughly say it is shown that the trend generally went down over time in the countries and the rate of decrease is different for different countries. Also, the infant mortality rate in Europe and Oceania is lower than in other continents, and the proportion from Africa is mainly arranged between 30 % and 70%. As can be seen in the right graph, it is represented the general trend of mortality rate for babies so that the percentage of infant mortality is stable regardless of year and country. Hence, I can expect that the rate of mortality will remain almost the same over the years if there is no random effect on countries.
Firstly, I'd investigate the effects in the dataset before modelling. Based on this formula, a factor should be a categorical variable which defines several groups, i.e., I set country in this case as a factor. To detail, 1 is the fixed effect and (1|country) is the random effect of intercept.
As the estimates of fixed effects are different at 12.26 and 15.45 respectively from the null model and random intercept model, I can see there are fixed effects. Also, the standard error in the random intercept and slope model is the highest at 0.068 but it is not noticeable compared to the value from the null model at 0.64. Hence, we can see standard error estimates of fixed effects are much the same and that the variability within the group is not big. By performing ANOVA, I checked the hypothesis of the intercept model is better than the model with the intercept and slope.
$H_0: \textrm{The random intercept model is better.}$
```{r random_effects}
#hypothesis testing for random effects
# random intercept model.
fmla_random_intercept <- infant_mortality_rate ~ 1 + (1|country) + year_centre
mod_random_intercept <- lmer(fmla_random_intercept, sdg,REML=FALSE)
fmla_intercept_slope <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country)
mod_intercept_slope <- lmer(fmla_intercept_slope, sdg, REML=FALSE)
# coefficients tables
summary(mod_null)$coefficients[1:2,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. Without Random Intercepts") %>%
kable_styling(full_width = F)
summary(mod_random_intercept)$coefficients[1:2,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. With Random Intercepts") %>%
kable_styling(full_width = F)
summary(mod_intercept_slope)$coefficients[1:2,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. With Random Intercepts and slope") %>%
kable_styling(full_width = F)
anova(mod_random_intercept,mod_intercept_slope)[1:8] %>%
kable(digits = 2,caption = "ANOVA Test for random effects") %>%
kable_styling(bootstrap_options = "bordered", full_width = F)
```
In accordance with coefficients tables, the estimates of intercepts are not identical but they do not have distinguished changes. After applying a random effect, the estimates for standard error are doubled to the model without random effect.
As the p-value is significantly small, I reject the null hypothesis with $\alpha$=0.05. It means that random intercept and slope have a better fit to this data set.
To develop the models to fit the data, as the interaction and country-to-country variability can not be seen in the plots easily, I'm going to make transformed linear mixed models and generalised linear mixed effects models and compare the models below using ANOVA.
$$
E[\mbox{response|}g_2] = \alpha_0 +(\alpha_{var}+b_1)\mbox{var}
$$
$$
E[\mbox{response|}g_2] = \alpha_0 +(\alpha_{var}+b_2)\mbox{var}
$$
...
$$
E[\mbox{response|}g_k] = \alpha_0 +(\alpha_{var}+b_k)\mbox{var}
$$
I can consider $\alpha_0, \alpha_{var}$ are fixed effects but unknown constants. As I set the country as a factor and there is k number of groups, $\alpha_0$ is also the fixed effect, indicating the overall mean and $\alpha_{var}$ represents the effect that time has on infant mortality in different countries. And $b_k$ depicts the random effects to figure out the variability with groups and within the groups.
For the first data set, I tried to build 12 different models using linear, transformed and generalised regression models with random effects and 6 different explanatory variables. In this case, it was clear to have random effects, I used mixed effects models, instead of regression models without effects. Also, the response variable in this data set characterised the left-skewed, continuous and positive, so I only consider the transformed linear model, the gamma and Gaussian distribution for generalised models.
To find out the best model for this data set, I used backwards stepwise regression to consider all possible variates and their interaction terms to get the lowest AIC. Since $R^2$ is not a goodness to measure the performance for models, in this case, I only consider the results from the ANOVA test, AIC values and residual plots and QQ-plot for normality.
```{r modelling}
fmla_0 <- log(infant_mortality_rate) ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + log(gdp)
fmla_1 <- log(infant_mortality_rate) ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + gdp + income_class*life_expectancy + years_education*income_class
# best model
fmla_2 <- log(infant_mortality_rate) ~ 1 + year_centre + (1 + year_centre | country) + income_class + log(gdp)+ life_expectancy + years_education + income_class*life_expectancy + years_education*income_class + log(gdp)*income_class
fmla_3 <- log(infant_mortality_rate) ~ life_expectancy +(year_centre | country) + log(gdp) + years_education
mod_0 <- lmer(fmla_0, sdg)
# anova(mod_intercept_slope,mod_0)
mod_1 <- lmer(fmla_1, sdg)
mod_2 <- lmer(fmla_2, sdg)
# anova(mod_1,mod_2)
mod_3 <- lmer(fmla_3, sdg)
# anova(mod_2,mod_3)
#glmm - gamma
#best
fmla_0 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + log(gdp)
fmla_1 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + gdp + income_class*life_expectancy + years_education*income_class
fmla_2 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + income_class + log(gdp)+ life_expectancy + years_education + income_class*life_expectancy + years_education*income_class + log(gdp)*income_class
fmla_3 <- infant_mortality_rate ~ life_expectancy + I(life_expectancy^2)+(year_centre | country) + log(gdp) + years_education
glmm_0_gamma <- glmer(fmla_0, sdg, family=Gamma(link="log"))
glmm_1_gamma <- glmer(fmla_1, sdg, family=Gamma(link="log"))
glmm_2_gamma <- glmer(fmla_2, sdg, family=Gamma(link="log"))
glmm_3_gamma <- glmer(fmla_3, sdg, family=Gamma(link="log"))
# anova(glmm_0_gamma, glmm_1_gamma)
# anova(glmm_0_gamma, glmm_2_gamma)
# anova(glmm_0_gamma, glmm_3_gamma)
# summary(glmm_0_gamma)
# glmm - gaussian
fmla_0 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + log(gdp)
fmla_1 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + life_expectancy + years_education + gdp + income_class*life_expectancy + years_education*income_class
fmla_2 <- infant_mortality_rate ~ 1 + year_centre + (1 + year_centre | country) + income_class + log(gdp)+ life_expectancy + years_education + income_class*life_expectancy + years_education*income_class + log(gdp)*income_class
#best
fmla_3 <- infant_mortality_rate ~ life_expectancy +(year_centre | country) + log(gdp) + years_education
glmm_0_gau <- glmer(fmla_0, sdg, family=gaussian(link="log"))
glmm_1_gau <- glmer(fmla_1, sdg, family=gaussian(link="log"))
glmm_2_gau <- glmer(fmla_2, sdg, family=gaussian(link="log"))
glmm_3_gau <- glmer(fmla_3, sdg, family=gaussian(link="log"))
# anova(glmm_0_gau,glmm_1_gau)
# anova(glmm_0_gau,glmm_2_gau)
# anova(glmm_0_gau,glmm_3_gau)
```
```{r evaluation}
diagnostics1 <-
data.frame(residuals = residuals(mod_2),
fitted_vals = fitted(mod_2),
country = sdg$country)
plt1 <- diagnostics1 %>%
ggplot(aes(x = fitted_vals,
y = residuals))+
geom_point() +
geom_hline(yintercept = 0,
col="red") +
labs(title = "LOG Transformed LMM") + theme(plot.title = element_text(size=8))
ordered_contry<-sdg$country[order(diagnostics1$residuals)]
plt2 <- diagnostics1 %>%
ggplot(aes(sample = residuals))+
stat_qq() +
stat_qq_line()+
labs(y= "residuals",
title = "LOG Transformed(Residuals)")+ theme(plot.title = element_text(size=8))
ranefs<-ranef(mod_2)$country %>% as_tibble(rownames = "country")
ordered_yearc<-ranefs$country[order(ranefs$year_centre)]
ordered_intercept<-ranefs$country[order(ranefs$`(Intercept)`)]
plt3 <- ranefs %>%
ggplot(aes(sample = year_centre))+
stat_qq() + stat_qq_line() +
labs(y= "slopes",
title = "Random(LOG Transformed)(slopes)")+ theme(plot.title = element_text(size=6))
plt4 <- ranefs %>%
ggplot(aes(sample = `(Intercept)`))+
stat_qq() + stat_qq_line() +
labs(y= "intercepts",
title = "Random(LOG Transformed)(intercepts)")+ theme(plot.title = element_text(size=6))
# glmm - gamma
diagnostics2 <-
data.frame(residuals = residuals(glmm_0_gamma),
fitted_vals = fitted(glmm_0_gamma),
country = sdg$country)
plt5 <- diagnostics2 %>%
ggplot(aes(x = fitted_vals,
y = residuals))+
geom_point() +
geom_hline(yintercept = 0,
col="red") +
labs(title = "Gamma GLMM(log)") + theme(plot.title = element_text(size=8))
ordered_contry<-sdg$country[order(diagnostics2$residuals)]
plt6 <- diagnostics2 %>%
ggplot(aes(sample = residuals))+
stat_qq() +
stat_qq_line()+
labs(y= "residuals",
title = "Gamma(Residuals)")+ theme(plot.title = element_text(size=6))
ranefs<-ranef(glmm_0_gamma)$country %>% as_tibble(rownames = "country")
ordered_yearc<-ranefs$country[order(ranefs$year_centre)]
ordered_intercept<-ranefs$country[order(ranefs$`(Intercept)`)]
plt7 <- ranefs %>%
ggplot(aes(sample = year_centre))+
stat_qq() + stat_qq_line() +
labs(y= "slopes",
title = "Random(Gamma)(slopes)")+ theme(plot.title = element_text(size=6))
plt8 <- ranefs %>%
ggplot(aes(sample = `(Intercept)`))+
stat_qq() + stat_qq_line() +
labs(y= "intercepts",
title = "Random(Gamma)(intercepts)")+ theme(plot.title = element_text(size=6))
# GLMM
diagnostics3 <-
data.frame(residuals = residuals(glmm_3_gau),
fitted_vals = fitted(glmm_3_gau),
country = sdg$country)
plt9 <- diagnostics3 %>%
ggplot(aes(x = fitted_vals,
y = residuals))+
geom_point() +
geom_hline(yintercept = 0,
col="red") +
labs(title = "Gaussian GLMM(log)")+ theme(plot.title = element_text(size=8))
ordered_contry<-sdg$country[order(diagnostics3$residuals)]
plt10 <- diagnostics3 %>%
ggplot(aes(sample = residuals))+
stat_qq() +
stat_qq_line()+
labs(y= "residuals",
title = "Gaussian(Residuals)")+ theme(plot.title = element_text(size=8))
ranefs<-ranef(glmm_3_gau)$country %>% as_tibble(rownames = "country")
ordered_yearc<-ranefs$country[order(ranefs$year_centre)]
ordered_intercept<-ranefs$country[order(ranefs$`(Intercept)`)]
plt11 <- ranefs %>%
ggplot(aes(sample = year_centre))+
stat_qq() + stat_qq_line() +
labs(y= "slopes",
title = "Random(Gaussian)(slopes)")+ theme(plot.title = element_text(size=8))
plt12 <- ranefs %>%
ggplot(aes(sample = `(Intercept)`))+
stat_qq() + stat_qq_line() +
labs(y= "intercepts",
title = "Random(Gaussian)(intercepts)")+ theme(plot.title = element_text(size=8))
ggarrange(plt1, plt2, plt3, plt4, plt5, plt6,plt7, plt8,plt9, plt10,plt11, plt12, nrow = 3, ncol=4)
```
Under the transformed models, I contained the interaction between GDP vs. income_level, income_level vs. life expectancy and education years vs. income level with the random effect. The residual plot of this model has no pattern and the QQ-plot for checking the normality of residuals also followed the linearity between -1 and 1 but it seems to need more improvement with some doubts. While the data points in the QQ-plot of the normality of intercept look satisfied with this assumption, points in the range -2 to -1.5 are away from the line.
For Gamma generalised model, the data points in the residual plots are more randomly spread around the line and the normality of residual seems to fit the line reasonably compared to other models even though the outlier is noticeable. Also, points in the QQ-plot for random effects imply the normality of intercept and slope models so that we can accept the assumption for the normality.
```{r AIC}
models <- c("Transformed LMM(mod2)","GLMM(glmm_0_gamma)","GLMM(glmm_3_gau)")
values <- c(AIC(mod_2)-2*sum(log(1/sdg$infant_mortality_rate)),AIC(glmm_0_gamma),AIC(glmm_3_gau))
table_1 <- rbind(models,values)
table_1%>% kable(digits = 2, caption="AIC Comparison") %>%
kable_styling(bootstrap_options = "basic", full_width = F)
```
As generalised models are associated with the distribution, we don't need to apply adjusted AIC values, however, for comparison with the transformed and generalised models, I applied the adjusted formula to the transformed model. The Gamma generalised model has the lowest AIC, so this model is optimal to this data set.
```{r predictions, out.width="70%",fig.align = 'center'}
sdg_fv <- sdg %>%
gather_predictions(mod_2,
glmm_0_gamma,
glmm_3_gau,
.model = "model",
.pred = "fitted_vals",
type = "response")
#fitted values plots
sdg_fv %>%
ggplot(aes(infant_mortality_rate,fitted_vals)) +
geom_point() +
geom_smooth(method = "loess") +
facet_wrap(~model) +ggtitle("Observed Vs. Fitted")
```
According to the observed vs fitted values plots, only Gaussian and Gamma generalised mixed effects model with log link have a linear relationship between the predicted proportion of the mortality for babies and the actual rate of infant mortality, pointing out the estimated and the actual values of mortality are similar. Whereas the estimated and actual figures in the transformed model looked similar, there is no linearity.
In conclusion, even if I remove the fixed effect of year, the rate changes with other treatment effects over countries.
# Production coffee data
```{r part_2}
coffee_production <- read_csv("https://people.bath.ac.uk/kai21/MA50258/data/coffee.csv")
coffee_1990 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 3) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1990) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1991 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 4) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1991) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1992 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 5) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1992) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1993 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 6) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1993) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1994 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 7) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1994) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1995 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 8) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1995) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1996 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 9) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1996) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1997 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 10) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1997) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1998 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 11) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1998) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_1999 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 12) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 1999) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2000 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 13) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2000) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2001 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 14) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2001) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2002 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 15) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2002) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2003 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 16) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2003) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2004 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 17) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2004) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2005 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 18) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2005) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2006 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 19) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2006) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2007 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 20) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2007) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2008 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 21) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2008) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2009 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 22) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2009) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2010 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 23) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2010) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2011 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 24) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2011) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2012 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 25) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2012) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2013 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 26) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2013) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2014 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 27) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2014) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2015 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 28) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2015) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2016 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 29) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2016) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2017 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 30) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2017) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
coffee_2018 <- coffee_production%>%
select(country = 1, arabica_robusta = 2, production = 31) %>%
filter(!is.na(production)& !is.na(arabica_robusta)) %>%
mutate(year = 2018) %>%
mutate_at(3, str_replace, " ", "") %>%
mutate_at(3, as.integer)
combined_data <- rbind(coffee_1990,coffee_1991,coffee_1992,coffee_1993,coffee_1994,coffee_1995,coffee_1996,coffee_1997,coffee_1998,coffee_1999,coffee_2000,coffee_2001,coffee_2002,coffee_2003,coffee_2004,coffee_2005,coffee_2006,coffee_2007,coffee_2008,coffee_2009,coffee_2010,coffee_2011,coffee_2012,coffee_2013,coffee_2014,coffee_2015,coffee_2016,coffee_2017,coffee_2018)
combined_data$country <- as.factor(combined_data$country)
combined_data$arabica_robusta <- as.factor(combined_data$arabica_robusta)
# mean(1990:2018) # 2004
combined_data <- combined_data %>% filter(production!=0) %>%
mutate(yearc = year-2004)
describe(combined_data[3:5])[c('mean','sd','median','min','max')] %>%
kable(digits=2, caption="summary statistics") %>%
kable_styling(full_width = F)
```
Since this data set consisted of the bags of coffee by years between 1990 and 2018, types of coffee and country, I should manipulate the data set to determine the effects between countries. First of all, I break down the data by years and then combined the values again with the column "year". Initially, the size of a data frame was 59*31, however, the manipulated data frame has 1,483 rows and 5 columns including the centring column which is to remove the fixed effect after dealing with missing values, 0 and manipulation.
According to the summary statistics, minimum and maximum values of production have a huge gap from 1 to 60,000, and the mean value is around 6 times higher than the value of the median. From those numbers, we can point out that this data set is left-skewed and has long tails as the median is smaller than the expected value and the standard deviation is also too high. In this special case, to make a centring column with year, it is more meaningful to use the median, instead of the mean value.
```{r exploration_part2,out.width="75%", fig.align='center'}
combined_data %>% ggplot(aes(year, production, group=country, col=arabica_robusta))+
geom_line() +
labs(x="year",
y="Coffee production",
title="The general trend by country, year")
```
As made the above plot, I wanted to see the relationship between year, coffee production and arabica robusta which represented four different types of coffee. According to the level of types of coffee, the trend of (R/A) has increased since 1990, and production in one country which cultivated A/R type of coffee was rocketed throughout the years. Because other types of coffee do not have any noticeable change and the production in almost countries shows similar patterns below 10,000 bags, I will investigate the method for dealing with outliers analysis.
```{r outliers, fig.align='center'}
mod_1 <- lm(production ~ 1+ yearc + arabica_robusta*country,combined_data)
mod_2 <- lm(log(production) ~ 1+ yearc + arabica_robusta*country,combined_data)
mod_3 <- glm(production ~ arabica_robusta*country,
family = Gamma(link = "log"),
combined_data)
diagn_m1 <- augment(mod_1) %>%
mutate(residuals = production - .fitted)
diagn_m2 <- augment(mod_2) %>%
mutate(residuals = `log(production)` - .fitted)
qq_m1 <- diagn_m1 %>%
ggplot(aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = "Model 1", y = "residuals")
qq_m2 <- diagn_m2 %>%
ggplot(aes(sample = residuals)) +
stat_qq() +
stat_qq_line() +
labs(title = "Model 2", y = "residuals")
ggarrange(qq_m1, qq_m2, nrow=1, ncol=2)
```
The data points should have been randomly scattered around the line for an ideal, however, they are not sufficiently gathered around the line at the end of the lines for two models. For model 2, as we made a transformed model, for calculating the residuals, we should apply log transformation to the response variable. The transformed linear regression model looks to follow the linear relationships in some parts, but It does not appear any normality in model 1. These two models need to improve systematic departure from the normality of residuals.
```{r outliers_2,out.width="75%", fig.align='center'}
diagn_m1 <- diagn_m1 %>%
mutate(.stud.resid = sigma(mod_1) * (.std.resid/.sigma))
q_val<-qt(0.025,df=nobs(mod_1)-length(coef(mod_1))-1)
res_plot <- diagn_m1 %>%
ggplot(aes(x=1:nobs(mod_1),y=.stud.resid)) +
geom_point() +
geom_hline(yintercept = c(-q_val,0,q_val),
lty = 2,
size = 0.2) +
labs(title = "Model 1", x = "index", y = "Studentized residual")
diagn_m2 <- diagn_m2 %>%
mutate(.stud.resid = sigma(mod_2) * (.std.resid/.sigma))
res_plot2 <- diagn_m2 %>%
ggplot(aes(x=1:nobs(mod_2),y=.stud.resid, label=country)) +
geom_point() +
geom_hline(yintercept = c(-q_val,0,q_val),
lty = 2,
size = 0.2) +
labs(title = "Model 2", x = "index", y = "Studentized residual")
outliers_m1 <- which(abs(diagn_m1$.stud.resid)>abs(q_val))
outliers_m2 <- which(abs(diagn_m2$.stud.resid)>abs(q_val))
ggarrange(res_plot, res_plot2, nrow=1, ncol=2)
total <- data.frame(count(combined_data[outliers_m1,c("country")]), count(combined_data[outliers_m2,"country"]))
total %>%
rename("model1"= n, "model2"=n.1) %>%
kable(caption="The number of outliers") %>%
kable_styling(bootstrap_options = "basic", full_width = F)
# count(combined_data[outliers_m1,c("country")])
# count(combined_data[outliers_m2,"country"])
```
To check the some data points as outliers using the studentized residuals plot, we should make the residuals by dividing the standardised residuals by .sigma and multiply with the sigma for the whole data set. This formula in diagn_m1 is not a formula for studentised error, however, this represented the relationship between these two residuals.
Based on these plots, there are lots of data points outside the lines at 46 and 92, respectively. So we can consider those points are suspicious with respect to each model.
```{r cook_distance,out.width="75%", fig.align='center'}
cook_plot <- diagn_m1 %>%
ggplot(aes(x=1:nobs(mod_1),y=.cooksd)) +
geom_point() +
geom_hline(yintercept = c(0,1),
lty = 2,
size = 0.2) +
labs(title = "Model 1", x = "index", y = "Cook's distance")
cook_plot2 <- diagn_m2 %>%
ggplot(aes(x=1:nobs(mod_2),y=.cooksd)) +
geom_point() +
geom_hline(yintercept = c(0,1),
lty = 2,
size = 0.2) +
labs(title = "Model 2", x = "index", y = "Cook's distance")
influencers_m1 <- which(abs(diagn_m1$.cooksd)>0.2)
influencers_m2 <- which(abs(diagn_m2$.cooksd)>0.2)
ggarrange(cook_plot, cook_plot2,nrow=1, ncol=2)
```
Another method to check the influence of outliers is to use Cook's distance. It is interpretable closer to 1, the more chance to fail applicable data points. Hence, in these graphs, each value is not significantly far away from the point of 0, we can say the outliers which we already checked above are not influential to these models.
Using this information, I'm going to investigate the effects of random or fixed effects and determine which models are suitable for this data set.
```{r, plots_for_effects, out.width="85%", fig.align='center'}
fmla_null <- production ~ 1 + yearc
mod_null <- lm(fmla_null, combined_data)
mod_without_random <- lm(production ~ yearc + country,
data = combined_data,
contrasts = list(country = "contr.sum"))
fmla_random_intercept <- production ~ 1 + (1|country) + yearc
mod_random_intercept <- lmer(fmla_random_intercept, combined_data)
group_means <-
combined_data %>%
group_by(country) %>%
summarise(mean_reaction = mean(production))
grand_mean <- mean(combined_data$production)
re_int <-
ranef(mod_random_intercept) %>%
as_tibble() %>%
bind_cols(fixed=group_means$mean_reaction-grand_mean)
re_int %>%
ggplot(aes(y = grp,
x = condval,
col = "Random")) +
geom_point(size = 0.5) +
geom_point(aes(x = fixed,
y = grp,
col = "Fixed"),
alpha=0.4) +
labs(x = "Production",
y = "Country",
col = "Effect") +
guides(y = guide_axis(check.overlap = TRUE)) +
labs(title="Estimated fixed effects with random effects of each country")
```
We can calculate the fixed effects using R. First at the most, using group_by(), we can identify the mean values by the country which is a factor in this data set. After that, we should know the grand mean which is an expected value of the production which is at around 2351.1.without regard to other explanatory variables. Then by subtracting the values of the mean by countries from the grand mean, we can obtain the fixed effects. Surprisingly, the estimates of random effects are closer to the mean value which is equal to 0 than fixed effects and it is called shrinkage towards the grand average.
```{r summary_table_part2 }
mod_random_int_slope <- lmer(formula = production ~ 1 + (1 + yearc|country) + yearc,combined_data)
summary(mod_without_random )$coefficients[1:2,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. Without Random Intercepts") %>%
kable_styling(full_width = F)
summary(mod_random_intercept)$coefficients[,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. Random Intercepts") %>%
kable_styling(full_width = F)
summary(mod_random_int_slope)$coefficients[1:2,1:2] %>%
kable(digits = 3,
caption = "Estimated parameters. Random Intercepts Slope") %>%
kable_styling(full_width = F)
```
When compared to the first two tables, we can observe the estimates of intercepts are not identical but much the same at 2209 and 2213, respectively. And the standard error of intercept in the model with random effects has a huge increase that started at 67.5 to 796 compared to the model without the random effect.
Note that discovered the fixed effects in the models so far, however, we want to know how different groups are correlated. To define the correlation between intercept and slope, I made another random intercept slope model to identify variability between groups and within groups as well. Under the intercept model, I got 5935.8 as the between-country variability and 2134.4 as the estimate of within-subject (also called residual) variability, respectively. So total estimate of variance is $5935.8^2 + 2134.4^2$ in the intercept model. Compared to the estimated variance which is $2098.238^2$ from the model without random effects, it stands out the variability with the random effects.
In the last table for the intercept and slope model, the estimates of intercept are similar but the standard error changes from 67.5 to 800.1 that increasing almost 12 times compared to the model with no random effects.
```{r prediction_graphs, out.width="75%", fig.align="center" }
combined_data %>%
add_predictions(mod_random_int_slope) %>%
ggplot(aes(x = yearc,
y = pred,
group = country)) +
geom_line(alpha=0.2,
col="blue") +
geom_line(aes(x = yearc,
y = production,
group = country),
alpha=0.2)
```
As can be seen in the tables above, the lines are spread with different slopes and intercept in this plot. Whilst it is shown that amount of coffee in countries was increased, most countries followed the trend of stability.
$$
E[\mbox{production|country1}] = \alpha_0 +(\alpha_{var}+b_1)var
$$
$$
E[\mbox{production|country2}] = \alpha_0 +(\alpha_{var}+b_2)var\\
$$
...
$$
E[\mbox{production|countryk}] = \alpha_0 +(\alpha_{var}+b_k)var\\
$$
Each line tells the each observation of countries which is a factor. Specially, in the random intercept model, as our formula in R is
$production$ ~ $1 + (1|country)\\$ where $\alpha_{var}$ is equal to 0. Since this model is satisfied with the normality of intercept, $E[b_1]=E[b_2]=...=E[b_k]$ = 0 and $Var[b_1] = Var[b_2] =...= Var[b_k] = \sigma^2$.
```{r hypothesis_testing_part2}
anova(mod_random_intercept, mod_random_int_slope)
```
To test the hypothesis with a model with random intercept and an intercept and slope model, we should set
$H_0$ : the random intercept model is better.
$H_a$ : The model with intercept and slope is better.
As the p-value is small with a 5% significance level, we can reject the null hypothesis, i.e., the second model is a better fit for this data set.
By this result, I'm going to develop the models to explain this data set well and fit the involved models by maximum likelihood instead of restricted maximum likelihood (REML) as I've learned this method is a better fit in slides.
```{r modelling_part2}
fmla_2_0 <- production~1+(1+yearc|country)+arabica_robusta
mod_lm_1 <- lmer(fmla_2_0, combined_data, REML=FALSE)
# anova(mod_random_int_slope, mod_lm_1) # arabica_robusta is not meaningful
# best model for lmm
fmla_2_1 <- production~1+(1+yearc|country)+arabica_robusta*country
mod_lm_2 <- lmer(fmla_2_1, combined_data, REML=FALSE)
# anova(mod_random_int_slope, mod_lm_2)
fmla_2_2 <- production~ (1|country)+arabica_robusta*country
mod_lm_3 <- lmer(fmla_2_2, combined_data, REML=FALSE)
# anova(mod_lm_2, mod_lm_3)
fmla_2_3 <- production~1+(1+yearc|country)+arabica_robusta*yearc
mod_lm_4 <- lmer(fmla_2_3, combined_data, REML=FALSE)
# anova(mod_random_int_slope, mod_lm_4)
fmla_2_3 <- production~1+(1+yearc|country)
mod_lm_4 <- lmer(fmla_2_3, combined_data, REML=FALSE)
# anova(mod_random_int_slope, mod_lm_4)
# transformed models
fmla_2_4 <- log(production)~ (yearc|country) +arabica_robusta
mod_tr_1 <- lmer(fmla_2_4, combined_data, REML=FALSE)