forked from mca91/EconometricsWithR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-ch5.Rmd
1432 lines (948 loc) · 69 KB
/
05-ch5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Hypothesis Tests and Confidence Intervals in the Simple Linear Regression Model {#htaciitslrm}
This chapter, continues our treatment of the simple linear regression model. The following subsections discuss how we may use our knowledge about the sampling distribution of the OLS estimator in order to make statements regarding its uncertainty.
These subsections cover the following topics:
- Testing Hypotheses regarding regression coefficients.
- Confidence intervals for regression coefficients.
- Regression when $X$ is a dummy variable.
- Heteroskedasticity and Homoskedasticity.
The packages `r ttcode('AER')` [@R-AER] and `r ttcode('scales')` [@R-scales] are required for reproduction of the code chunks presented throughout this chapter. The package `r ttcode('scales')` provides additional generic plot scaling methods. Make sure both packages are installed before you proceed. The safest way to do so is by checking whether the following code chunk executes without any errors.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(AER)
library(scales)
```
## Testing Two-Sided Hypotheses Concerning the Slope Coefficient
Using the fact that $\hat{\beta}_1$ is approximately normally distributed in large samples (see [Key Concept 4.4](#tsdotoe)), testing hypotheses about the true value $\beta_1$ can be done as in Chapter \@ref(potsm).
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC5.1">
<h3 class = "right"> Key Concept 5.1 </h3>
<h3 class = "left"> General Form of the $t$-Statistic </h3>
Remember from Chapter \\@ref(arosur) that a general $t$-statistic has the form
$$ t = \\frac{\\text{estimated value} - \\text{hypothesized value}}{\\text{standard error of the estimator}}.$$
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[General Form of the $t$-Statistic]{5.1}
Remember from Chapter \\ref{arosur} that a general $t$-statistic has the form
$$ t = \\frac{\\text{estimated value} - \\text{hypothesized value}}{\\text{standard error of the estimator}}.$$
\\end{keyconcepts}
')
```
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC5.2">
<h3 class = "right"> Key Concept 5.2 </h3>
<h3 class = "left"> Testing Hypotheses regarding $\\beta_1$ </h3>
For testing the hypothesis $H_0: \\beta_1 = \\beta_{1,0}$, we need to perform the following steps:
1. Compute the standard error of $\\hat{\\beta}_1$, $SE(\\hat{\\beta}_1)$
\\[ SE(\\hat{\\beta}_1) = \\sqrt{ \\hat{\\sigma}^2_{\\hat{\\beta}_1} } \\ \\ , \\ \\
\\hat{\\sigma}^2_{\\hat{\\beta}_1} = \\frac{1}{n} \\times \\frac{\\frac{1}{n-2} \\sum_{i=1}^n (X_i - \\overline{X})^2 \\hat{u_i}^2 }{ \\left[ \\frac{1}{n} \\sum_{i=1}^n (X_i - \\overline{X})^2 \\right]^2}.
\\]
2. Compute the $t$-statistic
\\[ t = \\frac{\\hat{\\beta}_1 - \\beta_{1,0}}{ SE(\\hat{\\beta}_1) }. \\]
3. Given a two sided alternative ($H_1:\\beta_1 \\neq \\beta_{1,0}$) we reject at the $5\\%$ level if $|t^{act}| > 1.96$ or, equivalently, if the $p$-value is less than $0.05$.<br>
Recall the definition of the $p$-value:
\\begin{align*}
p \\text{-value} =& \\, \\text{Pr}_{H_0} \\left[ \\left| \\frac{ \\hat{\\beta}_1 - \\beta_{1,0} }{ SE(\\hat{\\beta}_1) } \\right| > \\left| \\frac{ \\hat{\\beta}_1^{act} - \\beta_{1,0} }{ SE(\\hat{\\beta}_1) } \\right| \\right] \\\\
=& \\, \\text{Pr}_{H_0} (|t| > |t^{act}|) \\\\
\\approx& \\, 2 \\cdot \\Phi(-|t^{act}|)
\\end{align*}
The last transformation is due to the normal approximation for large samples.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Testing Hypotheses regarding $\\beta_1$]{5.2}
For testing the hypothesis $H_0: \\beta_1 = \\beta_{1,0}$, we need to perform the following steps:\\newline
\\begin{enumerate}
\\item Compute the standard error of $\\hat{\\beta}_1$, $SE(\\hat{\\beta}_1)$
\\[ SE(\\hat{\\beta}_1) = \\sqrt{ \\hat{\\sigma}^2_{\\hat{\\beta}_1} } \\ \\ , \\ \\
\\hat{\\sigma}^2_{\\hat{\\beta}_1} = \\frac{1}{n} \\times \\frac{\\frac{1}{n-2} \\sum_{i=1}^n (X_i - \\overline{X})^2 \\hat{u_i}^2 }{ \\left[ \\frac{1}{n} \\sum_{i=1}^n (X_i - \\overline{X})^2 \\right]^2}.
\\]
\\item Compute the $t$-statistic
\\[ t = \\frac{\\hat{\\beta}_1 - \\beta_{1,0}}{ SE(\\hat{\\beta}_1) }. \\]
\\item Given a two sided alternative ($H_1:\\beta_1 \\neq \\beta_{1,0}$) we reject at the $5\\%$ level if $|t^{act}| > 1.96$ or, equivalently, if the $p$-value is less than $0.05$.\\newline
Recall the definition of the $p$-value:
\\begin{align*}
p \\text{-value} =& \\, \\text{Pr}_{H_0} \\left[ \\left| \\frac{ \\hat{\\beta}_1 - \\beta_{1,0} }{ SE(\\hat{\\beta}_1) } \\right| > \\left| \\frac{ \\hat{\\beta}_1^{act} - \\beta_{1,0} }{ SE(\\hat{\\beta}_1) } \\right| \\right] \\\\
=& \\, \\text{Pr}_{H_0} (|t| > |t^{act}|) \\\\
=& \\, 2 \\cdot \\Phi(-|t^{act}|)
\\end{align*}
The last transformation is due to the normal approximation for large samples.
\\end{enumerate}
\\end{keyconcepts}
')
```
Consider again the OLS regression stored in `r ttcode('linear_model')` from Chapter \@ref(lrwor) that gave us the regression line
\[ \widehat{TestScore} \ = \underset{(9.47)}{698.9} - \underset{(0.49)}{2.28} \times STR \ , \ R^2=0.051 \ , \ SER=18.6. \]
Copy and execute the following code chunk if the above model object is not available in your working environment.
```{r, message=F, warning=F}
# load the `CASchools` dataset
data(CASchools)
# add student-teacher ratio
CASchools$STR <- CASchools$students/CASchools$teachers
# add average test-score
CASchools$score <- (CASchools$read + CASchools$math)/2
# estimate the model
linear_model <- lm(score ~ STR, data = CASchools)
```
For testing a hypothesis concerning the slope parameter (the coefficient on $STR$), we need $SE(\hat{\beta}_1)$, the standard error of the respective point estimator. As is common in the literature, standard errors are presented in parentheses below the point estimates.
Key Concept 5.1 reveals that it is rather cumbersome to compute the standard error and thereby the $t$-statistic by hand. The question you should be asking yourself right now is: can we obtain these values with minimum effort using `r ttcode('R')`? Yes, we can. Let us first use `r ttcode('summary()')` to get a summary on the estimated coefficients in `r ttcode('linear_model')`.
**Note**: Throughout the textbook, robust standard errors are reported. We consider it instructive keep things simple at the beginning and thus start out with simple examples that do not allow for robust inference. Standard errors that are robust to heteroskedasticity are introduced in Chapter \@ref(hah) where we also demonstrate how they can be computed using `r ttcode('R')`. A discussion of heteroskedasticity-autocorrelation robust standard errors takes place in Chapter \@ref(eodce).
```{r, warning=F, message=F}
# print the summary of the coefficients to the console
summary(linear_model)$coefficients
```
The second column of the coefficients' summary, reports $SE(\hat\beta_0)$ and $SE(\hat\beta_1)$. Also, in the third column `r ttcode('t value')`, we find $t$-statistics $t^{act}$ suitable for tests of the separate hypotheses $H_0: \beta_0=0$ and $H_0: \beta_1=0$. Furthermore, the output provides us with $p$-values corresponding to both tests against the two-sided alternatives $H_1:\beta_0\neq0$ respectively $H_1:\beta_1\neq0$ in the fourth column of the table.
Let us have a closer look at the test of
$$H_0: \beta_1=0 \ \ \ vs. \ \ \ H_1: \beta_1 \neq 0.$$
We have $$ t^{act} = \frac{-2.279808 - 0}{0.4798255} \approx - 4.75. $$
What does this tell us about the significance of the estimated coefficient? We reject the null hypothesis at the $5\%$ level of significance since $|t^{act}| > 1.96$. That is, the observed test statistic falls into the rejection region as $p\text{-value} = 2.78\cdot 10^{-6} < 0.05$. We conclude that the coefficient is significantly different from zero. In other words, we reject the hypothesis that the class size *has no influence* on the students test scores at the $5\%$ level.
Note that although the difference is negligible in the present case as we will see later, `r ttcode('summary()')` does not perform the normal approximation but calculates $p$-values using the $t$-distribution instead. Generally, the degrees of freedom of the assumed $t$-distribution are determined in the following manner:
$$ \text{DF} = n - k - 1 $$
where $n$ is the number of observations used to estimate the model and $k$ is the number of regressors, excluding the intercept. In our case, we have $n=420$ observations and the only regressor is $STR$ so $k=1$. The simplest way to determine the model degrees of freedom is
```{r}
# determine residual degrees of freedom
linear_model$df.residual
```
Hence, for the assumed sampling distribution of $\hat\beta_1$ we have
$$\hat\beta_1 \sim t_{418}$$
such that the $p$-value for a two-sided significance test can be obtained by executing the following code:
```{r}
2 * pt(-4.751327, df = 418)
```
The result is very close to the value provided by `r ttcode('summary()')`. However since $n$ is sufficiently large one could just as well use the standard normal density to compute the $p$-value:
```{r}
2 * pnorm(-4.751327)
```
The difference is indeed negligible. These findings tell us that, if $H_0: \beta_1 = 0$ is true and we were to repeat the whole process of gathering observations and estimating the model, observing a $\hat\beta_1 \geq |-2.28|$ is very unlikely!
Using `r ttcode('R')` we may visualize how such a statement is made when using the normal approximation. This reflects the principles depicted in figure 5.1 in the book. Do not let the following code chunk deter you: the code is somewhat longer than the usual examples and looks unappealing but there is a lot of repetition since color shadings and annotations are added on both tails of the normal distribution. We recommend to execute the code step by step in order to see how the graph is augmented with the annotations.
```{r, fig.align='center'}
# Plot the standard normal on the support [-6,6]
t <- seq(-6, 6, 0.01)
plot(x = t,
y = dnorm(t, 0, 1),
type = "l",
col = "steelblue",
lwd = 2,
yaxs = "i",
axes = F,
ylab = "",
main = expression("Calculating the p-value of a Two-sided Test when" ~ t^act ~ "=-4.75"),
cex.lab = 0.7,
cex.main = 1)
tact <- -4.75
axis(1, at = c(0, -1.96, 1.96, -tact, tact), cex.axis = 0.7)
# Shade the critical regions using polygon():
# critical region in left tail
polygon(x = c(-6, seq(-6, -1.96, 0.01), -1.96),
y = c(0, dnorm(seq(-6, -1.96, 0.01)), 0),
col = 'orange')
# critical region in right tail
polygon(x = c(1.96, seq(1.96, 6, 0.01), 6),
y = c(0, dnorm(seq(1.96, 6, 0.01)), 0),
col = 'orange')
# Add arrows and texts indicating critical regions and the p-value
arrows(-3.5, 0.2, -2.5, 0.02, length = 0.1)
arrows(3.5, 0.2, 2.5, 0.02, length = 0.1)
arrows(-5, 0.16, -4.75, 0, length = 0.1)
arrows(5, 0.16, 4.75, 0, length = 0.1)
text(-3.5, 0.22,
labels = expression("0.025"~"="~over(alpha, 2)),
cex = 0.7)
text(3.5, 0.22,
labels = expression("0.025"~"="~over(alpha, 2)),
cex = 0.7)
text(-5, 0.18,
labels = expression(paste("-|",t[act],"|")),
cex = 0.7)
text(5, 0.18,
labels = expression(paste("|",t[act],"|")),
cex = 0.7)
# Add ticks indicating critical values at the 0.05-level, t^act and -t^act
rug(c(-1.96, 1.96), ticksize = 0.145, lwd = 2, col = "darkred")
rug(c(-tact, tact), ticksize = -0.0451, lwd = 2, col = "darkgreen")
```
The $p$-Value is the area under the curve to left of $-4.75$ plus the area under the curve to the right of $4.75$. As we already know from the calculations above, this value is very small.
## Confidence Intervals for Regression Coefficients {#cifrc}
As we already know, estimates of the regression coefficients $\beta_0$ and $\beta_1$ are subject to sampling uncertainty, see Chapter \@ref(lrwor). Therefore, we will *never* exactly estimate the true value of these parameters from sample data in an empirical application. However, we may construct confidence intervals for the intercept and the slope parameter.
A $95\%$ confidence interval for $\beta_i$ has two equivalent definitions:
- The interval is the set of values for which a hypothesis test to the level of $5\%$ cannot be rejected.
- The interval has a probability of $95\%$ to contain the true value of $\beta_i$. So in $95\%$ of all samples that could be drawn, the confidence interval will cover the true value of $\beta_i$.
We also say that the interval has a confidence level of $95\%$. The idea of the confidence interval is summarized in Key Concept 5.3.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC5.3">
<h3 class = "right"> Key Concept 5.3 </h3>
<h3 class = "left"> A Confidence Interval for $\\beta_i$ </h3>
Imagine you could draw all possible random samples of given size. The interval that contains the true value $\\beta_i$ in $95\\%$ of all samples is given by the expression
\\[ \\text{CI}_{0.95}^{\\beta_i} = \\left[ \\hat{\\beta}_i - 1.96 \\times SE(\\hat{\\beta}_i) \\, , \\, \\hat{\\beta}_i + 1.96 \\times SE(\\hat{\\beta}_i) \\right]. \\]
Equivalently, this interval can be seen as the set of null hypotheses for which a $5\\%$ two-sided hypothesis test does not reject.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[A Confidence Interval for $\\beta_i$]{5.3}
Imagine you could draw all possible random samples of given size. The interval that contains the true value $\\beta_i$ in $95\\%$ of all samples is given by the expression
\\[ \\text{CI}_{0.95}^{\\beta_i} = \\left[ \\hat{\\beta}_i - 1.96 \\times SE(\\hat{\\beta}_i) \\, , \\, \\hat{\\beta}_i + 1.96 \\times SE(\\hat{\\beta}_i) \\right]. \\]
Equivalently, this interval can be seen as the set of null hypotheses for which a $5\\%$ two-sided hypothesis test does not reject.
\\end{keyconcepts}
')
```
### Simulation Study: Confidence Intervals {-}
To get a better understanding of confidence intervals we conduct another simulation study. For now, assume that we have the following sample of $n=100$ observations on a single variable $Y$ where
$$ Y_i \overset{i.i.d}{\sim} \mathcal{N}(5,25), \ i = 1, \dots, 100.$$
```{r, fig.align='center'}
# set seed for reproducibility
set.seed(4)
# generate and plot the sample data
Y <- rnorm(n = 100,
mean = 5,
sd = 5)
plot(Y,
pch = 19,
col = "steelblue")
```
We assume that the data is generated by the model
$$ Y_i = \mu + \epsilon_i $$
where $\mu$ is an unknown constant and we know that $\epsilon_i \overset{i.i.d.}{\sim} \mathcal{N}(0,25)$. In this model, the OLS estimator for $\mu$ is given by $$ \hat\mu = \overline{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, $$ i.e., the sample average of the $Y_i$. It further holds that
$$ SE(\hat\mu) = \frac{\sigma_{\epsilon}}{\sqrt{n}} = \frac{5}{\sqrt{100}} $$
(see Chapter [2](#SVSSDASE)) A large-sample $95\%$ confidence interval for $\mu$ is then given by
\begin{equation}
CI^{\mu}_{0.95} = \left[\hat\mu - 1.96 \times \frac{5}{\sqrt{100}} \ , \ \hat\mu + 1.96 \times \frac{5}{\sqrt{100}} \right]. (\#eq:KI)
\end{equation}
It is fairly easy to compute this interval in `r ttcode('R')` by hand. The following code chunk generates a named vector containing the interval bounds:
```{r}
cbind(CIlower = mean(Y) - 1.96 * 5 / 10, CIupper = mean(Y) + 1.96 * 5 / 10)
```
Knowing that $\mu = 5$ we see that, for our example data, the confidence interval covers true value.
As opposed to real world examples, we can use `r ttcode('R')` to get a better understanding of confidence intervals by repeatedly sampling data, estimating $\mu$ and computing the confidence interval for $\mu$ as in \@ref(eq:KI).
The procedure is as follows:
- We initialize the vectors `r ttcode('lower')` and `r ttcode('upper')` in which the simulated interval limits are to be saved. We want to simulate $10000$ intervals so both vectors are set to have this length.
- We use a `r ttcode('for()')` loop to sample $100$ observations from the $\mathcal{N}(5,25)$ distribution and compute $\hat\mu$ as well as the boundaries of the confidence interval in every iteration of the loop.
- At last we join `r ttcode('lower')` and `r ttcode('upper')` in a matrix.
```{r}
# set seed
set.seed(1)
# initialize vectors of lower and upper interval boundaries
lower <- numeric(10000)
upper <- numeric(10000)
# loop sampling / estimation / CI
for(i in 1:10000) {
Y <- rnorm(100, mean = 5, sd = 5)
lower[i] <- mean(Y) - 1.96 * 5 / 10
upper[i] <- mean(Y) + 1.96 * 5 / 10
}
# join vectors of interval bounds in a matrix
CIs <- cbind(lower, upper)
```
According to Key Concept 5.3 we expect that the fraction of the $10000$ simulated intervals saved in the matrix `r ttcode('CIs')` that contain the true value $\mu=5$ should be roughly $95\%$. We can easily check this using logical operators.
```{r}
mean(CIs[, 1] <= 5 & 5 <= CIs[, 2])
```
The simulation shows that the fraction of intervals covering $\mu=5$, i.e., those intervals for which $H_0: \mu = 5$ cannot be rejected is close to the theoretical value of $95\%$.
Let us draw a plot of the first $100$ simulated confidence intervals and indicate those which *do not* cover the true value of $\mu$. We do this via horizontal lines representing the confidence intervals on top of each other.
```{r, fig.align='center', fig.height=5, fig.width=4}
# identify intervals not covering mu
# (4 intervals out of 100)
ID <- which(!(CIs[1:100, 1] <= 5 & 5 <= CIs[1:100, 2]))
# initialize the plot
plot(0,
xlim = c(3, 7),
ylim = c(1, 100),
ylab = "Sample",
xlab = expression(mu),
main = "Confidence Intervals")
# set up color vector
colors <- rep(gray(0.6), 100)
colors[ID] <- "red"
# draw reference line at mu=5
abline(v = 5, lty = 2)
# add horizontal bars representing the CIs
for(j in 1:100) {
lines(c(CIs[j, 1], CIs[j, 2]),
c(j, j),
col = colors[j],
lwd = 2)
}
```
For the first $100$ samples, the true null hypothesis is rejected in four cases so these intervals do not cover $\mu=5$. We have indicated the intervals which lead to a rejection of the null red.
Let us now come back to the example of test scores and class sizes. The regression model from Chapter \@ref(lrwor) is stored in `r ttcode('linear_model')`. An easy way to get $95\%$ confidence intervals for $\beta_0$ and $\beta_1$, the coefficients on `r ttcode('(intercept)')` and `r ttcode('STR')`, is to use the function `r ttcode('confint()')`. We only have to provide a fitted model object as an input to this function. The confidence level is set to $95\%$ by default but can be modified by setting the argument `r ttcode('level')`, see `r ttcode('?confint')`.
```{r}
# compute 95% confidence interval for coefficients in 'linear_model'
confint(linear_model)
```
Let us check if the calculation is done as we expect it to be for $\beta_1$, the coefficient on `r ttcode('STR')`.
```{r}
# compute 95% confidence interval for coefficients in 'linear_model' by hand
lm_summ <- summary(linear_model)
c("lower" = lm_summ$coef[2,1] - qt(0.975, df = lm_summ$df[2]) * lm_summ$coef[2, 2],
"upper" = lm_summ$coef[2,1] + qt(0.975, df = lm_summ$df[2]) * lm_summ$coef[2, 2])
```
The upper and the lower bounds coincide. We have used the $0.975$-quantile of the $t_{418}$ distribution to get the exact result reported by `r ttcode("confint")`. Obviously, this interval *does not* contain the value zero which, as we have already seen in the previous section, leads to the rejection of the null hypothesis $\beta_{1,0} = 0$.
## Regression when X is a Binary Variable {#rwxiabv}
Instead of using a continuous regressor $X$, we might be interested in running the regression
\[ Y_i = \beta_0 + \beta_1 D_i + u_i \tag{5.2} \]
where $D_i$ is a binary variable, a so-called *dummy variable*. For example, we may define $D_i$ as follows:
\[ D_i = \begin{cases}
1 \ \ \text{if $STR$ in $i^{th}$ school district < 20} \\
0 \ \ \text{if $STR$ in $i^{th}$ school district $\geq$ 20} \\
\end{cases} \tag{5.3} \]
The regression model now is
\[ TestScore_i = \beta_0 + \beta_1 D_i + u_i. \tag{5.4} \]
Let us see how these data look like in a scatter plot:
```{r, eval=F}
# Create the dummy variable as defined above
CASchools$D <- CASchools$STR < 20
# Plot the data
plot(CASchools$D, CASchools$score, # provide the data to be plotted
pch = 20, # use filled circles as plot symbols
cex = 0.5, # set size of plot symbols to 0.5
col = "Steelblue", # set the symbols' color to "Steelblue"
xlab = expression(D[i]), # Set title and axis names
ylab = "Test Score",
main = "Dummy Regression")
```
```{r, fig.align='center', echo=F}
# Create the dummy variable as defined above
CASchools$D <- CASchools$STR < 20
# estimte the dummy regression
dummy_model <- lm(score ~ D, data = CASchools)
# Plot the data
plot(CASchools$D, CASchools$score,
pch = 20, cex = 0.5 , col = "Steelblue",
xlab = expression(D[i]), ylab = "Test Score",
main = "Dummy Regression")
#
points(CASchools$D, predict(dummy_model), col = "red", pch = 20)
```
With $D$ as the regressor, it is not useful to think of $\beta_1$ as a slope parameter since $D_i \in \{0,1\}$, i.e., we only observe two discrete values instead of a continuum of regressor values. There is no continuous line depicting the conditional expectation function $E(TestScore_i | D_i)$ since this function is solely defined for $x$-positions $0$ and $1$.
Therefore, the interpretation of the coefficients in this regression model is as follows:
- $E(Y_i | D_i = 0) = \beta_0$, so $\beta_0$ is the expected test score in districts where $D_i=0$ where $STR$ is above $20$.
- $E(Y_i | D_i = 1) = \beta_0 + \beta_1$ or, using the result above, $\beta_1 = E(Y_i | D_i = 1) - E(Y_i | D_i = 0)$. Thus, $\beta_1$ is *the difference in group-specific expectations*, i.e., the difference in expected test score between districts with $STR < 20$ and those with $STR \geq 20$.
We will now use `r ttcode('R')` to estimate the dummy regression model as defined by the equations (<a href="#mjx-eqn-5.2">5.2</a>) and (<a href="#mjx-eqn-5.3">5.3</a>) .
```{r}
# estimate the dummy regression model
dummy_model <- lm(score ~ D, data = CASchools)
summary(dummy_model)
```
```{block2, precision, type='rmdnote'}
<tt>summary()</tt> reports the $p$-value of the test that the coefficient on <tt>(Intercept)</tt> is zero to to be <tt>< 2e-16</tt>. This scientific notation states that the $p$-value is smaller than $\frac{2}{10^{16}}$, so a very small number. The reason for this is that computers cannot handle arbitrary small numbers. In fact, $\frac{2}{10^{16}}$ is the smallest possble number <tt>R</tt> can work with.
```
The vector `r ttcode("CASchools\\$D")` has the type `r ttcode("logical")` (to see this, use `r ttcode("typeof(CASchools$D)")`) which is shown in the output of `r ttcode("summary(dummy_model)")`: the label `r ttcode("DTRUE")` states that all entries `r ttcode("TRUE")` are coded as `r ttcode("1")` and all entries `r ttcode("FALSE")` are coded as `r ttcode("0")`. Thus, the interpretation of the coefficient `r ttcode("DTRUE")` is as stated above for $\beta_1$.
One can see that the expected test score in districts with $STR < 20$ ($D_i = 1$) is predicted to be $650.1 + 7.17 = 657.27$ while districts with $STR \geq 20$ ($D_i = 0$) are expected to have an average test score of only $650.1$.
Group specific predictions can be added to the plot by execution of the following code chunk.
```{r, eval=F}
# add group specific predictions to the plot
points(x = CASchools$D,
y = predict(dummy_model),
col = "red",
pch = 20)
```
Here we use the function `r ttcode('predict()')` to obtain estimates of the group specific means. The red dots represent these sample group averages. Accordingly, $\hat{\beta}_1 = 7.17$ can be seen as the difference in group averages.
`r ttcode('summary(dummy_model)')` also answers the question whether there is a statistically significant difference in group means. This in turn would support the hypothesis that students perform differently when they are taught in small classes. We can assess this by a two-tailed test of the hypothesis $H_0: \beta_1 = 0$. Conveniently, the $t$-statistic and the corresponding $p$-value for this test are computed by `r ttcode('summary()')`.
Since `r ttcode('t value')` $= 3.88 > 1.96$ we reject the null hypothesis at the $5\%$ level of significance. The same conclusion results when using the $p$-value, which reports significance up to the $0.00012\%$ level.
As done with `r ttcode("linear_model")`, we may alternatively use the function `r ttcode('confint()')` to compute a $95\%$ confidence interval for the true difference in means and see if the hypothesized value is an element of this confidence set.
```{r}
# confidence intervals for coefficients in the dummy regression model
confint(dummy_model)
```
We reject the hypothesis that there is no difference between group means at the $5\%$ significance level since $\beta_{1,0} = 0$ lies outside of $[3.54, 10.8]$, the $95\%$ confidence interval for the coefficient on $D$.
## Heteroskedasticity and Homoskedasticity {#hah}
All inference made in the previous chapters relies on the assumption that the error variance does not vary as regressor values change. But this will often not be the case in empirical applications.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC5.4">
<h3 class = "right"> Key Concept 5.4 </h3>
<h3 class = "left"> Heteroskedasticity and Homoskedasticity </h3>
- The error term of our regression model is homoskedastic if the variance of the conditional distribution of $u_i$ given $X_i$, $Var(u_i|X_i=x)$, is constant *for all* observations in our sample:
\\[ \\text{Var}(u_i|X_i=x) = \\sigma^2 \\ \\forall \\ i=1,\\dots,n. \\]
- If instead there is dependence of the conditional variance of $u_i$ on $X_i$, the error term is said to be heteroskedastic. We then write
\\[ \\text{Var}(u_i|X_i=x) = \\sigma_i^2 \\ \\forall \\ i=1,\\dots,n. \\]
- Homoskedasticity is a *special case* of heteroskedasticity.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Heteroskedasticity and Homoskedasticity]{5.4}
\\begin{itemize}
\\item The error term of our regression model is homoskedastic if the variance of the conditional distribution of $u_i$ given $X_i$, $Var(u_i|X_i=x)$, is constant \\textit{for all} observations in our sample:
\\[ \\text{Var}(u_i|X_i=x) = \\sigma^2 \\ \\forall \\ i=1,\\dots,n. \\]
\\item If instead there is dependence of the conditional variance of $u_i$ on $X_i$, the error term is said to be heteroskedastic. We then write
\\[ \\text{Var}(u_i|X_i=x) = \\sigma_i^2 \\ \\forall \\ i=1,\\dots,n. \\]
\\item Homoskedasticity is a \\textit{special case} of heteroskedasticity.
\\end{itemize}
\\end{keyconcepts}
')
```
For a better understanding of heteroskedasticity, we generate some bivariate heteroskedastic data, estimate a linear regression model and then use box plots to depict the conditional distributions of the residuals.
```{r, fig.align='center', warning=FALSE}
# load scales package for adjusting color opacities
library(scales)
# generate some heteroskedastic data:
# set seed for reproducibility
set.seed(123)
# set up vector of x coordinates
x <- rep(c(10, 15, 20, 25), each = 25)
# initialize vector of errors
e <- c()
# sample 100 errors such that the variance increases with x
e[1:25] <- rnorm(25, sd = 10)
e[26:50] <- rnorm(25, sd = 15)
e[51:75] <- rnorm(25, sd = 20)
e[76:100] <- rnorm(25, sd = 25)
# set up y
y <- 720 - 3.3 * x + e
# Estimate the model
mod <- lm(y ~ x)
# Plot the data
plot(x = x,
y = y,
main = "An Example of Heteroskedasticity",
xlab = "Student-Teacher Ratio",
ylab = "Test Score",
cex = 0.5,
pch = 19,
xlim = c(8, 27),
ylim = c(600, 710))
# Add the regression line to the plot
abline(mod, col = "darkred")
# Add boxplots to the plot
boxplot(formula = y ~ x,
add = TRUE,
at = c(10, 15, 20, 25),
col = alpha("gray", 0.4),
border = "black")
```
We have used the `r ttcode("formula")` argument `r ttcode("y ~ x")` in `r ttcode("boxplot()")` to specify that we want to split up the vector `r ttcode("y")` into groups according to `r ttcode("x")`. `r ttcode("boxplot(y ~ x)")` generates a boxplot for each of the groups in `r ttcode("y")` defined by `r ttcode("x")`.
For this artificial data it is clear that the conditional error variances differ. Specifically, we observe that the variance in test scores (and therefore the variance of the errors committed) *increases* with the student teacher ratio.
### A Real-World Example for Heteroskedasticity {-}
Think about the economic value of education: if there were no expected economic value-added to receiving university education, you probably would not be reading this script right now. A starting point to empirically verify such a relation is to have data on working individuals. More precisely, we need data on wages and education of workers in order to estimate a model like
$$ wage_i = \beta_0 + \beta_1 \cdot education_i + u_i. $$
What can be presumed about this relation? It is likely that, on average, higher educated workers earn more than workers with less education, so we expect to estimate an upward sloping regression line. Also, it seems plausible that earnings of better educated workers have a higher dispersion than those of low-skilled workers: solid education is not a guarantee for a high salary so even highly qualified workers take on low-income jobs. However, they are more likely to meet the requirements for the well-paid jobs than workers with less education for whom opportunities in the labor market are much more limited.
To verify this empirically we may use real data on hourly earnings and the number of years of education of employees. Such data can be found in `r ttcode('CPSSWEducation')`. This data set is part of the package `r ttcode('AER')` and comes from the Current Population Survey (CPS) which is conducted periodically by the [Bureau of Labor Statistics](http://www.bls.gov/) in the United States.
The subsequent code chunks demonstrate how to import the data into `r ttcode('R')` and how to produce a plot in the fashion of Figure 5.3 in the book.
```{r, fig.align='center'}
# load package and attach data
library(AER)
data("CPSSWEducation")
attach(CPSSWEducation)
# get an overview
summary(CPSSWEducation)
# estimate a simple regression model
labor_model <- lm(earnings ~ education)
# plot observations and add the regression line
plot(education,
earnings,
ylim = c(0, 150))
abline(labor_model,
col = "steelblue",
lwd = 2)
```
The plot reveals that the mean of the distribution of earnings increases with the level of education. This is also supported by a formal analysis: the estimated regression model stored in `r ttcode('labor_mod')` shows that there is a positive relation between years of education and earnings.
```{r}
# print the contents of labor_model to the console
labor_model
```
The estimated regression equation states that, on average, an additional year of education increases a worker's hourly earnings by about $\$ 1.47$. Once more we use `r ttcode('confint()')` to obtain a $95\%$ confidence interval for both regression coefficients.
```{r}
# compute a 95% confidence interval for the coefficients in the model
confint(labor_model)
```
Since the interval is $[1.33, 1.60]$ we can reject the hypothesis that the coefficient on `r ttcode('education')` is zero at the $5\%$ level.
Furthermore, the plot indicates that there is heteroskedasticity: if we assume the regression line to be a reasonably good representation of the conditional mean function $E(earnings_i\vert education_i)$, the dispersion of hourly earnings around that function clearly increases with the level of education, i.e., the variance of the distribution of earnings increases. In other words: the variance of the errors (the errors made in explaining earnings by education) increases with education so that the regression errors are heteroskedastic.
This example makes a case that the assumption of homoskedasticity is doubtful in economic applications. Should we care about heteroskedasticity? Yes, we should. As explained in the next section, heteroskedasticity can have serious negative consequences in hypothesis testing, if we ignore it.
### Should We Care About Heteroskedasticity? {-}
To answer the question whether we should worry about heteroskedasticity being present, consider the variance of $\hat\beta_1$ under the assumption of homoskedasticity. In this case we have
$$ \sigma^2_{\hat\beta_1} = \frac{\sigma^2_u}{n \cdot \sigma^2_X} \tag{5.5} $$
which is a simplified version of the general equation (<a href="#mjx-eqn-4.1">4.1</a>) presented in Key Concept 4.4. See Appendix 5.1 of the book for details on the derivation. `r ttcode('summary()')` estimates (<a href="#mjx-eqn-5.5">5.5</a>) by
$$ \overset{\sim}{\sigma}^2_{\hat\beta_1} = \frac{SER^2}{\sum_{i=1}^n (X_i - \overline{X})^2} \ \ \text{where} \ \ SER=\frac{1}{n-2} \sum_{i=1}^n \hat u_i^2. $$
Thus `r ttcode('summary()')` estimates the *homoskedasticity-only* standard error
\[ \sqrt{ \overset{\sim}{\sigma}^2_{\hat\beta_1} } = \sqrt{ \frac{SER^2}{\sum_{i=1}^n(X_i - \overline{X})^2} }. \]
This is in fact an estimator for the standard deviation of the estimator $\hat{\beta}_1$ that is *inconsistent* for the true value $\sigma^2_{\hat\beta_1}$ when there is heteroskedasticity. The implication is that $t$-statistics computed in the manner of Key Concept 5.1 do not follow a standard normal distribution, even in large samples. This issue may invalidate inference when using the previously treated tools for hypothesis testing: we should be cautious when making statements about the significance of regression coefficients on the basis of $t$-statistics as computed by `r ttcode('summary()')` or confidence intervals produced by `r ttcode('confint()')` if it is doubtful for the assumption of homoskedasticity to hold!
We will now use `r ttcode('R')` to compute the homoskedasticity-only standard error for $\hat{\beta}_1$ in the test score regression model `r ttcode('labor_model')` by hand and see that it matches the value produced by `r ttcode('summary()')`.
```{r}
# Store model summary in 'model'
model <- summary(labor_model)
# Extract the standard error of the regression from model summary
SER <- model$sigma
# Compute the variation in 'education'
V <- (nrow(CPSSWEducation)-1) * var(education)
# Compute the standard error of the slope parameter's estimator and print it
SE.beta_1.hat <- sqrt(SER^2/V)
SE.beta_1.hat
# Use logical operators to see if the value computed by hand matches the one provided
# in mod$coefficients. Round estimates to four decimal places
round(model$coefficients[2, 2], 4) == round(SE.beta_1.hat, 4)
```
Indeed, the estimated values are equal.
### Computation of Heteroskedasticity-Robust Standard Errors {-}
Consistent estimation of $\sigma_{\hat{\beta}_1}$ under heteroskedasticity is granted when the following *robust* estimator is used.
\[ SE(\hat{\beta}_1) = \sqrt{ \frac{1}{n} \cdot \frac{ \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2 \hat{u}_i^2 }{ \left[ \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2 \right]^2} } \tag{5.6} \]
Standard error estimates computed this way are also referred to as [Eicker-Huber-White standard errors](https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors), the most frequently cited paper on this is @white1980.
It can be quite cumbersome to do this calculation by hand. Luckily certain R functions exist, serving that purpose. A convenient one named `r ttcode('vcovHC()')` is part of the package `r ttcode("sandwich")`.^[The package `r ttcode('sandwich')` is a dependency of the package `r ttcode('AER')`, meaning that it is attached automatically if you load `r ttcode('AER')`.] This function can compute a variety of standard errors. The one brought forward in (<a href="#mjx-eqn-5.6">5.6</a>) is computed when the argument `r ttcode('type')` is set to `r ttcode('"HC0"')`. Most of the examples presented in the book rely on a slightly different formula which is the default in the statistics package *STATA*:
\begin{align}
SE(\hat{\beta}_1)_{HC1} = \sqrt{ \frac{1}{n} \cdot \frac{ \frac{1}{n-2} \sum_{i=1}^n (X_i - \overline{X})^2 \hat{u}_i^2 }{ \left[ \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X})^2 \right]^2}} (\#eq:hc1)
\end{align}
The difference is that we multiply by $\frac{1}{n-2}$ in the numerator of \@ref(eq:hc1). This is a degrees of freedom correction and was considered by @mackinnon1985. To get `r ttcode('vcovHC()')` to use \@ref(eq:hc1), we have to set `r ttcode('type = "HC1"')`.
Let us now compute robust standard error estimates for the coefficients in `r ttcode('linear_model')`.
```{r}
# compute heteroskedasticity-robust standard errors
vcov <- vcovHC(linear_model, type = "HC1")
vcov
```
The output of `r ttcode('vcovHC()')` is the variance-covariance matrix of coefficient estimates. We are interested in the square root of the diagonal elements of this matrix, i.e., the standard error estimates.
```{block2, vcovmatrix, type='rmdknit'}
When we have k > 1 regressors, writing down the equations for a regression model becomes very messy. A more convinient way to denote and estimate so-called multiple regression models (see Chapter \@ref(rmwmr)) is by using matrix algebra. This is why functions like <tt>vcovHC()</tt> produce matrices. In the simple linear regression model, the variances and covariances of the estimators can be gathered in the symmetric variance-covariance matrix
\begin{equation}
\text{Var}
\begin{pmatrix}
\hat\beta_0 \\
\hat\beta_1
\end{pmatrix} =
\begin{pmatrix}
\text{Var}(\hat\beta_0) & \text{Cov}(\hat\beta_0,\hat\beta_1) \\
\text{Cov}(\hat\beta_0,\hat\beta_1) & \text{Var}(\hat\beta_1)
\end{pmatrix},
\end{equation}
so <tt>vcovHC()</tt> gives us $\widehat{\text{Var}}(\hat\beta_0)$, $\widehat{\text{Var}}(\hat\beta_1)$ and $\widehat{\text{Cov}}(\hat\beta_0,\hat\beta_1)$, but most of the time we are interested in the diagonal elements of the estimated matrix.
```
```{r}
# compute the square root of the diagonal elements in vcov
robust_se <- sqrt(diag(vcov))
robust_se
```
Now assume we want to generate a coefficient summary as provided by `r ttcode('summary()')` but with *robust* standard errors of the coefficient estimators, robust $t$-statistics and corresponding $p$-values for the regression model `r ttcode("linear_model")`. This can be done using `r ttcode('coeftest()')` from the package `r ttcode('lmtest')`, see `?coeftest`. Further we specify in the argument `r ttcode('vcov.')` that `r ttcode("vcov")`, the Eicker-Huber-White estimate of the variance matrix we have computed before, should be used.
```{r}
# we invoke the function `coeftest()` on our model
coeftest(linear_model, vcov. = vcov)
```
We see that the values reported in the column `r ttcode('Std. Error')` are equal those from `r ttcode('sqrt(diag(vcov))')`.
How severe are the implications of using homoskedasticity-only standard errors in the presence of heteroskedasticity? The answer is: it depends. As mentioned above we face the risk of drawing wrong conclusions when conducting significance tests. <br> Let us illustrate this by generating another example of a heteroskedastic data set and using it to estimate a simple regression model. We take
$$ Y_i = \beta_1 \cdot X_i + u_i \ \ , \ \ u_i \overset{i.i.d.}{\sim} \mathcal{N}(0,0.36 \cdot X_i^2) $$
with $\beta_1=1$ as the data generating process. Clearly, the assumption of homoskedasticity is violated here since the variance of the errors is a nonlinear, increasing function of $X_i$ but the errors have zero mean and are i.i.d. such that the assumptions made in Key Concept 4.3 are not violated. As before, we are interested in estimating $\beta_1$.
```{r}
set.seed(905)
# generate heteroskedastic data
X <- 1:500
Y <- rnorm(n = 500, mean = X, sd = 0.6 * X)
# estimate a simple regression model
reg <- lm(Y ~ X)
```
We plot the data and add the regression line.
```{r, fig.align='center'}
# plot the data
plot(x = X, y = Y,
pch = 19,
col = "steelblue",
cex = 0.8)
# add the regression line to the plot
abline(reg,
col = "darkred",
lwd = 1.5)
```
The plot shows that the data are heteroskedastic as the variance of $Y$ grows with $X$. We next conduct a significance test of the (true) null hypothesis $H_0: \beta_1 = 1$ twice, once using the homoskedasticity-only standard error formula and once with the robust version (<a href="#mjx-eqn-5.6">5.6</a>). An easy way to do this in `r ttcode('R')` is the function `r ttcode('linearHypothesis()')` from the package `r ttcode('car')`, see `?linearHypothesis`. It allows to test linear hypotheses about parameters in linear models in a similar way as done with a $t$-statistic and offers various robust covariance matrix estimators. We test by comparing the tests' $p$-values to the significance level of $5\%$.
```{block2, linearhypothesis, type='rmdknit'}
<tt>linearHypothesis()</tt> computes a test statistic that follows an $F$-distribution under the null hypothesis. We will not focus on the details of the underlying theory. In general, the idea of the $F$-test is to compare the fit of different models. When testing a hypothesis about a *single* coefficient using an $F$-test, one can show that the test statistic is simply the square of the corresponding $t$-statistic:
$$F = t^2 = \left(\frac{\hat\beta_i - \beta_{i,0}}{SE(\hat\beta_i)}\right)^2 \sim F_{1,n-k-1}$$
In <tt>linearHypothesis()</tt>, there are different ways to specify the hypothesis to be tested, e.g., using a vector of the type <tt>character</tt> (as done in the next code chunk), see <tt>?linearHypothesis</tt> for alternatives. The function returns an object of class <tt>anova</tt> which contains further information on the test that can be accessed using the <tt>$</tt> operator.
```
```{r}
# test hypthesis using the default standard error formula
linearHypothesis(reg, hypothesis.matrix = "X = 1")$'Pr(>F)'[2] < 0.05
# test hypothesis using the robust standard error formula
linearHypothesis(reg, hypothesis.matrix = "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 0.05
```
This is a good example of what can go wrong if we ignore heteroskedasticity: for the data set at hand the default method rejects the null hypothesis $\beta_1 = 1$ although it is true. When using the robust standard error formula the test does not reject the null. Of course, we could think this might just be a coincidence and both tests do equally well in maintaining the type I error rate of $5\%$. This can be further investigated by computing *Monte Carlo* estimates of the rejection frequencies of both tests on the basis of a large number of random samples. We proceed as follows:
- initialize vectors `r ttcode('t')` and `r ttcode('t.rob')`.
- Using a `r ttcode('for()')` loop, we generate $10000$ heteroskedastic random samples of size $1000$, estimate the regression model and check whether the tests falsely reject the null at the level of $5\%$ using comparison operators. The results are stored in the respective vectors `r ttcode('t')` and `r ttcode('t.rob')`.
- After the simulation, we compute the fraction of false rejections for both tests.
```{r, cache=T, echo=-1}
set.seed(905)
# initialize vectors t and t.rob
t <- c()
t.rob <- c()
# loop sampling and estimation
for (i in 1:10000) {
# sample data
X <- 1:1000
Y <- rnorm(n = 1000, mean = X, sd = 0.6 * X)
# estimate regression model
reg <- lm(Y ~ X)
# homoskedasdicity-only significance test
t[i] <- linearHypothesis(reg, "X = 1")$'Pr(>F)'[2] < 0.05
# robust significance test
t.rob[i] <- linearHypothesis(reg, "X = 1", white.adjust = "hc1")$'Pr(>F)'[2] < 0.05
}
# compute the fraction of false rejections
round(cbind(t = mean(t), t.rob = mean(t.rob)), 3)
```
These results reveal the increased risk of falsely rejecting the null using the homoskedasticity-only standard error for the testing problem at hand: with the common standard error, $`r round(sum(t)/length(t)*100,3)`\%$ of all tests falsely reject the null hypothesis. In contrast, with the robust test statistic we are closer to the nominal level of $5\%$.
## The Gauss-Markov Theorem
When estimating regression models, we know that the results of the estimation procedure are random. However, when using unbiased estimators, at least on average, we estimate the true parameter. When comparing different unbiased estimators, it is therefore interesting to know which one has the highest precision: being aware that the likelihood of estimating the *exact* value of the parameter of interest is $0$ in an empirical application, we want to make sure that the likelihood of obtaining an estimate very close to the true value is as high as possible. This means we want to use the estimator with the lowest variance of all unbiased estimators, provided we care about unbiasedness. The Gauss-Markov theorem states that, in the class of conditionally unbiased linear estimators, the OLS estimator has this property under certain conditions.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC5.5">
<h3 class = "right"> Key Concept 5.5 </h3>
<h3 class = "left"> The Gauss-Markov Theorem for $\\hat{\\beta}_1$ </h3>
Suppose that the assumptions made in Key Concept 4.3 hold *and* that the errors are *homoskedastic*. The OLS estimator is the best (in the sense of smallest variance) linear conditionally unbiased estimator (BLUE) in this setting.
Let us have a closer look at what this means:
- Estimators of $\\beta_1$ that are linear functions of the $Y_1, \\dots, Y_n$ and that are unbiased conditionally on the regressor $X_1, \\dots, X_n$ can be written as \\[ \\overset{\\sim}{\\beta}_1 = \\sum_{i=1}^n a_i Y_i \\] where the $a_i$ are weights that are allowed to depend on the $X_i$ but *not* on the $Y_i$.
- We already know that $\\overset{\\sim}{\\beta}_1$ has a sampling distribution: $\\overset{\\sim}{\\beta}_1$ is a linear function of the $Y_i$ which are random variables. If now \\[ E(\\overset{\\sim}{\\beta}_1 | X_1, \\dots, X_n) = \\beta_1, \\] $\\overset{\\sim}{\\beta}_1$ is a linear unbiased estimator of $\\beta_1$, conditionally on the $X_1, \\dots, X_n$.
- We may ask if $\\overset{\\sim}{\\beta}_1$ is also the *best* estimator in this class, i.e., the most efficient one of all linear conditionally unbiased estimators where "most efficient" means smallest variance. The weights $a_i$ play an important role here and it turns out that OLS uses just the right weights to have the BLUE property.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Gauss-Markov Theorem for $\\hat{\\beta}_1$]{5.5}
Suppose that the assumptions made in Key Concept 4.3 hold \\textit{and} that the errors are \\textit{homoskedastic}. The OLS estimator is the best (in the sense of smallest variance) linear conditionally unbiased estimator (BLUE) in this setting.\\newline
Let us have a closer look at what this means:\\newline
\\begin{itemize}
\\item Estimators of $\\beta_1$ that are linear functions of the $Y_1, \\dots, Y_n$ and that are unbiased conditionally on the regressor $X_1, \\dots, X_n$ can be written as \\[ \\overset{\\sim}{\\beta}_1 = \\sum_{i=1}^n a_i Y_i \\] where the $a_i$ are weights that are allowed to depend on the $X_i$ but \\textit{not} on the $Y_i$.
\\item We already know that $\\overset{\\sim}{\\beta}_1$ has a sampling distribution: $\\overset{\\sim}{\\beta}_1$ is a linear function of the $Y_i$ which are random variables. If now \\[ E(\\overset{\\sim}{\\beta}_1 | X_1, \\dots, X_n) = \\beta_1, \\] $\\overset{\\sim}{\\beta}_1$ is a linear unbiased estimator of $\\beta_1$, conditionally on the $X_1, \\dots, X_n$.
\\item We may ask if $\\overset{\\sim}{\\beta}_1$ is also the \\textit{best} estimator in this class, i.e., the most efficient one of all linear conditionally unbiased estimators where most efficient means smallest variance. The weights $a_i$ play an important role here and it turns out that OLS uses just the right weights to have the BLUE property.
\\end{itemize}
\\end{keyconcepts}
')
```
### Simulation Study: BLUE Estimator {-}
Consider the case of a regression of $Y_i,\dots,Y_n$ only on a constant. Here, the $Y_i$ are assumed to be a random sample from a population with mean $\mu$ and variance $\sigma^2$. The OLS estimator in this model is simply the sample mean, see Chapter \@ref(potsm).
\begin{equation}
\hat{\beta}_1 = \sum_{i=1}^n \underbrace{\frac{1}{n}}_{=a_i} Y_i (\#eq:bluemean)
\end{equation}
Clearly, each observation is weighted by
$$a_i = \frac{1}{n}.$$
and we also know that $\text{Var}(\hat{\beta}_1)=\frac{\sigma^2}{n}$.
We now use `r ttcode('R')` to conduct a simulation study that demonstrates what happens to the variance of \@ref(eq:bluemean) if different weights \[ w_i = \frac{1 \pm \epsilon}{n} \] are assigned to either half of the sample $Y_1, \dots, Y_n$ instead of using $\frac{1}{n}$, the OLS weights.
```{r, fig.align='center', cache=T}
# set sample size and number of repetitions
n <- 100
reps <- 1e5
# choose epsilon and create a vector of weights as defined above
epsilon <- 0.8
w <- c(rep((1 + epsilon) / n, n / 2),
rep((1 - epsilon) / n, n / 2) )
# draw a random sample y_1,...,y_n from the standard normal distribution,
# use both estimators 1e5 times and store the result in the vectors 'ols' and
# 'weightedestimator'
ols <- rep(NA, reps)
weightedestimator <- rep(NA, reps)
for (i in 1:reps) {
y <- rnorm(n)
ols[i] <- mean(y)
weightedestimator[i] <- crossprod(w, y)
}
# plot kernel density estimates of the estimators' distributions:
# OLS
plot(density(ols),
col = "purple",
lwd = 3,
main = "Density of OLS and Weighted Estimator",
xlab = "Estimates")
# weighted
lines(density(weightedestimator),
col = "steelblue",
lwd = 3)
# add a dashed line at 0 and add a legend to the plot
abline(v = 0, lty = 2)
legend('topright',
c("OLS", "Weighted"),
col = c("purple", "steelblue"),
lwd = 3)
```
What conclusion can we draw from the result?
- Both estimators seem to be unbiased: the means of their estimated distributions are zero.
- The estimator using weights that deviate from those implied by OLS is less efficient than the OLS estimator: there is higher dispersion when weights are $w_i = \frac{1 \pm 0.8}{100}$ instead of $w_i=\frac{1}{100}$ as required by the OLS solution.
Hence, the simulation results support the Gauss-Markov Theorem.
## Using the t-Statistic in Regression When the Sample Size Is Small
The three OLS assumptions discussed in Chapter \@ref(lrwor) (see Key Concept 4.3) are the foundation for the results on the large sample distribution of the OLS estimators in the simple regression model. What can be said about the distribution of the estimators and their $t$-statistics when the sample size is small and the population distribution of the data is unknown? Provided that the three least squares assumptions hold and the errors are normally distributed and homoskedastic (we refer to these conditions as the homoskedastic normal regression assumptions), we have normally distributed estimators and $t$-distributed test statistics in small samples.
Recall the [definition](#thetdist) of a $t$-distributed variable
$$ \frac{Z}{\sqrt{W/M}} \sim t_M$$
where $Z$ is a standard normal random variable, $W$ is $\chi^2$ distributed with $M$ degrees of freedom and $Z$ and $W$ are independent. See section 5.6 in the book for a more detailed discussion of the small sample distribution of $t$-statistics in regression methods.
Let us simulate the distribution of regression $t$-statistics based on a large number of small random samples, say $n=20$, and compare the simulated distributions to the theoretical distributions which should be $t_{18}$, the $t$-distribution with $18$ degrees of freedom (recall that $\text{DF}=n-k-1$).
```{r, fig.align="center", cache=T}
# initialize two vectors
beta_0 <- c()
beta_1 <- c()
# loop sampling / estimation / t statistics
for (i in 1:10000) {
X <- runif(20, 0, 20)
Y <- rnorm(n = 20, mean = X)
reg <- summary(lm(Y ~ X))
beta_0[i] <- (reg$coefficients[1, 1] - 0)/(reg$coefficients[1, 2])
beta_1[i] <- (reg$coefficients[2, 1] - 1)/(reg$coefficients[2, 2])
}
# plot the distributions and compare with t_18 density:
# divide plotting area
par(mfrow = c(1, 2))
# plot the simulated density of beta_0