In this computer lab you will be practicing the following skills
Let’s start by loading some useful packages
library(readxl) # enable the read_excel function
library(tidyverse) # for almost all data handling tasks
library(ggplot2) # plotting toolbox
library(stargazer) # for nice regression output
library(AER) # access to HS robust standard errors
and setting the working directory
setwd(XXXX)
You should also save the separately supplied
stargazer_HC.r file in your working directory. This will
make it straightforward to estimate and compare regressions with robust
standard errors. Once you have done that you should include the
following line into your code which basically makes this function
available to you.
source("stargazer_HC.r") # includes the robust regression
We are continuing work that eventually aims to replicate some of the work in this paper:
Siegel et al (2019) The Impact of State Firearm Laws on Homicide and Suicide Deaths in the USA, 1991–2016: a Panel Study, J Gen Intern Med 34(10):2021–8.
As in Computer Lab 3 we import the dataset for the gun law example, however, do upload the new csv file “US_Gun_example_v2.csv” which now includes the proportion of 18 to 24 years old and information on the region in which each state is located in.
data2 <- read.csv("../data/US_Gun_example_v2.csv", stringsAsFactors = TRUE) # import data
str(data2)
## 'data.frame': 1071 obs. of 22 variables:
## $ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ State_code : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2005 2020 2002 2013 2010 2006 2009 2014 2018 2015 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ X : int 205 970 52 613 460 256 409 664 868 715 ...
## $ Population : int 666835 732441 642270 737626 710246 675150 698895 737075 736624 738430 ...
## $ Mechanism : Factor w/ 1 level "Firearm": 1 1 1 1 1 1 1 1 1 1 ...
## $ pol_ubc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pol_vmisd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pol_mayissue : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age.Adjusted.Rate: num 17.5 23.5 19.9 19.5 20.5 ...
## $ logy : num 2.86 3.16 2.99 2.97 3.02 ...
## $ ur : num 6.93 8.31 7.17 6.99 8.13 ...
## $ law.officers : int 1838 2002 1857 1937 1944 1937 1977 1938 1985 1936 ...
## $ law.officers.pc : num 276 273 289 263 274 ...
## $ vcrime : int 4162 6077 3594 5332 4506 4574 4402 5196 6508 5980 ...
## $ vcrime.pc : num 624 830 560 723 634 ...
## $ alcc.pc : num 2.64 2.83 2.99 2.75 3.03 ...
## $ incarc : int 3433 4352 2991 5054 3771 3371 3846 5728 4235 5247 ...
## $ incarc.pc : num 515 594 466 685 531 ...
## $ prop18.24 : num 11 NA 10.2 10.8 10.6 ...
## $ Region : Factor w/ 4 levels "Midwest","Northeast",..: 4 4 4 4 4 4 4 4 4 4 ...
You should find that data2 has 1,071 rows and 20
variables.
In last week’s computer lab we concluded by estimating these two regression models:
\(mod2: AAR_i = \alpha + gamma_1~lo.pc_i + v_i\)
\(mod3: AAR_i = \alpha + beta_1~lo.pc_i + \beta_2 ~ ur_i + \beta3 ~alcc.pc_i + u_i\)
mod2 <- lm(Age.Adjusted.Rate~law.officers.pc,data=data2)
mod3 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc,data=data2)
stargazer(mod2,mod3,type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## Age.Adjusted.Rate
## (1) (2)
## --------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007***
## (0.002) (0.002)
##
## ur 0.039
## (0.076)
##
## alcc.pc -0.528*
## (0.286)
##
## Constant 10.301*** 11.151***
## (0.501) (0.865)
##
## --------------------------------------------------------------------
## Observations 1,048 1,048
## R2 0.014 0.017
## Adjusted R2 0.013 0.014
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044)
## F Statistic 14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We were potentially puzzled by the very low \(R^2\) for these regressions (meaning that less than 2% of the variation in the dependent variable is explained by variation in the explanatory variables). We plotted a scatter graph where different colors represents different states.
ggplot(data2,aes(law.officers.pc,Age.Adjusted.Rate, color = State)) +
geom_point() +
guides(color = "none") + # removes the legend
ggtitle("Number of law enforcement officers v Deaths by Firearms")
What you can see from here is that observations for the same state cluster together and that different states seem to differ significantly between each other. This variation is not reflected in the above estimation.
This is how far we got last time. Before we continue to work on that issue, we will look at the issue of performing inference.
You would have learned in the lectures that in practice we wish to
account for the fact that error terms are not homoskedastic, meaning
that error term variances may well vary in some systematic way. If that
is so, then the way in which standard regression functions (like
lm) estimate coefficient standard errors (the values in
parenthesis underneath the parameter estimates in the regression output)
is incorrect. There are different ways in which this can be corrected.
The stargazer_HC function has one of them implemented and
by displaying results using stargazer_HC you will ensure
that inference on regression coefficients is robust to the presence of
error heteroskedasticity.
So let’s display the same regression results but with heteroskedasticty robust standard errors, often also called White standard errors.
stargazer_HC(mod2,mod3,type_out = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## Age.Adjusted.Rate
## (1) (2)
## --------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007***
## (0.002) (0.002)
##
## ur 0.039
## (0.071)
##
## alcc.pc -0.528**
## (0.237)
##
## Constant 10.301*** 11.151***
## (0.628) (0.871)
##
## --------------------------------------------------------------------
## Observations 1,048 1,048
## R2 0.014 0.017
## Adjusted R2 0.013 0.014
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044)
## F Statistic 14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
You should compare these standard errors to the ones in the previous output. They are different. But note that calculating White standard errors leaves the actual coefficient estimates unchanged.
From now on we will typically calculate heteroskedasticity robust standard errors.
Let us consider whether the variation we see between states is actually variation between regions in the U.S. First check how many states are in each region. There are many ways to achieve this. In the end you should be able to replicate the information in the following table.
## # A tibble: 4 × 2
## Region n
## <fct> <int>
## 1 Midwest 12
## 2 Northeast 9
## 3 South 17
## 4 West 13
One way of doing that would be by completing the following code skeleton:
table1 <- data2 %>% XXXX(XXXX) %>% # pick any year
XXXX(XXXX) %>% # groups by region
XXXX(n = n()) %>% # summarises each group by calculating obs
print()
If we believe that a significant proportion of the variation in the
dependent variable can be explained by regional variation then we should
include region-fixed-effects. This is done by including the
Region variable into the regression. As Region
is a factor variable, R will automatically create and include a dummy
variable for each region.
mod4 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc+Region,data=data2)
stargazer_HC(mod2,mod3,mod4,type_out = "text")
##
## ==============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## Age.Adjusted.Rate
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007*** -0.001
## (0.002) (0.002) (0.002)
##
## ur 0.039 -0.036
## (0.071) (0.057)
##
## alcc.pc -0.528** 1.031***
## (0.237) (0.198)
##
## RegionNortheast -4.292***
## (0.299)
##
## RegionSouth 4.466***
## (0.319)
##
## RegionWest 2.564***
## (0.361)
##
## Constant 10.301*** 11.151*** 8.747***
## (0.628) (0.871) (0.761)
##
## ----------------------------------------------------------------------------------------------
## Observations 1,048 1,048 1,048
## R2 0.014 0.017 0.378
## Adjusted R2 0.013 0.014 0.374
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044) 3.906 (df = 1041)
## F Statistic 14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044) 105.431*** (df = 6; 1041)
## ==============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
As an example, the new variable RegionWest is a variable
that takes the value 1 for all states that are in the West and 0 for all
others. You can see that the inclusion of the Region dummy variables has
significantly increased the \(R^2\). We
had four regions, but only three dummy variables were included. Why is
there no dummy variable for the Midwest region?
Also compare the coefficient estimates between mod3 and
mod4. Do you notice that the effect of the alcohol
consumption, which was previously estimated to be significantly
negative, now is statistically significant and positive? Clearly
regional variation is important in explaining variation in the number of
firearm deaths. Leaving these variables out does bad things (technically
it introduces biases - and that is a bad thing).
You may wonder whether one shouldn’t include state rather than region
dummies. And that would be a fair suggestion. Estimate a new model
mod5 which is identical to mod4 other than
replacing the Region variable with the State
variable. Then create an output in which you display models
mod2 to mod5.
You should receive a really long output as the new regression has now included a lot of state dummy variables. Display the results instead with the following line.
stargazer_HC(mod2,mod3,mod4,mod5,keep = c("law.officers.pc", "ur", "alcc.pc"),omit.stat=c("f"),type_out = "text")
##
## ==========================================================================================
## Dependent variable:
## ----------------------------------------------------------------------
## Age.Adjusted.Rate
## (1) (2) (3) (4)
## ------------------------------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007*** -0.001 -0.007***
## (0.002) (0.002) (0.002) (0.002)
##
## ur 0.039 -0.036 -0.220***
## (0.071) (0.057) (0.031)
##
## alcc.pc -0.528** 1.031*** 6.366***
## (0.237) (0.198) (0.533)
##
## StateMissouri -5.591***
## (0.900)
##
## ------------------------------------------------------------------------------------------
## Observations 1,048 1,048 1,048 1,048
## R2 0.014 0.017 0.378 0.839
## Adjusted R2 0.013 0.014 0.374 0.831
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044) 3.906 (df = 1041) 2.029 (df = 995)
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Much better. Try and figure out what the additional options
keep = c("law.officers.pc", "ur", "alcc.pc") and
omit.stat=c("f") did. The help function
?stargazer will be useful (as stargazer_HC
actually calls stargazer there is no need to use the help
for stargazer_HC as the above options come from
stargazer.)
If you want to perform a hypothesis test say on \(\beta_3\) (the coefficient on the
alcc.pc variable), then the usual hypothesis to pose is
\(H_0: \beta_3 = 0\) versus \(H_A: \beta_3 \neq 0\).
It is the p-value to that hypothesis test which is represented by the
asteriks next to the estimated coefficient in the standard regression
output. Let’s confirm that by looking back at the results using the
robust standard errors. The estimated coefficient to the
alcc.pc variable is -0.528 and the (**) indicate that the
p-value to that test is smaller than 0.05 (but not smaller than
0.01).
Here is how you can perform this test manually using the
lht (stands for Linear Hypothesis Test) function which is
written to use regression output (here saved in mod3) for
hypothesis testing.
lht(mod3,"alcc.pc=0", white.adjust = TRUE)
##
## Linear hypothesis test:
## alcc.pc = 0
##
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 1045
## 2 1044 1 4.8884 0.02725 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The option white.adjust = TRUE is there to ensure that
this procedure is using the heteroskedasticity robust standard errors.
There is a lot of information, but the important one is the value
displayed under (“Pr(>F)”), that is the p-value. Here it is, 0.02725,
and, as predicted, is < 0.05 but larger than 0.01.
Confirm that p-value for \(H_0: \beta_2 =
0\) versus \(H_A: \beta_2 \neq
0\) (coefficient on ur in mod3) is
larger than 0.1.
XXXX(XXXX,"XXXX", white.adjust = XXXX)
##
## Linear hypothesis test:
## ur = 0
##
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 1045
## 2 1044 1 0.3022 0.5826
The use of the lht function is that you can test
different hypothesis. Say \(H_0: \beta_3 =
-0.5\) versus \(H_A: \beta_3 \neq
-0.5\) (coefficient on law.officers.pc).
lht(mod3,"alcc.pc=-0.5", white.adjust = TRUE)
##
## Linear hypothesis test:
## alcc.pc = - 0.5
##
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 1045
## 2 1044 1 0.0136 0.9073
So, that null hypothesis cannot be rejected at any of the usual significance level.
Even more so, you can use this function to test multiple hypotheses.
Say you want to test whether the inclusion of the additional two
variables (in mod3 as opposed to mod2") is
relevant. If it wasn’t then the following null hypothesis should be
correct: \(H_0: \beta_2=\beta_3=0\). We
call this a multiple hypothesis.
Use the help function (?lht) or search for advice () on
how to use the lht function to test this hypothesis. If you
get it right you should get the following output.
##
## Linear hypothesis test:
## ur = 0
## alcc.pc = 0
##
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 1046
## 2 1044 2 2.9501 0.05277 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis that none of these two variables is relevant can only be rejected marginally, at a 10% significance level as the p-value is 0.05277.
Are you getting the same conclusion if you test the same hypothesis in the model with state fixed effects? You should get the following output.
##
## Linear hypothesis test:
## ur = 0
## alcc.pc = 0
##
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc + State
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 997
## 2 995 2 99.982 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here it seems very obvious (very small p-value) that \(H_0: \beta_2=\beta_3=0\) should be rejected
as the p-value is very small. What we can conclude from this is that not
including the important state fixed effects also made it difficult to
detect whether the alcc.pc and the ur
variables were important in explaining variation in the rate of firearm
deaths. Recall that going to a model with region or state fixed effects
actually even changed the sign of the coefficient.
The techniques you covered in this computer lab are absolutely fundamental to the remainder of this unit, so please ensure that you have not rushed over the material. Even better, you should experiment yourself and answer some of your own questions. And if you need questions, here are a few:
lht?Year as a variable, why does it matter
whether I include Year or
as.factor(Year))?