Introduction

In this computer lab you will be practicing the following skills

Let’s start by loading some useful packages

library(readxl)       # enable the read_excel function
library(tidyverse)    # for almost all data handling tasks
library(ggplot2)      # plotting toolbox
library(stargazer)    # for nice regression output
library(AER)          # access to HS robust standard errors

and setting the working directory

setwd(XXXX)

You should also save the separately supplied stargazer_HC.r file in your working directory. This will make it straightforward to estimate and compare regressions with robust standard errors. Once you have done that you should include the following line into your code which basically makes this function available to you.

source("stargazer_HC.r")  # includes the robust regression 

Context

We are continuing work that eventually aims to replicate some of the work in this paper:

Siegel et al (2019) The Impact of State Firearm Laws on Homicide and Suicide Deaths in the USA, 1991–2016: a Panel Study, J Gen Intern Med 34(10):2021–8.

Data Import

As in Computer Lab 3 we import the dataset for the gun law example, however, do upload the new csv file “US_Gun_example_v2.csv” which now includes the proportion of 18 to 24 years old and information on the region in which each state is located in.

data2 <- read.csv("../data/US_Gun_example_v2.csv", stringsAsFactors = TRUE)          # import data
str(data2)
## 'data.frame':    1071 obs. of  22 variables:
##  $ X.1              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ State_code       : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year             : int  2005 2020 2002 2013 2010 2006 2009 2014 2018 2015 ...
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ X                : int  205 970 52 613 460 256 409 664 868 715 ...
##  $ Population       : int  666835 732441 642270 737626 710246 675150 698895 737075 736624 738430 ...
##  $ Mechanism        : Factor w/ 1 level "Firearm": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pol_ubc          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pol_vmisd        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pol_mayissue     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.Adjusted.Rate: num  17.5 23.5 19.9 19.5 20.5 ...
##  $ logy             : num  2.86 3.16 2.99 2.97 3.02 ...
##  $ ur               : num  6.93 8.31 7.17 6.99 8.13 ...
##  $ law.officers     : int  1838 2002 1857 1937 1944 1937 1977 1938 1985 1936 ...
##  $ law.officers.pc  : num  276 273 289 263 274 ...
##  $ vcrime           : int  4162 6077 3594 5332 4506 4574 4402 5196 6508 5980 ...
##  $ vcrime.pc        : num  624 830 560 723 634 ...
##  $ alcc.pc          : num  2.64 2.83 2.99 2.75 3.03 ...
##  $ incarc           : int  3433 4352 2991 5054 3771 3371 3846 5728 4235 5247 ...
##  $ incarc.pc        : num  515 594 466 685 531 ...
##  $ prop18.24        : num  11 NA 10.2 10.8 10.6 ...
##  $ Region           : Factor w/ 4 levels "Midwest","Northeast",..: 4 4 4 4 4 4 4 4 4 4 ...

You should find that data2 has 1,071 rows and 20 variables.

Regression

Previous work

In last week’s computer lab we concluded by estimating these two regression models:

\(mod2: AAR_i = \alpha + gamma_1~lo.pc_i + v_i\)

\(mod3: AAR_i = \alpha + beta_1~lo.pc_i + \beta_2 ~ ur_i + \beta3 ~alcc.pc_i + u_i\)

mod2 <- lm(Age.Adjusted.Rate~law.officers.pc,data=data2)
mod3 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc,data=data2)
stargazer(mod2,mod3,type = "text")
## 
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                    Age.Adjusted.Rate                
##                               (1)                      (2)          
## --------------------------------------------------------------------
## law.officers.pc             0.006***                0.007***        
##                             (0.002)                  (0.002)        
##                                                                     
## ur                                                    0.039         
##                                                      (0.076)        
##                                                                     
## alcc.pc                                              -0.528*        
##                                                      (0.286)        
##                                                                     
## Constant                   10.301***                11.151***       
##                             (0.501)                  (0.865)        
##                                                                     
## --------------------------------------------------------------------
## Observations                 1,048                    1,048         
## R2                           0.014                    0.017         
## Adjusted R2                  0.013                    0.014         
## Residual Std. Error    4.907 (df = 1046)        4.902 (df = 1044)   
## F Statistic         14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

We were potentially puzzled by the very low \(R^2\) for these regressions (meaning that less than 2% of the variation in the dependent variable is explained by variation in the explanatory variables). We plotted a scatter graph where different colors represents different states.

ggplot(data2,aes(law.officers.pc,Age.Adjusted.Rate, color = State)) +
  geom_point() +
  guides(color = "none") + # removes the legend
  ggtitle("Number of law enforcement officers v Deaths by Firearms")

What you can see from here is that observations for the same state cluster together and that different states seem to differ significantly between each other. This variation is not reflected in the above estimation.

Robust standard errors

This is how far we got last time. Before we continue to work on that issue, we will look at the issue of performing inference.

You would have learned in the lectures that in practice we wish to account for the fact that error terms are not homoskedastic, meaning that error term variances may well vary in some systematic way. If that is so, then the way in which standard regression functions (like lm) estimate coefficient standard errors (the values in parenthesis underneath the parameter estimates in the regression output) is incorrect. There are different ways in which this can be corrected. The stargazer_HC function has one of them implemented and by displaying results using stargazer_HC you will ensure that inference on regression coefficients is robust to the presence of error heteroskedasticity.

So let’s display the same regression results but with heteroskedasticty robust standard errors, often also called White standard errors.

stargazer_HC(mod2,mod3,type_out = "text")
## 
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                    Age.Adjusted.Rate                
##                               (1)                      (2)          
## --------------------------------------------------------------------
## law.officers.pc             0.006***                0.007***        
##                             (0.002)                  (0.002)        
##                                                                     
## ur                                                    0.039         
##                                                      (0.071)        
##                                                                     
## alcc.pc                                             -0.528**        
##                                                      (0.237)        
##                                                                     
## Constant                   10.301***                11.151***       
##                             (0.628)                  (0.871)        
##                                                                     
## --------------------------------------------------------------------
## Observations                 1,048                    1,048         
## R2                           0.014                    0.017         
## Adjusted R2                  0.013                    0.014         
## Residual Std. Error    4.907 (df = 1046)        4.902 (df = 1044)   
## F Statistic         14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01
##                                Robust standard errors in parenthesis

You should compare these standard errors to the ones in the previous output. They are different. But note that calculating White standard errors leaves the actual coefficient estimates unchanged.

From now on we will typically calculate heteroskedasticity robust standard errors.

Multiple regression and fixed effects

Region Effects

Let us consider whether the variation we see between states is actually variation between regions in the U.S. First check how many states are in each region. There are many ways to achieve this. In the end you should be able to replicate the information in the following table.

## # A tibble: 4 × 2
##   Region        n
##   <fct>     <int>
## 1 Midwest      12
## 2 Northeast     9
## 3 South        17
## 4 West         13

One way of doing that would be by completing the following code skeleton:

table1 <- data2 %>% XXXX(XXXX) %>% # pick any year
  XXXX(XXXX) %>% # groups by region
  XXXX(n = n()) %>% # summarises each group by calculating obs
  print()  

If we believe that a significant proportion of the variation in the dependent variable can be explained by regional variation then we should include region-fixed-effects. This is done by including the Region variable into the regression. As Region is a factor variable, R will automatically create and include a dummy variable for each region.

mod4 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc+Region,data=data2)
stargazer_HC(mod2,mod3,mod4,type_out = "text")
## 
## ==============================================================================================
##                                                Dependent variable:                            
##                     --------------------------------------------------------------------------
##                                                 Age.Adjusted.Rate                             
##                               (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------------
## law.officers.pc             0.006***                0.007***                  -0.001          
##                             (0.002)                  (0.002)                  (0.002)         
##                                                                                               
## ur                                                    0.039                   -0.036          
##                                                      (0.071)                  (0.057)         
##                                                                                               
## alcc.pc                                             -0.528**                 1.031***         
##                                                      (0.237)                  (0.198)         
##                                                                                               
## RegionNortheast                                                              -4.292***        
##                                                                               (0.299)         
##                                                                                               
## RegionSouth                                                                  4.466***         
##                                                                               (0.319)         
##                                                                                               
## RegionWest                                                                   2.564***         
##                                                                               (0.361)         
##                                                                                               
## Constant                   10.301***                11.151***                8.747***         
##                             (0.628)                  (0.871)                  (0.761)         
##                                                                                               
## ----------------------------------------------------------------------------------------------
## Observations                 1,048                    1,048                    1,048          
## R2                           0.014                    0.017                    0.378          
## Adjusted R2                  0.013                    0.014                    0.374          
## Residual Std. Error    4.907 (df = 1046)        4.902 (df = 1044)        3.906 (df = 1041)    
## F Statistic         14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044) 105.431*** (df = 6; 1041)
## ==============================================================================================
## Note:                                                              *p<0.1; **p<0.05; ***p<0.01
##                                                          Robust standard errors in parenthesis

As an example, the new variable RegionWest is a variable that takes the value 1 for all states that are in the West and 0 for all others. You can see that the inclusion of the Region dummy variables has significantly increased the \(R^2\). We had four regions, but only three dummy variables were included. Why is there no dummy variable for the Midwest region?

Also compare the coefficient estimates between mod3 and mod4. Do you notice that the effect of the alcohol consumption, which was previously estimated to be significantly negative, now is statistically significant and positive? Clearly regional variation is important in explaining variation in the number of firearm deaths. Leaving these variables out does bad things (technically it introduces biases - and that is a bad thing).

State Effects

You may wonder whether one shouldn’t include state rather than region dummies. And that would be a fair suggestion. Estimate a new model mod5 which is identical to mod4 other than replacing the Region variable with the State variable. Then create an output in which you display models mod2 to mod5.

You should receive a really long output as the new regression has now included a lot of state dummy variables. Display the results instead with the following line.

stargazer_HC(mod2,mod3,mod4,mod5,keep = c("law.officers.pc", "ur", "alcc.pc"),omit.stat=c("f"),type_out = "text")
## 
## ==========================================================================================
##                                              Dependent variable:                          
##                     ----------------------------------------------------------------------
##                                               Age.Adjusted.Rate                           
##                            (1)               (2)               (3)              (4)       
## ------------------------------------------------------------------------------------------
## law.officers.pc         0.006***          0.007***           -0.001          -0.007***    
##                          (0.002)           (0.002)           (0.002)          (0.002)     
##                                                                                           
## ur                                          0.039            -0.036          -0.220***    
##                                            (0.071)           (0.057)          (0.031)     
##                                                                                           
## alcc.pc                                   -0.528**          1.031***          6.366***    
##                                            (0.237)           (0.198)          (0.533)     
##                                                                                           
## StateMissouri                                                                -5.591***    
##                                                                               (0.900)     
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations              1,048             1,048             1,048            1,048      
## R2                        0.014             0.017             0.378            0.839      
## Adjusted R2               0.013             0.014             0.374            0.831      
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044) 3.906 (df = 1041) 2.029 (df = 995)
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01
##                                                      Robust standard errors in parenthesis

Much better. Try and figure out what the additional options keep = c("law.officers.pc", "ur", "alcc.pc") and omit.stat=c("f") did. The help function ?stargazer will be useful (as stargazer_HC actually calls stargazer there is no need to use the help for stargazer_HC as the above options come from stargazer.)

Inference on regression coefficients

If you want to perform a hypothesis test say on \(\beta_3\) (the coefficient on the alcc.pc variable), then the usual hypothesis to pose is \(H_0: \beta_3 = 0\) versus \(H_A: \beta_3 \neq 0\).

It is the p-value to that hypothesis test which is represented by the asteriks next to the estimated coefficient in the standard regression output. Let’s confirm that by looking back at the results using the robust standard errors. The estimated coefficient to the alcc.pc variable is -0.528 and the (**) indicate that the p-value to that test is smaller than 0.05 (but not smaller than 0.01).

Here is how you can perform this test manually using the lht (stands for Linear Hypothesis Test) function which is written to use regression output (here saved in mod3) for hypothesis testing.

lht(mod3,"alcc.pc=0", white.adjust = TRUE)
## 
## Linear hypothesis test:
## alcc.pc = 0
## 
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F  Pr(>F)  
## 1   1045                    
## 2   1044  1 4.8884 0.02725 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The option white.adjust = TRUE is there to ensure that this procedure is using the heteroskedasticity robust standard errors. There is a lot of information, but the important one is the value displayed under (“Pr(>F)”), that is the p-value. Here it is, 0.02725, and, as predicted, is < 0.05 but larger than 0.01.

Confirm that p-value for \(H_0: \beta_2 = 0\) versus \(H_A: \beta_2 \neq 0\) (coefficient on ur in mod3) is larger than 0.1.

XXXX(XXXX,"XXXX", white.adjust = XXXX)
## 
## Linear hypothesis test:
## ur = 0
## 
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F Pr(>F)
## 1   1045                 
## 2   1044  1 0.3022 0.5826

The use of the lht function is that you can test different hypothesis. Say \(H_0: \beta_3 = -0.5\) versus \(H_A: \beta_3 \neq -0.5\) (coefficient on law.officers.pc).

lht(mod3,"alcc.pc=-0.5", white.adjust = TRUE)
## 
## Linear hypothesis test:
## alcc.pc = - 0.5
## 
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F Pr(>F)
## 1   1045                 
## 2   1044  1 0.0136 0.9073

So, that null hypothesis cannot be rejected at any of the usual significance level.

Even more so, you can use this function to test multiple hypotheses. Say you want to test whether the inclusion of the additional two variables (in mod3 as opposed to mod2") is relevant. If it wasn’t then the following null hypothesis should be correct: \(H_0: \beta_2=\beta_3=0\). We call this a multiple hypothesis.

Use the help function (?lht) or search for advice () on how to use the lht function to test this hypothesis. If you get it right you should get the following output.

## 
## Linear hypothesis test:
## ur = 0
## alcc.pc = 0
## 
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F  Pr(>F)  
## 1   1046                    
## 2   1044  2 2.9501 0.05277 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypothesis that none of these two variables is relevant can only be rejected marginally, at a 10% significance level as the p-value is 0.05277.

Are you getting the same conclusion if you test the same hypothesis in the model with state fixed effects? You should get the following output.

## 
## Linear hypothesis test:
## ur = 0
## alcc.pc = 0
## 
## Model 1: restricted model
## Model 2: Age.Adjusted.Rate ~ law.officers.pc + ur + alcc.pc + State
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1    997                        
## 2    995  2 99.982 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here it seems very obvious (very small p-value) that \(H_0: \beta_2=\beta_3=0\) should be rejected as the p-value is very small. What we can conclude from this is that not including the important state fixed effects also made it difficult to detect whether the alcc.pc and the ur variables were important in explaining variation in the rate of firearm deaths. Recall that going to a model with region or state fixed effects actually even changed the sign of the coefficient.

The techniques you covered in this computer lab are absolutely fundamental to the remainder of this unit, so please ensure that you have not rushed over the material. Even better, you should experiment yourself and answer some of your own questions. And if you need questions, here are a few:

  • What happens if you include other variables?
  • What other types of hypothesis can I test with lht?
  • Can I also include Year fixed effects?
  • If I include Year as a variable, why does it matter whether I include Year or as.factor(Year))?