In this Section we will demonstrate how to use instrumental variables (IV) estimation (or better Two-Stage-Least Squares, 2SLS) to estimate the parameters in a linear regression model. If you want some more theoretical background on why we may need to use these techniques you may want to refer to any decent Econometrics textbook.
Here we will be very short on the problem setup and big on the implementation! When you estimate a linear regression model, say
\(y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + u\)
the most crucial of all assumptions you got to make is that the explanatory variables \(x_1\) to \(x_3\) are uncorrelated to the error term \(u\). Of course, the error term \(u\) is unobservable and hence it is impossible to empirically test this assumption (notwithstanding a related test introduced below) and you ought to think very carefully whether there may be any reason that makes it likely that this assumption might be breached. Seasoned econometricians would immediately rattle down simultaneity bias, measurement error and omitted relevant variables as the three causes for this to happen.
In some such situations you can actually fix the problem, e.g. by including additional explanatory variables into the model, but in others that is impossible and you need to accept that there is a high probability that, say, \(x_3\) is correlated with \(u\). We would then call \(x_3\) an endogenous variable and all those explanatory variables that do not have that issue are called exogenous. If you still persist with estimating that model by Ordinary Least Squares (OLS) you will have to accept that your estimated coefficients come from a random distribution that on average will not produce the correct (yet unknown) value, in technical lingo, the estimators are biased.
To the rescue come instrumental variables (IV) estimation. What we need to use this technique is what is called an instrumental variable. And if only \(x_3\) is potentially correlated with the error term we need at least one such variable, but more could be useful as well. You always need at least as many instruments as you have endogenous variables. These instruments need to have the following crucial properties, they need to be correlated to the endogenous variable, uncorrelated to the error term and shouldn’t be part of the model explaining the dependent variable \(y\).
Here is a brief outline of what happens when you use IV, in the form of a TSLS regression.
This sounds pretty easy. There is a slight complication, the standard errors that the second stage OLS regression delivers are incorrect and we need to calculate different standard errors. But that will happen automatically in the procedure below.
The R Package needed is the AER package which is a package also recommended for use in the context of estimating robust standard errors. Included in that package is a function called ivreg which we will use. We explain how to use it by walking through an example.
If you use IV a lot in your work, you may well want to pack all of the following into one convenient function (just as Alan Fernihough has done here. But if you are applying IV for the first time it is actually very instructive to go through some of the steps in a bit more detail. It is is also good to see that really there is not a lot of technical magic … just a great idea!
library(tidyverse) # for general data handling
library(stargazer) # package for nicer regression output
library(AER) # include the ARE package
We will use the Women’s Wages dataset to illustrate the use of the IV
regression (mroz.csv).
The dependent variable which we use here is the log wage
lwage
and we are interested in whether the years of
education (educ
) has a positive influence on this log wage
(here we mirror the analysis in Wooldridge’s Example 15.1 in his
Introductory Econometrics textbook).
Let’s first import the data
setwd("YOUR/DIRECTORY/PATH") # This sets the working directory
mydata <- read.csv("mroz.csv", na.strings = ".") # Opens mroz.csv from working directory
And also let’s remove all observations with missing wages from the dataframe
mydata <- subset(mydata, is.na(wage) == FALSE) # remove observations with missing wages from dataset
or if you wish to use the tidyverse
methodology
mydata <- mydata %>% filter(!is.na(wage))
An extremely simple model would be to estimate the following OLS regression which models lwage as a function of a constant and educ.
reg_ex0 <- lm(lwage~educ,data=mydata)
stargazer(reg_ex0, type = "text", title = "OLS estimation")
##
## OLS estimation
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## educ 0.110***
## (0.014)
##
## Constant -0.180
## (0.180)
##
## -----------------------------------------------
## Observations 428
## R2 0.120
## Adjusted R2 0.120
## Residual Std. Error 0.680 (df = 426)
## F Statistic 57.000*** (df = 1; 426)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This seems to indicate that every additional year of education increases the wage by almost 11% (recall the interpretation of a coefficient in a log-lin model!). The issue with this sort of model is that education is most likely to be correlated with individual characteristics that are important for the person’s wage, but not modeled (and hence captured by the error term).
What we need is an instrument that meets the conditions outlined above and here and as in Wooldridge’s example we use the father’s education as an instrument. The way to do this is as follows:
reg_iv0 <- ivreg(lwage~educ|fatheduc,data=mydata)
stargazer(reg_iv0, type = "text", title = "IV estimation")
##
## IV estimation
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## educ 0.059*
## (0.035)
##
## Constant 0.440
## (0.450)
##
## -----------------------------------------------
## Observations 428
## R2 0.093
## Adjusted R2 0.091
## Residual Std. Error 0.690 (df = 426)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The ivreg
function works very similar to the
lm
function (as usual use ?ivreg
to get more
detailed help). In fact the only difference is the specification of the
instrument |fatheduc
. The instruments follow the model
specification. Behind the vertical lines we find the instrument used to
instrument the educ variable (The order of the variables after the
vertical line doesn’t matter).
Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly larger than those of the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of educ on fatheduc (Which turns out to be 0.1958 if you check it.).
In order to illustrate the full functionality of the ivreg procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge’s Example 15.5). Let’s start by estimating this model by OLS (as we need this result later, but result not shown here).
reg_1 <- lm(lwage~educ+age+exper+expersq, data=mydata) # OLS estimation
stargazer(reg_1, type = "text", title = "OLS estimation")
##
## OLS estimation
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## educ 0.110***
## (0.014)
##
## age 0.0003
## (0.005)
##
## exper 0.042***
## (0.013)
##
## expersq -0.001**
## (0.0004)
##
## Constant -0.530*
## (0.280)
##
## -----------------------------------------------
## Observations 428
## R2 0.160
## Adjusted R2 0.150
## Residual Std. Error 0.670 (df = 423)
## F Statistic 20.000*** (df = 4; 423)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimated coefficient for educ is 0.108 with standard error 0.014
(the rest of the results are not shown). Then we estimate the TSLS
regression with fatheduc
and matheduc
as
instruments.
reg_iv1 <- ivreg(lwage~educ+age+exper+expersq|fatheduc+motheduc+age+exper+expersq,data=mydata)
stargazer(reg_1, reg_iv1, type = "text", title = "OLS and IV estimation")
##
## OLS and IV estimation
## ===================================================================
## Dependent variable:
## ------------------------------------
## lwage
## OLS instrumental
## variable
## (1) (2)
## -------------------------------------------------------------------
## educ 0.110*** 0.061*
## (0.014) (0.032)
##
## age 0.0003 -0.0004
## (0.005) (0.005)
##
## exper 0.042*** 0.044***
## (0.013) (0.013)
##
## expersq -0.001** -0.001**
## (0.0004) (0.0004)
##
## Constant -0.530* 0.067
## (0.280) (0.460)
##
## -------------------------------------------------------------------
## Observations 428 428
## R2 0.160 0.140
## Adjusted R2 0.150 0.130
## Residual Std. Error (df = 423) 0.670 0.680
## F Statistic 20.000*** (df = 4; 423)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Before the vertical line we can see the model that is to be
estimeted, lwage~educ+age+exper+expersq
. All the action is
after the vertical line. First we see the instrumental variables used to
instrument educ, fatheduc+motheduc
; this is followed by all
the explanatory variables that are considered exogenous,
age+exper+expersq
.
When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables. A slightly different, more economical way of asking R to do the same thing is as follows
reg_iv1b <- ivreg(lwage~educ+age+exper+expersq|.-educ+fatheduc+motheduc,data=mydata)
After the vertical line you are basically telling R which variable to
remove from the instrument set (the endogenous variable,
.-educ
) and which to add (+fatheduc+motheduc
).
Make sure you don’t forget the decimal point straight after the vertical
line when you use this way of specifying the instruments. The results in
reg_iv1
and reg_ivb
are identical.