This exercise is part of the ECLR page.
In the Ridge and Lasso exercise we learned how to apply Ridge and LASSO regressions to select covariates in a linear (in parameters) regression model. As you well know, regression parameters do not automatically have a causal interpretation. This is true for normal regressions where the researcher selects the covariates, regressions where covariates are selected with subsample regression methods as well as regressions estimated by Ridge or LASSO regressions.
Consider a regression model of the following form
\[\begin{equation} y_i = \delta D_i + \mathbf{x}_i\mathbf{\beta} + \epsilon_i \end{equation}\]
where the variables represent
Here we are interested in the average treatment effect (ATE), as captured by the parameter \(\delta\). It is, of course, important to understand that the coefficient \(\delta\) does identify the average treatment effect of treatment \(D\) only if certain assumptions are given. This exercise is not the place where these are discussed in detail. The one assumption that has consequences for this exercise is the conditional independence assumption (CIA). This implies that the potential outcomes (the outcome that eventuates if \(i\) receives treatment and the outcome that eventuates if \(i\) does not receive treatment) are independent of whether treatment is received as long as we condition on covariates \(X\).
This is trivially true if the treatment is randomly allocated (e.g. in a randomised controled trial). However, in the more common situation in which we are dealing with an observational study where treatments are not randomly allocated, this is unlikely to be true. In such situations the treatment variable is potentially endogenous (correlated with the error term) resulting in \(\delta\) not identifying the ATE.
This is problematic if the treatment is correlated with factors that are relevant for the outcome. Here there are two situations to consider:
It is in the process of following the strategy in 2 that machine learning methods like the LASSO can be useful, in particular in situations where economic theory does not provide clear guidance on what these variables should be. As it is important to include all relevant variables this selection should be done as best as possible. In particular if the set of possible variables is large relative to the number of observations this is crucial.
The way this will be implemented is by making use of the following result (Frisch-Waugh-Lovell theorem). We can estimate the parameter \(\delta\) using the following regression
\[\begin{equation} r^y_i = \delta r^D_i + u_i \end{equation}\]
where \(r^y_i\) is the estimated residual of the regression \(y_i= \mathbf{x}_i\mathbf{\tau}^y + res^y_i\) and \(r^D_i\) is the estimated residual of the regression \(D_i= \mathbf{x}_i\mathbf{\tau}^D + res^D_i\).
LASSO can be applied for these two helper regressions. This is particularly useful if you have a high-dimensional matrix of covariates \(\mathbf{X}\).
Let's load up the libraries we will be using in this project.
# load packages
library(car)
library(AER)
library(glmnet)
library(hdm)
library(tidyverse)
library(stargazer)
library(ggplot2)
library(sandwich)
The package hdm
is the package which offers the
functionality we will be using.
The example we will work through here is a cross-country regression
following the work by Barro and Lee (1994), NBER
Working Paper) attempting to establish the determinants of
differential country growth rates. The dataset for that work is
pre-loaded with the hdm
package. We can load it using the
following command.
data(GrowthData)
str(GrowthData)
## 'data.frame': 90 obs. of 63 variables:
## $ Outcome : num -0.0243 0.1005 0.0671 0.0641 0.0279 ...
## $ intercept: int 1 1 1 1 1 1 1 1 1 1 ...
## $ gdpsh465 : num 6.59 6.83 8.9 7.57 7.16 ...
## $ bmp1l : num 0.284 0.614 0 0.2 0.174 ...
## $ freeop : num 0.153 0.314 0.204 0.249 0.299 ...
## $ freetar : num 0.04389 0.06183 0.00919 0.03627 0.03737 ...
## $ h65 : num 0.007 0.019 0.26 0.061 0.017 0.023 0.039 0.024 0.402 0.145 ...
## $ hm65 : num 0.013 0.032 0.325 0.07 0.027 0.038 0.063 0.035 0.488 0.173 ...
## $ hf65 : num 0.001 0.007 0.201 0.051 0.007 0.006 0.014 0.013 0.314 0.114 ...
## $ p65 : num 0.29 0.91 1 1 0.82 0.5 0.92 0.69 1 1 ...
## $ pm65 : num 0.37 1 1 1 0.85 0.55 0.94 0.69 1 1 ...
## $ pf65 : num 0.21 0.65 1 1 0.81 0.5 0.92 0.69 1 1 ...
## $ s65 : num 0.04 0.16 0.56 0.24 0.17 0.08 0.17 0.14 0.9 0.28 ...
## $ sm65 : num 0.06 0.23 0.62 0.22 0.15 0.1 0.21 0.14 0.9 0.26 ...
## $ sf65 : num 0.02 0.09 0.51 0.31 0.13 0.07 0.12 0.13 0.9 0.4 ...
## $ fert65 : num 6.67 6.97 3.11 6.26 6.71 6.7 6.72 7.19 2.91 3.07 ...
## $ mort65 : num 0.16 0.145 0.024 0.072 0.12 0.112 0.082 0.121 0.025 0.058 ...
## $ lifee065 : num 3.69 3.93 4.27 4.17 4 ...
## $ gpop1 : num 0.0203 0.0185 0.0188 0.0345 0.031 0.0303 0.032 0.0268 0.0146 0.0155 ...
## $ fert1 : num 6.68 7.11 3.66 6.83 6.82 ...
## $ mort1 : num 0.165 0.154 0.027 0.085 0.131 0.119 0.087 0.131 0.025 0.06 ...
## $ invsh41 : num 0.119 0.1205 0.231 0.1293 0.0793 ...
## $ geetot1 : num 0.0195 0.0556 0.0465 0.0375 0.0257 0.0151 0.0139 0.0173 0.0474 0.0236 ...
## $ geerec1 : num 0.0176 0.0369 0.0365 0.035 0.0224 0.0156 0.0134 0.0165 0.0383 0.0223 ...
## $ gde1 : num 0.019 0.019 0.04 0.011 0.012 0.009 0.007 0.018 0.088 0.021 ...
## $ govwb1 : num 0.0931 0.1589 0.1442 0.1165 0.0971 ...
## $ govsh41 : num 0.116 0.156 0.137 0.202 0.169 ...
## $ gvxdxe41 : num 0.0788 0.1 0.06 0.1562 0.1343 ...
## $ high65 : num 0.12 0.7 16.67 3.1 0.67 ...
## $ highm65 : num 0.23 1.18 17.95 3.4 0.98 ...
## $ highf65 : num 0.01 0.2 15.41 2.8 0.36 ...
## $ highc65 : num 0.09 0.63 4.5 2.11 0.45 0.48 1.09 0.41 11.6 2.25 ...
## $ highcm65 : num 0.18 1.04 5.7 2.28 0.66 ...
## $ highcf65 : num 0.01 0.2 3.31 1.95 0.25 0.21 0.46 0.07 9.27 1.43 ...
## $ human65 : num 0.301 0.706 8.317 3.833 1.9 ...
## $ humanm65 : num 0.568 1.138 8.249 3.86 2.084 ...
## $ humanf65 : num 0.043 0.257 8.384 3.807 1.72 ...
## $ hyr65 : num 0.004 0.027 0.424 0.104 0.022 0.024 0.059 0.02 0.616 0.12 ...
## $ hyrm65 : num 0.008 0.045 0.473 0.114 0.033 0.037 0.097 0.038 0.723 0.166 ...
## $ hyrf65 : num 0 0.008 0.375 0.095 0.012 0.01 0.022 0.003 0.517 0.074 ...
## $ no65 : num 89.5 89.1 1.4 20.6 58.7 ...
## $ nom65 : num 80 82.3 1.4 20.6 55.6 ...
## $ nof65 : num 98.6 96.1 1.4 20.6 61.8 ...
## $ pinstab1 : num 0 0.0232 0 0 0.2 ...
## $ pop65 : int 12359 4630 19678 1482 3006 4568 44752 1750 194303 22283 ...
## $ worker65 : num 0.347 0.27 0.387 0.301 0.331 ...
## $ pop1565 : num 0.444 0.447 0.318 0.467 0.456 ...
## $ pop6565 : num 0.0276 0.0356 0.0767 0.031 0.0263 ...
## $ sec65 : num 0.45 3 36.74 7.6 5.07 ...
## $ secm65 : num 0.75 4.74 33.5 7.5 5.37 ...
## $ secf65 : num 0.17 1.2 39.95 7.7 4.78 ...
## $ secc65 : num 0.13 1.36 15.68 2.76 2.17 ...
## $ seccm65 : num 0.21 2.05 13.19 2.89 2.23 ...
## $ seccf65 : num 0.04 0.64 18.14 2.63 2.11 ...
## $ syr65 : num 0.033 0.173 2.573 0.438 0.257 ...
## $ syrm65 : num 0.057 0.274 2.478 0.453 0.287 ...
## $ syrf65 : num 0.01 0.067 2.667 0.424 0.229 ...
## $ teapri65 : num 47.6 57.1 26.5 27.8 34.5 34.3 46.6 34 28.2 20.3 ...
## $ teasec65 : num 17.3 18 20.7 22.7 17.6 8.1 14.7 16.1 20.6 7.2 ...
## $ ex1 : num 0.0729 0.094 0.1741 0.1265 0.1211 ...
## $ im1 : num 0.0667 0.1438 0.175 0.1496 0.1308 ...
## $ xr65 : num 0.348 0.525 1.082 6.625 2.5 ...
## $ tot1 : num -0.01473 0.00575 -0.01004 -0.00219 0.00328 ...
There are 90 observations and 63 variables (really only 62 as there
is an intercept
column).
This dataset contains a lot of information and the variable
definitions are not immediately obvious. In this
ECLR exercise you can find more details on a somewhat larger dataset
and a link to a data dictionary (readme.txt)
from which you can derive the definitions of the variables in the
GrowthData
dataset. It is not important to understand the
definitions of all variables here, but with the help of that data
dictionary you can for instance establish the following definitions:
human65
, Average schooling years in the total
population over age 25 in 1965seccf65
, Percentage of "secondary school complete" in
female pop.freeop
, Measure of "Free trade openness".The two variables we are particularly interested here is the variable
labelled Outcome
which here is defined as a GDP growth rate
over a decade (from 1965 to 1975) for a particular country. The
covariate we are particularly interested in is gdpsh465
which represents the initial (i.e. in 1965) level of \(log(GDP per capita)\). The question here is
whether countries which are poorer at the beginning of that growth
period grow faster, ceteris paribus, implying a negative relation
between Outcome
(\(y_i\))
and gdpsh465
(\(D_i\))
after controlling for relevant covariates (\(\mathbf{X}\)).
The question then is what the relevant covariates are?
Before we answer this question let us explore some of the data.
The data file has data for countries.
After inspecting the datafile it is apparent that
GrowthData
...
All variables are numeric variables. Evidently they do not contain
any country names or letter abbreviations. It could be that one of the
numerical variables represent a country. In the data dictionary you will
actually find such a variable called SHCODE
, but that
variable is not included in GrowthData
.
If you wanted to identify countries you could go to the data exercise
linked above and attempt to match variable outcomes, say for
pop65
to identify countries. In that way you will be able
to match about half of the countries.
First we shall plot a histogram of the outcome variable.
p1 <- ggplot(GrowthData,aes(x=Outcome)) + geom_histogram()
p1
As you can see the majority of countries display positive growth rates.
Let's produce a scatterplot that displays the initial level of GDP to the subsequent growth.
p2 <- ggplot(GrowthData,aes(x=gdpsh465,y=Outcome)) + geom_point()
p2
From this plot it is impossible to see any obvious relationship
between the initial GDP per person (gdpsh465
) and the
Growth rate over the next decade (Outcome
). But that is
perhaps not surprising as the above plot ignores any covariates that may
help explain variation. Their absence may hide any such
relationship.
A simple regression shows the following.
ols1 <- lm(Outcome~gdpsh465,data = GrowthData)
stargazer(ols1,type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## Outcome
## -----------------------------------------------
## gdpsh465 0.001
## (0.006)
##
## Constant 0.035
## (0.047)
##
## -----------------------------------------------
## Observations 90
## R2 0.001
## Adjusted R2 -0.011
## Residual Std. Error 0.052 (df = 88)
## F Statistic 0.047 (df = 1; 88)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient to the initial GDP value is clearly not statistically significant. Even calculating heteroskedasticity robust standard errors (see next code chunk) does not change this.
se_hc_ols1 <- sqrt(diag(vcovHC(ols1, type = "HC1"))) # calculate HS robust s.e.
In the paper by Barro and Lee (1994), NBER Working Paper) the authors make economic arguments for the covariates that are to be included. They list the following type of variables:
lifee065
seccm65
and seccf65
invsh41
gvxdxe41
log(1+bmp1l)
pinstab1
ols2mod <- "Outcome~gdpsh465+lifee065+seccm65+seccf65+invsh41+gvxdxe41+log(1+bmp1l)+pinstab1"
ols2 <- lm(ols2mod,data = GrowthData)
We want to calculate heteroskedasticity robust standard errors. This
is easily done using this little wrapper function for
stargazer
. Dowload this file stargazer_HC.R,
save it to your working folder and then load the function using
source("stargazer_HC.R")
Then you can display regression results using this function. Robust standard errors will be automatically displayed.
stargazer_HC(ols1,ols2)
##
## ============================================================
## Dependent variable:
## ----------------------------------------
## Outcome
## (1) (2)
## ------------------------------------------------------------
## gdpsh465 0.001 -0.038***
## (0.005) (0.013)
##
## lifee065 0.103*
## (0.053)
##
## seccm65 0.006***
## (0.002)
##
## seccf65 -0.005***
## (0.002)
##
## invsh41 0.171**
## (0.073)
##
## gvxdxe41 -0.230**
## (0.094)
##
## log(1 + bmp1l) -0.106***
## (0.027)
##
## pinstab1 -0.035
## (0.021)
##
## Constant 0.035 -0.084
## (0.042) (0.158)
##
## ------------------------------------------------------------
## Observations 90 90
## R2 0.001 0.424
## Adjusted R2 -0.011 0.367
## Residual Std. Error 0.052 (df = 88) 0.041 (df = 81)
## F Statistic 0.047 (df = 1; 88) 7.458*** (df = 8; 81)
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
You now find a statistically significant effect of the the original level of GDP.
In the previous section we used the economic expertise of Barro and Lee to decide which covariates to include in \(\mathbf{X}\). You may wonder whether such expertise is actually needed. Couldn't we just include all available variables (the kitchen sink approach)? Let's try this.
ols3mod <- "Outcome~." # ~. indicates that you do want to include all variables in the dataset
ols3 <- lm(ols3mod,data = GrowthData)
stargazer_HC(ols1,ols2,ols3,keep = "gdpsh465")
##
## ===================================================================================
## Dependent variable:
## ---------------------------------------------------------------
## Outcome
## (1) (2) (3)
## -----------------------------------------------------------------------------------
## gdpsh465 0.001 -0.038*** -0.009
## (0.005) (0.013) (0.032)
##
## -----------------------------------------------------------------------------------
## Observations 90 90 90
## R2 0.001 0.424 0.887
## Adjusted R2 -0.011 0.367 0.641
## Residual Std. Error 0.052 (df = 88) 0.041 (df = 81) 0.031 (df = 28)
## F Statistic 0.047 (df = 1; 88) 7.458*** (df = 8; 81) 3.607*** (df = 61; 28)
## ===================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
While we now get a slightly negative coefficient we also get a large standard error resulting in a statistically insignificant coefficient estimate.
The aim is now to evaluate whether the economic expertise can be replaced by using LASSO, or if we can even make a case for the statistical tool of LASSO being able to identify variables that are statistically relevant and had not been identified by the economic expertise.
In order to use the rlassoEffect
function which is part
of the hdm
package we need to define matrices with
variables \(y_i\), \(D_i\) and \(\mathbf{x_i}\). This is done here:
y = as.matrix(GrowthData[, "Outcome"])
d = as.matrix(GrowthData[, "gdpsh465"])
X = as.matrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)
varnames.X = colnames(X) # produces a list of all possible covariates
As you can see, for the definition of X
the first three
columns of data (Outcome
, Intercept
and
gdpsh465
) have been excluded. An intercept is automatically
added.
The process that is now implemented is that the
rlassoEffect
function will estimate two helper
regressions
\[\begin{equation} \text{Direct: }y_i = \mathbf{x}_i\mathbf{\gamma} + u_i \end{equation}\]
and
\[\begin{equation} \text{Indirect: }d_i = \mathbf{x}_i\mathbf{\kappa} + v_i \end{equation}\]
For each of these a LASSO is applied to select variables from \(\mathbf{x}_i\). The two selections will typically be different and the Post Double Selection estimator then uses the union of these two selections, \(\mathbf{x}^*_i\), in a new model.
\[\begin{equation} y_i = \delta D_i + \mathbf{x}^*_i\mathbf{\beta} + \epsilon_i \end{equation}\]
This is what is being achieved by the following command:
doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## d1 -0.05001 0.01579 -3.167 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the estimated effect of the initial level of GDP
(gdpsh465
) is stronger than that identified in the model
that used the expert judgement (OLS2
).
You may now be interested which variables this procedure selected for
\(\mathbf{x}^*_i\). This information is
contained in doublesel.effect$selection.index
.
ds.sel <- varnames.X[doublesel.effect$selection.index]
ds.sel
## [1] "bmp1l" "freetar" "hm65" "sf65" "lifee065" "humanf65" "pop6565"
Only variables bmp1l
and life065
were also
included in the expertise driven model OLS2
.
In this section we shall reconstruct the double selection process in
the above example. We wish to understand why, the particular selection
(ds.sel
) was made. To understand this we shall estimate
In order to understand the Double Selection method
This exercise is part of the ECLR page.
\[\begin{equation} \text{Direct: }y_i = \mathbf{x}_i\mathbf{\gamma} + u_i \end{equation}\]
and
\[\begin{equation} \text{Indirect: }d_i = \mathbf{x}_i\mathbf{\kappa} + v_i \end{equation}\]
One of the variables in ds.sel
is humanf65
.
Which of the following is true?
Recall that the covariates used in the post double selection model is
the union of the covariate selections in the two helper models.
Therefore, any variable that appears in ds.sel
should have
been selected in at least one of the two models.
Let us run the two LASSO selection models.
GrowthData.direkt <- GrowthData %>% select(-gdpsh465)
direkt <- rlasso(Outcome~.,data = GrowthData.direkt, post = TRUE)
You could look at the actual regression output with
summary(direkt)
, but here we are only interested in which
variables have been selected by the LASSO. The model output
direkt$index
contains that information and we will filter
out all the variables for which that index is TRUE
.
direkt$index[direkt$index==TRUE]
## bmp1l
## TRUE
So, the only variable selected is bmp1l
.
Now we shall estimate the indirekt model
GrowthData.indirekt <- GrowthData %>% select(-Outcome)
indirekt <- rlasso(gdpsh465~.,data = GrowthData.indirekt, post = TRUE)
indirekt$index[indirekt$index==TRUE]
## freetar hm65 sf65 lifee065 humanf65 pop6565
## TRUE TRUE TRUE TRUE TRUE TRUE
These are the remaining six variables that were selected in the Double Selection procedure.