This exercise is part of the ECLR page.

Introduction

In the Ridge and Lasso exercise we learned how to apply Ridge and LASSO regressions to select covariates in a linear (in parameters) regression model. As you well know, regression parameters do not automatically have a causal interpretation. This is true for normal regressions where the researcher selects the covariates, regressions where covariates are selected with subsample regression methods as well as regressions estimated by Ridge or LASSO regressions.

Consider a regression model of the following form

\[\begin{equation} y_i = \delta D_i + \mathbf{x}_i\mathbf{\beta} + \epsilon_i \end{equation}\]

where the variables represent

Here we are interested in the average treatment effect (ATE), as captured by the parameter \(\delta\). It is, of course, important to understand that the coefficient \(\delta\) does identify the average treatment effect of treatment \(D\) only if certain assumptions are given. This exercise is not the place where these are discussed in detail. The one assumption that has consequences for this exercise is the conditional independence assumption (CIA). This implies that the potential outcomes (the outcome that eventuates if \(i\) receives treatment and the outcome that eventuates if \(i\) does not receive treatment) are independent of whether treatment is received as long as we condition on covariates \(X\).

This is trivially true if the treatment is randomly allocated (e.g. in a randomised controled trial). However, in the more common situation in which we are dealing with an observational study where treatments are not randomly allocated, this is unlikely to be true. In such situations the treatment variable is potentially endogenous (correlated with the error term) resulting in \(\delta\) not identifying the ATE.

This is problematic if the treatment is correlated with factors that are relevant for the outcome. Here there are two situations to consider:

  1. All or some of the relevant (for the outcome variable) factors that are correlated with the treatment (\(D\)) are not observable and hence necessarily captured in the error term. This situation is called "selection on unobservables". In such a situation, there is no way how estimating a regression like the above can recover the ATE. Alternative identifying strategies are needed (e.g. instrumental variables or difference-in-difference).
  2. All relevant (for the outcome variable) factors that are correlated with the treatment (\(D\)) are observable and hence can be included in the set of covariates ("selection on observables"). In this situation the ATE is identified by \(\delta\) if all these variables are included in \(x_i\) in the above regression.

It is in the process of following the strategy in 2 that machine learning methods like the LASSO can be useful, in particular in situations where economic theory does not provide clear guidance on what these variables should be. As it is important to include all relevant variables this selection should be done as best as possible. In particular if the set of possible variables is large relative to the number of observations this is crucial.

The way this will be implemented is by making use of the following result (Frisch-Waugh-Lovell theorem). We can estimate the parameter \(\delta\) using the following regression

\[\begin{equation} r^y_i = \delta r^D_i + u_i \end{equation}\]

where \(r^y_i\) is the estimated residual of the regression \(y_i= \mathbf{x}_i\mathbf{\tau}^y + res^y_i\) and \(r^D_i\) is the estimated residual of the regression \(D_i= \mathbf{x}_i\mathbf{\tau}^D + res^D_i\).

LASSO can be applied for these two helper regressions. This is particularly useful if you have a high-dimensional matrix of covariates \(\mathbf{X}\).

Setup

Let's load up the libraries we will be using in this project.

# load packages
library(car)
library(AER)
library(glmnet)
library(hdm)
library(tidyverse)
library(stargazer)
library(ggplot2)
library(sandwich)

The package hdm is the package which offers the functionality we will be using.

Empirical example

The example we will work through here is a cross-country regression following the work by Barro and Lee (1994), NBER Working Paper) attempting to establish the determinants of differential country growth rates. The dataset for that work is pre-loaded with the hdm package. We can load it using the following command.

data(GrowthData)
str(GrowthData)
## 'data.frame':    90 obs. of  63 variables:
##  $ Outcome  : num  -0.0243 0.1005 0.0671 0.0641 0.0279 ...
##  $ intercept: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ gdpsh465 : num  6.59 6.83 8.9 7.57 7.16 ...
##  $ bmp1l    : num  0.284 0.614 0 0.2 0.174 ...
##  $ freeop   : num  0.153 0.314 0.204 0.249 0.299 ...
##  $ freetar  : num  0.04389 0.06183 0.00919 0.03627 0.03737 ...
##  $ h65      : num  0.007 0.019 0.26 0.061 0.017 0.023 0.039 0.024 0.402 0.145 ...
##  $ hm65     : num  0.013 0.032 0.325 0.07 0.027 0.038 0.063 0.035 0.488 0.173 ...
##  $ hf65     : num  0.001 0.007 0.201 0.051 0.007 0.006 0.014 0.013 0.314 0.114 ...
##  $ p65      : num  0.29 0.91 1 1 0.82 0.5 0.92 0.69 1 1 ...
##  $ pm65     : num  0.37 1 1 1 0.85 0.55 0.94 0.69 1 1 ...
##  $ pf65     : num  0.21 0.65 1 1 0.81 0.5 0.92 0.69 1 1 ...
##  $ s65      : num  0.04 0.16 0.56 0.24 0.17 0.08 0.17 0.14 0.9 0.28 ...
##  $ sm65     : num  0.06 0.23 0.62 0.22 0.15 0.1 0.21 0.14 0.9 0.26 ...
##  $ sf65     : num  0.02 0.09 0.51 0.31 0.13 0.07 0.12 0.13 0.9 0.4 ...
##  $ fert65   : num  6.67 6.97 3.11 6.26 6.71 6.7 6.72 7.19 2.91 3.07 ...
##  $ mort65   : num  0.16 0.145 0.024 0.072 0.12 0.112 0.082 0.121 0.025 0.058 ...
##  $ lifee065 : num  3.69 3.93 4.27 4.17 4 ...
##  $ gpop1    : num  0.0203 0.0185 0.0188 0.0345 0.031 0.0303 0.032 0.0268 0.0146 0.0155 ...
##  $ fert1    : num  6.68 7.11 3.66 6.83 6.82 ...
##  $ mort1    : num  0.165 0.154 0.027 0.085 0.131 0.119 0.087 0.131 0.025 0.06 ...
##  $ invsh41  : num  0.119 0.1205 0.231 0.1293 0.0793 ...
##  $ geetot1  : num  0.0195 0.0556 0.0465 0.0375 0.0257 0.0151 0.0139 0.0173 0.0474 0.0236 ...
##  $ geerec1  : num  0.0176 0.0369 0.0365 0.035 0.0224 0.0156 0.0134 0.0165 0.0383 0.0223 ...
##  $ gde1     : num  0.019 0.019 0.04 0.011 0.012 0.009 0.007 0.018 0.088 0.021 ...
##  $ govwb1   : num  0.0931 0.1589 0.1442 0.1165 0.0971 ...
##  $ govsh41  : num  0.116 0.156 0.137 0.202 0.169 ...
##  $ gvxdxe41 : num  0.0788 0.1 0.06 0.1562 0.1343 ...
##  $ high65   : num  0.12 0.7 16.67 3.1 0.67 ...
##  $ highm65  : num  0.23 1.18 17.95 3.4 0.98 ...
##  $ highf65  : num  0.01 0.2 15.41 2.8 0.36 ...
##  $ highc65  : num  0.09 0.63 4.5 2.11 0.45 0.48 1.09 0.41 11.6 2.25 ...
##  $ highcm65 : num  0.18 1.04 5.7 2.28 0.66 ...
##  $ highcf65 : num  0.01 0.2 3.31 1.95 0.25 0.21 0.46 0.07 9.27 1.43 ...
##  $ human65  : num  0.301 0.706 8.317 3.833 1.9 ...
##  $ humanm65 : num  0.568 1.138 8.249 3.86 2.084 ...
##  $ humanf65 : num  0.043 0.257 8.384 3.807 1.72 ...
##  $ hyr65    : num  0.004 0.027 0.424 0.104 0.022 0.024 0.059 0.02 0.616 0.12 ...
##  $ hyrm65   : num  0.008 0.045 0.473 0.114 0.033 0.037 0.097 0.038 0.723 0.166 ...
##  $ hyrf65   : num  0 0.008 0.375 0.095 0.012 0.01 0.022 0.003 0.517 0.074 ...
##  $ no65     : num  89.5 89.1 1.4 20.6 58.7 ...
##  $ nom65    : num  80 82.3 1.4 20.6 55.6 ...
##  $ nof65    : num  98.6 96.1 1.4 20.6 61.8 ...
##  $ pinstab1 : num  0 0.0232 0 0 0.2 ...
##  $ pop65    : int  12359 4630 19678 1482 3006 4568 44752 1750 194303 22283 ...
##  $ worker65 : num  0.347 0.27 0.387 0.301 0.331 ...
##  $ pop1565  : num  0.444 0.447 0.318 0.467 0.456 ...
##  $ pop6565  : num  0.0276 0.0356 0.0767 0.031 0.0263 ...
##  $ sec65    : num  0.45 3 36.74 7.6 5.07 ...
##  $ secm65   : num  0.75 4.74 33.5 7.5 5.37 ...
##  $ secf65   : num  0.17 1.2 39.95 7.7 4.78 ...
##  $ secc65   : num  0.13 1.36 15.68 2.76 2.17 ...
##  $ seccm65  : num  0.21 2.05 13.19 2.89 2.23 ...
##  $ seccf65  : num  0.04 0.64 18.14 2.63 2.11 ...
##  $ syr65    : num  0.033 0.173 2.573 0.438 0.257 ...
##  $ syrm65   : num  0.057 0.274 2.478 0.453 0.287 ...
##  $ syrf65   : num  0.01 0.067 2.667 0.424 0.229 ...
##  $ teapri65 : num  47.6 57.1 26.5 27.8 34.5 34.3 46.6 34 28.2 20.3 ...
##  $ teasec65 : num  17.3 18 20.7 22.7 17.6 8.1 14.7 16.1 20.6 7.2 ...
##  $ ex1      : num  0.0729 0.094 0.1741 0.1265 0.1211 ...
##  $ im1      : num  0.0667 0.1438 0.175 0.1496 0.1308 ...
##  $ xr65     : num  0.348 0.525 1.082 6.625 2.5 ...
##  $ tot1     : num  -0.01473 0.00575 -0.01004 -0.00219 0.00328 ...

There are 90 observations and 63 variables (really only 62 as there is an intercept column).

Exploratory analysis

This dataset contains a lot of information and the variable definitions are not immediately obvious. In this ECLR exercise you can find more details on a somewhat larger dataset and a link to a data dictionary (readme.txt) from which you can derive the definitions of the variables in the GrowthData dataset. It is not important to understand the definitions of all variables here, but with the help of that data dictionary you can for instance establish the following definitions:

  • human65, Average schooling years in the total population over age 25 in 1965
  • seccf65, Percentage of "secondary school complete" in female pop.
  • freeop, Measure of "Free trade openness".

The two variables we are particularly interested here is the variable labelled Outcome which here is defined as a GDP growth rate over a decade (from 1965 to 1975) for a particular country. The covariate we are particularly interested in is gdpsh465 which represents the initial (i.e. in 1965) level of \(log(GDP per capita)\). The question here is whether countries which are poorer at the beginning of that growth period grow faster, ceteris paribus, implying a negative relation between Outcome (\(y_i\)) and gdpsh465 (\(D_i\)) after controlling for relevant covariates (\(\mathbf{X}\)).

The question then is what the relevant covariates are?

Before we answer this question let us explore some of the data.

The data file has data for countries.

After inspecting the datafile it is apparent that GrowthData ...

All variables are numeric variables. Evidently they do not contain any country names or letter abbreviations. It could be that one of the numerical variables represent a country. In the data dictionary you will actually find such a variable called SHCODE, but that variable is not included in GrowthData.

If you wanted to identify countries you could go to the data exercise linked above and attempt to match variable outcomes, say for pop65 to identify countries. In that way you will be able to match about half of the countries.

First we shall plot a histogram of the outcome variable.

p1 <- ggplot(GrowthData,aes(x=Outcome)) + geom_histogram()
p1

As you can see the majority of countries display positive growth rates.

Let's produce a scatterplot that displays the initial level of GDP to the subsequent growth.

p2 <- ggplot(GrowthData,aes(x=gdpsh465,y=Outcome)) + geom_point()
p2

From this plot it is impossible to see any obvious relationship between the initial GDP per person (gdpsh465) and the Growth rate over the next decade (Outcome). But that is perhaps not surprising as the above plot ignores any covariates that may help explain variation. Their absence may hide any such relationship.

A simple regression shows the following.

ols1 <- lm(Outcome~gdpsh465,data = GrowthData)
stargazer(ols1,type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               Outcome          
## -----------------------------------------------
## gdpsh465                       0.001           
##                               (0.006)          
##                                                
## Constant                       0.035           
##                               (0.047)          
##                                                
## -----------------------------------------------
## Observations                    90             
## R2                             0.001           
## Adjusted R2                   -0.011           
## Residual Std. Error       0.052 (df = 88)      
## F Statistic             0.047 (df = 1; 88)     
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The coefficient to the initial GDP value is clearly not statistically significant. Even calculating heteroskedasticity robust standard errors (see next code chunk) does not change this.

se_hc_ols1 <- sqrt(diag(vcovHC(ols1, type = "HC1"))) # calculate HS robust s.e.

Allowing for covariates, Economic judgement

In the paper by Barro and Lee (1994), NBER Working Paper) the authors make economic arguments for the covariates that are to be included. They list the following type of variables:

  • life expectancy, lifee065
  • percentage of secondary school completions in male and female population, seccm65 and seccf65
  • ratio of real gross domestic investment to GDP, invsh41
  • ratio of government consumption (not incl defense and education) to GDP, gvxdxe41
  • (the log of) the black market premium on foreign exchange, log(1+bmp1l)
  • the average number of revolutions and assassinations per year, pinstab1
ols2mod <- "Outcome~gdpsh465+lifee065+seccm65+seccf65+invsh41+gvxdxe41+log(1+bmp1l)+pinstab1"
ols2 <- lm(ols2mod,data = GrowthData)

We want to calculate heteroskedasticity robust standard errors. This is easily done using this little wrapper function for stargazer. Dowload this file stargazer_HC.R, save it to your working folder and then load the function using

source("stargazer_HC.R")

Then you can display regression results using this function. Robust standard errors will be automatically displayed.

stargazer_HC(ols1,ols2)
## 
## ============================================================
##                               Dependent variable:           
##                     ----------------------------------------
##                                     Outcome                 
##                            (1)                  (2)         
## ------------------------------------------------------------
## gdpsh465                  0.001              -0.038***      
##                          (0.005)              (0.013)       
##                                                             
## lifee065                                      0.103*        
##                                               (0.053)       
##                                                             
## seccm65                                      0.006***       
##                                               (0.002)       
##                                                             
## seccf65                                      -0.005***      
##                                               (0.002)       
##                                                             
## invsh41                                       0.171**       
##                                               (0.073)       
##                                                             
## gvxdxe41                                     -0.230**       
##                                               (0.094)       
##                                                             
## log(1 + bmp1l)                               -0.106***      
##                                               (0.027)       
##                                                             
## pinstab1                                      -0.035        
##                                               (0.021)       
##                                                             
## Constant                  0.035               -0.084        
##                          (0.042)              (0.158)       
##                                                             
## ------------------------------------------------------------
## Observations                90                  90          
## R2                        0.001                0.424        
## Adjusted R2               -0.011               0.367        
## Residual Std. Error  0.052 (df = 88)      0.041 (df = 81)   
## F Statistic         0.047 (df = 1; 88) 7.458*** (df = 8; 81)
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01
##                        Robust standard errors in parenthesis

You now find a statistically significant effect of the the original level of GDP.

Using LASSO for variable selection

In the previous section we used the economic expertise of Barro and Lee to decide which covariates to include in \(\mathbf{X}\). You may wonder whether such expertise is actually needed. Couldn't we just include all available variables (the kitchen sink approach)? Let's try this.

ols3mod <- "Outcome~."  # ~. indicates that you do want to include all variables in the dataset
ols3 <- lm(ols3mod,data = GrowthData)
stargazer_HC(ols1,ols2,ols3,keep = "gdpsh465")
## 
## ===================================================================================
##                                           Dependent variable:                      
##                     ---------------------------------------------------------------
##                                                 Outcome                            
##                            (1)                  (2)                   (3)          
## -----------------------------------------------------------------------------------
## gdpsh465                  0.001              -0.038***               -0.009        
##                          (0.005)              (0.013)               (0.032)        
##                                                                                    
## -----------------------------------------------------------------------------------
## Observations                90                  90                     90          
## R2                        0.001                0.424                 0.887         
## Adjusted R2               -0.011               0.367                 0.641         
## Residual Std. Error  0.052 (df = 88)      0.041 (df = 81)       0.031 (df = 28)    
## F Statistic         0.047 (df = 1; 88) 7.458*** (df = 8; 81) 3.607*** (df = 61; 28)
## ===================================================================================
## Note:                                                   *p<0.1; **p<0.05; ***p<0.01
##                                               Robust standard errors in parenthesis

While we now get a slightly negative coefficient we also get a large standard error resulting in a statistically insignificant coefficient estimate.

The aim is now to evaluate whether the economic expertise can be replaced by using LASSO, or if we can even make a case for the statistical tool of LASSO being able to identify variables that are statistically relevant and had not been identified by the economic expertise.

In order to use the rlassoEffect function which is part of the hdm package we need to define matrices with variables \(y_i\), \(D_i\) and \(\mathbf{x_i}\). This is done here:

y = as.matrix(GrowthData[, "Outcome"]) 
d = as.matrix(GrowthData[, "gdpsh465"])
X = as.matrix(GrowthData)[, -c(1, 2, 3)] 
varnames = colnames(GrowthData)
varnames.X = colnames(X) # produces a list of all possible covariates

As you can see, for the definition of X the first three columns of data (Outcome, Intercept and gdpsh465) have been excluded. An intercept is automatically added.

The process that is now implemented is that the rlassoEffect function will estimate two helper regressions

\[\begin{equation} \text{Direct: }y_i = \mathbf{x}_i\mathbf{\gamma} + u_i \end{equation}\]

and

\[\begin{equation} \text{Indirect: }d_i = \mathbf{x}_i\mathbf{\kappa} + v_i \end{equation}\]

For each of these a LASSO is applied to select variables from \(\mathbf{x}_i\). The two selections will typically be different and the Post Double Selection estimator then uses the union of these two selections, \(\mathbf{x}^*_i\), in a new model.

\[\begin{equation} y_i = \delta D_i + \mathbf{x}^*_i\mathbf{\beta} + \epsilon_i \end{equation}\]

This is what is being achieved by the following command:

doublesel.effect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
## [1] "Estimates and significance testing of the effect of target variables"
##    Estimate. Std. Error t value Pr(>|t|)   
## d1  -0.05001    0.01579  -3.167  0.00154 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the estimated effect of the initial level of GDP (gdpsh465) is stronger than that identified in the model that used the expert judgement (OLS2).

You may now be interested which variables this procedure selected for \(\mathbf{x}^*_i\). This information is contained in doublesel.effect$selection.index.

ds.sel <- varnames.X[doublesel.effect$selection.index]
ds.sel
## [1] "bmp1l"    "freetar"  "hm65"     "sf65"     "lifee065" "humanf65" "pop6565"

Only variables bmp1l and life065 were also included in the expertise driven model OLS2.

Investigating the double selection method

In this section we shall reconstruct the double selection process in the above example. We wish to understand why, the particular selection (ds.sel) was made. To understand this we shall estimate

In order to understand the Double Selection method

This exercise is part of the ECLR page.

\[\begin{equation} \text{Direct: }y_i = \mathbf{x}_i\mathbf{\gamma} + u_i \end{equation}\]

and

\[\begin{equation} \text{Indirect: }d_i = \mathbf{x}_i\mathbf{\kappa} + v_i \end{equation}\]

One of the variables in ds.sel is humanf65. Which of the following is true?

Recall that the covariates used in the post double selection model is the union of the covariate selections in the two helper models. Therefore, any variable that appears in ds.sel should have been selected in at least one of the two models.

Let us run the two LASSO selection models.

GrowthData.direkt <- GrowthData %>% select(-gdpsh465) 
direkt <- rlasso(Outcome~.,data = GrowthData.direkt, post = TRUE)

You could look at the actual regression output with summary(direkt), but here we are only interested in which variables have been selected by the LASSO. The model output direkt$index contains that information and we will filter out all the variables for which that index is TRUE.

direkt$index[direkt$index==TRUE]
## bmp1l 
##  TRUE

So, the only variable selected is bmp1l.

Now we shall estimate the indirekt model

GrowthData.indirekt <- GrowthData %>% select(-Outcome) 
indirekt <- rlasso(gdpsh465~.,data = GrowthData.indirekt, post = TRUE)
indirekt$index[indirekt$index==TRUE]
##  freetar     hm65     sf65 lifee065 humanf65  pop6565 
##     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE

These are the remaining six variables that were selected in the Double Selection procedure.