This exercise is part of the ECLR page.

Introduction

Here we will demonstrate the use of LASSO and Ridge regressions to deal with situations in which you estimate a regression model that contains a large amount of potential explanatory variables. These type of models will mainly be used for prediction purposes. The predictions may be the final aim of the work, or they may be used in some intermediate step in another analysis.

We begin by loading a range of packages.

# load packages
library(car)
library(AER)
library(glmnet)
library(hdm)
library(stats)
require(graphics)
library(leaps)
library(tidyverse)
library(haven)
library(stargazer)
library(ggplot2)
library(sandwich)
library(coefplot)
library(hdm)

Let's load the dataset which contains an extract from the Panel Study of Income Dynamics (PSID). You can download this file from here: PSIDsmall.txt. The datafile here is saved as a text file which can be imported using the read.delim function.

## load data set 
PSID1 <- read.delim("../data/PSIDsmall.txt",header=TRUE)

Let's see what variables are contained in this dataset.

names(PSID1)

##  [1] "WAGE"      "AGE"       "AGE2"      "AGEW"      "AGEY"      "CHILDREN" 
##  [7] "DEGREEW"   "FEDUCA"    "FEDUCH"    "GRADCOLLW" "GRADHS"    "GRADHSW"  
## [13] "HAPPINESS" "MARRIED"   "MEDUCA"    "MEDUCH"    "SEX"       "YEARSCOLL"
## [19] "STATE"     "GRADCOLL"  "DEGREE"

summary(PSID1)

##       WAGE             AGE             AGE2           AGEW     
##  Min.   :  1.00   Min.   :18.00   Min.   : 324   Min.   : 0.0  
##  1st Qu.: 10.00   1st Qu.:28.00   1st Qu.: 784   1st Qu.: 0.0  
##  Median : 14.00   Median :36.00   Median :1296   Median : 0.0  
##  Mean   : 17.99   Mean   :39.21   Mean   :1715   Mean   :18.9  
##  3rd Qu.: 20.00   3rd Qu.:50.00   3rd Qu.:2500   3rd Qu.:37.0  
##  Max.   :975.00   Max.   :85.00   Max.   :7225   Max.   :78.0  
##       AGEY           CHILDREN         DEGREEW           FEDUCA       
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.0000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 0.0000  
##  Mean   : 2.776   Mean   :0.8201   Mean   :0.2967   Mean   : 0.6266  
##  3rd Qu.: 4.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.: 0.0000  
##  Max.   :17.000   Max.   :9.0000   Max.   :6.0000   Max.   :20.0000  
##      FEDUCH        GRADCOLLW          GRADHS         GRADHSW      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :4.000   Median :0.0000   Median :1.000   Median :0.0000  
##  Mean   :3.987   Mean   :0.6826   Mean   :1.194   Mean   :0.5109  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :5.0000   Max.   :3.000   Max.   :3.0000  
##    HAPPINESS        MARRIED          MEDUCA            MEDUCH     
##  Min.   :0.000   Min.   :1.000   Min.   : 0.0000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 0.0000   1st Qu.:3.000  
##  Median :2.000   Median :2.000   Median : 0.0000   Median :4.000  
##  Mean   :2.238   Mean   :2.078   Mean   : 0.5849   Mean   :4.138  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.:6.000  
##  Max.   :5.000   Max.   :5.000   Max.   :20.0000   Max.   :8.000  
##       SEX          YEARSCOLL         STATE          GRADCOLL    
##  Min.   :1.000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.:17.00   1st Qu.:0.000  
##  Median :1.000   Median :1.000   Median :28.00   Median :1.000  
##  Mean   :1.343   Mean   :1.336   Mean   :28.61   Mean   :1.538  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:41.00   3rd Qu.:5.000  
##  Max.   :2.000   Max.   :5.000   Max.   :56.00   Max.   :5.000  
##      DEGREE      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4155  
##  3rd Qu.:0.0000  
##  Max.   :6.0000

Let's look at the distribution of wages (measured as hourly wages).

phist <- ggplot(PSID1, aes(x=WAGE)) + geom_histogram()
phist

You can see that there are a few individuals with very large hourly wages. Let us exclude wages that are larger than USD200 per hour and have another look at the wage distribution.

PSID <- PSID1 %>% filter(WAGE <200)
phist <- ggplot(PSID, aes(x=WAGE)) + geom_histogram()
phist

Exploring the variables

These are observations from the PSID and to do serious work with this dataset we need to understand the variables. The PSID surveys household on their income but also a number of household and personal characteristics. The following variable search page can be useful here. The main person interviewed is the household head, if there is a couple this was typically the man. You will see some variables which exist in two versions, for instance AGE and AGEW. The former refers to the household head and the latter to the wife's age. You can tell that this labeling is the result of patriarchal thinking. In later versions of the PSID this labeling has changed to household head and (where appropriate) the spouse.

Variable Name	Definition
WAGE	Hourly wage of household head
AGE	Age of household (HH) head
AGE2	Squared Age of HH head
AGEW	Age of wife (0 if no wife/spouse)
AGEY	Age of youngest child (0 if no child, children aged 0 and 1 will have an entry of "1")
CHILDREN	number of children
DEGREE	highest degree earned (0 = not applicable, 1 = Associate, 2 = Bachelor, 3 = Master, 4 = PhD, 5 = Law, LLB, 6 = Medical, MD ), several reasons for 0, e.g. no college attendance GRADCOLL = 5
DEGREEW	highest degree earned (0 = not applicable, 1 = Associate, 2 = Bachelor, 3 = Master, 4 = PhD, 5 = Law, LLB, 6 = Medical, MD ), several reasons for 0, e.g. no college attendance GRADCOLLW = 5 or no spouse
FEDUCA	Years of foreign education of father, 0 if no education abroad
FEDUCH	Father completed education in the US (1 = 0 to 5 grades, 2 = 6 to 8 grades, 3 = 9 to 11 grades, 4 = 12 grades, High School, 5 = HS plus further non-academic education, 6 = HS + some college but no completed degree, 7 = 15 to 16 years, College degree, 8 = 17+ years, higher than College degree, 0 not applicable)
MEDUCA	Years of foreign education of mother, 0 if no education abroad
MEDUCH	Mother completed education in the US (1 = 0 to 5 grades, 2 = 6 to 8 grades, 3 = 9 to 11 grades, 4 = 12 grades, High School, 5 = HS plus further non-academic education, 6 = HS + some college but no completed degree, 7 = 15 to 16 years, College degree, 8 = 17+ years, higher than College degree, 0 not applicable)
GRADCOLL	Graduated from college (1 = Yes, 5 = no, 0 = not applicable)
GRADCOLLW	Graduated from college (1 = Yes, 5 = no, 0 = not applicable or no wife/spouse)
GRADHS	HH head graduate from High School (1 = Graduated, 2 = General Educational Diploma, 3 = Neither, 4 = not applicable)
GRADHSW	Wife graduate from High School (1 = Graduated, 2 = General Educational Diploma, 3 = Neither, 4 = not applicable or no wife/spouse)
HAPPINESS	During the past 30 days, how much of the time did you feel extremely happy (1 = All of the time to 5 = None of the time)
MARRIED	1 = Married, 2 = Never Married, 3 = Widowed, 4 = Divorced, 5 = Separated
SEX	1 = Male, 2 = Female
YEARSCOLL	Highest year of college completed
STATE	State identifier, e.g. 5 = Arkansas

We will soon explore some aspects of these variables. But first we need to address a fundamental data structure question

Data setup

A fundamental problem in modern data science is that of over-fitting the data. The potential availability of many covariates (explanatory variables) and or the ability of fitting highly non-linear models (regression trees, random forests, neural networks to mention a few) leads to the possibility of fitting data very accurately.

However, fitting data accurately (in-sample) may not lead to ver good out-of sample predictions. What we mean here is predicting values for outcome variables for observations that were not involved in the estimation process. That, however, is usually the aim.

To emulate this process we will separate the data into a base (or estimation or training) data set which we will use to estimate a model (PSID_train) and then a test dataset which is used to test the model on data that were not involved in the estimation process (PSID_test).

In real problems the latter may become available after the former has been used to estimate the model. Here we will set the testing dataset to be a randomly selected subset of our dataset, here approximately 30% of the data.

set.seed(1234) # We set the random seed to ensure that we always randomly select the same set of data
test_prop = 0.3
sel_test <- (runif(n=dim(PSID)[1], min=0, max=1) < test_prop)
sel_train <- !sel_test

The two variables sel_test and sel_train are now two vectors which will help for the remainder of this project to select the training and testing dataset.

What values do you find in the sel_test and sel_train vectors?

Both, sel_test and sel_train, contain as many elements as rows in PSID. Each element is either TRUE of FALSE. sel_train is defined as the opposite of sel_test, so when a row is TRUE in sel_test it is FALSE in sel_train. All rows that are TRUE in sel_test belong to the test dataset and all rows that are TRUE in sel_train belong tothe training dataset. sel_test was defined such that each element had a 30% probability to take the value TRUE.

Dealing with two datasets has an important complication, but one which carries an important lesson for empirical work. When we make changes to variables in the dataset we need to make them to all datasets. We therefore, keep all the data together in the PSID dataset and just select the respective rows when we need them for training or testing purposes.

Further data exploration

We perform this data exploration on the training dataset only to emulate the process where you perform initial data analysis and model estimation on the training dataset only. The test dataset only comes in at the end as the data to which you want to apply the estimated model.

Let us start by running a simple regression of wages against the GRADCOLL variable.

OLS1<-lm(WAGE~GRADCOLL, data = PSID[sel_train,])
stargazer(OLS1, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                WAGE            
## -----------------------------------------------
## GRADCOLL                       0.145           
##                               (0.132)          
##                                                
## Constant                     16.559***         
##                               (0.342)          
##                                                
## -----------------------------------------------
## Observations                   1,787           
## R2                             0.001           
## Adjusted R2                   0.0001           
## Residual Std. Error     11.661 (df = 1785)     
## F Statistic            1.196 (df = 1; 1785)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

ggplot(PSID[sel_train,], aes(x = GRADCOLL, y = WAGE)) + 
  geom_point() +
  geom_smooth(method = "lm")

GRADCOLL does not appear to explain the variation in wages (it appears as statistically insignificant). But the plot above also reveals another aspect of the data. There are only three possible values (0,1 and 5) for the GRADCOLL variable. It is defined as as a numeric variable, but really is a categorical variable (see the above table). If we treat it as numeric, as we have done above, we assume that a response of 1 (did receive a college degree is one unit better or worse than a response of 0 (inapplicable question) and a response of 5 (no college degree) is five times as good or bad as a response of 1. That doesn't make sense as the question is really a categorical variable.

Let's define the variables as categorical variable (in both datasets!).

PSID$GRADCOLL <- as.factor(PSID$GRADCOLL)

Now, let's re-run the regression with GRADCOLL as a categorical variable:

OLS1b<-lm(WAGE~GRADCOLL, data = PSID[sel_train,])
stargazer(OLS1b, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                WAGE            
## -----------------------------------------------
## GRADCOLL1                    7.481***          
##                               (0.662)          
##                                                
## GRADCOLL5                    1.758***          
##                               (0.646)          
##                                                
## Constant                     14.539***         
##                               (0.376)          
##                                                
## -----------------------------------------------
## Observations                   1,787           
## R2                             0.067           
## Adjusted R2                    0.066           
## Residual Std. Error     11.269 (df = 1784)     
## F Statistic          64.357*** (df = 2; 1784)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

What this regression has done is to create a dummy variable for all possible outcome values of the GRADCOLL variable, other than for a base group, here GRADCOLL=0. Now we can see that there is a positive effect of having a college degree, but also that not having one has a positive effect relative to the reference group which are individuals where the question was not applicable. One would really have to understand what the composition of this reference group is. It may for instance include immigrants.

Let's look at other variables which we may have to convert to categorical variables.

Which of the following variables should really be defined as categorical?

So, let's turn the following variables into factors.

cols <- c("DEGREE","DEGREEW", "FEDUCH", "GRADCOLLW","GRADHS","GRADHSW","MARRIED","MEDUCH","SEX","STATE")
PSID <- PSID %>% mutate_at(cols, factor)

In fact, you also need to carefully think about how to code, say the age of the spouse AGEW. To understand this it is useful to look at which values this variable takes.

table(PSID[sel_train,]$AGEW)

## 
##   0  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
## 931   2  11  11  11  14  17  32  26  15  17  39  40  29  35  29  19  19  18  19 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 
##  13  13  17  15  16  15  10  18  18  21  19  14  21  26  18  22  27  14  14  19 
##  57  58  59  60  61  62  63  64  65  66  67  68  69  70  72  78 
##  12  14  13  12  11   6   9  10   3   1   3   2   4   1   1   1

As you can see there are 946 responses where the age of the wife/spouse is coded as "0". This indicates that there are 946 households where there is no wife or spouse. If we were to use the AGEW variable as a numeric variable (which it is coded as) then R would take the value of "0" literally. To avoid this we may want to create a new variable which indicates whether there is a spouse in the household (SPOUSE). That variable will be TRUE if AGEW>17.

PSID <- PSID %>% mutate(SPOUSE = (AGEW>17))

A similar story applies to the AGEY variable which indicates the age of the youngest child. We can look at a table indicating how often we get certain combinations of the AGEY and CHILDREN variable.

table(PSID[sel_train,]$AGEY,PSID[sel_train,]$CHILDREN, dnn = c("Age of youngest child","Number of children"))

##                      Number of children
## Age of youngest child    0    1    2    3    4    5    6    7    8    9
##                    0  1006    0    0    0    0    0    0    0    0    0
##                    1     0   61   54   26   12    4    1    1    1    1
##                    2     0   42   25   20    4    0    1    0    0    0
##                    3     0   25   22   17    3    1    0    0    0    0
##                    4     0   25   21    6    6    2    2    0    0    0
##                    5     0   15   19   12    1    0    0    0    0    0
##                    6     0    7   15    7    5    1    0    0    0    0
##                    7     0   10   17    8    5    0    0    0    0    0
##                    8     0    7   16    5    2    0    0    0    0    0
##                    9     0   16   12    3    3    0    0    0    0    0
##                    10    0   14    9    4    0    0    0    0    0    0
##                    11    0   13   11    3    0    0    0    0    0    0
##                    12    0    9   16    4    1    0    0    0    0    0
##                    13    0   19   11    1    0    0    0    0    0    0
##                    14    0   19    7    2    0    0    0    0    0    0
##                    15    0   17   12    0    0    0    0    0    0    0
##                    16    0   20    0    0    0    0    0    0    0    0
##                    17    0   20    0    0    0    0    0    0    0    0

You can see that AGEY=0 corresponds to no child being in the household. We therefore define the variable CHILD as follows

PSID <- PSID %>% mutate(CHILD = (CHILDREN>0))

Modeling wage variation

Now we can include multiple variables that may help explain variation in wages, such as AGE, MARRIED etc. Therefore define the model with all controls.

reg_wage_all_control <- WAGE~AGE+AGE2+AGEW+AGEY+CHILDREN+DEGREE+DEGREEW+
  FEDUCA+FEDUCH+GRADHS+GRADHSW+HAPPINESS+MARRIED+MEDUCA+
  MEDUCH+SEX+STATE+YEARSCOLL
# does not use GRADCOLL and GRADCOLLW as that info is colinear with DEGREE and DEGREEW
# check table(PSID$GRADCOLL,PSID$DEGREE) to see why
# run multiple linear regression
OLS2<-lm(reg_wage_all_control, data = PSID[sel_train,])
stargazer(OLS2, type = "text")

Why did the above regression exclude the GRADCOLL and GRADCOLLW variable?

The regression output shows almost 100 coefficients. Recall that dummy variables are created for all possible outcomes for each factor variable, so as there 52 possible outcomes for the variable STATE there are 52-1 variables included for the STATE variable alone.

Of course there are many of these variables that are not significant. For instance, let's test whether the coefficients relating to the MEDUCH variable are statistically significantly different from 0.

test0 <- names(coef(OLS2)) # get all coefficient names
test0 <- test0[grepl("MEDUCH", test0)] # keep the ones that contain "MEDUCH"
lht(OLS2, test0)

## 
## Linear hypothesis test:
## MEDUCH1 = 0
## MEDUCH2 = 0
## MEDUCH3 = 0
## MEDUCH4 = 0
## MEDUCH5 = 0
## MEDUCH6 = 0
## MEDUCH7 = 0
## MEDUCH8 = 0
## 
## Model 1: restricted model
## Model 2: WAGE ~ AGE + AGE2 + AGEW + AGEY + CHILDREN + DEGREE + DEGREEW + 
##     FEDUCA + FEDUCH + GRADHS + GRADHSW + HAPPINESS + MARRIED + 
##     MEDUCA + MEDUCH + SEX + STATE + YEARSCOLL
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1697 140669                           
## 2   1689 140159  8    509.98 0.7682  0.631

The p-value is larger than 0.6 and hence the null hypothesis that these coefficients are 0 cannot be rejected.

Variable Selection

The mechanics of the forward, backward and best subset selection methods are described in more detail here.

Forward stepwise selection

PSID_reg <- PSID %>% select(-c(GRADCOLL,GRADCOLLW))  # to exclude the colinear variables

regfit.forw = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,method="forward")

## Reordering variables and trying again:

forw.summary = summary(regfit.forw)
rsq_forw <- forw.summary$rsq
bic_forw <- forw.summary$bic

Backward stepwise selection

regfit.back = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,method="backward")

## Reordering variables and trying again:

back.summary = summary(regfit.back)
rsq_back <- back.summary$rsq
bic_back <- back.summary$bic

Best Subset selection

This would, in principle, be a possibility, but with around 100 possible variables, this leaves too many possibilities. The code would run for a long time, even for the modest number of observations. We therefore will not apply this.

regfit.back = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,really.big = TRUE, method="exhaustive")
full.summary = summary(regfit.forw)
rsq_full <- full.summary$rsq
bic_full <- full.summary$bic

Subset selection summary

res_sel <- data.frame(vars = seq(1,41), bic = bic_forw, rsq = rsq_forw, method = "Forward")
res_sel <- rbind(res_sel, data.frame(vars = seq(1,41), bic = bic_back, rsq = rsq_back, method = "Backward"))

Look at the resulting dataframe to understand its structure. Now we can easily plot the results.

ggplot(res_sel, aes(x = vars, y = bic, color = method)) + 
  geom_line() +
  labs(title = "BIC for forward and backward selection")

res_sel %>% group_by(method) %>% summarise(test = which.min(bic))

## # A tibble: 2 × 2
##   method    test
##   <chr>    <int>
## 1 Backward    13
## 2 Forward     18

The different methods select slightly different numbers of optimal variables. Let's check which variables were selected.

print("Selected variables from forward procedure")

## [1] "Selected variables from forward procedure"

selvars.forw <- forw.summary$which[18,]
names(selvars.forw[selvars.forw==TRUE])

##  [1] "(Intercept)" "AGE"         "AGE2"        "AGEW"        "DEGREEW6"   
##  [6] "FEDUCH8"     "GRADHS1"     "GRADHSW1"    "MEDUCA"      "YEARSCOLL"  
## [11] "STATE2"      "STATE6"      "STATE24"     "STATE27"     "STATE34"    
## [16] "STATE36"     "STATE53"     "DEGREE4"     "DEGREE6"

print("")

## [1] ""

print("Selected variables from backward procedure")

## [1] "Selected variables from backward procedure"

selvars.back <- back.summary$which[13,]
names(selvars.back[selvars.back==TRUE])

##  [1] "(Intercept)" "AGE"         "AGE2"        "DEGREEW6"    "FEDUCH8"    
##  [6] "GRADHS1"     "GRADHSW1"    "MEDUCA"      "SEX2"        "YEARSCOLL"  
## [11] "STATE28"     "STATE34"     "DEGREE4"     "DEGREE6"

From these lists you can see that the two lists contain common variables (like FEDUCH8 or GRADHS1) but there are also some variables which only appear in one of the two selected models.

Out-of-sample performance

We now apply the two models we estimated to the training dataset. As discussed in the exercise dedicated to the subset selection methods, this is facilitated by this function.

predict.regsubsets = function(object,newdata,id,...){
  form = as.formula(object$call[[2]]) # Extract the formula used when we called regsubsets()
  mat = model.matrix(form,newdata)    # Build the model matrix
  coefi = coef(object,id=id)          # Extract the coefficiants of the ith model
  xvars = names(coefi)                # Pull out the names of the predictors used in the ith model
  mat[,xvars]%*%coefi               # Make predictions using matrix multiplication
}

# Create model matrix - may not be needed
#mat <- model.matrix(WAGE ~.,PSID)
#mat_train <- mat[sel_train,]
#mat_test <- mat[sel_test,]

pred.forw=predict.regsubsets(regfit.forw,PSID_reg[sel_test,],id=18)
val.errors.forw=mean((PSID_reg[sel_test,]$WAGE-pred.forw)^2)

pred.back=predict.regsubsets(regfit.back,PSID_reg[sel_test,],id=13)
val.errors.back=mean((PSID_reg[sel_test,]$WAGE-pred.back)^2)

Ridge Regression

The previous variable selection methods estimated a lot of different models to then compare these using information criteria. Ridge regression will estimate one model only (not by OLS any longer) but will penalise coefficients in a way to "bend" them towards 0.

To apply this method we need to define a vector y to contain the dependent variable and a matrix x which contains all possible variables in the columns. So let's define these two variables.

y <- PSID_reg$WAGE
x <- model.matrix(WAGE~.,data = PSID_reg)[,-1]

The model.matrix function is useful as it creates a matrix of all the explanatory variables of a model that models the dependent variable (here WAGE) on all other variables in the specified dataset (here PSID_reg). The important role this function has is to convert all categorical variables into numerical dummy variables. The glmnet function we are using to estimate ridge and LASSO regressions can only deal with numeric variables. The [,-1] at the end actually deletes the first column which is the constant.

How many rows and columns does the x matrix have?

rows: columns:

You can use the command dim(x) to get rows and columns numbers of a matrix.

Note that we did this on the entire datset (training and testing) to ensure that both datasets have the same structure.

You will now use the glmnet function to estimate a ridge regression. While we will not go through the technical details of ridge regressions it is worthwhile to restate the optimisation problem solved by ridge regressions. We are estimating the parameters in the following regression model

\[\begin{equation} y_i = \beta_0 + \sum^p_{j=1} \beta_j x_{ij} + u_i \end{equation}\]

where the regression model contains all $p$ regressors included in the matrix x we defined above. The parameters are estimated by minimizing

\[\begin{equation} \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij}\right)^2 + \lambda \sum^p_{j=1} \beta_j^2 = RSS + \lambda \sum^p_{j=1} \beta_j^2 \end{equation}\]

This is the standard residual sum of squares (RSS) plus a penalty term which penalises against large parameter values. The parameter $\lambda$ will have to be chosen by you the empirical researcher. More on that soon.

Before you apply the following function you should briefly browse through the help (?glmnet) of this function to understand some of the main features. There you will learn that the input option alpha = 0 instructs the function to implement a ridge regression. This sets the penalty term to be equal to the above $\lambda \sum^p_{j=1} \beta_j^2$.

Let's estimate a ridge regression:

ridge=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0)

Does the estimated ridge regression standardise the variables?

You can use the help function ?glmnet to find information on the function's default choice.

The glmnet function automatically minimises the above objective function for a range of valued of $\lambda$. For each value of $\lambda$ the estimated coefficients are different. The larger the $\lambda$ the more large parameter values are penalised and therefore we will get smaller parameter estimates.

What is the meaning of the family = "gaussian" option inthe call to glmnet?

Consult the help function. In the Details section you will find that using this option ensures that the optimisation minimises $(1/2 n) * RSS + \lambda * penalty$. That is the optimisation function stated above. The difference is the factor $(1/2n)$ but as that is a constant this makes no difference to the optimised parameter values.

The following plot shows these parameter estimates for large $\lambda$s on the left and small $\lambda$s on the right.

plot(ridge, main="Ridge and coefficients path")

You can see one line for each of the 136 variables included in the model. An alternative way to see the parameters is the following. First re-estimate the model for particular values of $\lambda$.

paste("Large lambda:", ridge$lambda[25])

## [1] "Large lambda: 511.534471271628"

paste("Medium lambda:", ridge$lambda[50])

## [1] "Medium lambda: 49.9774271934175"

paste("Smalllambda:", ridge$lambda[75])

## [1] "Smalllambda: 4.8828444008166"

ridge_lar=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[25])
ridge_med=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[50])
ridge_small=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[75])

Now we use the coefplot function to plot the coefficients for each of these models.

coefplot(ridge_small,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (small lambda)")

coefplot(ridge_med,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (mid lambda)")

coefplot(ridge_lar,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (large lambda)")

rm(ridge25,ridge50,ridge75)

When you compare the plots make sure you note the different scales on the horizontal axis. Larger $\lambdas$ result in smaller coefficient values. The vertical axis shows the associated variables, but there are too many to read these properly.

Choice of lambda

The ridge regressions estimated above incorporate a trade-off. A $\lambda = 0$ would have delivered the best in-sample fit, however, this is unlikely to be the best choice when using the model for forecasting as it would over-fit the data used for estimation. Larger values of $\lambda$ will penalise against the in-sample over-fitting. But what is the best $\lambda$ to use?

You will apply a cross validation approach to establish the best value of $\lambda$. The glmnet package we are using here to estimate Ridge (and Lasso) regressions has a function that makes this easy to apply. Here is the code:

set.seed(1) # chosing CV folds is random, this ensures that you get identical folds when re-running the code
cvridge=cv.glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0)
plot(cvridge,main="Ridge and CV criterion")

How many folds are used in the default setting for cv.glmnet?

Why does plot(cvridge,main="Ridge and CV criterion") return the mean square error as the criterion of fit?

As you can see the cv.glmnet function returns estimation results for a range of possible $\lambda$ values. Let's see which is the optimal $\lambda$ by looking at cvridge$lambda.min which is where the $\lambda$ with the smallest MSE has been saved.

cvridge$lambda.min

## [1] 2.794143

When calculating the natural log of that $\lambda$ you will find that to be 1.0275255 which corresponds to the value where you can see the minimum MSE in the above plot (also signaled by the left dashed vertical line).

As you can see from the above plot MSE stays more or less the same even for somewhat larger values of $\lambda$ and often it is seen as advantageous to impose a somewhat larger penalty. The second vertical dashed line in the above plot indicates the value of $\lambda$ where the MSE is only one standard error larger than at cvridge$lambda.min. This is this value:

cvridge$lambda.1se

## [1] 23.74333

Once you have chosen a $\lambda$ you can get the coefficients and predictions from that particular model. Here we look at the first 11 coefficients from the optimal model using both $\lambda_{min}$ and $\lambda_{1se}$. To do that we use the predict function using the type="coeff" option.

# list estimated coefficients with lambda.min
paste("lambda = ", cvridge$lambda.min)

## [1] "lambda =  2.79414308848722"

c.min <- coef(cvridge, s = "lambda.min")

# list estimated coefficients with lambda.1se
paste("lambda = ", cvridge$lambda.1se)

## [1] "lambda =  23.7433268986241"

c.1se <- coef(cvridge, s = "lambda.1se")

cbind(c.min, c.1se)

## 102 x 2 sparse Matrix of class "dgCMatrix"
##                        s1            s1
## (Intercept)  9.1014067436 12.6742187209
## AGE          0.0802872856  0.0405998911
## AGE2         0.0004950848  0.0004057272
## AGEW         0.0264441625  0.0210302370
## AGEY         0.1021420149  0.0333197847
## CHILDREN    -0.0693001375 -0.0856079570
## DEGREEW1     0.1506774212  0.2254285581
## DEGREEW2     1.2577026008  1.4049824045
## DEGREEW3    -1.4065242890  0.0963861901
## DEGREEW4     5.6376863928  3.8931612712
## DEGREEW5     .             .           
## DEGREEW6    11.7197142469  5.1312209193
## FEDUCA       0.1861737013  0.1232521289
## FEDUCH1     -0.1178456819 -0.4150403277
## FEDUCH2     -0.7432948488 -0.5924396256
## FEDUCH3     -0.0396782274 -0.2846424423
## FEDUCH4      0.4481337574 -0.0600412747
## FEDUCH5      1.1583178578  0.2045373810
## FEDUCH6      0.5972920274  0.1703029202
## FEDUCH7     -0.2202405829  0.0802927201
## FEDUCH8      2.7873580854  1.5419665798
## GRADHS1      1.8519555784  0.8623297565
## GRADHS2      0.2656079422 -0.4974598399
## GRADHS3     -0.2467499094 -0.8551807161
## GRADHSW1     1.6068166869  0.8682308987
## GRADHSW2     0.7052146120 -0.0545319468
## GRADHSW3     0.1687687981 -0.3424071917
## HAPPINESS   -0.2847114241 -0.1636467044
## MARRIED2    -1.6566234245 -0.9016139453
## MARRIED3    -3.0382398265 -1.2016453792
## MARRIED4     0.0502963342  0.1311097546
## MARRIED5    -1.2020387753 -0.5594359986
## MEDUCA       0.2596187794  0.1296854734
## MEDUCH1     -1.5739110699 -0.9027864905
## MEDUCH2     -0.4039232333 -0.5072512445
## MEDUCH3     -0.0965886240 -0.3860794002
## MEDUCH4      0.2489500872  0.0568353206
## MEDUCH5      1.9798262397  0.7910881146
## MEDUCH6      0.2921505895 -0.0016639295
## MEDUCH7      0.7016035783  0.1482510129
## MEDUCH8      0.0595108304  0.1013835380
## SEX2        -1.1967053144 -0.7208610770
## YEARSCOLL    0.9481175157  0.5045621532
## STATE1      -0.9524462308 -0.3696938263
## STATE2       9.6436546458  3.7212511921
## STATE4      -0.1376531799 -0.2075287169
## STATE5      -2.4813508962 -0.6288146557
## STATE6       2.5781905768  1.1941635480
## STATE8      -2.3375130540 -0.9890307150
## STATE9       2.1186883743  0.8268321794
## STATE10     -5.2631360971 -2.3426680721
## STATE11      4.3600403454  1.5231882599
## STATE12     -1.6213635920 -0.6463841703
## STATE13      0.3088584956 -0.0384257467
## STATE15      .             .           
## STATE16     -3.5589325343 -0.8397824782
## STATE17      1.4625047089  0.6859675600
## STATE18     -0.1613226817 -0.1887122086
## STATE19     -1.8640715663 -0.7379751326
## STATE20     -0.0950459172 -0.0352675266
## STATE21     -0.9249492636 -0.3108357225
## STATE22     -0.6486062092 -0.6205942232
## STATE23      1.4726416559  0.6865286362
## STATE24      2.3163726879  0.7980432677
## STATE25      1.8078808024  0.9523231134
## STATE26     -1.5729000766 -0.7389507599
## STATE27      3.2910798191  1.4772488318
## STATE28     -2.2895897200 -1.1767101633
## STATE29      0.0686523576 -0.0616267683
## STATE30      0.0331920591 -0.0069603181
## STATE31     -3.3008764771 -1.0944796406
## STATE32     -0.3294603454  0.0089716052
## STATE33      2.1206206074  0.8666355839
## STATE34      6.9961301095  2.8743553025
## STATE35     -2.2504073187 -0.3789120127
## STATE36      2.5512633319  1.4122461773
## STATE37     -1.3977041728 -0.7079579999
## STATE38     -1.9190881743 -0.5510121743
## STATE39      0.0427188985  0.1573350848
## STATE40     -1.7502426964 -0.8066552679
## STATE41      1.7118663909  0.7320701950
## STATE42     -0.5198135303 -0.2443726690
## STATE44     -6.1195892139 -3.2775690086
## STATE45     -1.1699703136 -0.7613723302
## STATE46     -3.2255215484 -0.9661174279
## STATE47     -2.5769568236 -1.0668846228
## STATE48     -0.6671657814 -0.1549902809
## STATE49      0.8566125335  0.6079918760
## STATE50      3.2260551128  1.0237892948
## STATE51     -1.8839119039 -0.8429527795
## STATE53      3.0015704824  1.4684296750
## STATE54     -0.1234142874  0.1735687031
## STATE55     -2.0914613884 -0.9099678488
## STATE56     -0.6975400126 -0.3886752337
## DEGREE1      0.5790263430  0.3475980178
## DEGREE2      1.7331023330  1.1575254669
## DEGREE3      3.9695066217  2.3172383310
## DEGREE4     11.7600811226  6.0589030042
## DEGREE5      9.3062814482  4.3270434655
## DEGREE6     74.6105565824 31.2325995033
## SPOUSETRUE  -0.0408197776  0.6148527780
## CHILDTRUE   -0.2873259363 -0.1626462650

When re-running the above code for different random seeds the optimised parameters at any $\lambda$ will remain unchanged.

The folds are created randomly. Everytime you divide the observations into the folds this allocation is random. For different such allocations you will find different optimised parameters. Only if you set the random seed to a particular value can you replicate results.

Predicting from Ridge regression

Once you have selected a model, for instance the model with $\lambda_{1se}$, you can also use the predict function to use the estimated model to produce model forecasts. Typically you will have a new set of observations, as we have here using the information indexed by sel_test which were not used in the estimation process. To produce forecasts you will need a new matrix x which has the correct number of columns. Recall what the dimension of the matrix x is

How many columns does the x matrix have?

columns:

You can use the command dim(x) to get rows and columns numbers of a matrix. You could also look at dim(x[sel_train,]) which is the actual x matrix used for estimation. They will both have the same number of columns.

You can feed in any number of observations but you will need to ensure that the new input matrix has the same column structure (number of columns and same ordering of variables) as the x used in estimating the model. As we will use x[sel_test,], this is the case by construction.

ridge.predict.lambda.1se <- predict(cvridge, x[sel_test,], type="response",s=cvridge$lambda.1se)
ridge.predict.lambda.min <- predict(cvridge, x[sel_test,], type="response",s=cvridge$lambda.min)

Here the result is a vector of 2564 predictions which is the number of observations we set aside to be in our test dataset. We can compare to the original observations of hourly wages which are stored in y.

yp_ridge <- as.vector(ridge.predict.lambda.1se) # as vector makes a vector 
temppred <- data.frame(y = y[sel_test], yp.ridge = yp_ridge )
ggplot(temppred, aes(y = yp.ridge, x = y)) + 
  geom_point() +
  geom_abline(aes(intercept = 0, slope = 1)) +
  labs(x = "Observed Wage", y = "Predicted wage",
  title ="Predictions from Ridge regression")

As you can see, the model clearly is unable to predict very high wages well.

LASSO

Ridge regressions introduced a penalty to the least squares fit. This penalty penalised against large coefficient values and hence shrunk the coefficient values relative to the least squares fit, but importantly did shrink the parameter values towards 0, but not to 0. The LASSO introduces a different penalty, which will ensure that many coefficient parameters will be estimated to be 0.

We are estimating the parameters in the following, unchanged, regression model

\[\begin{equation} y_i = \beta_0 + \sum^p_{j=1} \beta_j x_{ij} + u_i \end{equation}\]

where the regression model contains all $p$ regressors included in the matrix x we defined above. The parameters are estimated by minimizing

\[\begin{equation} \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij}\right)^2 + \lambda \sum^p_{j=1} \beta_j^2 = RSS + \lambda \sum^p_{j=1} |\beta_j| \end{equation}\]

Implementing the LASSO is done very similar to the Ridge regression. The glmnet function is used to estimate the LASSO, where we have to set the alpha parameter to alpha = 1.

lasso=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=1)

As for the Ridge regression this function implements the optimisation for a range of different $\lambda$ values.

plot(lasso,main="Lasso and coefficients path")

We can see that larger $\lambda$ values (on the left) lead to parameter values which are 0. The smaller $\lambda$ the more parameters are estimated to be different from 0. This becomes more obvious if we look at the coefficient plot for a particular choice of $\lambda$.

lasso25=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=1,lambda=lasso$lambda[25])
coefplot(lasso25,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="LASSO Coefficient Plot (high lambda)")

The plot only shows the variables with coefficients that have been estimated to be unequal to 0. These are only 23 out of 101 variables.

Similar the Ridge regression you will eventually have to chose a $\lambda$ value.

Choice of lambda

The tool of choice is cross validation and this is implemented in a similar way as for the Ridge regression.

set.seed(1)
cv_lasso=cv.glmnet(x[sel_train,],y[sel_train],family="gaussian",nfolds=10,alpha=1)
# Plot CV criterion
plot(cv_lasso,main="Lasso and CV criterion")

You see from this plot that the minimum of the MSE criterion (average MSE over the 10 cross-validation folds) is where $log(\lambda)$ is about -1.5 (equivalent to $= $ 0.2214311). This could be a $\lambda$ value to use. It is, in this literature, however common practice to chose (or at lest consider) a somewhat larger $\lambda$ implying even more regulisation. That is the $\lambda$ value associated to the second dashed line, around $log(\lambda)=0.17$ or $= $ 1.181711.

Let us now check what parameters these different levels of regularisation imply:

# list estimated coefficients with lambda.min
print(paste("lambda min = ", cv_lasso$lambda.min))

## [1] "lambda min =  0.221431061317992"

print(paste("lambda +1s.e. = ", cv_lasso$lambda.1se))

## [1] "lambda +1s.e. =  1.18171097470092"

# get the coeficients using the predict function
lasso.coef.lambda.min <- coef(cv_lasso, s = "lambda.min")
lasso.coef.lambda.1se <- coef(cv_lasso, s = "lambda.1se")

print(cbind(lasso.coef.lambda.min, lasso.coef.lambda.1se))

## 102 x 2 sparse Matrix of class "dgCMatrix"
##                      s1          s1
## (Intercept)  7.39360389 11.43160186
## AGE          0.12649026  0.06284273
## AGE2         .           .         
## AGEW         0.02225465  0.05203511
## AGEY         0.05040412  .         
## CHILDREN     .           .         
## DEGREEW1     .           .         
## DEGREEW2     0.41194378  .         
## DEGREEW3    -0.12300401  .         
## DEGREEW4     .           .         
## DEGREEW5     .           .         
## DEGREEW6     9.95505248  .         
## FEDUCA       0.10419891  .         
## FEDUCH1      .           .         
## FEDUCH2     -0.74588672  .         
## FEDUCH3      .           .         
## FEDUCH4      0.06797709  .         
## FEDUCH5      .           .         
## FEDUCH6      .           .         
## FEDUCH7      .           .         
## FEDUCH8      2.26554506  .         
## GRADHS1      1.61005944  .         
## GRADHS2      .           .         
## GRADHS3      .           .         
## GRADHSW1     1.86900841  0.68113426
## GRADHSW2     .           .         
## GRADHSW3     .           .         
## HAPPINESS   -0.08197663  .         
## MARRIED2    -1.42289172 -0.52094874
## MARRIED3    -1.99047193  .         
## MARRIED4     .           .         
## MARRIED5    -0.37393495  .         
## MEDUCA       0.25252174  0.02207431
## MEDUCH1      .           .         
## MEDUCH2     -0.27205898  .         
## MEDUCH3      .           .         
## MEDUCH4      .           .         
## MEDUCH5      1.29510288  .         
## MEDUCH6      .           .         
## MEDUCH7      0.03166446  .         
## MEDUCH8      .           .         
## SEX2        -1.23414809  .         
## YEARSCOLL    1.36927802  1.24088121
## STATE1       .           .         
## STATE2       7.54381464  .         
## STATE4       .           .         
## STATE5      -1.37179944  .         
## STATE6       3.08567866  0.15176071
## STATE8      -0.66083334  .         
## STATE9       .           .         
## STATE10      .           .         
## STATE11      2.59598973  .         
## STATE12      .           .         
## STATE13      .           .         
## STATE15      .           .         
## STATE16      .           .         
## STATE17      1.36208901  .         
## STATE18      .           .         
## STATE19     -0.03613881  .         
## STATE20      .           .         
## STATE21      .           .         
## STATE22      .           .         
## STATE23      .           .         
## STATE24      1.93388344  .         
## STATE25      1.02461109  .         
## STATE26     -0.21175294  .         
## STATE27      3.04390671  .         
## STATE28     -1.47925003  .         
## STATE29      .           .         
## STATE30      .           .         
## STATE31     -0.67967388  .         
## STATE32      .           .         
## STATE33      .           .         
## STATE34      7.49087624  0.02591321
## STATE35      .           .         
## STATE36      2.40722495  .         
## STATE37     -0.43440305  .         
## STATE38      .           .         
## STATE39      .           .         
## STATE40      .           .         
## STATE41      1.03770334  .         
## STATE42      .           .         
## STATE44      .           .         
## STATE45     -0.14928896  .         
## STATE46     -0.04182384  .         
## STATE47     -0.77044516  .         
## STATE48      .           .         
## STATE49      .           .         
## STATE50      .           .         
## STATE51     -0.65198063  .         
## STATE53      3.19453623  .         
## STATE54      .           .         
## STATE55     -0.43950153  .         
## STATE56      .           .         
## DEGREE1      .           .         
## DEGREE2      0.28490050  .         
## DEGREE3      1.76564741  .         
## DEGREE4     10.33057437  .         
## DEGREE5      3.85163572  .         
## DEGREE6     87.03248659 69.61234545
## SPOUSETRUE   .           .         
## CHILDTRUE    .           .

You can see that the larger $\lambda$ value does deliver a much sparser model.

Predicting from LASSO regressions

Let's see how closely the model predictions (in the training dataset) are correlated.

# Extract the predicted values
lasso.predict.lambda.min <- predict(cv_lasso, x[sel_test,], s=cv_lasso$lambda.min)
lasso.predict.lambda.1se <- predict(cv_lasso, x[sel_test,], s=cv_lasso$lambda.1se)

# Correlation between the predicted values of the two Lasso models
cor_fit <- cor(lasso.predict.lambda.min,lasso.predict.lambda.1se)
print(paste0("Correlation between predicted values: ", cor_fit))

## [1] "Correlation between predicted values: 0.935481426603129"

So while the models are quite different, the correlation between the two prediction series is quite strong.

Comparing Model performance

We have now estimated models with forward and backward variable selection as well as Ridge and Lasso regressions. Let us compare how well these models do in predicting wages in our testing dataset.

y.real <- y[sel_test]
MSE.back <- mean((y.real-pred.back)^2)
MSE.forw <- mean((y.real-pred.forw)^2)
MSE.ridge.min <- mean((y.real-ridge.predict.lambda.min)^2) 
MSE.ridge.1se <- mean((y.real-ridge.predict.lambda.1se)^2) 
MSE.lasso.min <- mean((y.real-lasso.predict.lambda.min)^2) 
MSE.lasso.1se <- mean((y.real-lasso.predict.lambda.1se)^2) 

print("Mean Square Errors (MSE) in training set")

## [1] "Mean Square Errors (MSE) in training set"

print(paste("MSE backward selection",MSE.back))

## [1] "MSE backward selection 163.1522633524"

print(paste("MSE forward selection",MSE.forw))

## [1] "MSE forward selection 163.881339600827"

print(paste("MSE Ridge min lambda",MSE.ridge.min))

## [1] "MSE Ridge min lambda 153.676638320345"

print(paste("MSE Ridge 1se lambda",MSE.ridge.1se))

## [1] "MSE Ridge 1se lambda 150.588153326857"

print(paste("MSE LASSO min lambda",MSE.lasso.min))

## [1] "MSE LASSO min lambda 157.232877262571"

print(paste("MSE LASSO 1se lambda",MSE.lasso.1se))

## [1] "MSE LASSO 1se lambda 158.954752803404"

From here you can see that in this case the Ridge regression delivered the best out of sample performance.

Alternative Packages

In R, but also other programming languages like Python, the same thing can be done in different ways. In particular, different packages may have been written to achieve the same thing.

Which function from the hdm package allows you to estimate a LASSO regression?

You can't know!!! But you can find out. If you know that a particular package could potentially have a certain functionality. You should web search something like "R hdm lasso regression" and you are likely to find a link to the package's page on the CRAN webpage. These pages contain info on all packages. The reference manual and the vignette are likely to help you find the relevant function.

# replace XXXX with the correct function name
lasso.hdm <- XXXX(WAGE~.,data = PSID_reg[sel_train,],post=TRUE) 
summary(lasso.hdm)

## 
## Call:
## rlasso.formula(formula = WAGE ~ ., data = PSID_reg[sel_train, 
##     ], post = TRUE)
## 
## Post-Lasso Estimation:  TRUE 
## 
## Total number of variables: 99
## Number of selected variables: 13 
## 
## Residuals: 
##     Min      1Q  Median      3Q     Max 
## -44.059  -4.865  -1.422   2.611  90.824 
## 
##             Estimate
## (Intercept)    7.768
## AGE            0.151
## AGE2           0.000
## AGEW           0.026
## AGEY           0.000
## CHILDREN       0.000
## DEGREEW1       0.000
## DEGREEW2       0.000
## DEGREEW3       0.000
## DEGREEW4       0.000
## DEGREEW6       0.000
## FEDUCA         0.000
## FEDUCH1        0.000
## FEDUCH2       -2.235
## FEDUCH3        0.000
## FEDUCH4        0.000
## FEDUCH5        0.000
## FEDUCH6        0.000
## FEDUCH7        0.000
## FEDUCH8        0.000
## GRADHS1        1.603
## GRADHS2        0.000
## GRADHS3       -0.329
## GRADHSW1       1.756
## GRADHSW2       0.000
## GRADHSW3       0.000
## HAPPINESS      0.000
## MARRIED2      -1.407
## MARRIED3       0.000
## MARRIED4       0.000
## MARRIED5       0.000
## MEDUCA         0.000
## MEDUCH1        0.000
## MEDUCH2        0.000
## MEDUCH3        0.000
## MEDUCH4        0.000
## MEDUCH5        0.000
## MEDUCH6        0.000
## MEDUCH7        0.000
## MEDUCH8        0.000
## SEX2          -1.874
## YEARSCOLL      1.717
## STATE1         0.000
## STATE2         0.000
## STATE4         0.000
## STATE5         0.000
## STATE6         0.000
## STATE8         0.000
## STATE9         0.000
## STATE10        0.000
## STATE11        0.000
## STATE12        0.000
## STATE13        0.000
## STATE16        0.000
## STATE17        0.000
## STATE18        0.000
## STATE19        0.000
## STATE20        0.000
## STATE21        0.000
## STATE22        0.000
## STATE23        0.000
## STATE24        0.000
## STATE25        0.000
## STATE26        0.000
## STATE27        0.000
## STATE28       -3.402
## STATE29        0.000
## STATE30        0.000
## STATE31        0.000
## STATE32        0.000
## STATE33        0.000
## STATE34        0.000
## STATE35        0.000
## STATE36        0.000
## STATE37       -2.396
## STATE38        0.000
## STATE39        0.000
## STATE40        0.000
## STATE41        0.000
## STATE42        0.000
## STATE44        0.000
## STATE45       -1.969
## STATE46        0.000
## STATE47        0.000
## STATE48        0.000
## STATE49        0.000
## STATE50        0.000
## STATE51        0.000
## STATE53        0.000
## STATE54        0.000
## STATE55        0.000
## STATE56        0.000
## DEGREE1        0.000
## DEGREE2        0.000
## DEGREE3        0.000
## DEGREE4        0.000
## DEGREE5        0.000
## DEGREE6       90.527
## SPOUSETRUE     0.000
## CHILDTRUE      0.000
## 
## Residual standard error: 9.452
## Multiple R-squared:  0.343
## Adjusted R-squared:  0.3382
## Joint significance test:
##  the sup score statistic for joint significance test is 1.199e+05 with a p-value of 0.002

While this conceptually estimates the same LASSO model as above there are slight differences to the earlier approach. To figure out what these are you really need to read the help function. The main differences here are:

the hdm package does not use cross validation to find the optimal $\lambda$, they set it to a data dependent value (see the vignette for details).
after chosing the variables with non-zero coefficients the model is re-estimated without the penalty term (option POST = TRUE).

Let's predict from this LASSO model.

lasso.hdm.predict = predict(lasso.hdm, newdata = PSID_reg[sel_test,])  #out-of-sample prediction

When you try this you will most likely get an error message like:

"Error in X %*% beta : non-conformable arguments"

Dealing with error messages like this is a normal and very common activity you will have to do when applying empirical techniques. The message refers to "X %*% beta".

This refers to the multiplication needed for producing forecasts: $\hat{y} = X \hat{\beta}$. For this operation to work the number of columns of $X$ has to be identical to the number of estimated parameters in $\hat{\beta}$. And the error message tells us that this does not seem to be the case ("non-conformable arguments"). In this case this message will lead us to a solution. Unfortunately it is not the case that error messages are always so informative. A great illustration of that and an essential watch to understand how to debug is this YouTube clip of a Jenny Bryan.

So how can that be. We are feeding the same dataset into the predict function (newdata = PSID[sel_test,]) as the one we used to estimate the model (data = PSID_reg[sel_train,]), just different rows. Why then do the parameters which come from the estimated model not fit the data used for forecasting?

The key to understanding this is to recognise that PSID_reg contains some factor variables. What the rlasso function does is that it turns all factor variables into k-1 dummy variables if there are k different values for that factor variable (e.g. DEGREEW). So it could be that in the training dataset there are observations with all 7 different outcomes and hence when we called rlasso there were indeed 6 dummy variables created and appropriate parameters estimated. If now, however, in the testing dataset there are only observations for 6 out of these 7 categories, then only 5 dummy variables will be created and therefore we have created a mismatch between $X$ columns and $\hat{\beta}$ elements.

This should not be a problem if all variables are numeric as then the size of the dataset handed into the function exactly represents the number of variables used.

So, how do we go about this. We will actually use a trick we already used when applying the Ridge and LASSO regressions earlier. We then first created the y and x objects which we used in glmnet. Let's recall how we created y and x:

y <- PSID_reg$WAGE
x <- model.matrix(WAGE~.,data = PSID_reg)[,-1]

y just gets the dependent variable. x uses the model.matrix function and creates the dummy variables from the factor variables in PSID_reg. Importantly we do this for the entire dataset not only the training dataset. We then feed the training rows (sel_train) into glmnet. Let's use the same approach with the rlasso function:

lasso.hdm2 <- rlasso(y[sel_train]~x[sel_train,],post=TRUE)
lasso.hdm2.predict = predict(lasso.hdm2, newdata = x[sel_test,])  #out-of-sample prediction

Now that we have forecasts from this LASSO implementation we can again calculate the MSE for these forecasts.

MSE.lasso.hdm <- mean((y.real-lasso.hdm2.predict)^2) 
print(paste("MSE backward selection",MSE.lasso.hdm))

## [1] "MSE backward selection 155.085106352071"

Comparing this to the earlier evaluations you will realise that this is better than the two cross-validated LASSO versions, but not better than the Ridge regressions.

Summary

In this section we went through applications of Ridge and LASSO regressions. We learned how to apply cross-validation to find the optimal value for the penalty parameters and then how to apply the estimated models to a new dataset.

You also learned that you have to be very careful with the data setup, in particular when your dataset contains factor variables. As we encountered such a problem we also aplied one of the most important skills you will need as an applied researcher, namely debugging your code. nobody can teach you how to debug, but perhaps the work in this section gave you some hints on how to start.

R Variable Selection, Lasso and Ridge Regression

Introduction

Exploring the variables

Data setup

Further data exploration

Modeling wage variation

Variable Selection

Forward stepwise selection

Backward stepwise selection

Best Subset selection

Subset selection summary

Out-of-sample performance

Ridge Regression

Choice of lambda

Predicting from Ridge regression

LASSO

Choice of lambda

Predicting from LASSO regressions

Comparing Model performance

Alternative Packages

Summary