This exercise is part of the ECLR page.
Here we will demonstrate the use of LASSO and Ridge regressions to deal with situations in which you estimate a regression model that contains a large amount of potential explanatory variables. These type of models will mainly be used for prediction purposes. The predictions may be the final aim of the work, or they may be used in some intermediate step in another analysis.
We begin by loading a range of packages.
# load packages
library(car)
library(AER)
library(glmnet)
library(hdm)
library(stats)
require(graphics)
library(leaps)
library(tidyverse)
library(haven)
library(stargazer)
library(ggplot2)
library(sandwich)
library(coefplot)
library(hdm)
Let's load the dataset which contains an extract from the Panel Study
of Income Dynamics (PSID). You can download this file from here: PSIDsmall.txt.
The datafile here is saved as a text file which can be imported using
the read.delim
function.
## load data set
PSID1 <- read.delim("../data/PSIDsmall.txt",header=TRUE)
Let's see what variables are contained in this dataset.
names(PSID1)
## [1] "WAGE" "AGE" "AGE2" "AGEW" "AGEY" "CHILDREN"
## [7] "DEGREEW" "FEDUCA" "FEDUCH" "GRADCOLLW" "GRADHS" "GRADHSW"
## [13] "HAPPINESS" "MARRIED" "MEDUCA" "MEDUCH" "SEX" "YEARSCOLL"
## [19] "STATE" "GRADCOLL" "DEGREE"
summary(PSID1)
## WAGE AGE AGE2 AGEW
## Min. : 1.00 Min. :18.00 Min. : 324 Min. : 0.0
## 1st Qu.: 10.00 1st Qu.:28.00 1st Qu.: 784 1st Qu.: 0.0
## Median : 14.00 Median :36.00 Median :1296 Median : 0.0
## Mean : 17.99 Mean :39.21 Mean :1715 Mean :18.9
## 3rd Qu.: 20.00 3rd Qu.:50.00 3rd Qu.:2500 3rd Qu.:37.0
## Max. :975.00 Max. :85.00 Max. :7225 Max. :78.0
## AGEY CHILDREN DEGREEW FEDUCA
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 0.0000
## Mean : 2.776 Mean :0.8201 Mean :0.2967 Mean : 0.6266
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000
## Max. :17.000 Max. :9.0000 Max. :6.0000 Max. :20.0000
## FEDUCH GRADCOLLW GRADHS GRADHSW
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000
## Median :4.000 Median :0.0000 Median :1.000 Median :0.0000
## Mean :3.987 Mean :0.6826 Mean :1.194 Mean :0.5109
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :8.000 Max. :5.0000 Max. :3.000 Max. :3.0000
## HAPPINESS MARRIED MEDUCA MEDUCH
## Min. :0.000 Min. :1.000 Min. : 0.0000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 0.0000 1st Qu.:3.000
## Median :2.000 Median :2.000 Median : 0.0000 Median :4.000
## Mean :2.238 Mean :2.078 Mean : 0.5849 Mean :4.138
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.:6.000
## Max. :5.000 Max. :5.000 Max. :20.0000 Max. :8.000
## SEX YEARSCOLL STATE GRADCOLL
## Min. :1.000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:17.00 1st Qu.:0.000
## Median :1.000 Median :1.000 Median :28.00 Median :1.000
## Mean :1.343 Mean :1.336 Mean :28.61 Mean :1.538
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.:5.000
## Max. :2.000 Max. :5.000 Max. :56.00 Max. :5.000
## DEGREE
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4155
## 3rd Qu.:0.0000
## Max. :6.0000
Let's look at the distribution of wages (measured as hourly wages).
phist <- ggplot(PSID1, aes(x=WAGE)) + geom_histogram()
phist
You can see that there are a few individuals with very large hourly wages. Let us exclude wages that are larger than USD200 per hour and have another look at the wage distribution.
PSID <- PSID1 %>% filter(WAGE <200)
phist <- ggplot(PSID, aes(x=WAGE)) + geom_histogram()
phist
These are observations from the PSID and to do serious work with this
dataset we need to understand the variables. The PSID surveys household
on their income but also a number of household and personal
characteristics. The following variable search page
can be useful here. The main person interviewed is the household head,
if there is a couple this was typically the man. You will see some
variables which exist in two versions, for instance AGE
and
AGEW
. The former refers to the household head and the
latter to the wife's age. You can tell that this labeling is the result
of patriarchal thinking. In later versions of the PSID this labeling has
changed to household head and (where appropriate) the spouse.
Variable Name | Definition |
---|---|
WAGE | Hourly wage of household head |
AGE | Age of household (HH) head |
AGE2 | Squared Age of HH head |
AGEW | Age of wife (0 if no wife/spouse) |
AGEY | Age of youngest child (0 if no child, children aged 0 and 1 will have an entry of "1") |
CHILDREN | number of children |
DEGREE | highest degree earned (0 = not applicable, 1 = Associate, 2 = Bachelor, 3 = Master, 4 = PhD, 5 = Law, LLB, 6 = Medical, MD ), several reasons for 0, e.g. no college attendance GRADCOLL = 5 |
DEGREEW | highest degree earned (0 = not applicable, 1 = Associate, 2 = Bachelor, 3 = Master, 4 = PhD, 5 = Law, LLB, 6 = Medical, MD ), several reasons for 0, e.g. no college attendance GRADCOLLW = 5 or no spouse |
FEDUCA | Years of foreign education of father, 0 if no education abroad |
FEDUCH | Father completed education in the US (1 = 0 to 5 grades, 2 = 6 to 8 grades, 3 = 9 to 11 grades, 4 = 12 grades, High School, 5 = HS plus further non-academic education, 6 = HS + some college but no completed degree, 7 = 15 to 16 years, College degree, 8 = 17+ years, higher than College degree, 0 not applicable) |
MEDUCA | Years of foreign education of mother, 0 if no education abroad |
MEDUCH | Mother completed education in the US (1 = 0 to 5 grades, 2 = 6 to 8 grades, 3 = 9 to 11 grades, 4 = 12 grades, High School, 5 = HS plus further non-academic education, 6 = HS + some college but no completed degree, 7 = 15 to 16 years, College degree, 8 = 17+ years, higher than College degree, 0 not applicable) |
GRADCOLL | Graduated from college (1 = Yes, 5 = no, 0 = not applicable) |
GRADCOLLW | Graduated from college (1 = Yes, 5 = no, 0 = not applicable or no wife/spouse) |
GRADHS | HH head graduate from High School (1 = Graduated, 2 = General Educational Diploma, 3 = Neither, 4 = not applicable) |
GRADHSW | Wife graduate from High School (1 = Graduated, 2 = General Educational Diploma, 3 = Neither, 4 = not applicable or no wife/spouse) |
HAPPINESS | During the past 30 days, how much of the time did you feel extremely happy (1 = All of the time to 5 = None of the time) |
MARRIED | 1 = Married, 2 = Never Married, 3 = Widowed, 4 = Divorced, 5 = Separated |
SEX | 1 = Male, 2 = Female |
YEARSCOLL | Highest year of college completed |
STATE | State identifier, e.g. 5 = Arkansas |
We will soon explore some aspects of these variables. But first we need to address a fundamental data structure question
A fundamental problem in modern data science is that of over-fitting the data. The potential availability of many covariates (explanatory variables) and or the ability of fitting highly non-linear models (regression trees, random forests, neural networks to mention a few) leads to the possibility of fitting data very accurately.
However, fitting data accurately (in-sample) may not lead to ver good out-of sample predictions. What we mean here is predicting values for outcome variables for observations that were not involved in the estimation process. That, however, is usually the aim.
To emulate this process we will separate the data into a base (or
estimation or training) data set which we will use to estimate a model
(PSID_train
) and then a test dataset which is used to test
the model on data that were not involved in the estimation process
(PSID_test
).
In real problems the latter may become available after the former has been used to estimate the model. Here we will set the testing dataset to be a randomly selected subset of our dataset, here approximately 30% of the data.
set.seed(1234) # We set the random seed to ensure that we always randomly select the same set of data
test_prop = 0.3
sel_test <- (runif(n=dim(PSID)[1], min=0, max=1) < test_prop)
sel_train <- !sel_test
The two variables sel_test
and sel_train
are now two vectors which will help for the remainder of this project to
select the training and testing dataset.
What values do you find in the sel_test
and
sel_train
vectors?
Both, sel_test
and sel_train
, contain as
many elements as rows in PSID. Each element is either TRUE of FALSE.
sel_train
is defined as the opposite of
sel_test
, so when a row is TRUE in sel_test
it
is FALSE in sel_train
. All rows that are TRUE in
sel_test
belong to the test dataset and all rows that are
TRUE in sel_train
belong tothe training dataset.
sel_test
was defined such that each element had a 30%
probability to take the value TRUE.
Dealing with two datasets has an important complication, but one
which carries an important lesson for empirical work. When we make
changes to variables in the dataset we need to make them to all
datasets. We therefore, keep all the data together in the
PSID
dataset and just select the respective rows when we
need them for training or testing purposes.
We perform this data exploration on the training dataset only to emulate the process where you perform initial data analysis and model estimation on the training dataset only. The test dataset only comes in at the end as the data to which you want to apply the estimated model.
Let us start by running a simple regression of wages against the
GRADCOLL
variable.
OLS1<-lm(WAGE~GRADCOLL, data = PSID[sel_train,])
stargazer(OLS1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## WAGE
## -----------------------------------------------
## GRADCOLL 0.145
## (0.132)
##
## Constant 16.559***
## (0.342)
##
## -----------------------------------------------
## Observations 1,787
## R2 0.001
## Adjusted R2 0.0001
## Residual Std. Error 11.661 (df = 1785)
## F Statistic 1.196 (df = 1; 1785)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
ggplot(PSID[sel_train,], aes(x = GRADCOLL, y = WAGE)) +
geom_point() +
geom_smooth(method = "lm")
GRADCOLL does not appear to explain the variation in wages (it
appears as statistically insignificant). But the plot above also reveals
another aspect of the data. There are only three possible values (0,1
and 5) for the GRADCOLL
variable. It is defined as as a
numeric variable, but really is a categorical variable (see the above
table). If we treat it as numeric, as we have done above, we assume that
a response of 1 (did receive a college degree is one unit better or
worse than a response of 0 (inapplicable question) and a response of 5
(no college degree) is five times as good or bad as a response of 1.
That doesn't make sense as the question is really a categorical
variable.
Let's define the variables as categorical variable (in both datasets!).
PSID$GRADCOLL <- as.factor(PSID$GRADCOLL)
Now, let's re-run the regression with GRADCOLL
as a
categorical variable:
OLS1b<-lm(WAGE~GRADCOLL, data = PSID[sel_train,])
stargazer(OLS1b, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## WAGE
## -----------------------------------------------
## GRADCOLL1 7.481***
## (0.662)
##
## GRADCOLL5 1.758***
## (0.646)
##
## Constant 14.539***
## (0.376)
##
## -----------------------------------------------
## Observations 1,787
## R2 0.067
## Adjusted R2 0.066
## Residual Std. Error 11.269 (df = 1784)
## F Statistic 64.357*** (df = 2; 1784)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What this regression has done is to create a dummy variable for all
possible outcome values of the GRADCOLL
variable, other
than for a base group, here GRADCOLL=0
. Now we can see that
there is a positive effect of having a college degree, but also that not
having one has a positive effect relative to the reference group which
are individuals where the question was not applicable. One would really
have to understand what the composition of this reference group is. It
may for instance include immigrants.
Let's look at other variables which we may have to convert to categorical variables.
Which of the following variables should really be defined as categorical?
So, let's turn the following variables into factors.
cols <- c("DEGREE","DEGREEW", "FEDUCH", "GRADCOLLW","GRADHS","GRADHSW","MARRIED","MEDUCH","SEX","STATE")
PSID <- PSID %>% mutate_at(cols, factor)
In fact, you also need to carefully think about how to code, say the
age of the spouse AGEW
. To understand this it is useful to
look at which values this variable takes.
table(PSID[sel_train,]$AGEW)
##
## 0 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 931 2 11 11 11 14 17 32 26 15 17 39 40 29 35 29 19 19 18 19
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## 13 13 17 15 16 15 10 18 18 21 19 14 21 26 18 22 27 14 14 19
## 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 78
## 12 14 13 12 11 6 9 10 3 1 3 2 4 1 1 1
As you can see there are 946 responses where the age of the
wife/spouse is coded as "0". This indicates that there are 946
households where there is no wife or spouse. If we were to use the
AGEW
variable as a numeric variable (which it is coded as)
then R would take the value of "0" literally. To avoid this we may want
to create a new variable which indicates whether there is a spouse in
the household (SPOUSE
). That variable will be TRUE if
AGEW>17
.
PSID <- PSID %>% mutate(SPOUSE = (AGEW>17))
A similar story applies to the AGEY
variable which
indicates the age of the youngest child. We can look at a table
indicating how often we get certain combinations of the
AGEY
and CHILDREN
variable.
table(PSID[sel_train,]$AGEY,PSID[sel_train,]$CHILDREN, dnn = c("Age of youngest child","Number of children"))
## Number of children
## Age of youngest child 0 1 2 3 4 5 6 7 8 9
## 0 1006 0 0 0 0 0 0 0 0 0
## 1 0 61 54 26 12 4 1 1 1 1
## 2 0 42 25 20 4 0 1 0 0 0
## 3 0 25 22 17 3 1 0 0 0 0
## 4 0 25 21 6 6 2 2 0 0 0
## 5 0 15 19 12 1 0 0 0 0 0
## 6 0 7 15 7 5 1 0 0 0 0
## 7 0 10 17 8 5 0 0 0 0 0
## 8 0 7 16 5 2 0 0 0 0 0
## 9 0 16 12 3 3 0 0 0 0 0
## 10 0 14 9 4 0 0 0 0 0 0
## 11 0 13 11 3 0 0 0 0 0 0
## 12 0 9 16 4 1 0 0 0 0 0
## 13 0 19 11 1 0 0 0 0 0 0
## 14 0 19 7 2 0 0 0 0 0 0
## 15 0 17 12 0 0 0 0 0 0 0
## 16 0 20 0 0 0 0 0 0 0 0
## 17 0 20 0 0 0 0 0 0 0 0
You can see that AGEY=0
corresponds to no child being in
the household. We therefore define the variable CHILD
as
follows
PSID <- PSID %>% mutate(CHILD = (CHILDREN>0))
Now we can include multiple variables that may help explain variation in wages, such as AGE, MARRIED etc. Therefore define the model with all controls.
reg_wage_all_control <- WAGE~AGE+AGE2+AGEW+AGEY+CHILDREN+DEGREE+DEGREEW+
FEDUCA+FEDUCH+GRADHS+GRADHSW+HAPPINESS+MARRIED+MEDUCA+
MEDUCH+SEX+STATE+YEARSCOLL
# does not use GRADCOLL and GRADCOLLW as that info is colinear with DEGREE and DEGREEW
# check table(PSID$GRADCOLL,PSID$DEGREE) to see why
# run multiple linear regression
OLS2<-lm(reg_wage_all_control, data = PSID[sel_train,])
stargazer(OLS2, type = "text")
Why did the above regression exclude the GRADCOLL
and
GRADCOLLW
variable?
The regression output shows almost 100 coefficients. Recall that
dummy variables are created for all possible outcomes for each factor
variable, so as there 52 possible outcomes for the variable
STATE
there are 52-1 variables included for the
STATE
variable alone.
Of course there are many of these variables that are not significant.
For instance, let's test whether the coefficients relating to the
MEDUCH
variable are statistically significantly different
from 0.
test0 <- names(coef(OLS2)) # get all coefficient names
test0 <- test0[grepl("MEDUCH", test0)] # keep the ones that contain "MEDUCH"
lht(OLS2, test0)
##
## Linear hypothesis test:
## MEDUCH1 = 0
## MEDUCH2 = 0
## MEDUCH3 = 0
## MEDUCH4 = 0
## MEDUCH5 = 0
## MEDUCH6 = 0
## MEDUCH7 = 0
## MEDUCH8 = 0
##
## Model 1: restricted model
## Model 2: WAGE ~ AGE + AGE2 + AGEW + AGEY + CHILDREN + DEGREE + DEGREEW +
## FEDUCA + FEDUCH + GRADHS + GRADHSW + HAPPINESS + MARRIED +
## MEDUCA + MEDUCH + SEX + STATE + YEARSCOLL
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1697 140669
## 2 1689 140159 8 509.98 0.7682 0.631
The p-value is larger than 0.6 and hence the null hypothesis that these coefficients are 0 cannot be rejected.
The mechanics of the forward, backward and best subset selection methods are described in more detail here.
PSID_reg <- PSID %>% select(-c(GRADCOLL,GRADCOLLW)) # to exclude the colinear variables
regfit.forw = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,method="forward")
## Reordering variables and trying again:
forw.summary = summary(regfit.forw)
rsq_forw <- forw.summary$rsq
bic_forw <- forw.summary$bic
regfit.back = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,method="backward")
## Reordering variables and trying again:
back.summary = summary(regfit.back)
rsq_back <- back.summary$rsq
bic_back <- back.summary$bic
This would, in principle, be a possibility, but with around 100 possible variables, this leaves too many possibilities. The code would run for a long time, even for the modest number of observations. We therefore will not apply this.
regfit.back = regsubsets(WAGE~.,data=PSID_reg[sel_train,],nvmax=40,really.big = TRUE, method="exhaustive")
full.summary = summary(regfit.forw)
rsq_full <- full.summary$rsq
bic_full <- full.summary$bic
res_sel <- data.frame(vars = seq(1,41), bic = bic_forw, rsq = rsq_forw, method = "Forward")
res_sel <- rbind(res_sel, data.frame(vars = seq(1,41), bic = bic_back, rsq = rsq_back, method = "Backward"))
Look at the resulting dataframe to understand its structure. Now we can easily plot the results.
ggplot(res_sel, aes(x = vars, y = bic, color = method)) +
geom_line() +
labs(title = "BIC for forward and backward selection")
res_sel %>% group_by(method) %>% summarise(test = which.min(bic))
## # A tibble: 2 × 2
## method test
## <chr> <int>
## 1 Backward 13
## 2 Forward 18
The different methods select slightly different numbers of optimal variables. Let's check which variables were selected.
print("Selected variables from forward procedure")
## [1] "Selected variables from forward procedure"
selvars.forw <- forw.summary$which[18,]
names(selvars.forw[selvars.forw==TRUE])
## [1] "(Intercept)" "AGE" "AGE2" "AGEW" "DEGREEW6"
## [6] "FEDUCH8" "GRADHS1" "GRADHSW1" "MEDUCA" "YEARSCOLL"
## [11] "STATE2" "STATE6" "STATE24" "STATE27" "STATE34"
## [16] "STATE36" "STATE53" "DEGREE4" "DEGREE6"
print("")
## [1] ""
print("Selected variables from backward procedure")
## [1] "Selected variables from backward procedure"
selvars.back <- back.summary$which[13,]
names(selvars.back[selvars.back==TRUE])
## [1] "(Intercept)" "AGE" "AGE2" "DEGREEW6" "FEDUCH8"
## [6] "GRADHS1" "GRADHSW1" "MEDUCA" "SEX2" "YEARSCOLL"
## [11] "STATE28" "STATE34" "DEGREE4" "DEGREE6"
From these lists you can see that the two lists contain common
variables (like FEDUCH8
or GRADHS1
) but there
are also some variables which only appear in one of the two selected
models.
We now apply the two models we estimated to the training dataset. As discussed in the exercise dedicated to the subset selection methods, this is facilitated by this function.
predict.regsubsets = function(object,newdata,id,...){
form = as.formula(object$call[[2]]) # Extract the formula used when we called regsubsets()
mat = model.matrix(form,newdata) # Build the model matrix
coefi = coef(object,id=id) # Extract the coefficiants of the ith model
xvars = names(coefi) # Pull out the names of the predictors used in the ith model
mat[,xvars]%*%coefi # Make predictions using matrix multiplication
}
# Create model matrix - may not be needed
#mat <- model.matrix(WAGE ~.,PSID)
#mat_train <- mat[sel_train,]
#mat_test <- mat[sel_test,]
pred.forw=predict.regsubsets(regfit.forw,PSID_reg[sel_test,],id=18)
val.errors.forw=mean((PSID_reg[sel_test,]$WAGE-pred.forw)^2)
pred.back=predict.regsubsets(regfit.back,PSID_reg[sel_test,],id=13)
val.errors.back=mean((PSID_reg[sel_test,]$WAGE-pred.back)^2)
The previous variable selection methods estimated a lot of different models to then compare these using information criteria. Ridge regression will estimate one model only (not by OLS any longer) but will penalise coefficients in a way to "bend" them towards 0.
To apply this method we need to define a vector y
to
contain the dependent variable and a matrix x
which
contains all possible variables in the columns. So let's define these
two variables.
y <- PSID_reg$WAGE
x <- model.matrix(WAGE~.,data = PSID_reg)[,-1]
The model.matrix
function is useful as it creates a
matrix of all the explanatory variables of a model that models the
dependent variable (here WAGE
) on all other variables in
the specified dataset (here PSID_reg
). The important role
this function has is to convert all categorical variables into numerical
dummy variables. The glmnet
function we are using to
estimate ridge and LASSO regressions can only deal with numeric
variables. The [,-1]
at the end actually deletes the first
column which is the constant.
How many rows and columns does the x
matrix have?
rows: columns:
You can use the command dim(x)
to get rows and columns
numbers of a matrix.
Note that we did this on the entire datset (training and testing) to ensure that both datasets have the same structure.
You will now use the glmnet
function to estimate a ridge
regression. While we will not go through the technical details of ridge
regressions it is worthwhile to restate the optimisation problem solved
by ridge regressions. We are estimating the parameters in the following
regression model
\[\begin{equation} y_i = \beta_0 + \sum^p_{j=1} \beta_j x_{ij} + u_i \end{equation}\]
where the regression model contains all \(p\) regressors included in the matrix
x
we defined above. The parameters are estimated by
minimizing
\[\begin{equation} \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij}\right)^2 + \lambda \sum^p_{j=1} \beta_j^2 = RSS + \lambda \sum^p_{j=1} \beta_j^2 \end{equation}\]
This is the standard residual sum of squares (RSS) plus a penalty term which penalises against large parameter values. The parameter \(\lambda\) will have to be chosen by you the empirical researcher. More on that soon.
Before you apply the following function you should briefly browse
through the help (?glmnet
) of this function to understand
some of the main features. There you will learn that the input option
alpha = 0
instructs the function to implement a ridge
regression. This sets the penalty term to be equal to the above \(\lambda \sum^p_{j=1} \beta_j^2\).
Let's estimate a ridge regression:
ridge=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0)
Does the estimated ridge regression standardise the variables?
You can use the help function ?glmnet
to find
information on the function's default choice.
The glmnet
function automatically minimises the above
objective function for a range of valued of \(\lambda\). For each value of \(\lambda\) the estimated coefficients are
different. The larger the \(\lambda\)
the more large parameter values are penalised and therefore we will get
smaller parameter estimates.
What is the meaning of the family = "gaussian"
option
inthe call to glmnet
?
Consult the help function. In the Details section you will find that using this option ensures that the optimisation minimises \((1/2 n) * RSS + \lambda * penalty\). That is the optimisation function stated above. The difference is the factor \((1/2n)\) but as that is a constant this makes no difference to the optimised parameter values.
The following plot shows these parameter estimates for large \(\lambda\)s on the left and small \(\lambda\)s on the right.
plot(ridge, main="Ridge and coefficients path")
You can see one line for each of the 136 variables included in the model. An alternative way to see the parameters is the following. First re-estimate the model for particular values of \(\lambda\).
paste("Large lambda:", ridge$lambda[25])
## [1] "Large lambda: 511.534471271628"
paste("Medium lambda:", ridge$lambda[50])
## [1] "Medium lambda: 49.9774271934175"
paste("Smalllambda:", ridge$lambda[75])
## [1] "Smalllambda: 4.8828444008166"
ridge_lar=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[25])
ridge_med=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[50])
ridge_small=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0,lambda=ridge$lambda[75])
Now we use the coefplot
function to plot the
coefficients for each of these models.
coefplot(ridge_small,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (small lambda)")
coefplot(ridge_med,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (mid lambda)")
coefplot(ridge_lar,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="RIDGE Coefficient Plot (large lambda)")
rm(ridge25,ridge50,ridge75)
When you compare the plots make sure you note the different scales on the horizontal axis. Larger \(\lambdas\) result in smaller coefficient values. The vertical axis shows the associated variables, but there are too many to read these properly.
The ridge regressions estimated above incorporate a trade-off. A \(\lambda = 0\) would have delivered the best in-sample fit, however, this is unlikely to be the best choice when using the model for forecasting as it would over-fit the data used for estimation. Larger values of \(\lambda\) will penalise against the in-sample over-fitting. But what is the best \(\lambda\) to use?
You will apply a cross validation approach to establish the best
value of \(\lambda\). The
glmnet
package we are using here to estimate Ridge (and
Lasso) regressions has a function that makes this easy to apply. Here is
the code:
set.seed(1) # chosing CV folds is random, this ensures that you get identical folds when re-running the code
cvridge=cv.glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=0)
plot(cvridge,main="Ridge and CV criterion")
How many folds are used in the default setting for
cv.glmnet
?
Why does plot(cvridge,main="Ridge and CV criterion")
return the mean square error as the criterion of fit?
As you can see the cv.glmnet
function returns estimation
results for a range of possible \(\lambda\) values. Let's see which is the
optimal \(\lambda\) by looking at
cvridge$lambda.min
which is where the \(\lambda\) with the smallest MSE has been
saved.
cvridge$lambda.min
## [1] 2.794143
When calculating the natural log of that \(\lambda\) you will find that to be 1.0275255 which corresponds to the value where you can see the minimum MSE in the above plot (also signaled by the left dashed vertical line).
As you can see from the above plot MSE stays more or less the same
even for somewhat larger values of \(\lambda\) and often it is seen as
advantageous to impose a somewhat larger penalty. The second vertical
dashed line in the above plot indicates the value of \(\lambda\) where the MSE is only one
standard error larger than at cvridge$lambda.min
. This is
this value:
cvridge$lambda.1se
## [1] 23.74333
Once you have chosen a \(\lambda\)
you can get the coefficients and predictions from that particular model.
Here we look at the first 11 coefficients from the optimal model using
both \(\lambda_{min}\) and \(\lambda_{1se}\). To do that we use the
predict
function using the type="coeff"
option.
# list estimated coefficients with lambda.min
paste("lambda = ", cvridge$lambda.min)
## [1] "lambda = 2.79414308848722"
c.min <- coef(cvridge, s = "lambda.min")
# list estimated coefficients with lambda.1se
paste("lambda = ", cvridge$lambda.1se)
## [1] "lambda = 23.7433268986241"
c.1se <- coef(cvridge, s = "lambda.1se")
cbind(c.min, c.1se)
## 102 x 2 sparse Matrix of class "dgCMatrix"
## s1 s1
## (Intercept) 9.1014067436 12.6742187209
## AGE 0.0802872856 0.0405998911
## AGE2 0.0004950848 0.0004057272
## AGEW 0.0264441625 0.0210302370
## AGEY 0.1021420149 0.0333197847
## CHILDREN -0.0693001375 -0.0856079570
## DEGREEW1 0.1506774212 0.2254285581
## DEGREEW2 1.2577026008 1.4049824045
## DEGREEW3 -1.4065242890 0.0963861901
## DEGREEW4 5.6376863928 3.8931612712
## DEGREEW5 . .
## DEGREEW6 11.7197142469 5.1312209193
## FEDUCA 0.1861737013 0.1232521289
## FEDUCH1 -0.1178456819 -0.4150403277
## FEDUCH2 -0.7432948488 -0.5924396256
## FEDUCH3 -0.0396782274 -0.2846424423
## FEDUCH4 0.4481337574 -0.0600412747
## FEDUCH5 1.1583178578 0.2045373810
## FEDUCH6 0.5972920274 0.1703029202
## FEDUCH7 -0.2202405829 0.0802927201
## FEDUCH8 2.7873580854 1.5419665798
## GRADHS1 1.8519555784 0.8623297565
## GRADHS2 0.2656079422 -0.4974598399
## GRADHS3 -0.2467499094 -0.8551807161
## GRADHSW1 1.6068166869 0.8682308987
## GRADHSW2 0.7052146120 -0.0545319468
## GRADHSW3 0.1687687981 -0.3424071917
## HAPPINESS -0.2847114241 -0.1636467044
## MARRIED2 -1.6566234245 -0.9016139453
## MARRIED3 -3.0382398265 -1.2016453792
## MARRIED4 0.0502963342 0.1311097546
## MARRIED5 -1.2020387753 -0.5594359986
## MEDUCA 0.2596187794 0.1296854734
## MEDUCH1 -1.5739110699 -0.9027864905
## MEDUCH2 -0.4039232333 -0.5072512445
## MEDUCH3 -0.0965886240 -0.3860794002
## MEDUCH4 0.2489500872 0.0568353206
## MEDUCH5 1.9798262397 0.7910881146
## MEDUCH6 0.2921505895 -0.0016639295
## MEDUCH7 0.7016035783 0.1482510129
## MEDUCH8 0.0595108304 0.1013835380
## SEX2 -1.1967053144 -0.7208610770
## YEARSCOLL 0.9481175157 0.5045621532
## STATE1 -0.9524462308 -0.3696938263
## STATE2 9.6436546458 3.7212511921
## STATE4 -0.1376531799 -0.2075287169
## STATE5 -2.4813508962 -0.6288146557
## STATE6 2.5781905768 1.1941635480
## STATE8 -2.3375130540 -0.9890307150
## STATE9 2.1186883743 0.8268321794
## STATE10 -5.2631360971 -2.3426680721
## STATE11 4.3600403454 1.5231882599
## STATE12 -1.6213635920 -0.6463841703
## STATE13 0.3088584956 -0.0384257467
## STATE15 . .
## STATE16 -3.5589325343 -0.8397824782
## STATE17 1.4625047089 0.6859675600
## STATE18 -0.1613226817 -0.1887122086
## STATE19 -1.8640715663 -0.7379751326
## STATE20 -0.0950459172 -0.0352675266
## STATE21 -0.9249492636 -0.3108357225
## STATE22 -0.6486062092 -0.6205942232
## STATE23 1.4726416559 0.6865286362
## STATE24 2.3163726879 0.7980432677
## STATE25 1.8078808024 0.9523231134
## STATE26 -1.5729000766 -0.7389507599
## STATE27 3.2910798191 1.4772488318
## STATE28 -2.2895897200 -1.1767101633
## STATE29 0.0686523576 -0.0616267683
## STATE30 0.0331920591 -0.0069603181
## STATE31 -3.3008764771 -1.0944796406
## STATE32 -0.3294603454 0.0089716052
## STATE33 2.1206206074 0.8666355839
## STATE34 6.9961301095 2.8743553025
## STATE35 -2.2504073187 -0.3789120127
## STATE36 2.5512633319 1.4122461773
## STATE37 -1.3977041728 -0.7079579999
## STATE38 -1.9190881743 -0.5510121743
## STATE39 0.0427188985 0.1573350848
## STATE40 -1.7502426964 -0.8066552679
## STATE41 1.7118663909 0.7320701950
## STATE42 -0.5198135303 -0.2443726690
## STATE44 -6.1195892139 -3.2775690086
## STATE45 -1.1699703136 -0.7613723302
## STATE46 -3.2255215484 -0.9661174279
## STATE47 -2.5769568236 -1.0668846228
## STATE48 -0.6671657814 -0.1549902809
## STATE49 0.8566125335 0.6079918760
## STATE50 3.2260551128 1.0237892948
## STATE51 -1.8839119039 -0.8429527795
## STATE53 3.0015704824 1.4684296750
## STATE54 -0.1234142874 0.1735687031
## STATE55 -2.0914613884 -0.9099678488
## STATE56 -0.6975400126 -0.3886752337
## DEGREE1 0.5790263430 0.3475980178
## DEGREE2 1.7331023330 1.1575254669
## DEGREE3 3.9695066217 2.3172383310
## DEGREE4 11.7600811226 6.0589030042
## DEGREE5 9.3062814482 4.3270434655
## DEGREE6 74.6105565824 31.2325995033
## SPOUSETRUE -0.0408197776 0.6148527780
## CHILDTRUE -0.2873259363 -0.1626462650
When re-running the above code for different random seeds the optimised parameters at any \(\lambda\) will remain unchanged.
The folds are created randomly. Everytime you divide the observations into the folds this allocation is random. For different such allocations you will find different optimised parameters. Only if you set the random seed to a particular value can you replicate results.
Once you have selected a model, for instance the model with \(\lambda_{1se}\), you can also use the
predict
function to use the estimated model to produce
model forecasts. Typically you will have a new set of observations, as
we have here using the information indexed by sel_test
which were not used in the estimation process. To produce forecasts you
will need a new matrix x
which has the correct number of
columns. Recall what the dimension of the matrix x
is
How many columns does the x
matrix have?
columns:
You can use the command dim(x)
to get rows and columns
numbers of a matrix. You could also look at
dim(x[sel_train,])
which is the actual x
matrix used for estimation. They will both have the same number of
columns.
You can feed in any number of observations but you will need to
ensure that the new input matrix has the same column structure (number
of columns and same ordering of variables) as the x
used in
estimating the model. As we will use x[sel_test,]
, this is
the case by construction.
ridge.predict.lambda.1se <- predict(cvridge, x[sel_test,], type="response",s=cvridge$lambda.1se)
ridge.predict.lambda.min <- predict(cvridge, x[sel_test,], type="response",s=cvridge$lambda.min)
Here the result is a vector of 2564 predictions which is the number
of observations we set aside to be in our test dataset. We can compare
to the original observations of hourly wages which are stored in
y
.
yp_ridge <- as.vector(ridge.predict.lambda.1se) # as vector makes a vector
temppred <- data.frame(y = y[sel_test], yp.ridge = yp_ridge )
ggplot(temppred, aes(y = yp.ridge, x = y)) +
geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
labs(x = "Observed Wage", y = "Predicted wage",
title ="Predictions from Ridge regression")
As you can see, the model clearly is unable to predict very high wages well.
Ridge regressions introduced a penalty to the least squares fit. This penalty penalised against large coefficient values and hence shrunk the coefficient values relative to the least squares fit, but importantly did shrink the parameter values towards 0, but not to 0. The LASSO introduces a different penalty, which will ensure that many coefficient parameters will be estimated to be 0.
We are estimating the parameters in the following, unchanged, regression model
\[\begin{equation} y_i = \beta_0 + \sum^p_{j=1} \beta_j x_{ij} + u_i \end{equation}\]
where the regression model contains all \(p\) regressors included in the matrix
x
we defined above. The parameters are estimated by
minimizing
\[\begin{equation} \sum^n_{i=1} \left( y_i - \beta_0 - \sum^p_{j=1} \beta_j x_{ij}\right)^2 + \lambda \sum^p_{j=1} \beta_j^2 = RSS + \lambda \sum^p_{j=1} |\beta_j| \end{equation}\]
Implementing the LASSO is done very similar to the Ridge regression.
The glmnet
function is used to estimate the LASSO, where we
have to set the alpha
parameter to
alpha = 1
.
lasso=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=1)
As for the Ridge regression this function implements the optimisation for a range of different \(\lambda\) values.
plot(lasso,main="Lasso and coefficients path")
We can see that larger \(\lambda\) values (on the left) lead to parameter values which are 0. The smaller \(\lambda\) the more parameters are estimated to be different from 0. This becomes more obvious if we look at the coefficient plot for a particular choice of \(\lambda\).
lasso25=glmnet(x[sel_train,],y[sel_train],family="gaussian",alpha=1,lambda=lasso$lambda[25])
coefplot(lasso25,intercept=FALSE,sort="magnitude",ylab="Variable",xlab="Coefficient",title="LASSO Coefficient Plot (high lambda)")
The plot only shows the variables with coefficients that have been estimated to be unequal to 0. These are only 23 out of 101 variables.
Similar the Ridge regression you will eventually have to chose a \(\lambda\) value.
The tool of choice is cross validation and this is implemented in a similar way as for the Ridge regression.
set.seed(1)
cv_lasso=cv.glmnet(x[sel_train,],y[sel_train],family="gaussian",nfolds=10,alpha=1)
# Plot CV criterion
plot(cv_lasso,main="Lasso and CV criterion")
You see from this plot that the minimum of the MSE criterion (average MSE over the 10 cross-validation folds) is where \(log(\lambda)\) is about -1.5 (equivalent to $= $ 0.2214311). This could be a \(\lambda\) value to use. It is, in this literature, however common practice to chose (or at lest consider) a somewhat larger \(\lambda\) implying even more regulisation. That is the \(\lambda\) value associated to the second dashed line, around \(log(\lambda)=0.17\) or $= $ 1.181711.
Let us now check what parameters these different levels of regularisation imply:
# list estimated coefficients with lambda.min
print(paste("lambda min = ", cv_lasso$lambda.min))
## [1] "lambda min = 0.221431061317992"
print(paste("lambda +1s.e. = ", cv_lasso$lambda.1se))
## [1] "lambda +1s.e. = 1.18171097470092"
# get the coeficients using the predict function
lasso.coef.lambda.min <- coef(cv_lasso, s = "lambda.min")
lasso.coef.lambda.1se <- coef(cv_lasso, s = "lambda.1se")
print(cbind(lasso.coef.lambda.min, lasso.coef.lambda.1se))
## 102 x 2 sparse Matrix of class "dgCMatrix"
## s1 s1
## (Intercept) 7.39360389 11.43160186
## AGE 0.12649026 0.06284273
## AGE2 . .
## AGEW 0.02225465 0.05203511
## AGEY 0.05040412 .
## CHILDREN . .
## DEGREEW1 . .
## DEGREEW2 0.41194378 .
## DEGREEW3 -0.12300401 .
## DEGREEW4 . .
## DEGREEW5 . .
## DEGREEW6 9.95505248 .
## FEDUCA 0.10419891 .
## FEDUCH1 . .
## FEDUCH2 -0.74588672 .
## FEDUCH3 . .
## FEDUCH4 0.06797709 .
## FEDUCH5 . .
## FEDUCH6 . .
## FEDUCH7 . .
## FEDUCH8 2.26554506 .
## GRADHS1 1.61005944 .
## GRADHS2 . .
## GRADHS3 . .
## GRADHSW1 1.86900841 0.68113426
## GRADHSW2 . .
## GRADHSW3 . .
## HAPPINESS -0.08197663 .
## MARRIED2 -1.42289172 -0.52094874
## MARRIED3 -1.99047193 .
## MARRIED4 . .
## MARRIED5 -0.37393495 .
## MEDUCA 0.25252174 0.02207431
## MEDUCH1 . .
## MEDUCH2 -0.27205898 .
## MEDUCH3 . .
## MEDUCH4 . .
## MEDUCH5 1.29510288 .
## MEDUCH6 . .
## MEDUCH7 0.03166446 .
## MEDUCH8 . .
## SEX2 -1.23414809 .
## YEARSCOLL 1.36927802 1.24088121
## STATE1 . .
## STATE2 7.54381464 .
## STATE4 . .
## STATE5 -1.37179944 .
## STATE6 3.08567866 0.15176071
## STATE8 -0.66083334 .
## STATE9 . .
## STATE10 . .
## STATE11 2.59598973 .
## STATE12 . .
## STATE13 . .
## STATE15 . .
## STATE16 . .
## STATE17 1.36208901 .
## STATE18 . .
## STATE19 -0.03613881 .
## STATE20 . .
## STATE21 . .
## STATE22 . .
## STATE23 . .
## STATE24 1.93388344 .
## STATE25 1.02461109 .
## STATE26 -0.21175294 .
## STATE27 3.04390671 .
## STATE28 -1.47925003 .
## STATE29 . .
## STATE30 . .
## STATE31 -0.67967388 .
## STATE32 . .
## STATE33 . .
## STATE34 7.49087624 0.02591321
## STATE35 . .
## STATE36 2.40722495 .
## STATE37 -0.43440305 .
## STATE38 . .
## STATE39 . .
## STATE40 . .
## STATE41 1.03770334 .
## STATE42 . .
## STATE44 . .
## STATE45 -0.14928896 .
## STATE46 -0.04182384 .
## STATE47 -0.77044516 .
## STATE48 . .
## STATE49 . .
## STATE50 . .
## STATE51 -0.65198063 .
## STATE53 3.19453623 .
## STATE54 . .
## STATE55 -0.43950153 .
## STATE56 . .
## DEGREE1 . .
## DEGREE2 0.28490050 .
## DEGREE3 1.76564741 .
## DEGREE4 10.33057437 .
## DEGREE5 3.85163572 .
## DEGREE6 87.03248659 69.61234545
## SPOUSETRUE . .
## CHILDTRUE . .
You can see that the larger \(\lambda\) value does deliver a much sparser model.
Let's see how closely the model predictions (in the training dataset) are correlated.
# Extract the predicted values
lasso.predict.lambda.min <- predict(cv_lasso, x[sel_test,], s=cv_lasso$lambda.min)
lasso.predict.lambda.1se <- predict(cv_lasso, x[sel_test,], s=cv_lasso$lambda.1se)
# Correlation between the predicted values of the two Lasso models
cor_fit <- cor(lasso.predict.lambda.min,lasso.predict.lambda.1se)
print(paste0("Correlation between predicted values: ", cor_fit))
## [1] "Correlation between predicted values: 0.935481426603129"
So while the models are quite different, the correlation between the two prediction series is quite strong.
We have now estimated models with forward and backward variable selection as well as Ridge and Lasso regressions. Let us compare how well these models do in predicting wages in our testing dataset.
y.real <- y[sel_test]
MSE.back <- mean((y.real-pred.back)^2)
MSE.forw <- mean((y.real-pred.forw)^2)
MSE.ridge.min <- mean((y.real-ridge.predict.lambda.min)^2)
MSE.ridge.1se <- mean((y.real-ridge.predict.lambda.1se)^2)
MSE.lasso.min <- mean((y.real-lasso.predict.lambda.min)^2)
MSE.lasso.1se <- mean((y.real-lasso.predict.lambda.1se)^2)
print("Mean Square Errors (MSE) in training set")
## [1] "Mean Square Errors (MSE) in training set"
print(paste("MSE backward selection",MSE.back))
## [1] "MSE backward selection 163.1522633524"
print(paste("MSE forward selection",MSE.forw))
## [1] "MSE forward selection 163.881339600827"
print(paste("MSE Ridge min lambda",MSE.ridge.min))
## [1] "MSE Ridge min lambda 153.676638320345"
print(paste("MSE Ridge 1se lambda",MSE.ridge.1se))
## [1] "MSE Ridge 1se lambda 150.588153326857"
print(paste("MSE LASSO min lambda",MSE.lasso.min))
## [1] "MSE LASSO min lambda 157.232877262571"
print(paste("MSE LASSO 1se lambda",MSE.lasso.1se))
## [1] "MSE LASSO 1se lambda 158.954752803404"
From here you can see that in this case the Ridge regression delivered the best out of sample performance.
In R, but also other programming languages like Python, the same thing can be done in different ways. In particular, different packages may have been written to achieve the same thing.
Which function from the hdm
package allows you to
estimate a LASSO regression?
You can't know!!! But you can find out. If you know that a particular package could potentially have a certain functionality. You should web search something like "R hdm lasso regression" and you are likely to find a link to the package's page on the CRAN webpage. These pages contain info on all packages. The reference manual and the vignette are likely to help you find the relevant function.
# replace XXXX with the correct function name
lasso.hdm <- XXXX(WAGE~.,data = PSID_reg[sel_train,],post=TRUE)
summary(lasso.hdm)
##
## Call:
## rlasso.formula(formula = WAGE ~ ., data = PSID_reg[sel_train,
## ], post = TRUE)
##
## Post-Lasso Estimation: TRUE
##
## Total number of variables: 99
## Number of selected variables: 13
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.059 -4.865 -1.422 2.611 90.824
##
## Estimate
## (Intercept) 7.768
## AGE 0.151
## AGE2 0.000
## AGEW 0.026
## AGEY 0.000
## CHILDREN 0.000
## DEGREEW1 0.000
## DEGREEW2 0.000
## DEGREEW3 0.000
## DEGREEW4 0.000
## DEGREEW6 0.000
## FEDUCA 0.000
## FEDUCH1 0.000
## FEDUCH2 -2.235
## FEDUCH3 0.000
## FEDUCH4 0.000
## FEDUCH5 0.000
## FEDUCH6 0.000
## FEDUCH7 0.000
## FEDUCH8 0.000
## GRADHS1 1.603
## GRADHS2 0.000
## GRADHS3 -0.329
## GRADHSW1 1.756
## GRADHSW2 0.000
## GRADHSW3 0.000
## HAPPINESS 0.000
## MARRIED2 -1.407
## MARRIED3 0.000
## MARRIED4 0.000
## MARRIED5 0.000
## MEDUCA 0.000
## MEDUCH1 0.000
## MEDUCH2 0.000
## MEDUCH3 0.000
## MEDUCH4 0.000
## MEDUCH5 0.000
## MEDUCH6 0.000
## MEDUCH7 0.000
## MEDUCH8 0.000
## SEX2 -1.874
## YEARSCOLL 1.717
## STATE1 0.000
## STATE2 0.000
## STATE4 0.000
## STATE5 0.000
## STATE6 0.000
## STATE8 0.000
## STATE9 0.000
## STATE10 0.000
## STATE11 0.000
## STATE12 0.000
## STATE13 0.000
## STATE16 0.000
## STATE17 0.000
## STATE18 0.000
## STATE19 0.000
## STATE20 0.000
## STATE21 0.000
## STATE22 0.000
## STATE23 0.000
## STATE24 0.000
## STATE25 0.000
## STATE26 0.000
## STATE27 0.000
## STATE28 -3.402
## STATE29 0.000
## STATE30 0.000
## STATE31 0.000
## STATE32 0.000
## STATE33 0.000
## STATE34 0.000
## STATE35 0.000
## STATE36 0.000
## STATE37 -2.396
## STATE38 0.000
## STATE39 0.000
## STATE40 0.000
## STATE41 0.000
## STATE42 0.000
## STATE44 0.000
## STATE45 -1.969
## STATE46 0.000
## STATE47 0.000
## STATE48 0.000
## STATE49 0.000
## STATE50 0.000
## STATE51 0.000
## STATE53 0.000
## STATE54 0.000
## STATE55 0.000
## STATE56 0.000
## DEGREE1 0.000
## DEGREE2 0.000
## DEGREE3 0.000
## DEGREE4 0.000
## DEGREE5 0.000
## DEGREE6 90.527
## SPOUSETRUE 0.000
## CHILDTRUE 0.000
##
## Residual standard error: 9.452
## Multiple R-squared: 0.343
## Adjusted R-squared: 0.3382
## Joint significance test:
## the sup score statistic for joint significance test is 1.199e+05 with a p-value of 0.002
While this conceptually estimates the same LASSO model as above there are slight differences to the earlier approach. To figure out what these are you really need to read the help function. The main differences here are:
hdm
package does not use cross validation to find
the optimal \(\lambda\), they set it to
a data dependent value (see the vignette
for details).POST = TRUE
).Let's predict from this LASSO model.
lasso.hdm.predict = predict(lasso.hdm, newdata = PSID_reg[sel_test,]) #out-of-sample prediction
When you try this you will most likely get an error message like:
"Error in X %*% beta : non-conformable arguments"
Dealing with error messages like this is a normal and very common activity you will have to do when applying empirical techniques. The message refers to "X %*% beta".
This refers to the multiplication needed for producing forecasts: \(\hat{y} = X \hat{\beta}\). For this operation to work the number of columns of \(X\) has to be identical to the number of estimated parameters in \(\hat{\beta}\). And the error message tells us that this does not seem to be the case ("non-conformable arguments"). In this case this message will lead us to a solution. Unfortunately it is not the case that error messages are always so informative. A great illustration of that and an essential watch to understand how to debug is this YouTube clip of a Jenny Bryan.
So how can that be. We are feeding the same dataset into the
predict
function (newdata = PSID[sel_test,]
)
as the one we used to estimate the model
(data = PSID_reg[sel_train,]
), just different rows. Why
then do the parameters which come from the estimated model not fit the
data used for forecasting?
The key to understanding this is to recognise that
PSID_reg
contains some factor
variables. What
the rlasso
function does is that it turns all factor
variables into k-1 dummy variables if there are k different values for
that factor variable (e.g. DEGREEW
). So it could be that in
the training dataset there are observations with all 7 different
outcomes and hence when we called rlasso
there were indeed
6 dummy variables created and appropriate parameters estimated. If now,
however, in the testing dataset there are only observations for 6 out of
these 7 categories, then only 5 dummy variables will be created and
therefore we have created a mismatch between \(X\) columns and \(\hat{\beta}\) elements.
This should not be a problem if all variables are numeric as then the size of the dataset handed into the function exactly represents the number of variables used.
So, how do we go about this. We will actually use a trick we already
used when applying the Ridge and LASSO regressions earlier. We then
first created the y
and x
objects which we
used in glmnet
. Let's recall how we created y
and x
:
y <- PSID_reg$WAGE
x <- model.matrix(WAGE~.,data = PSID_reg)[,-1]
y
just gets the dependent variable. x
uses
the model.matrix function and creates the dummy variables from the
factor variables in PSID_reg
. Importantly we do this for
the entire dataset not only the training dataset. We then feed the
training rows (sel_train
) into glmnet
. Let's
use the same approach with the rlasso
function:
lasso.hdm2 <- rlasso(y[sel_train]~x[sel_train,],post=TRUE)
lasso.hdm2.predict = predict(lasso.hdm2, newdata = x[sel_test,]) #out-of-sample prediction
Now that we have forecasts from this LASSO implementation we can again calculate the MSE for these forecasts.
MSE.lasso.hdm <- mean((y.real-lasso.hdm2.predict)^2)
print(paste("MSE backward selection",MSE.lasso.hdm))
## [1] "MSE backward selection 155.085106352071"
Comparing this to the earlier evaluations you will realise that this is better than the two cross-validated LASSO versions, but not better than the Ridge regressions.
In this section we went through applications of Ridge and LASSO regressions. We learned how to apply cross-validation to find the optimal value for the penalty parameters and then how to apply the estimated models to a new dataset.
You also learned that you have to be very careful with the data setup, in particular when your dataset contains factor variables. As we encountered such a problem we also aplied one of the most important skills you will need as an applied researcher, namely debugging your code. nobody can teach you how to debug, but perhaps the work in this section gave you some hints on how to start.