This exercise is part of the ECLR page.
The problem addressed here is that of deciding between models of different complexities. To fix ideas we shall load some packages and a dataset we work with
library(tidyverse)
library(ggplot2)
library(stargazer)
library(ISLR2)
library(cv)
data(Auto) # This dataset, Auto, comes with the ISLR2 package,
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
This is a database of different cars with their respective characteristics (name plus eight features). For instance
Auto[269,]
## mpg cylinders displacement horsepower weight acceleration year origin
## 271 21.1 4 134 95 2515 14.8 78 3
## name
## 271 toyota celica gt liftback
is a Toyata Celica.
How much horsepower has the dodge rampage
?
You could of course open the object Auto
find the "dodge
rampage" and then check out its horsepower ... but that is a little
tedious. Imagine you had 2,000 car models and not just 392. Here is what
you should do:
Auto$horsepower[Auto$name == "dodge rampage"]
. Do you
understand what this does?
You are wondering how the fuel efficiency, as measured by
miles-per-gallon (mpg
, higher is more efficient), relates
to the horsepower of the car. Let's start by estimating a simple linear
model.
OLS1 <- lm(mpg ~ horsepower,data = Auto)
stargazer(OLS1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## mpg
## -----------------------------------------------
## horsepower -0.158***
## (0.006)
##
## Constant 39.936***
## (0.717)
##
## -----------------------------------------------
## Observations 392
## R2 0.606
## Adjusted R2 0.605
## Residual Std. Error 4.906 (df = 390)
## F Statistic 599.718*** (df = 1; 390)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As you can see, horsepower
appears to be a highly
influential factor explaining variation in the outcome variable,
mpg
. The more horsepower, the lower the fuel efficiency.
But is the relationship really linear? Let's estimate models with
quadratic and cubic terms in horsepower
.
Complete the following code to replicate the subsequent output.
OLS2 <- lm(mpg ~ poly(horsepower,2),data = XXXX)
OLS3 <- lm(mpg ~ XXXX(horsepower,XXXX),XXXX)
stargazer(OLS1, XXXX, XXXX, type = "text")
##
## ===============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## mpg
## (1) (2) (3)
## -----------------------------------------------------------------------------------------------
## horsepower -0.158***
## (0.006)
##
## poly(horsepower, 2)1 -120.138***
## (4.374)
##
## poly(horsepower, 2)2 44.090***
## (4.374)
##
## poly(horsepower, 3)1 -120.138***
## (4.375)
##
## poly(horsepower, 3)2 44.090***
## (4.375)
##
## poly(horsepower, 3)3 -3.949
## (4.375)
##
## Constant 39.936*** 23.446*** 23.446***
## (0.717) (0.221) (0.221)
##
## -----------------------------------------------------------------------------------------------
## Observations 392 392 392
## R2 0.606 0.688 0.688
## Adjusted R2 0.605 0.686 0.686
## Residual Std. Error 4.906 (df = 390) 4.374 (df = 389) 4.375 (df = 388)
## F Statistic 599.718*** (df = 1; 390) 428.018*** (df = 2; 389) 285.481*** (df = 3; 388)
## ===============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
From here you can see that the amount of variation in the dependent variable that can be explained by the model increases from 0.6 to 0.69 from the linear to the quadratic model but does not really increase any further with the cubic model.
You should check out the help function ?poly
to see the
function's description. It includes higher order versions of horsepower
into the model. When you read the help you will realise that it includes
orthogonalised higher powers, so not just horsepower^2 and
horsepower^3.
Which of the following statements is correct?
Traditional ways of selecting between models would be by performing hypothesis tests for the additional, higher order variables or to compare the in-sample fit between models, taking account of the different number of variables used (e.g. select the model that delivers the smallest information criterion).
Both these methods rely on in-sample fit, meaning that the same observations are used for estimating parameters and for evaluating how well the model fits. If you are building a model for the purpose of forecasting you may want to estimate the model on some data and then evaluate its fit on another set of data. This is what you would have to do for forecasting and therefore it is attractive to use such a scheme for model evaluation as well.
Let us randomly select half of the observations (training or
estimation set). Estimate the three different models and then see how
well the different models predict the mpg
on the on the
other half of the data (test or evaluation set).
set.seed(123)
obs <- dim(Auto)[1]
train <- sample(obs,floor(dim(Auto)[1]/2)) # randomly selects half the obs
test <- seq(1:obs)
test <- test[-train] # all obs not in train
OLS1.train <- lm(mpg ~ horsepower,data = Auto[train,])
OLS2.train <- lm(mpg ~ poly(horsepower,2),data = Auto[train,])
OLS3.train <- lm(mpg ~ poly(horsepower,3),data = Auto[train,])
Now we can use the estimated models to predict the outcome variable
in the test
data set (Auto[train,]
).
p1 <- predict(OLS1.train,Auto[test,])
p2 <- predict(OLS2.train,Auto[test,])
p3 <- predict(OLS3.train,Auto[test,])
Let us compare the predictions with the actual observations. In order to use ggplot we add the actual realisations and the three predictions to a dataframe (here as 4 variables) and then the data are formed into a long format.
comp_data <- data.frame(real = Auto[test,"mpg"], p1 = p1, p2 = p2, p3 = p3)
comp_data <- comp_data %>% pivot_longer(cols = starts_with("p"),names_to = "forecast")
ggplot(comp_data,aes(x = real, y = value, colour = forecast)) +
geom_point() +
geom_abline(intercept = 0,slope = 1, size = 1) +
labs(title = "Comparing forecasts to realisations", x = "mpg", y = "Forecast mpg")
On the horizontal axis you have the realised mpg and on the vertical the forecast mpg. If a prediction is on the black line the forecast hits exactly the realisation.
From the plot it is difficult to establish which model is better and we should calculate summary statistics to have a numerical measure of the forecast quality.
# calculates the mean square prediction error
mse1.1 = mean((p1-Auto[test,"mpg"])^2)
mse2.1 = mean((p2-Auto[test,"mpg"])^2)
mse3.1 = mean((p3-Auto[test,"mpg"])^2)
# or alternatively use the mse(y,yhat) function
mse1.1 <- mse(Auto[test,"mpg"],p1)
mse2.1 <- mse(Auto[test,"mpg"],p2)
mse3.1 <- mse(Auto[test,"mpg"],p3)
paste("MSE.OLS1.1 =", mse1.1)
## [1] "MSE.OLS1.1 = 21.2499053024022"
paste("MSE.OLS2.1 =", mse2.1)
## [1] "MSE.OLS2.1 = 16.4811183970587"
paste("MSE.OLS3.1 =", mse3.1)
## [1] "MSE.OLS3.1 = 16.5827624744033"
You can see that, on this occasion the in-sample findings are
confirmed. OLS2
and OLS3
perform better when
it comes to the forecasting.
An obvious question here is whether we would get a similar result if
we estimated the model using the test
data and then tested
these models on the train
data.
train <- test # change the role of test and train data
test <- seq(1:obs)
test <- test[-train] # all obs not in train
OLS1.train <- lm(mpg ~ horsepower,data = Auto[train,])
OLS2.train <- lm(mpg ~ poly(horsepower,2),data = Auto[train,])
OLS3.train <- lm(mpg ~ poly(horsepower,3),data = Auto[train,])
p1 <- predict(OLS1.train,Auto[test,])
p2 <- predict(OLS2.train,Auto[test,])
p3 <- predict(OLS3.train,Auto[test,])
mse1.2 <- mse(Auto[test,"mpg"],p1)
mse2.2 <- mse(Auto[test,"mpg"],p2)
mse3.2 <- mse(Auto[test,"mpg"],p3)
paste("MSE.OLS1.2 =", mse1.2)
## [1] "MSE.OLS1.2 = 27.4554663165504"
paste("MSE.OLS2.2 =", mse2.2)
## [1] "MSE.OLS2.2 = 21.7291725516889"
paste("MSE.OLS3.2 =", mse3.2)
## [1] "MSE.OLS3.2 = 22.554576695954"
Qualitatively we get the same result. OLS2
and
OLS3
fit the data better.
You should recognise that the above operations simulate the process of estimating a model on one dataset and then apply that model to forecast on a fresh dataset. This is done by selecting data for the training and testing dataset role from the existing dataset. In this case we split the data into two subsets (or folds) that were then used for trainig/estimation or testing/predicting purposes.
The introductory example above implemented a 2-fold cross validation. Averaging the two mse measures we got for each model we obtain a summary measure for each model describing the quality of fit.
paste("MSE.OLS1 =", (mse1.1+mse1.2)/2)
## [1] "MSE.OLS1 = 24.3526858094763"
paste("MSE.OLS2 =", (mse2.1+mse2.2)/2)
## [1] "MSE.OLS2 = 19.1051454743738"
paste("MSE.OLS3 =", (mse3.1+mse3.2)/2)
## [1] "MSE.OLS3 = 19.5686695851786"
In that example we split the data into 2 parts/folds. We could decide to split the sample into say 4 parts. Then one could estimate and test the model with the following scheme:
Estimation/Training parts | Prediction/Testing parts |
---|---|
1,2,3 | 4 |
1,2,4 | 3 |
1,3,4 | 2 |
2,3,4 | 1 |
In a similar way you could split the data into \(k\) parts. The larger \(k\) the more data you use to estimate the
model but the more different models are to be estimated. This can get
tedious and fortunately, a package has been written that makes the job
easier. This is the cv
package.
You use the cv
function to do the hard work, i.e. take
your model, create all the necessary datasplits, obtain the respective
model parameters and then predict the left out observations, calculate
and summarise a measure of fit across all different folds. And in fact,
the cv
function can do the whole thing for several models
at the same time. This is useful as we would typically use
cross-validation to decide which of a set of models we should be
using.
Here is how you do the cross-validation for the three models
OLS1
, OLS2
and OLS3
using a 4
fold cross validation.
cv4 <- cv(models(OLS1,OLS2,OLS3), data = Auto, k = 4)
Note that the input is the saved, full sample models
(OLS1
, OLS2
and OLS3
). You also
need ot specify the dataset used (the full dataset, the function will do
the sample splitting) and then how many folds you wish to have.
You can look at the results of this as follows
summary(cv4)
##
## Model model.1:
## 4-Fold Cross Validation
## method: Woodbury
## cross-validation criterion = 24.21926
## bias-adjusted cross-validation criterion = 24.18006
## full-sample criterion = 23.94366
##
## Model model.2:
## 4-Fold Cross Validation
## method: Woodbury
## cross-validation criterion = 19.42406
## bias-adjusted cross-validation criterion = 19.36081
## full-sample criterion = 18.98477
##
## Model model.3:
## 4-Fold Cross Validation
## method: Woodbury
## cross-validation criterion = 19.77161
## bias-adjusted cross-validation criterion = 19.65248
## full-sample criterion = 18.94499
Now you can compare the cross-validation criterion and chose the model with the smallest value. You can do that by looking at the above results or by plotting the results. To plot the results we first have to extract the summary measures (the cv criterion, which are mean square errors by default) into a useful structure.
cv4.crit <- as.data.frame(cv4,
rows="cv",
columns="criteria"
)
cv4.crit$p <- seq(1,3) # this adds a model number, indexing the poynomial
This is now a dataframe and we can use ggplot to plot results.
ggplot(cv4.crit, aes(y = adjusted.criterion, x = p)) +
geom_line() +
labs(title = "CV mse for models with increasing polynomials in horsepower",
x = "polynomial order")
You should now attempt to redo the above exercise for polynomials in
horsepower up to 11. Then you should use a 10 fold cross validation to
decide which of the models has the lowest cross validation criterion.
When you plot the data also add the full.criterion
which
shows you the non-cross-validated in-sample fit.
# Add models for higher polynomials, OLS1 to OLS3 are already available
OLS4 <- lm(mpg ~ poly(horsepower,4),data = Auto)
OLS5 <- lm(mpg ~ poly(horsepower,5),data = Auto)
OLS6 <- lm(mpg ~ poly(horsepower,6),data = Auto)
OLS7 <- lm(mpg ~ poly(horsepower,7),data = Auto)
OLS8 <- lm(mpg ~ poly(horsepower,8),data = Auto)
OLS9 <- lm(mpg ~ poly(horsepower,9),data = Auto)
OLS10 <- lm(mpg ~ poly(horsepower,10),data = Auto)
OLS11 <- lm(mpg ~ poly(horsepower,11),data = Auto)
# seed ensures that you always get the same random splits
cv10 <- cv(models(OLS1,OLS2,OLS3,OLS4,OLS5,OLS6,OLS7,OLS8,OLS9,OLS10,OLS11),
data = Auto,
k = 10,
seed = 1234)
cv10.crit <- as.data.frame(cv10,
rows="cv",
columns="criteria"
)
# save the criteria in a separate data frame - will add other crit later
# remove the fold variable and rename some columns
cv.crit <- cv10.crit %>%
select(-fold) %>%
rename(CV10.adj.crit = adjusted.criterion) %>%
rename(CV10.crit = criterion)
cv.crit$p <- seq(1,dim(cv10.crit)[1]) # this adds a model number, indexing the poynomial
ggplot(cv.crit, aes(x = p)) +
geom_line(aes(y = CV10.adj.crit), color = "darkred") +
geom_line(aes(y = full.criterion), color="steelblue") +
labs(title = "CV mse for models with increasing polynomials in horsepower",
x = "polynomial order")
In your solution you should see that the full.criterion keeps decreasing, but the CV criterion will identify an optimal polynomial order.
In the above example we applied a 4 and then a 10 fold cross-validation. Could you increase the number of folds? Indeed you could, you could increase it up to the number of observations, here 392.
This implies the following structure:
Estimation/Training observations | Prediction/Testing observations |
---|---|
1-391 | 392 |
1-390, 392 | 391 |
1-389, 391-392 | 390 |
1-388, 390-392 | 389 |
... | ... |
1-2, 4-392 | 3 |
1, 3-392 | 2 |
2-392 | 1 |
This is called the leave one out cross-validation scheme. The
cv
function can implement this by using
k="loo"
as the input for the number of folds.
# no seed required for loo CV as there are no random draws
cvloo <- cv(models(OLS1,OLS2,OLS3,OLS4,OLS5,OLS6,OLS7,OLS8,OLS9,OLS10,OLS11), data = Auto, k = "loo")
summary(cvloo)
##
## Model model.1:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 24.23151
##
## Model model.2:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.24821
##
## Model model.3:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.33498
##
## Model model.4:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.42443
##
## Model model.5:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.03321
##
## Model model.6:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 18.97864
##
## Model model.7:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 18.83305
##
## Model model.8:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 18.96115
##
## Model model.9:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.06863
##
## Model model.10:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.49093
##
## Model model.11:
## n-Fold Cross Validation
## method: hatvalues
## cross-validation criterion = 19.30755
Now we add the leave one out (loo) cross-validation criterion to the
cv.crit
dataframe.
cvloo.crit <- as.data.frame(cvloo,
rows="cv",
columns="criteria"
)
cv.crit$CVloo.crit <- cvloo.crit$criterion
With the criteria for the 10-fold and the loo cross validation in one
data frame we can display these nicely. In order to show multiple lines
nicely with ggplot
it is best to transfer the data into the
long format.
cv.crit.long <- cv.crit %>% pivot_longer(!c(model,p),names_to = "Eval.type",values_to = "criterion")
ggplot(cv.crit.long, aes(x = p, y = criterion, colour = Eval.type)) +
geom_line() +
labs(title = "CV mse for models with increasing polynomials in horsepower",
x = "polynomial order")
On this occasion the message from the 10-fold and loo cross-validation are very comparable.
As the loo cross-validation has been described above it look as if we have to run 392 regressions. As it turns out there is a neat algebraic trick that enables the calculation of the loo-CV criterion after only running the full-sample regression. When we are dealing with 392 observations this is not so essential, but if you have thousands if not millions of observations then this does matter for computational time.
We will illustrate this trick here for a simple regression. This
means that we will reconstruct the loo-CV criterion for the
p=1
model which previously was calculated to be
24.2315135.
The formula for the \(CV_{loo,n}\) can be calculated from
\[\begin{equation} CV_{loo,n} = \frac{1}{n} \Sigma_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1-h_i}\right)^2 \end{equation}\]
where
\[\begin{equation} h_i = \frac{1}{n} + \frac{(x_{j} - \bar{x})^2}{\Sigma_{j=1}^{n} (x_{j} - \bar{x})^2} \end{equation}\]
is called the leverage of the \(i\)th observation, \(x_i\) is the \(i\)th observation of the (single)
explanatory variable (here horsepower
) and \(\hat{y}_i\) is the model prediction for the
\(i\)th observation coming from the
full-sample estimation, i.e. from OLS1
in our case.
Let us calculate the above. We already have
# prepare all the terms needed
n <- dim(Auto)[1]
y <- Auto$mpg
x <- Auto$horsepower
xbar <- mean(x)
yhat <- predict(OLS1,Auto)
x_m_xbar.sq <- (x - xbar)^2
sum.x_m_xbar.sq <- sum(x_m_xbar.sq)
# calculate leverage
h = (1/n) + (x_m_xbar.sq/sum.x_m_xbar.sq)
# Calculate CV criterion
temp <- ((y-yhat)/(1-h))^2
CVloo.1 = mean(temp)
paste("CVloo.1 =", CVloo.1, ", from cv procedure: ", cv.crit[cv.crit$p==1,"CVloo.crit"])
## [1] "CVloo.1 = 24.2315135179292 , from cv procedure: 24.2315135179292"
As you can see they are identical. Well actually that is not a
surprise as the cleaver programmers of the cv
function are
of course aware of the trick and are using this approach rather than
estimating 392 regressions. So in effect we just recreated one of the
pieces of code that are hidden underneath the hood of the
cv
function.
For the general k-fold CV such tricks do not exist and you (or the
cv
procedure) actually do have to estimate multiple
regressions.
Above we build a model with only horsepower as the explanatory variable. Now we include some additional variables.
OLS1.ext <- lm(mpg ~ horsepower+weight+year,data = Auto)
summary(OLS1.ext)
##
## Call:
## lm(formula = mpg ~ horsepower + weight + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7911 -2.3220 -0.1753 2.0595 14.3527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.372e+01 4.182e+00 -3.281 0.00113 **
## horsepower -5.000e-03 9.439e-03 -0.530 0.59663
## weight -6.448e-03 4.089e-04 -15.768 < 2e-16 ***
## year 7.487e-01 5.212e-02 14.365 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8068
## F-statistic: 545.4 on 3 and 388 DF, p-value: < 2.2e-16
you can see that the inclusion of other explanatory variables has turned the horsepower variable insignificant. however, higher order terms may still be relevant.
Your task is now to replicate the above analysis. Consider models
which extend OLS1.ext
with higher powers of the
horsepower
variable (up to \(p=11\)). Then apply the 10-fold and the loo
CV to decide which model is most suitable for forecasting purposes.
# Add models for higher polynomials, OLS1.ext is already available
OLS2.ext <- lm(mpg ~ poly(horsepower,2) + weight + year,data = Auto)
OLS3.ext <- lm(mpg ~ poly(horsepower,3) + weight + year,data = Auto)
OLS4.ext <- lm(mpg ~ poly(horsepower,4) + weight + year,data = Auto)
OLS5.ext <- lm(mpg ~ poly(horsepower,5) + weight + year,data = Auto)
OLS6.ext <- lm(mpg ~ poly(horsepower,6) + weight + year,data = Auto)
OLS7.ext <- lm(mpg ~ poly(horsepower,7) + weight + year,data = Auto)
OLS8.ext <- lm(mpg ~ poly(horsepower,8) + weight + year,data = Auto)
OLS9.ext <- lm(mpg ~ poly(horsepower,9) + weight + year,data = Auto)
OLS10.ext <- lm(mpg ~ poly(horsepower,10) + weight + year,data = Auto)
OLS11.ext <- lm(mpg ~ poly(horsepower,11) + weight + year,data = Auto)
# 10 fold CV
# seed ensures that you always get the same random splits
cv10.ext <- cv(models(OLS1.ext,OLS2.ext,OLS3.ext,OLS4.ext,OLS5.ext,
OLS6.ext,OLS7.ext,OLS8.ext,OLS9.ext,OLS10.ext,OLS11.ext),
data = Auto,
k = 10,
seed = 1234)
cv10.ext.crit <- as.data.frame(cv10.ext,
rows="cv",
columns="criteria"
)
# save the criteria in a separate data frame - will add other crit later
# remove the fold variable and rename some columns
cv.ext.crit <- cv10.ext.crit %>%
select(-fold) %>%
rename(CV10.adj.crit = adjusted.criterion) %>%
rename(CV10.crit = criterion)
cv.ext.crit$p <- seq(1,dim(cv10.ext.crit)[1]) # this adds a model number, indexing the poynomial
# loo CV
# no seed required for loo CV as there are no random draws
cvloo.ext <- cv(models(OLS1.ext,OLS2.ext,OLS3.ext,OLS4.ext,OLS5.ext,
OLS6.ext,OLS7.ext,OLS8.ext,OLS9.ext,OLS10.ext,OLS11.ext),
data = Auto,
k = "loo")
cvloo.ext.crit <- as.data.frame(cvloo.ext,
rows="cv",
columns="criteria"
)
# add he loo CV crit to results dataframe
cv.ext.crit$CVloo.crit <- cvloo.ext.crit$criterion
# change results dataframe to long format for easier plotting
cv.ext.crit.long <- cv.ext.crit %>% pivot_longer(!c(model,p),
names_to = "Eval.type",
values_to = "criterion")
ggplot(cv.ext.crit.long, aes(x = p, y = criterion, colour = Eval.type)) +
geom_line() +
labs(title = "CV mse for extended models with increasing polynomials in horsepower",
x = "polynomial order")
When inspecting the results you will see that now, after including
other variables, the cross-validation results suggest that we should
only include up to the third power of horsepower
variable.
Without the inclusion of weight
and year
the
cross-validation procedure suggested the inclusion of 7 powers.
Here we learned how to implement cross-validation approaches to
support model selection. The useful cv
function implements
k-fold and leave one out cross-validation in a straightforward manner.
All you need to do is to estimate the full-sample model and then feed
that into the cv
function.
This exercise is part of the ECLR page.