Introduction

Heterogenous Autoregressive (HAR) models for realized volatility have become a very popular tool for volatility modelling and forecasting. In this project we will replicate some of the work presented in Clements and Preve (2021), in order to illustrate how to model and forecast such models in R.

The seminal paper is that of Corsi (2009) which introduced the general methodology. At the core of this methodology is the availability of realized volatilities, volatility proxies based on high frequency data. This implies that volatility can be modeled on the basis of observed data as opposed to treating volatility as a latent (unobserved) variable which is filtered from observed asset return data (the idea underlying GARCH and Stochastic volatility type models).

One reason for the popularity is the increasing availability of high-frequency data which can then be used to construct realized volatility measures. While such intra-day (high-frequency) data are now widely available, they usually require payment from providers such as:

Luckily, realized volatility data (without access to the underlying high-frequency data) for some indices and assets can be obtained, free of charge, from the Oxford Realized Library.

Preparing your workfile and data download

As usual you should prepare your workscript by setting the correct working directory.

setwd("YOUR/WORKING/DIRECTORY")

Further we shall add a few packages such that we can use their functionality

library(tidyverse)
library(readxl)
library(xts)
library(ggplot2)
library(ggpubr)  # to combine ggplots
library(AER)

The particular data used here are the same as those used in the Clements and Preve (2021) paper. They in turn use data which have been used in Bollerslev et al. (2016). These data are available from James Patton’s website (search for “Bollerslev, Patton and Quaedvlieg (2015, Journal of Econometrics)” download the zip files with MATLAB codes and data, the particular file used here is the SP500_RV_5min.xlsx file from the “Market” folder.) Download that data file into your working directory.

We shall now upload the data into from the working directory:

rv_data <- read_xlsx("SP500_RV_5min.xlsx") # ensure this file is in your WD

Let us check out the structure of this file.

str(rv_data)
## tibble [4,096 × 12] (S3: tbl_df/tbl/data.frame)
##  $ Date   : num [1:4096] 2e+07 2e+07 2e+07 2e+07 2e+07 ...
##  $ RV     : num [1:4096] 0.372 0.559 0.551 1.97 0.692 ...
##  $ RQ     : num [1:4096] 6.64e-05 1.61e-04 1.94e-04 1.64e-03 1.83e-04 ...
##  $ RJ     : num [1:4096] 0.06913 0 0.00961 0.20376 0 ...
##  $ BPV    : num [1:4096] 0.303 0.591 0.541 1.766 0.72 ...
##  $ RVn    : num [1:4096] 0.136 0.395 0.324 1.212 0.28 ...
##  $ RVp    : num [1:4096] 0.237 0.164 0.226 0.758 0.412 ...
##  $ TPQ    : num [1:4096] 3.57e-05 2.31e-04 1.20e-04 1.20e-03 1.88e-04 ...
##  $ MedRQ  : num [1:4096] 2.35e-05 1.61e-04 5.53e-05 1.06e-03 7.34e-05 ...
##  $ TrRQ   : num [1:4096] 6.64e-05 9.02e-05 1.94e-04 4.38e-04 1.83e-04 ...
##  $ RQ15Min: num [1:4096] 0.00052 0.002214 0.000756 0.012159 0.002034 ...
##  $ RQBoot : num [1:4096] 0.00195 0.00437 0.00548 0.04579 0.00474 ...

As you can see we have 4096 observations for realized volatility (RV) and related variables. Before we look at the details of the series, we shall look at the Date variable.

min(rv_data$Date)
## [1] 19970408
max(rv_data$Date)
## [1] 20130830

You can see that the first day in the sample is the 8th of April 1997 and the last is 30th August 2013. Have a look at the data itself to confirm that we typically observe 5 days of data interupted by two days (weekend) without data. Altogether this gives us 4,096 observations.

The date information is currently saved as a numeric variable in the format “YYYYMMDD”. As the months and dates have leading 0 when referring to single digit months and days (e.g. “19990408” for the 8th of April 1999) the data actually sort correctly according to the underlying dates. However, it will be useful to let R know that we are dealing with dated data. At this stage R just thinks that it has a bunch of numbers.

We shall use the xts date format and turn rv_data into a dated format.

rv_data <- xts(rv_data, order.by=as.Date(as.character(rv_data$Date),"%Y%m%d"))

The xts function takes the rv_data object and transforms it. You need to supply (order.by=) the date information. that date information is currently contained in rv_data$Date as a numerical variable. This needs transforming into a date variable which is done by the as.Date function. The as.Date function takes a character variable (and hence we need to transform the num to a chr variable, as.character(rv_data$Date)) and format information and translates that into a date variable.

Checking our data structure we can confirm the success of this operation.

str(rv_data)
## An xts object on 1997-04-08 / 2013-08-30 containing: 
##   Data:    double [4096, 12]
##   Columns: Date, RV, RQ, RJ, BPV ... with 7 more columns
##   Index:   Date [4096] (TZ: "UTC")

If you look at a subset of data you will see that they are now dated.

head(rv_data$RV)
##                   RV
## 1997-04-08 0.3720967
## 1997-04-09 0.5587527
## 1997-04-10 0.5506843
## 1997-04-11 1.9699422
## 1997-04-14 0.6923827
## 1997-04-15 1.0595450

Here you can see again that there are no observations for the 12th and 13th of April 1997, which were a Saturday and Sunday respectively. On weekends there is no trading and hence there are no intra-day return data on the basis of which we could calculate a realized volatility.

Explore data

Let’s look at the data and identify the data series we will be using. We start by printing the variable names

names(rv_data)
##  [1] "Date"    "RV"      "RQ"      "RJ"      "BPV"     "RVn"     "RVp"    
##  [8] "TPQ"     "MedRQ"   "TrRQ"    "RQ15Min" "RQBoot"

Apart from the date, this dataset contains daily observations of realized volatility estimates (e.g. RV, RJ, BPV) but also daily estimates of realized quarticity (e.g. RQ, TrRQ).

This is not the place to discuss details of these estimators. They are discussed in detail in Bollerslev et al. (2016). But importantly, realized volatility estimates are estimates of the intergrated one-day variance of the underlying asset return process. They will be used as the observed estimates on the basis of which we will estimate the volatility process.

It is important to realise that these are estimates and hence to recognise that the underlying volatility is only estimated with error. The realized quarticity is an estimate of the variance of this estimation error. It can be used in variations of the HAR model to take account of this setup. For the work here we will utilise RV as the estimates for the realized volatility and RQ as the variable for the realized quarticity.

Let’s visualise the data.

p1 <- ggplot(rv_data,aes(y=RV, x=Index)) + 
  geom_line() + 
  labs(x = "Time",title="S&P500, Realized Volatility") +
  theme_light()

p2 <- ggplot(rv_data,aes(y=RQ, x=Index)) + 
  geom_line() + 
  labs(x = "Time",title="S&P500, Realized Quarticity") +
  theme_light()
ggarrange(p1,p2,nrow = 2, ncol = 1)

A feature of these realized volatility and realized quarticity estimates is that they are strongly positively correlated. This is somewhat difficult to see as variables are very strongly skewed as can be seen from the following density plots.

p3 <- ggplot(rv_data,aes(x=RV)) +
  geom_density() + 
  labs(x = "RV",title="S&P500, Density of Realized Volatility") +
  theme_light()
p4 <- ggplot(rv_data,aes(x=RQ)) +
  geom_density() + 
  labs(x = "RQ",title="S&P500, Density of Realized Quarticity") +
  theme_light()
ggarrange(p3,p4,nrow = 2, ncol = 1)

In order to be able to deal with data with lesser skew the following transformations are often used:

\(logRV = \log{RV}\) and \(srRV = \frac{\sqrt{RV}-1}{1/2}\).

These are special cases of the so-called Box-Cox transform. We shall add these series to rv_data and then look at their densities to evaluate whether there distributions are more symmetrical.

rv_data$logRV <- log(rv_data$RV)
rv_data$srRV <- 2*(sqrt(rv_data$RV)-1)
p5 <- ggplot(rv_data) +
  geom_density(aes(x=logRV), color="blue", size = 1.5) + 
  geom_density(aes(x=srRV), color="red", size = 1.5) + 
  labs(x = "logRV (blue), srRV (red)",title="S&P500, Density of log and square root Realized Volatility") +
  theme_light()
p5

These are still skewed but much closer to symmetry than RV itself. In a later section we will apply HAR type models to such transformations.

The HAR model

Here we willl introduce the basic HAR model. For details of HAR-type models and its variations you should refer to the work listed in the references.

The RV variable in our dataset contains observed realized volatilities. Let’s call this variable \(RV_t\). This is an observed proxy for the volatility of a returns process (here the returns of the S&P 500 share index).

The following model is used to model the dynamics of, and eventually forecast, realized volatility:

\[\begin{equation} RV_t = \beta_0 + \beta_1 RV_{t-1} + \beta_2 RV_{t-1}^5 + \beta_3 RV_{t-1}^{22} + u_t \end{equation}\]

Here \(RV_{t-1}^5\) stands for the 5-day realizes volatility \(RV_{t-1}^5=\frac{1}{5}\sum_{i=1}^5 RV_{t-i}\) and \(RV_{t-1}^{22}=\frac{1}{22}\sum_{i=1}^{22} RV_{t-i}\) for the 22-day realized volatility.

Estimating the HAR model

Estimating the HAR models will be done using the HARmodel function which is part of the highfrequency package. So this is a good time to install the highfrequency package if you have not yet done so and then to load it.

library(highfrequency)

The HARmodel function takes a realized volatility series as input and estimates a HAR model. You need to specify what realized volatility terms you want included as explanatory variables. Here we specify that we want 1, 5 and 22 day averages (periods = c(1,5,22)). The input inputType = "RM" indicates that we are feeding in a realized variance measure (as opposed to intra-day returns) and type = "HAR" indicates that we want a HAR model. The h = 1 input to the function indicates that we estimate a model for daily realized volatility as the dependent variable. This choice will become important when we use the estimated models to forecast.

data_temp <- rv_data[,"RV"]
mod_HAR <- HARmodel(data = data_temp , periods = c(1,5,22), 
              type = "HAR", h = 1, inputType = "RM")
summary(mod_HAR)
## 
## Call:
## "RV1 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.413  -0.278  -0.122   0.074  47.453 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## beta0  0.11231    0.03513   3.197 0.001397 ** 
## beta1  0.22734    0.10091   2.253 0.024318 *  
## beta2  0.49035    0.14646   3.348 0.000821 ***
## beta3  0.18638    0.06016   3.098 0.001963 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.605 on 4070 degrees of freedom
## Multiple R-squared:  0.5224, Adjusted R-squared:  0.5221 
## F-statistic:  1484 on 3 and 4070 DF,  p-value: < 2.2e-16

The estimated parameters are basically identical to those reported in Table 1 of Clements and Perve (2021). The standard errors are slightly different but very similar.

Replication of Table 1 from Clements and Perve (2021)
Replication of Table 1 from Clements and Perve (2021)

Replicate the basic result with the lm function

We will later explore some other HAR-type model estimates available from the HARmodel function. But before we do so we shall replicate these results by manually replicating the calculation of the explanatory variables and re-estimate the model using the lm function. This is to gain a better understanding of the inner workings of the HARmodel function.

We begin by calculating the \(RV_{t-1}\), \(RV_{t-1}^5\) and \(RV_{t-1}^22\) and saving them in the dataset. Here we use the lag.xts function which allows us to calculate lags of xts formatted variables. lag.xts(rv_data$RV,4) for instance gives us \(RV_{t-4}\).

rv_data$RV1 <- lag.xts(rv_data$RV,1)

rv_data$RV5 <- (lag.xts(rv_data$RV,1)+lag.xts(rv_data$RV,2)+lag.xts(rv_data$RV,3)+
                  lag.xts(rv_data$RV,4)+lag.xts(rv_data$RV,5))/5

rv_data$RV22 <- (lag.xts(rv_data$RV,1)+lag.xts(rv_data$RV,2)+
                  lag.xts(rv_data$RV,3)+lag.xts(rv_data$RV,4)+lag.xts(rv_data$RV,5)+
                  lag.xts(rv_data$RV,6)+lag.xts(rv_data$RV,7)+lag.xts(rv_data$RV,8)+
                  lag.xts(rv_data$RV,9)+lag.xts(rv_data$RV,10)+lag.xts(rv_data$RV,11)+
                  lag.xts(rv_data$RV,12)+lag.xts(rv_data$RV,13)+lag.xts(rv_data$RV,14)+
                  lag.xts(rv_data$RV,15)+lag.xts(rv_data$RV,16)+lag.xts(rv_data$RV,17)+
                  lag.xts(rv_data$RV,18)+lag.xts(rv_data$RV,19)+lag.xts(rv_data$RV,20)+
                  lag.xts(rv_data$RV,21)+lag.xts(rv_data$RV,22))/22

Now we have created the three explanatory variables for use in the standard HAR model and now we can use these to estimate the standard HAR model with the lm function.

mod_HAR_vlm <- lm(RV~RV1+RV5+RV22,data=rv_data)
summary(mod_HAR_vlm)
## 
## Call:
## lm(formula = RV ~ RV1 + RV5 + RV22, data = rv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.413  -0.278  -0.122   0.074  47.453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.11231    0.03065   3.664 0.000252 ***
## RV1          0.22734    0.01870  12.157  < 2e-16 ***
## RV5          0.49035    0.03144  15.595  < 2e-16 ***
## RV22         0.18638    0.02813   6.624 3.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.605 on 4070 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.5224, Adjusted R-squared:  0.5221 
## F-statistic:  1484 on 3 and 4070 DF,  p-value: < 2.2e-16

When comparing these results to the results produced earlier by the HARmodel function and indeed the results from Clements and Preve (Table 1) you can see that the coefficient estimates are essentially identical, although the standard errors from the lm estimation are significantly smaller.

The standard errors from the HARmodel are, as per package documentation, p.34, Newey-West standard errors. These are heteroskedasticity and autocorrelation consistent (robust) standard errors. The notes to Clements and Perve’s Table 1 also state that they use robust standard errors, presumably also of the Newey-West type, explaining why these are very similar to those coming from the HARmodel function.

The standard errors coming from the lm function are non-robust standard errors. We can calculate robust standard errors and see whether we can replicate the standard errors obtained earlier. To do so we report the coefficients from the mod_HAR_vlm model using the coeftest function, but making sure that we use a different method to calculate the coefficient estimates’ standard errors (vcov = vcovHAC(mod_HAR_vlm)). You can consult the details of the vcovHAC function to confirm that this calculates a Newey-West type standard errors.

coeftest(mod_HAR_vlm, vcov = vcovHAC(mod_HAR_vlm))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.112314   0.062840  1.7873 0.0739629 .  
## RV1         0.227344   0.107791  2.1091 0.0349955 *  
## RV5         0.490349   0.147156  3.3322 0.0008694 ***
## RV22        0.186377   0.093825  1.9864 0.0470525 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These standard errors are not identical but very similar to those from both the Clements and Perve paper as well as the HARmodel estimation. When calculating heteroskedasticity and autocorrelation robust standard errors there are many small choices to be made. Without knowing precisely which choices are implemented by the HARmodel function and in Clements and Perve, it is understandable that we cannot replicate the exact standard errors.

Forecasting from HAR models

Above we demonstrated how we estimate HAR models. Once we have estimated parameters we can use this model to forecast realized volatility. Let’s look at the last observation in our dataset:

tail(rv_data[,c("RV", "RV1", "RV5", "RV22")])
##                   RV       RV1       RV5      RV22
## 2013-08-23 0.1806362 0.2469839 0.3536340 0.2343011
## 2013-08-26 0.1804883 0.1806362 0.3260802 0.2320665
## 2013-08-27 0.4467254 0.1804883 0.3255740 0.2291035
## 2013-08-28 0.2576929 0.4467254 0.3725868 0.2388208
## 2013-08-29 0.3483140 0.2576929 0.2625053 0.2426365
## 2013-08-30 0.5403510 0.3483140 0.2827714 0.2492720

Look at the last day, 30 Aug 2013 (day 4096). The realized volatility for this day is 0.5404. But what would we have predicted the realized volatility to be at the end of 29 Aug 2013 (day 4095), \(E(RV_{4096}|I_{4095})\), where \(I_{4095}\) contains all information available at the end of the 29 Aug 2013. This information includes

  • \(RV_{4096-1}=RV_{4095}=\) 0.3483
  • \(RV_{4096-1}^5=RV_{4095}^5=\) 0.2828
  • \(RV_{4096-1}^{22}=RV_{4095}^{22}=\) 0.2493

Plugging these values into the estimated model we get

\[\begin{eqnarray} F_{4096}=E(RV_{4096}|I_{4095}) &=& \hat{\beta}_0 + \hat{\beta}_1 RV_{4095} + \hat{\beta}_2 RV_{4095}^5 + \hat{\beta}_3 RV_{4095}^{22}\\ &=& 0.112314 + 0.227344 \cdot 0.3483 + 0.490349 \cdot 0.2828+ 0.186377 \cdot 0.2493\\ &=&0.3766 \end{eqnarray}\]

This implies that at the end of 29 Aug 2013 we would have predicted the realized volatility on 30 Aug 2013 to be 0.3766. In reality it turned out to be 0.5404. We would have therefore underestimated the realized volatility. The forecast error would have been \(RV_{4096}-F_{4096}=0.5404-0.3766=0.1638\).

If you wanted to produce this forecast, \(F_{4096}\), without using the step-by-step algebra above you could simply use the the predict function.

predict(mod_HAR)
## [1] 0.3766164

Clearly this produces exactly the same forecast as the one calculated above.

There is just a little snag in this example. To estimate the coefficients used above we did actually use the data from 30 Aug 2013 (\(t=4096\)). In reality that would be impossible if we wanted to make a forecast at the end of day 4095 and we would use coefficients estimated on the basis of data up to \(t=4095\). In our later (pseudo) forecasting exercise we will take that into account.

Forecasting multiple periods ahead

Warning, this section will be messy! But there is no way aroud this!

When models like the HAR models are used to forecast multiple periods ahead we would usually apply a method called direct forecasts. Say we are interested in forecasting the average 5-day volatility. When doing so we would actually first estimate a slightly different model.

\[\begin{equation} RV_t^5 = \beta_0 + \beta_1 RV_{t-5} + \beta_2 RV_{t-5}^5 + \beta_3 RV_{t-5}^{22} + u_t \end{equation}\]

The dependent variable has been transformed into the term we wish to forecast and the explanatory variables have been lagged such that they precede all information going into the calculation of the dependent variable.

Before we re-estimate this model we shall also control the estimation sample (as this will soon become important). The date information in xts formatted rv_data can be called from index(rv_data). Let’s look at the last 6 observations.

tail(index(rv_data),6)
## [1] "2013-08-23" "2013-08-26" "2013-08-27" "2013-08-28" "2013-08-29"
## [6] "2013-08-30"

We now want to pretend that we are at the end of day “2013-08-23” and want to use all available data to forecast “RV_{2013-08-30}^5” which averages the RV from days “2013-08-26” to “2013-08-30”.

data_temp <- rv_data[index(rv_data)<="2013-08-23","RV"]

You can check data_temp to confirm that it’s last observation is from “2013-08-23”. As we estimate the model on the basis of this reduced estimation sample we also need to let HARmodel know that the definition of the dependent variable is now \(RV_t^5\). This change is reflected by the h=5 input in the HARmodel function which indicates over how many days the dependent variable should be averaged.

mod_HAR5 <- HARmodel(data = data_temp , periods = c(1,5,22), 
              type = "HAR", h = 5, inputType = "RM")
summary(mod_HAR5)
## 
## Call:
## "RV5 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22"
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5785  -0.2524  -0.1237   0.0606  19.8571 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## beta0  0.17188    0.08073   2.129  0.03331 *  
## beta1  0.18643    0.03892   4.791 1.72e-06 ***
## beta2  0.39571    0.12109   3.268  0.00109 ** 
## beta3  0.27090    0.13093   2.069  0.03861 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.157 on 4061 degrees of freedom
## Multiple R-squared:  0.6406, Adjusted R-squared:  0.6403 
## F-statistic:  2413 on 3 and 4061 DF,  p-value: < 2.2e-16

When comparing the degrees of freedom of this model with mod_HAR you will note that we lost 4 more observations through the additional lagging. You will also note that the parameter estimates changed, although not overly much. The in-sample fit has improved as the dependent variable has now smoothed out some of the daily random variation which is unforecastable.

If you now wanted to predict the last observation we can again use the predict function. If we just applied predict(mod_HAR5) we would merely obtain the in-sample fit of the last observation used in the parameter estimation. We can use the predict function to predict beyond the in-sample period. Before we can do so we need to understand how to feed in the data correctly. And to understand we need to understand how the HARmodel function treats data.

When we applied the HARmodel function and saved the output into mod_HAR5 we did not only save the estimated parameters and regression diagnostics (like \(R^2\)) but you will find a lot more information in this object. For instance look at the following, the last 6 observations from mod_HAR5$model

tail(mod_HAR5$model,6)
##                    y       RV1       RV5      RV22
## 2013-08-16 0.2399225 0.2315964 0.1593696 0.1774491
## 2013-08-19 0.2459137 0.1530634 0.1762318 0.1755209
## 2013-08-20 0.2467341 0.2075593 0.1890873 0.1754221
## 2013-08-21 0.3586860 0.2483406 0.2074296 0.1822711
## 2013-08-22 0.3536340 0.2722440 0.2225608 0.1880321
## 2013-08-23 0.3260802 0.3184053 0.2399225 0.1918824

This displays the data which have been used to estimate the regression model. The explanatory variables RV1, RV5 and RV22 were internally calculated by the HARmodel function as the only variable we fed into the function was the RV variable from rv_data.

Let’s first understand the information in the above table and let’s lok at the last row (“2013-08-23”). The dependent variable in this row is \(RV_{2013-08-23}^5\). Let’s confirm this. Recall how we previously “hand”-calculated the \(RV_t^5\) variable and saved it into the rv_data dataset. In fact we calculated \(RV_{t-1}^5\), which means that we should find the value for \(RV_{2013-08-23}^5\) in the row for “2013-08-26” (the next available day). Let us just confirm that:

rv_data[index(rv_data)=="2013-08-26","RV5"]
##                  RV5
## 2013-08-26 0.3260802

Yes, indeed that is exactly the right value. Where does the value in the RV1 column for day “2013-08-23” come from? From the model we know that we need \(RV_{2013-08-16}\) which is 5 days before the dependent variable. We will find that in two places in our dataset:

rv_data[index(rv_data)=="2013-08-16","RV"]
##                   RV
## 2013-08-16 0.3184053
rv_data[index(rv_data)=="2013-08-19","RV1"]
##                  RV1
## 2013-08-19 0.3184053

So. let’s return to the task at hand. We want to forecast \(RV_{2013-08-30}^5\) only using information available at the end of “2013-08-23”, \(E[RV_{2013-08-30}^5|I_{2013-08-23}]\) or, better, using observation numbers \(E[RV_{4096}^5|I_{4091}]\).

\[\begin{equation} E[RV_{4096}^5|I_{4091}] = \hat{\beta}_0 + \hat{\beta}_1 RV_{4091} + \hat{\beta}_2 RV_{4091}^5 + \hat{\beta}_3 RV_{4091}^{22} \end{equation}\]

The parameter estimates are saved in mod_HAR5, remembering that we restricted the estimation sample to only use information from the first 4091 days (index(rv_data)<="2013-08-23"). What we now need to feed into the predict function is the values \(RV_{4091}, RV_{4091}^5, RV_{4091}^{22}\). We need to do so with the names RV1, RV5 and RV22 as these are the variable names used by mod_HAR5 (as seen when we looked at mod_HAR5$model). In fact we have already calculated these and they are stored in row 4092 (day “2013-08-26”), remembering that we calculated these values as \(t-1\) values in rv_data.

rv_data[index(rv_data)=="2013-08-26",c("RV1","RV5","RV22")]
##                  RV1       RV5      RV22
## 2013-08-26 0.1806362 0.3260802 0.2320665

The crucial information from here is that if we want to use the predict function to produce forecasts we will need to feed in the correct explanatory variables. In this particular case we need to feed in a value for RV1, RV5 and RV22.

predict(mod_HAR5,newdata = rv_data[index(rv_data)=="2013-08-26",c("RV1","RV5","RV22")])
## [1] 0.3974498

So, this is the predicted value for \(RV_{4096}^5\) only using information five days earlier, \(E[RV_{4096}^5|I_{4091}]\).

I promised it would be messy! Keeping track of the dates is really tricky!

A trick

As you have seen above, assembling the new information required to apply the predict function, in order to make out of sample forecasts is a little messy. And in fact we used information which we had previously calculated, the variables RV1, RV5 and RV22. But at the same time we saw above that the HARmodel function does internally calculate these values. Let’s see whether we can use the HARmodel function to do some of that dirty work for us. Recall that the aim was to get \(E[RV_{4096}^5|I_{4091}]\). For that reason we first calculate coefficients using information \(I_{4091}\). They are already stored in mod_HAR5. In the creation of that we used data_temp <- rv_data[index(rv_data)<="2013-08-23","RV"] or data_temp <- rv_data[1:4091,"RV"] acknowledging that “2013-08-23” is day 4091.

Let’s call HARmodel with the dataset that includes the period we want to forecast. But we will only do so as it nicely presents the data. For that reason we only use a small number of observations which will take less time.

data_fore <- rv_data[(4096-40):4096,"RV"]
fore_HAR5 <- HARmodel(data = data_fore , periods = c(1,5,22), 
              type = "HAR", h = 5, inputType = "RM")
tail(fore_HAR5$model,6)
##                    y       RV1       RV5      RV22
## 2013-08-23 0.3260802 0.3184053 0.2399225 0.1918824
## 2013-08-26 0.3255740 0.1830191 0.2459137 0.1932473
## 2013-08-27 0.3725868 0.2116613 0.2467341 0.1972984
## 2013-08-28 0.2625053 0.8081004 0.3586860 0.2280568
## 2013-08-29 0.2827714 0.2469839 0.3536340 0.2343011
## 2013-08-30 0.3547143 0.1806362 0.3260802 0.2320665

You can see that the last line contains exactly the values for RV1, RV5 and RV22 we need to forecast \(RV_{4096}^5\). We can now use that to send to the predict function.

data_fore<- fore_HAR5$model[nrow(fore_HAR5$model),c("RV1","RV5","RV22")]
data_fore <- as.xts(data_fore)  # when extracting $model the data loose their
                                # xts formatting, we need to add that back
predict(mod_HAR5,newdata = data_fore)
## [1] 0.3974498

This is of course the same result as the one we got above, the difference being that we let HARmodel do the hard work of picking out the correct right hand side variable for our forecast.

Pseudo-Out of sample forecasting

When we investigate the forecasting ability of different forecasting models we would usually want to do so for repeated forecasts over a period of time. Here we are attempting to replicate results of such an exercise published in Clements and Preve (2021, Figure 2 and Table 4).

Here is how the exercise is set-up. We set the length of an estimation window, 1000 observations, and we then estimate the model using the first 1000 observations. Let’s say we concentrate on the model for the 1-day forecasts

\[\begin{equation} RV_t = \beta_0 + \beta_1 RV_{t-1} + \beta_2 RV_{t-1}^5 + \beta_3 RV_{t-1}^{22} + u_t \end{equation}\]

Given the structure of the model, we actually require 1022 days of RV data to have 1000 observations in our model, as we need to calculate lagged \(RV_{t-1}^{22}\) observations. So lets set some parameters.

l <- 22   # max lag in explanatory variables
h <- 1    # forecast horizon
n <- 1000 # estimation sample

With this information we can extract the first set of RV data to estimate a model

data_temp <- rv_data[1:1022,"RV"] 

We could send this to the HARmodel function, but what we really want to do estimate, forecast, then re-estimate with the next sample and forecast, then re-estimate with the next sample etc. We will do this in a loop.

So let’s first think about the beginning and ends of the estimation sample. We already know the one for the first sample, they are 1 and 1022. The second one will be 2 and 1023, then 3 and 1024 etc. What will the last one be? We are having, altogether 4096 observations. If we want to make one-step ahead forecasts, then the last day for which we will make a forecast is day 4096. That means that the last estimation window will end (h=1) periods before that, i.e. at 4095. This dictates that the last sample begins 1021 observations earlier, i.e. 3074 (=4095-1021).

So let us define all the estimation sample start days, end days and end dates. (These would change if you were looking at the forecasting, say, the 5 day realized volatility.)

est_start <- 1:3074
est_end <- 1022:4095
est_end_dates <- index(rv_data)[est_end]

est_info <- tibble(est_end_dates,est_start,est_end)
est_samples <- nrow(est_info)  # number of estimation samples

Now we have the information we need to run through 3074 estimations and forecasts. But before we do so we should also think about the things we want to save from each estimation and forecast. We will add that information to the est_info tibble.

We will want to save the estimated parameters, the forecast and the actual realisation. The latter two will then allow us to calculate measures of forecast accuracy.

est_info$beta0_1 <- NA   # four HAR coefficients to be saved
est_info$beta1_1 <- NA
est_info$beta2_1 <- NA
est_info$beta3_1 <- NA
est_info$f1 <- NA
est_info$r1 <- NA

Now we can think of doing the hard work

for(j in 1:est_samples) {
  st <- unlist(est_info[j,"est_start"])   # the unlist argument strips the number
  en <- unlist(est_info[j,"est_end"])     # out of the tibble structure
  
  # Estimate the HAR model
  data_temp <- rv_data[st:en,"RV"] 
  mod_HAR_loop <- HARmodel(data = data_temp , periods = c(1,5,22), 
              type = "HAR", h = 1, inputType = "RM")
  est_info$beta0_1[j] <- mod_HAR_loop$coefficients[1]  # saves the estimated coefficients 
  est_info$beta1_1[j] <- mod_HAR_loop$coefficients[2]  # into est_info
  est_info$beta2_1[j] <- mod_HAR_loop$coefficients[3]
  est_info$beta3_1[j] <- mod_HAR_loop$coefficients[4]
  
  # We need the realisation from en+h
  est_info$r1[j] <- rv_data[en+1,"RV"]
  
  # now forecast one step ahead
  # we use the trick
  ## Pretend call of HARmodel to get the right data structure
  data_fore <- rv_data[(en-40):(en+h),"RV"]
  fore_HAR_loop <- HARmodel(data = data_fore , periods = c(1,5,22), 
                        type = "HAR", h = 1, inputType = "RM")

  ## Feed the data into predict
  data_fore<- fore_HAR_loop$model[nrow(fore_HAR_loop$model),c("RV1","RV5","RV22")]
  data_fore <- as.xts(data_fore)  # when extracting $model the data loose their
                                  # xts formatting, we need to add that back
  est_info$f1[j] <- predict(mod_HAR_loop,newdata = data_fore)
  
}

The following code is to replicate elements of Figure 2 in Clements and Perve (2021) in which the variation of the estimated coefficients through time is illustrated.

p5 <- ggplot(est_info, aes(y=beta0_1,x=est_end_dates)) +
  geom_line() +
  labs(x = "Time",title="HAR1, beta_0") +
  theme_light()
p6 <- ggplot(est_info, aes(y=beta1_1,x=est_end_dates)) +
  geom_line() +
  labs(x = "Time",title="HAR1, beta_1") +
  theme_light()
p7 <- ggplot(est_info, aes(y=beta2_1,x=est_end_dates)) +
  geom_line() +
  labs(x = "Time",title="HAR1, beta_2") +
  theme_light()
p8 <- ggplot(est_info, aes(y=beta3_1,x=est_end_dates)) +
  geom_line() +
  labs(x = "Time",title="HAR1, beta_3") +
  theme_light()

ggarrange(p5,p6,p7,p8,nrow = 2, ncol = 2)

This basically replicates the OLS part of Figure 2 in Clements and Perve (2021).

Forecast evaluation

Let us also visualise the actual one step ahead forecasts against the realisations. We saved them in the variables f1 and r1.

p9 <- ggplot(est_info, aes(y=r1,x=est_end_dates)) +
  geom_line() +
  geom_line(aes(y = f1), color = "blue") +
  labs(x = "Time",title="HAR1, forecast and realisation") +
  theme_light()
p9

So we have a lot of forecasts and realisations. What we want is to calculate a summary statistics which describes the difference between. Here we will calculate two common loss functions, the mean square error (MSE) and the QLIKE loss function.

MSE = mean((est_info$r1-est_info$f1)^2)
QLIKE = mean(est_info$r1/est_info$f1 - log(est_info$r1/est_info$f1) - 1)

print(paste("MSE = ", MSE))
## [1] "MSE =  3.22861543645418"
print(paste("QLIKE = ", QLIKE))
## [1] "QLIKE =  0.13987581341473"

This measure could now be compared to forecast loss functions of other forecast models.

Variations to the basic model

HAR models come in a lot of different flavours, many of which are discussed and compared with the standard HAR model in Clements and Perve (2021).

In particular we could think of estimating standard HAR models on transformations of \(RV_t\), or extensions to the standard HAR model and combinations thereof.

Variable transformations

In the Exploring Data section we introduced transformations to the realized volatility measure, in particular a square root and a log transformation.

rv_data$logRV <- log(rv_data$RV)
rv_data$srRV <- 2*(sqrt(rv_data$RV)-1)

Here we will feed the square root transformation into the standard HARmodel function (type = HAR).

data_temp <- rv_data[,c("srRV")]
mod_HARsq <- HARmodel(data = data_temp , periods = c(1,5,22),
              type = "HAR", h = 1, inputType = "RM")
summary(mod_HARsq)
## 
## Call:
## "RV1 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9157 -0.2804 -0.0668  0.1869  8.3269 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## beta0 -0.009102   0.011790  -0.772     0.44    
## beta1  0.396835   0.051875   7.650 2.49e-14 ***
## beta2  0.385709   0.074684   5.165 2.53e-07 ***
## beta3  0.161511   0.036630   4.409 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6159 on 4070 degrees of freedom
## Multiple R-squared:  0.7059, Adjusted R-squared:  0.7057 
## F-statistic:  3257 on 3 and 4070 DF,  p-value: < 2.2e-16

When comparing these parameters to those in column “sqr-HAR” of Table 2 in Clements and Perve (2021) we get basically identical parameter values.

The HARmodel function also has a transform option which allows you to apply the square root transformation inside the function; so we feed in the untransformed RV variable. What you get is the following:

data_temp <- rv_data[,c("RV")]
mod_HARsq2 <- HARmodel(data = data_temp , periods = c(1,5,22),
              type = "HAR", h = 1, inputType = "RM", transform = "sqrt")
summary(mod_HARsq2)
## 
## Call:
## "sqrt(RV1) = beta0  +  beta1 * sqrt(RV1) +  beta2 * sqrt(RV5) +  beta3 * sqrt(RV22)"
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5145 -0.1414 -0.0333  0.0937  4.2026 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## beta0  0.06216    0.01507   4.125 3.78e-05 ***
## beta1  0.41021    0.04863   8.435  < 2e-16 ***
## beta2  0.33331    0.06617   5.037 4.93e-07 ***
## beta3  0.17455    0.03398   5.137 2.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3094 on 4070 degrees of freedom
## Multiple R-squared:  0.7032, Adjusted R-squared:  0.703 
## F-statistic:  3215 on 3 and 4070 DF,  p-value: < 2.2e-16

Turns out that the results are actually different. To understand the difference we look at the last line of the $model data for the standard model and the two transformed models:

mod_HAR$model[nrow(mod_HAR$model),]
##                   y      RV1       RV5     RV22
## 2013-08-30 0.540351 0.348314 0.2827714 0.249272
mod_HARsq$model[nrow(mod_HARsq$model),]
##                     y        RV1        RV5      RV22
## 2013-08-30 -0.5298285 -0.8196373 -0.9535826 -1.036222
mod_HARsq2$model[nrow(mod_HARsq2$model),]
##                    y       RV1       RV5      RV22
## 2013-08-30 0.7350857 0.5901814 0.5317625 0.4992715

First looking at the dependent variable y we can see that the dependent variable for mod_HARsq is \(-0.5298 = 2*(sqrt(0.3483)-1)\), i.e. exactly as we should expect given the way we calculated srRV. But for mod_HARsq2 the dependent variable is \(0.7351 = sqrt(0.3483)\). Hence the square root transformation used inside the HARmodel function is slightly different to the square root transform used in Clements and Perve (2021).

There is an additional difference when it comes to the aggregation of the multi-day RV measures. For mod_HARsq we fed in the transformed variable and we let the function do the aggregation on this transformed variable. So for instance the RV5 value for the last day (“2013-08-30”) is -0.9536. This is the result of aggregating the transformed variable (remembering that explanatory variables are lagged by one period):

\[\begin{equation} RVsq_{4095}^5 = -0.9536=\frac{1}{5} \sum_{j=0}^4 RVsq_{4095-j} = \frac{1}{5} \sum_{j=0}^4 2(\sqrt{RV_{4095-j}} -1) \end{equation}\]

How does the HARmodel function handle the aggregation? In fact what the function does is as follows:

\[\begin{equation} RVsq_{4095}^5 = 0.5318=\sqrt{\frac{1}{5} \sum_{j=0}^4 RV_{4095-j}} = \sqrt{0.2827714} \ne \frac{1}{5} \sum_{j=0}^4 \sqrt{RV_{4095-j}} \end{equation}\]

The difference in definitions discussed explain why the two models mod_HARsq and mod_HARsq2 have different parameter estimates. In terms of forecasting their differences may not be so large though.

When it comes to forecasting from a transformed model, we will need to look out for one extra aspect. Even if we are modelling \(\sqrt{RV_t}\) or \(2(\sqrt{RV_t}-1)\), in the end we will be interested in predicting \(RV_t\). This means that after obtaining a prediction for either of the two transformed variables we have to transform them back. Such a transformation also involves a bias correction involving the estimation’s standard error. The details of this go beyond this piece, but are available from Clements and Perve (2021) and Taylor (2017).

Alternative HAR models

In the discussion above we restricted our attention to the standard HAR model. There are a range of variations of this model. Such variations can account for either some additional covariates; or may allow for additional realized jump measures; or allow for the inclusion of realized quarticity measures. One of the alternative models estimated is the HARQ model. This model uses the realized quarticity which is a measure of the variance of the the realized volatility estimate. For model details refer to the papers by Bollerslev et al. (2016) or Clements and Perve (2021). The way in which such a model is estimated is shown here:

data_temp <- rv_data[,c("RV","RQ")]
mod_HARQ <- HARmodel(data = data_temp , periods = c(1,5,22), periodsQ = c(1),
              type = "HARQ", h = 1, inputType = "RM")
summary(mod_HARQ)
## 
## Call:
## "RV1 = beta0  +  beta1 * RV1 +  beta2 * RV5 +  beta3 * RV22 +  beta4 * RQ1"
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.890 -0.253 -0.054  0.120 45.960 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## beta0 -0.009806   0.053806  -0.182   0.8554    
## beta1  0.576823   0.050057  11.523  < 2e-16 ***
## beta2  0.358626   0.116514   3.078   0.0021 ** 
## beta3  0.097615   0.083862   1.164   0.2445    
## beta4 -0.360197   0.060617  -5.942 3.05e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.533 on 4069 degrees of freedom
## Multiple R-squared:  0.5627, Adjusted R-squared:  0.5623 
## F-statistic:  1309 on 4 and 4069 DF,  p-value: < 2.2e-16

You can see that this basically replicates the results for HARQ in the Clements and Perve (2021, Table 1).

Let us briefly demonstrate how we would forecast from this model, using the trick introduced earlier. Say we want to predicted the realized volatility on 30 Aug 2013 (day 4096) at the end of 29 Aug 2013 (day 4095), \(E(RV_{4096}|I_{4095})\), where \(I_{4095}\) contains all information available at the end of the 29 Aug 2013. Further we want to use 1,000 observations to estimate the model parameters

data_temp <- rv_data[3074:4095,c("RV","RQ")]
mod_HARQ <- HARmodel(data = data_temp , periods = c(1,5,22), periodsQ = c(1),
              type = "HARQ", h = 1, inputType = "RM")

This model now contains the parameter information to use in the forecast. Before we use the predict function to forecast we will use the HARmodel function again just to get the correct data for the right hand side variables in our model. Importantly, as we call the HARmodel function we now need to include the observation for which we want to forecast.

data_temp <- rv_data[4056:4096,c("RV","RQ")]
fore_HARQ <- HARmodel(data = data_temp , periods = c(1,5,22), periodsQ = c(1),
              type = "HARQ", h = 1, inputType = "RM")

Really we re-estimated the model here, but we are not interested in the new coefficient estimates (which is why we only included a very short sample) we are just interested in the last line of mod_HAR_fore$model

fore_HARQ$model[nrow(fore_HARQ$model),]
##                   y      RV1       RV5     RV22          RQ1
## 2013-08-30 0.540351 0.348314 0.2827714 0.249272 6.554222e-05

This has all the explanatory variables we require for our forecast

data_fore<- fore_HARQ$model[nrow(fore_HARQ$model),c("RV1","RV5","RV22","RQ1")]
data_fore <- as.xts(data_fore)  # when extracting $model the data loose their
                                # xts formatting, we need to add that back
predict(mod_HARQ,newdata = data_fore)
## [1] 0.381139

This is \(E(RV_{4096}|I_{4095})=\) 0.381139 and can be compared to the realized volatility at time 4096, \(RV_{4096}=0.540351\). So on this occasion we are underestimating the one day ahead realized volatility.

Calculating pseudo out of sample forecasts using the HARQ model.

In order to produce all the out of sample forecats we will, again, have to run through the loop (or if you did the work you would expand the loop we wrote earlier). We will save the results in the est_info dataframe, in variables fQ1 (for the forecast) and rQ1 for the realisation.

est_info$fQ1 <- NA
est_info$rQ1 <- NA

Now the loop. Inside the loop we will have to 1) select the estimation sample, 2) Estimate the model, 3) get the explanatory variables for the forecast and 4) produce the forecast.

for(j in 1:est_samples) {
  
  # 1) get the estimation sample
  st <- unlist(est_info[j,"est_start"])   # the unlist argument strips the number
  en <- unlist(est_info[j,"est_end"])     # out of the tibble structure
  data_temp <- rv_data[st:en,c("RV","RQ")] 
  
  # 2) Estimate the model
  mod_HARQ_loop <- HARmodel(data = data_temp , periods = c(1,5,22), periodsQ = c(1),
              type = "HARQ", h = 1, inputType = "RM")

  # 3) get the explanatory variables for the forecast
  data_temp <- rv_data[(en-40):(en+1),c("RV","RQ")]
  fore_HARQ_loop <- HARmodel(data = data_temp , periods = c(1,5,22), periodsQ = c(1),
              type = "HARQ", h = 1, inputType = "RM")
  data_fore <- fore_HARQ_loop$model[nrow(fore_HARQ_loop$model),c("RV1","RV5","RV22","RQ1")]
  data_fore <- as.xts(data_fore)  
                               
  # 4) forecast and save results
  est_info$fQ1[j] <- predict(mod_HARQ_loop,newdata = data_fore)
  est_info$rQ1[j] <- rv_data[en+1,"RV"]

}

Let us also visualise the actual one step ahead forecasts against the realisations. We saved them in the variables fQ1 and rQ1.

p10 <- ggplot(est_info, aes(y=rQ1,x=est_end_dates)) +
  geom_line() +
  geom_line(aes(y = fQ1), color = "blue") +
  geom_line(aes(y = f1), color = "red") +
  labs(x = "Time",title="HARQ1, forecast and realisation") +
  theme_light()
p10

As you can see from the above figure there is a particular time (actually the estimation period ending 2008-09-29) for which we have a negative forecast. The reason for this being some negative parameter estimate for the coefficient to RQ1. As we are forecasting a volatility measure this is not a reasonable forecast. As described in Clements and Perve (2021, Section 2.6.3) one can apply a so-called insanity filter to eliminate such forecasts.

We can now use the information stored in fQ1 and rQ1 (the latter being identical to r1) to calculate our forecast loss function for the HARQ forecast. We will print the respective loss functions for the HAR forecasts which we calculated earlier.

MSE_Q = mean((est_info$rQ1-est_info$fQ1)^2)
QLIKE_Q = mean(est_info$rQ1/est_info$fQ1 - log(est_info$rQ1/est_info$fQ1) - 1)


print(paste("MSE_Q = ", MSE_Q))
## [1] "MSE_Q =  2.86474709372225"
print(paste("QLIKE_Q = ", QLIKE_Q))
## [1] "QLIKE_Q =  NaN"
print(paste("MSE = ", MSE))
## [1] "MSE =  3.22861543645418"
print(paste("QLIKE = ", QLIKE))
## [1] "QLIKE =  0.13987581341473"
print(paste("MSE_Q/MSE = ", MSE_Q/MSE))
## [1] "MSE_Q/MSE =  0.887298952168936"
print(paste("QLIKE_Q/QLIKE = ", QLIKE_Q/QLIKE))
## [1] "QLIKE_Q/QLIKE =  NaN"

As previously discussed we had one (actually two) periods in which the HARQ model forecast negative values. This causes the QLIKE calculation to fail which is the reason for the missing QLIKE calulation for this model (as in Clements and Perve, 2021).