This exercise is part of the ECLR page.
Here, we will demonstrate the use of factor analysis to address situations similar to Lasso and Ridge Regression, where you estimate a regression model containing a large number of potential or weak explanatory variables. These types of models are primarily used for prediction purposes. Let's assume we want to predict the return of the Dow Jones Industrial portfolio using the returns of its constituents from the previous period. Formally, given a \(k \times 1\) vector of individual returns from the previous day \(\boldsymbol r_{t-1}\), we aim to predict the return on the value-weighted portfolio at time \(t\), \(r^p_t\), i.e. assuming linearity:
\[r^p_t = \mu_0+{\boldsymbol{\beta r_{t-1}}} + u_t\]
According to the efficient market hypothesis (EMH), the explanatory power of this regression should be close to zero. One of the formulations of EMH is that all relevant information at time ( \(t-1\) ) is already reflected in the stock prices at time ( \(t-1\) ). Thus, changes in prices (returns) cannot be predicted using past changes in prices, i.e., $E(r_{t} | r_{t-1,1}, , r_{t-1,k}) = $.
At the very least, we can anticipate that any individual return at time \(t-1\) has close to zero explanatory power to predict the portfolio return at time \(t\). To check whether this is the case we will identify and download the necessary data. Then we will perform the above regression but also estimate a model that used principal components rather than individual returns.
The data to download includes the Dow Jones Industrial Average and
its constituents. The data will be downloaded from Yahoo Finance using the
pdfetch
package. The list of constituents will be extracted
from Wikipedia
using the xml2
and rvest
packages.
library(xml2) # package to read websites
library(rvest) # works with xml2 to access content of websites
library(pdfetch) # to access data
library(quantmod) # to access data
library(pls) # used for Principle Component Regressions
We start by copying the entire content of the corresponding Wikipedia page:
url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
webpage <- xml2::read_html(url)
If you look at the webpage
object in your environment
you see that it is actually a list of two things (node
and
doc
). And if you try to look at webpage$node
or webpage$doc
R will reply with a message like
<pointer: 0x0000016a157ceba0>
. Don't worry, all the
webpage info is there (somewhere in your computer's memory and the
webpage
object links/points to that memory), and you could
look at it using as.character(webpage)
. But we are rarely
interested in the entire html document, but rather particular elements.
Below we will extract a table from that html document.
Second, we extract the table with the relevant data (component companies of the DJIA). We are interested in the first table on the above linked page. However, sometimes elements of the html page that are not obviously tables are coded as tables. So you may need to try which table is the one you want. Here it turns out to be the 3rd.
table_DJI <- rvest::html_table(webpage)[[3]]
print(table_DJI)
## # A tibble: 30 × 7
## Company Exchange Symbol Industry `Date added` Notes `Index weighting`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3M NYSE MMM Conglom… 1976-08-09 "As … 2.06%
## 2 American Expre… NYSE AXP Financi… 1982-08-30 "" 4.34%
## 3 Amgen NASDAQ AMGN Biophar… 2020-08-31 "" 4.88%
## 4 Amazon NASDAQ AMZN Retaili… 2024-02-26 "" 2.85%
## 5 Apple NASDAQ AAPL Informa… 2015-03-19 "" 3.53%
## 6 Boeing NYSE BA Aerospa… 1987-03-12 "" 2.36%
## 7 Caterpillar NYSE CAT Constru… 1991-05-06 "" 5.99%
## 8 Chevron NYSE CVX Petrole… 2008-02-19 "Als… 2.30%
## 9 Cisco NASDAQ CSCO Informa… 2009-06-08 "" 0.86%
## 10 Coca-Cola NYSE KO Drink i… 1987-03-12 "Als… 1.06%
## # ℹ 20 more rows
Now we store the column with the company symbols
(Symbol
) as a vector of strings. We will use that
information soon to download the return data for these companies.
dji_ticks <- table_DJI$Symbol
print(dji_ticks)
## [1] "MMM" "AXP" "AMGN" "AMZN" "AAPL" "BA" "CAT" "CVX" "CSCO" "KO"
## [11] "DIS" "DOW" "GS" "HD" "HON" "IBM" "INTC" "JNJ" "JPM" "MCD"
## [21] "MRK" "MSFT" "NKE" "PG" "CRM" "TRV" "UNH" "VZ" "V" "WMT"
Here's how you can download the adjusted close daily prices of the
Dow Jones Industrial Average constituents from January 2005 to July 2024
using the pdfetch
package and in particular the
pdfetch_YAHOO
function as we get the data from the YAHOO
Finance website.
DJIC = pdfetch_YAHOO(identifiers = dji_ticks, fields = "adjclose",
from = as.Date("2005-01-01"),
to = as.Date("2024-07-07"),
interval = "1d")
This puts the DJIC
object in your environment. This is a
xts
type object which means that R knows that these are
time-series data.
It is important to keep in mind that most things can be achieved in
different ways. Above we used the pdfetch
package to get
the data. There is another package (quantmod
) which can be
used to get the same data. It has a function called
getSymbols
which obtains data for a particular tick.
# the 6th column is the adjusted close, hence [,6]
DJIC <- getSymbols(dji_ticks[1],auto.assign = FALSE)[,6]
names(DJIC) <- dji_ticks[1]
for(i in seq(2, length(dji_ticks))){
temp <- getSymbols(dji_ticks[i],auto.assign = FALSE)[,6]
names(temp) <- dji_ticks[i]
DJIC = merge(DJIC, temp)
}
isna = colSums(is.na(DJIC))
print(isna)
But if the pdfetch
function worked there is no nee to do
this as well.
Next we check for missing values.
isna = colSums(is.na(DJIC))
print(isna)
## MMM AXP AMGN AMZN AAPL BA CAT CVX CSCO KO DIS DOW GS HD HON IBM
## 0 0 0 0 0 0 0 0 0 0 0 3576 0 0 0 0
## INTC JNJ JPM MCD MRK MSFT NKE PG CRM TRV UNH VZ V WMT
## 0 0 0 0 0 0 0 0 0 0 0 0 807 0
Two of the stocks have a significant amount of missing observations.
Here's how you can drop Visa (V
) and Dow Chemical
(DOW
) from the list of constituents
DJIC_nmiss = DJIC[,isna==0]
Now we import the Dow Jones Industrial Average ^DJI
index:
dji_ind = pdfetch_YAHOO(identifiers = "^DJI",fields = "adjclose",
from = as.Date("2005-01-01"),
to = as.Date("2024-07-07"),
interval = "1d")
plot(dji_ind,main = "Dow Jones Industrial")
Finally, construct log-return series for ^DJI
and
log-returns for its constituents DJIC_nmiss
in percents
djret = diff(log(dji_ind))[-1]*100
Xret = diff(log(DJIC_nmiss))[-1]*100
To invesigate this, run the code without the [-1]
part
and compare this to the results above. The return calculation creates a
missing observation for the first observation and [-1]
removes that first observation.
Now that the data have been setup we can apply PCA to the transformed data.
Xpca <- prcomp(Xret, scale. = TRUE)
By default, the prcomp()
function centers the variables
to have a mean of zero. By using scale. = TRUE
, we scale
the variables to have a standard deviation of one. The output from
prcomp()
contains several useful quantities.
names(Xpca)
## [1] "sdev" "rotation" "center" "scale" "x"
The center
and scale
components correspond
to the means and standard deviations of the variables that were used for
scaling prior to implementing PCA. Since our data consists of daily
returns (in percentages), the means are quite small.
print(Xpca$center)
## MMM AXP AMGN AMZN AAPL BA CAT CVX CSCO KO DIS
## 0.0197 0.0381 0.0390 0.0916 0.1114 0.0335 0.0495 0.0375 0.0260 0.0349 0.0299
## GS HD HON IBM INTC JNJ JPM MCD MRK MSFT NKE
## 0.0363 0.0514 0.0465 0.0248 0.0177 0.0285 0.0446 0.0530 0.0438 0.0657 0.0434
## PG CRM TRV UNH VZ WMT
## 0.0334 0.0846 0.0440 0.0538 0.0222 0.0361
print(Xpca$scale)
## MMM AXP AMGN AMZN AAPL BA CAT CVX CSCO KO DIS GS HD HON IBM INTC
## 1.47 2.23 1.62 2.38 2.04 2.21 2.00 1.80 1.78 1.15 1.75 2.20 1.65 1.59 1.44 2.00
## JNJ JPM MCD MRK MSFT NKE PG CRM TRV UNH VZ WMT
## 1.08 2.29 1.27 1.53 1.72 1.81 1.14 2.58 1.71 1.93 1.29 1.27
The rotation
matrix provides the principal component
loadings. The loadings for the first four factors are:
Xpca$rotation[,1:4]
## PC1 PC2 PC3 PC4
## MMM 0.205 0.00815 -0.10068 0.00167
## AXP 0.214 0.17652 -0.22970 -0.02541
## AMGN 0.159 -0.25218 0.14226 0.44770
## AMZN 0.155 0.19849 0.42360 0.05310
## AAPL 0.172 0.16457 0.32265 0.09081
## BA 0.184 0.16588 -0.17077 -0.02562
## CAT 0.199 0.18782 -0.16367 0.09177
## CVX 0.196 0.04205 -0.24122 0.18958
## CSCO 0.201 0.06323 0.16338 0.01717
## KO 0.189 -0.27132 -0.04337 -0.16665
## DIS 0.206 0.10713 -0.05766 -0.09269
## GS 0.207 0.20782 -0.21600 0.03725
## HD 0.206 0.03326 0.05785 -0.24750
## HON 0.226 0.08662 -0.13909 0.03053
## IBM 0.198 0.02620 0.00443 -0.01163
## INTC 0.189 0.12294 0.20137 0.06023
## JNJ 0.187 -0.36125 0.00284 0.18185
## JPM 0.208 0.17257 -0.27253 -0.03233
## MCD 0.183 -0.11510 0.00599 -0.19918
## MRK 0.163 -0.32640 -0.02154 0.31609
## MSFT 0.201 0.08593 0.31809 0.07422
## NKE 0.182 0.08906 0.08567 -0.30791
## PG 0.182 -0.36448 0.04454 -0.20044
## CRM 0.162 0.21000 0.33156 0.07927
## TRV 0.200 -0.03811 -0.22869 -0.00504
## UNH 0.167 -0.14823 -0.05102 0.31970
## VZ 0.173 -0.22104 -0.02149 -0.17901
## WMT 0.151 -0.26988 0.16368 -0.42904
What will be the result of sum(Xpca$rotation^2)
?
By construction, \[\sum_{i=1}^{k}
w_{i,j}^2 = 1\] meaning that the sum of squared loadings for one
factor (over all \(k\) elements in
Xret
) sum to 1.
As there are \(k\) factors:
\[\sum_{j=1}^{k}\sum_{i=1}^{k} w_{i,j}^2 = k\] In this case, \(k=28\).
Standard deviations are stored in
Xpca$sdev
## [1] 3.578 1.236 1.123 0.917 0.893 0.875 0.810 0.800 0.765 0.754 0.743 0.732
## [13] 0.727 0.720 0.706 0.695 0.683 0.674 0.654 0.647 0.629 0.627 0.610 0.608
## [25] 0.583 0.556 0.523 0.465
With a fraction of total variance explained by a particular factor:
pve = Xpca$sdev^2/sum(Xpca$sdev^2)
pve
## [1] 0.45726 0.05458 0.04508 0.03006 0.02849 0.02734 0.02343 0.02284 0.02089
## [10] 0.02030 0.01971 0.01914 0.01890 0.01852 0.01781 0.01723 0.01666 0.01624
## [19] 0.01526 0.01496 0.01413 0.01402 0.01327 0.01320 0.01215 0.01102 0.00978
## [28] 0.00773
The first factor explains 45.7% of all the variance and the second one explains 5.4% of it. The last factor explains 0.7% of the total variance. We can plot the proportion of variance explained:
plot(pve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0,0.5), type = "o")
What will be the result of the sum(Xpca$sdev^2)
calculation?
Will it be different from the sum(var(Xpca$x))
calculation?
.
Let's define a \(T\times k\) matrix
of factors \(F\) (Xpca$x
),
a \(k\times k\) matrix of loadings
\(w\) (Xpca$rotation
), and
a \(T\times k\) matrix of demeaned
standardised returns \(R\). By
construction, \(F = R \times w\).
A sample covariance matrix of \(F\)
(var(Xpca$x)
), given demeaned \(R\), can be defined as \[\widehat{Cov}(F) = \frac{1}{T-1}
w'R'Rw\] As \(w'w = ww'
= I_{k\times k}\) (see previous hint) and \(Cov(f_i,f_j)=0, i\ne j\), it follows that
\(\widehat{Cov}(F)\) is a diagonal
matrix up to rounding error. sum(Xpca$sdev^2)
is a sum of
variances of all factors, i.e. \(trace(\widehat{Cov}(F))=k\).
sum(var(Xpca$x))
is \[\sum_{i=1}^k\sum_{j=1}^k \widehat{cov(f_i,f_j)} =
\sum_{i=1}^k \widehat{cov(f_i,f_i)} = Tr(\widehat{Cov}(F))\] On
the other hand,
\[Tr(\widehat{Cov}(F)) = \frac{1}{T-1} Tr(w'R'Rw) = \frac{1}{T-1} Tr(ww'R'R) = Tr(\frac{1}{T-1}R'R) = \sum_{i=1}^k \widehat{cov(r_i,r_i)}=k\] As for standardised returns \(\widehat{cov(r_i,r_i)}=1\).
In Xpca$x
you now have the principal components and you
could decide to attempt to explain variation in daily DJI index returns
by regressing it against the one day lagged first, say, two principal
components. But how do you know how many principal components you should
use? This calls for a cross validation approach!
To facilitate this we shall use the pcr()
function,
which is part of the pls
library (which was loaded at the
top of the code). pcr
is short for Principal components
regression. We now apply PCR to the Dow Jones data, imported in section
one, in an attempt to predict one-day-ahead Dow Jones returns using 28
out of its 30 previous day constituents (recall that we dropped
DOW
and V
).
First, we need to construct a vector of dependent variables and a (lagged) matrix of explanatory variables.
set.seed(2) # as we are using cross validation (which uses random datasplits)
# we set a random sees such that repeated applications
# produce identical results
T = dim(djret)[1]
y = djret[-1] # remove first observation, y_{t}
X = Xret[-T,] # remove last observation, X_{t-1}
We can test that the data are aligned correctly now (i.e. with one
day lag for X
) by looking at the start and end dates.
print(c(start(y),start(X)))
## [1] "2005-01-05" "2005-01-04"
print(c(end(y),end(X)))
## [1] "2024-07-05" "2024-07-03"
Now we can apply the principal component regression.
out_full <- pcr(y~.,data = X, scale = TRUE, validation = "CV")
The syntax for the pcr()
function is similar to that for
lm()
, with a few additional options. Setting
scale = TRUE
has the same effect as
scale. = TRUE
in prcomp()
. Setting
validation = "CV"
causes pcr()
to compute the
ten-fold cross-validation error for each possible value of \(M\), the number of principal components
used.
summary(out_full)
## Data: X dimension: 4907 28
## Y dimension: 4907 1
## Fit method: svdpc
## Number of components considered: 28
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.152 1.144 1.144 1.144 1.143 1.143 1.142
## adjCV 1.152 1.144 1.144 1.144 1.143 1.142 1.142
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1.142 1.142 1.142 1.143 1.143 1.144 1.144
## adjCV 1.142 1.142 1.142 1.143 1.143 1.143 1.144
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 1.145 1.145 1.145 1.144 1.144 1.144 1.145
## adjCV 1.144 1.145 1.145 1.144 1.143 1.144 1.144
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 1.146 1.144 1.145 1.145 1.144 1.145 1.146
## adjCV 1.146 1.144 1.144 1.144 1.143 1.144 1.145
## 28 comps
## CV 1.146
## adjCV 1.145
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 45.729 51.187 55.692 58.697 61.547 64.281 66.624 68.907
## y 1.564 1.738 1.776 2.074 2.418 2.443 2.466 2.551
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 70.997 73.027 75.00 76.912 78.80 80.653 82.434
## y 2.552 2.628 2.63 2.679 2.73 2.854 2.896
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 84.157 85.822 87.446 88.972 90.469 91.882 93.284
## y 2.915 3.425 3.481 3.484 3.484 3.486 3.845
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps
## X 94.611 95.931 97.146 98.248 99.227 100.000
## y 3.867 3.868 4.119 4.131 4.162 4.413
The \(CV\) score is provided for
each possible number of components, ranging from \(M = 0\) onwards. Please note that
pcr()
reports the root mean squared error. We can
plot the cross-validation scores using the validationplot()
function. In particular, to visualize the mean squared
error:
validationplot(out_full, val.type = "MSEP")
The smallest cross-validation error occurs when \(M =6\). This suggests the optimal number of factors for the model.
All of the information you need to figure out what number of
components delivers the minimum cross-validated error is contained in
out_full
although that does not directly report the mean
squared prediction error (MSEP). THat can be done by
MSEP(out_full,estimate = "CV")
This has now calculated the cross-validated MSEP. You can plot this and identify the minimum (as you could identify the minimum in the earlier plot to see that it is associated to a 6 component regression).
If you wish to have R identify this directly you could for instance do this with the following command:
names(which.min(MSEP(out_full,estimate = "CV")$val[1,1,]))
## [1] "6 comps"
This tells you that the 6 component model worked best.
The summary()
function also provides the percentage
of variance explained in the predictors and in the response using
different numbers of components (see the previous section). Say, the
first factor explains 45.729% of all the variance in X
with
the corresponding \(R^2 = 1.564\)%. For
\(M=6\) these numbers are, accordingly,
\(64.281\)% and \(2.443\)%.
As you can see, even though the explanatory power of the PCA is small, it is statistically significant, which contradicts the Efficient Market Hypothesis discussed briefly before. Let's take a closer look at the returns distribution:
hist(y,breaks = 50)
Most of the returns are concentrated around 0. However, there are outliers with 5% to 10% daily (!) returns. These outliers are typically attributed to times of financial crises, when all stocks tend to move together in the same (negative) direction, followed by positive corrections. Let's check whether these outliers are the main driver for the predictability of returns.
# indicator variables to indicate which obs to include
insample1 <- (abs(y) < 5) # only use obs where returns are smaller than 5
insample2 <- (abs(y) < 4)
insample3 <- (abs(y) < 3)
sum(!insample1)
## [1] 27
sum(!insample2)
## [1] 54
sum(!insample3)
## [1] 111
out1 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample1)
out2 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample2)
out3 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample3)
summary(out1)
## Data: X dimension: 4880 28
## Y dimension: 4880 1
## Fit method: svdpc
## Number of components considered: 28
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.001 0.9984 0.9988 0.9990 0.9985 0.9980 0.9982
## adjCV 1.001 0.9984 0.9988 0.9989 0.9984 0.9979 0.9981
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.9972 0.9975 0.9982 0.9979 0.9980 0.9983 0.9982
## adjCV 0.9970 0.9973 0.9980 0.9976 0.9978 0.9981 0.9975
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 0.9977 0.9982 0.9986 0.9980 0.9984 0.9988 0.9989
## adjCV 0.9974 0.9979 0.9983 0.9977 0.9980 0.9984 0.9985
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 0.9987 0.9991 0.9993 0.9995 0.9989 0.9992 0.9992
## adjCV 0.9981 0.9986 0.9988 0.9991 0.9984 0.9987 0.9988
## 28 comps
## CV 1.001
## adjCV 1.000
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 44.319 49.9246 54.4879 57.620 60.56 63.287 65.712 68.031
## y 0.677 0.7035 0.7136 0.867 1.07 1.077 1.332 1.335
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 70.213 72.314 74.344 76.316 78.217 80.113 81.946
## y 1.338 1.441 1.459 1.459 1.677 1.698 1.698
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 83.735 85.44 87.095 88.67 90.212 91.661 93.086
## y 1.698 1.81 1.811 1.84 1.846 1.963 1.967
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps
## X 94.487 95.878 97.136 98.266 99.27 100.000
## y 1.983 1.986 2.154 2.154 2.24 2.241
#summary(out2)
#summary(out3)
validationplot(out1,val.type = "MSEP")
By dropping 27, 54, 111 observations, the full explanatory power goes down from \(4.162\)% to \(2.24\)%, \(1.779\)%, \(1.13\)%. The validation plot that corresponds to the last subsample with \(M =8\) and \(0.657\)%..
validationplot(out3,val.type = "MSEP")
Extra Question
Repeat the analysis using S&P 500 index and S&P 500 constituents.
ECLR
This exercise is part of the ECLR page.