Data Import

The data to download includes the Dow Jones Industrial Average and its constituents. The data will be downloaded from Yahoo Finance using the pdfetch package. The list of constituents will be extracted from Wikipedia using the xml2 and rvest packages.

library(xml2)   # package to read websites
library(rvest)  # works with xml2 to access content of websites
library(pdfetch)  # to access data  
library(quantmod) # to access data 
library(pls)      # used for Principle Component Regressions

We start by copying the entire content of the corresponding Wikipedia page:

url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
webpage <- xml2::read_html(url)

If you look at the webpage object in your environment you see that it is actually a list of two things (node and doc). And if you try to look at webpage$node or webpage$doc R will reply with a message like <pointer: 0x0000016a157ceba0>. Don't worry, all the webpage info is there (somewhere in your computer's memory and the webpage object links/points to that memory), and you could look at it using as.character(webpage). But we are rarely interested in the entire html document, but rather particular elements. Below we will extract a table from that html document.

Second, we extract the table with the relevant data (component companies of the DJIA). We are interested in the first table on the above linked page. However, sometimes elements of the html page that are not obviously tables are coded as tables. So you may need to try which table is the one you want. Here it turns out to be the 3rd.

table_DJI <- rvest::html_table(webpage)[[3]] 
print(table_DJI)

## # A tibble: 30 × 7
##    Company         Exchange Symbol Industry `Date added` Notes `Index weighting`
##    <chr>           <chr>    <chr>  <chr>    <chr>        <chr> <chr>            
##  1 3M              NYSE     MMM    Conglom… 1976-08-09   "As … 2.06%            
##  2 American Expre… NYSE     AXP    Financi… 1982-08-30   ""    4.34%            
##  3 Amgen           NASDAQ   AMGN   Biophar… 2020-08-31   ""    4.88%            
##  4 Amazon          NASDAQ   AMZN   Retaili… 2024-02-26   ""    2.85%            
##  5 Apple           NASDAQ   AAPL   Informa… 2015-03-19   ""    3.53%            
##  6 Boeing          NYSE     BA     Aerospa… 1987-03-12   ""    2.36%            
##  7 Caterpillar     NYSE     CAT    Constru… 1991-05-06   ""    5.99%            
##  8 Chevron         NYSE     CVX    Petrole… 2008-02-19   "Als… 2.30%            
##  9 Cisco           NASDAQ   CSCO   Informa… 2009-06-08   ""    0.86%            
## 10 Coca-Cola       NYSE     KO     Drink i… 1987-03-12   "Als… 1.06%            
## # ℹ 20 more rows

Now we store the column with the company symbols (Symbol) as a vector of strings. We will use that information soon to download the return data for these companies.

dji_ticks <- table_DJI$Symbol
print(dji_ticks)

##  [1] "MMM"  "AXP"  "AMGN" "AMZN" "AAPL" "BA"   "CAT"  "CVX"  "CSCO" "KO"  
## [11] "DIS"  "DOW"  "GS"   "HD"   "HON"  "IBM"  "INTC" "JNJ"  "JPM"  "MCD" 
## [21] "MRK"  "MSFT" "NKE"  "PG"   "CRM"  "TRV"  "UNH"  "VZ"   "V"    "WMT"

Here's how you can download the adjusted close daily prices of the Dow Jones Industrial Average constituents from January 2005 to July 2024 using the pdfetch package and in particular the pdfetch_YAHOO function as we get the data from the YAHOO Finance website.

DJIC = pdfetch_YAHOO(identifiers = dji_ticks, fields = "adjclose",
                     from = as.Date("2005-01-01"),
                     to = as.Date("2024-07-07"),
                     interval = "1d")

This puts the DJIC object in your environment. This is a xts type object which means that R knows that these are time-series data.

It is important to keep in mind that most things can be achieved in different ways. Above we used the pdfetch package to get the data. There is another package (quantmod) which can be used to get the same data. It has a function called getSymbols which obtains data for a particular tick.

# the 6th column is the adjusted close, hence [,6]
DJIC <- getSymbols(dji_ticks[1],auto.assign = FALSE)[,6] 
names(DJIC) <- dji_ticks[1]
for(i in seq(2, length(dji_ticks))){
  temp <- getSymbols(dji_ticks[i],auto.assign = FALSE)[,6]
  names(temp) <- dji_ticks[i]
  DJIC = merge(DJIC, temp)
}

isna = colSums(is.na(DJIC))
print(isna)

But if the pdfetch function worked there is no nee to do this as well.

Next we check for missing values.

isna = colSums(is.na(DJIC))
print(isna)

##  MMM  AXP AMGN AMZN AAPL   BA  CAT  CVX CSCO   KO  DIS  DOW   GS   HD  HON  IBM 
##    0    0    0    0    0    0    0    0    0    0    0 3576    0    0    0    0 
## INTC  JNJ  JPM  MCD  MRK MSFT  NKE   PG  CRM  TRV  UNH   VZ    V  WMT 
##    0    0    0    0    0    0    0    0    0    0    0    0  807    0

Two of the stocks have a significant amount of missing observations. Here's how you can drop Visa (V) and Dow Chemical (DOW) from the list of constituents

DJIC_nmiss = DJIC[,isna==0]

Now we import the Dow Jones Industrial Average ^DJI index:

dji_ind = pdfetch_YAHOO(identifiers = "^DJI",fields = "adjclose",
                    from = as.Date("2005-01-01"),
                    to = as.Date("2024-07-07"),
                    interval = "1d")
plot(dji_ind,main = "Dow Jones Industrial")

Finally, construct log-return series for ^DJI and log-returns for its constituents DJIC_nmiss in percents

djret = diff(log(dji_ind))[-1]*100
Xret = diff(log(DJIC_nmiss))[-1]*100

To invesigate this, run the code without the [-1] part and compare this to the results above. The return calculation creates a missing observation for the first observation and [-1] removes that first observation.

PCA construction

Now that the data have been setup we can apply PCA to the transformed data.

Xpca <-  prcomp(Xret, scale. = TRUE)

By default, the prcomp() function centers the variables to have a mean of zero. By using scale. = TRUE, we scale the variables to have a standard deviation of one. The output from prcomp() contains several useful quantities.

names(Xpca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

The center and scale components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA. Since our data consists of daily returns (in percentages), the means are quite small.

print(Xpca$center)

##    MMM    AXP   AMGN   AMZN   AAPL     BA    CAT    CVX   CSCO     KO    DIS 
## 0.0197 0.0381 0.0390 0.0916 0.1114 0.0335 0.0495 0.0375 0.0260 0.0349 0.0299 
##     GS     HD    HON    IBM   INTC    JNJ    JPM    MCD    MRK   MSFT    NKE 
## 0.0363 0.0514 0.0465 0.0248 0.0177 0.0285 0.0446 0.0530 0.0438 0.0657 0.0434 
##     PG    CRM    TRV    UNH     VZ    WMT 
## 0.0334 0.0846 0.0440 0.0538 0.0222 0.0361

print(Xpca$scale)

##  MMM  AXP AMGN AMZN AAPL   BA  CAT  CVX CSCO   KO  DIS   GS   HD  HON  IBM INTC 
## 1.47 2.23 1.62 2.38 2.04 2.21 2.00 1.80 1.78 1.15 1.75 2.20 1.65 1.59 1.44 2.00 
##  JNJ  JPM  MCD  MRK MSFT  NKE   PG  CRM  TRV  UNH   VZ  WMT 
## 1.08 2.29 1.27 1.53 1.72 1.81 1.14 2.58 1.71 1.93 1.29 1.27

The rotation matrix provides the principal component loadings. The loadings for the first four factors are:

Xpca$rotation[,1:4]

##        PC1      PC2      PC3      PC4
## MMM  0.205  0.00815 -0.10068  0.00167
## AXP  0.214  0.17652 -0.22970 -0.02541
## AMGN 0.159 -0.25218  0.14226  0.44770
## AMZN 0.155  0.19849  0.42360  0.05310
## AAPL 0.172  0.16457  0.32265  0.09081
## BA   0.184  0.16588 -0.17077 -0.02562
## CAT  0.199  0.18782 -0.16367  0.09177
## CVX  0.196  0.04205 -0.24122  0.18958
## CSCO 0.201  0.06323  0.16338  0.01717
## KO   0.189 -0.27132 -0.04337 -0.16665
## DIS  0.206  0.10713 -0.05766 -0.09269
## GS   0.207  0.20782 -0.21600  0.03725
## HD   0.206  0.03326  0.05785 -0.24750
## HON  0.226  0.08662 -0.13909  0.03053
## IBM  0.198  0.02620  0.00443 -0.01163
## INTC 0.189  0.12294  0.20137  0.06023
## JNJ  0.187 -0.36125  0.00284  0.18185
## JPM  0.208  0.17257 -0.27253 -0.03233
## MCD  0.183 -0.11510  0.00599 -0.19918
## MRK  0.163 -0.32640 -0.02154  0.31609
## MSFT 0.201  0.08593  0.31809  0.07422
## NKE  0.182  0.08906  0.08567 -0.30791
## PG   0.182 -0.36448  0.04454 -0.20044
## CRM  0.162  0.21000  0.33156  0.07927
## TRV  0.200 -0.03811 -0.22869 -0.00504
## UNH  0.167 -0.14823 -0.05102  0.31970
## VZ   0.173 -0.22104 -0.02149 -0.17901
## WMT  0.151 -0.26988  0.16368 -0.42904

What will be the result of sum(Xpca$rotation^2)?

By construction, \[\sum_{i=1}^{k} w_{i,j}^2 = 1\] meaning that the sum of squared loadings for one factor (over all $k$ elements in Xret) sum to 1.

As there are $k$ factors:

\[\sum_{j=1}^{k}\sum_{i=1}^{k} w_{i,j}^2 = k\] In this case, $k=28$.

Standard deviations are stored in

Xpca$sdev

##  [1] 3.578 1.236 1.123 0.917 0.893 0.875 0.810 0.800 0.765 0.754 0.743 0.732
## [13] 0.727 0.720 0.706 0.695 0.683 0.674 0.654 0.647 0.629 0.627 0.610 0.608
## [25] 0.583 0.556 0.523 0.465

With a fraction of total variance explained by a particular factor:

pve = Xpca$sdev^2/sum(Xpca$sdev^2)
pve

##  [1] 0.45726 0.05458 0.04508 0.03006 0.02849 0.02734 0.02343 0.02284 0.02089
## [10] 0.02030 0.01971 0.01914 0.01890 0.01852 0.01781 0.01723 0.01666 0.01624
## [19] 0.01526 0.01496 0.01413 0.01402 0.01327 0.01320 0.01215 0.01102 0.00978
## [28] 0.00773

The first factor explains 45.7% of all the variance and the second one explains 5.4% of it. The last factor explains 0.7% of the total variance. We can plot the proportion of variance explained:

plot(pve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0,0.5), type = "o")

What will be the result of the sum(Xpca$sdev^2) calculation? Will it be different from the sum(var(Xpca$x)) calculation? .

Let's define a $T\times k$ matrix of factors $F$ (Xpca$x), a $k\times k$ matrix of loadings $w$ (Xpca$rotation), and a $T\times k$ matrix of demeaned standardised returns $R$. By construction, $F = R \times w$.
A sample covariance matrix of $F$ (var(Xpca$x)), given demeaned $R$, can be defined as \[\widehat{Cov}(F) = \frac{1}{T-1} w'R'Rw\] As $w'w = ww' = I_{k\times k}$ (see previous hint) and $Cov(f_i,f_j)=0, i\ne j$, it follows that $\widehat{Cov}(F)$ is a diagonal matrix up to rounding error. sum(Xpca$sdev^2) is a sum of variances of all factors, i.e. $trace(\widehat{Cov}(F))=k$.

sum(var(Xpca$x)) is \[\sum_{i=1}^k\sum_{j=1}^k \widehat{cov(f_i,f_j)} = \sum_{i=1}^k \widehat{cov(f_i,f_i)} = Tr(\widehat{Cov}(F))\] On the other hand,

\[Tr(\widehat{Cov}(F)) = \frac{1}{T-1} Tr(w'R'Rw) = \frac{1}{T-1} Tr(ww'R'R) = Tr(\frac{1}{T-1}R'R) = \sum_{i=1}^k \widehat{cov(r_i,r_i)}=k\] As for standardised returns $\widehat{cov(r_i,r_i)}=1$.

PCA regression

In Xpca$x you now have the principal components and you could decide to attempt to explain variation in daily DJI index returns by regressing it against the one day lagged first, say, two principal components. But how do you know how many principal components you should use? This calls for a cross validation approach!

To facilitate this we shall use the pcr() function, which is part of the pls library (which was loaded at the top of the code). pcr is short for Principal components regression. We now apply PCR to the Dow Jones data, imported in section one, in an attempt to predict one-day-ahead Dow Jones returns using 28 out of its 30 previous day constituents (recall that we dropped DOW and V).

First, we need to construct a vector of dependent variables and a (lagged) matrix of explanatory variables.

set.seed(2) # as we are using cross validation (which uses random datasplits)
            # we set a random sees such that repeated applications
            # produce identical results

T = dim(djret)[1]
y = djret[-1] # remove first observation, y_{t}
X = Xret[-T,] # remove last observation, X_{t-1}

We can test that the data are aligned correctly now (i.e. with one day lag for X) by looking at the start and end dates.

print(c(start(y),start(X)))

## [1] "2005-01-05" "2005-01-04"

print(c(end(y),end(X)))

## [1] "2024-07-05" "2024-07-03"

Now we can apply the principal component regression.

out_full <- pcr(y~.,data = X, scale = TRUE, validation = "CV")

The syntax for the pcr() function is similar to that for lm(), with a few additional options. Setting scale = TRUE has the same effect as scale. = TRUE in prcomp(). Setting validation = "CV" causes pcr() to compute the ten-fold cross-validation error for each possible value of $M$, the number of principal components used.

summary(out_full)

## Data:    X dimension: 4907 28 
##  Y dimension: 4907 1
## Fit method: svdpc
## Number of components considered: 28
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.152    1.144    1.144    1.144    1.143    1.143    1.142
## adjCV        1.152    1.144    1.144    1.144    1.143    1.142    1.142
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       1.142    1.142    1.142     1.143     1.143     1.144     1.144
## adjCV    1.142    1.142    1.142     1.143     1.143     1.143     1.144
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV        1.145     1.145     1.145     1.144     1.144     1.144     1.145
## adjCV     1.144     1.145     1.145     1.144     1.143     1.144     1.144
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV        1.146     1.144     1.145     1.145     1.144     1.145     1.146
## adjCV     1.146     1.144     1.144     1.144     1.143     1.144     1.145
##        28 comps
## CV        1.146
## adjCV     1.145
## 
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X   45.729   51.187   55.692   58.697   61.547   64.281   66.624   68.907
## y    1.564    1.738    1.776    2.074    2.418    2.443    2.466    2.551
##    9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X   70.997    73.027     75.00    76.912     78.80    80.653    82.434
## y    2.552     2.628      2.63     2.679      2.73     2.854     2.896
##    16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X    84.157    85.822    87.446    88.972    90.469    91.882    93.284
## y     2.915     3.425     3.481     3.484     3.484     3.486     3.845
##    23 comps  24 comps  25 comps  26 comps  27 comps  28 comps
## X    94.611    95.931    97.146    98.248    99.227   100.000
## y     3.867     3.868     4.119     4.131     4.162     4.413

The $CV$ score is provided for each possible number of components, ranging from $M = 0$ onwards. Please note that pcr() reports the root mean squared error. We can plot the cross-validation scores using the validationplot() function. In particular, to visualize the mean squared error:

validationplot(out_full, val.type = "MSEP")

The smallest cross-validation error occurs when $M =6$. This suggests the optimal number of factors for the model.

All of the information you need to figure out what number of components delivers the minimum cross-validated error is contained in out_full although that does not directly report the mean squared prediction error (MSEP). THat can be done by

MSEP(out_full,estimate = "CV")

This has now calculated the cross-validated MSEP. You can plot this and identify the minimum (as you could identify the minimum in the earlier plot to see that it is associated to a 6 component regression).

If you wish to have R identify this directly you could for instance do this with the following command:

names(which.min(MSEP(out_full,estimate = "CV")$val[1,1,]))

## [1] "6 comps"

This tells you that the 6 component model worked best.

The summary() function also provides the percentage of variance explained in the predictors and in the response using different numbers of components (see the previous section). Say, the first factor explains 45.729% of all the variance in X with the corresponding $R^2 = 1.564$%. For $M=6$ these numbers are, accordingly, $64.281$% and $2.443$%.

Predictability of returns

As you can see, even though the explanatory power of the PCA is small, it is statistically significant, which contradicts the Efficient Market Hypothesis discussed briefly before. Let's take a closer look at the returns distribution:

hist(y,breaks = 50)

Most of the returns are concentrated around 0. However, there are outliers with 5% to 10% daily (!) returns. These outliers are typically attributed to times of financial crises, when all stocks tend to move together in the same (negative) direction, followed by positive corrections. Let's check whether these outliers are the main driver for the predictability of returns.

# indicator variables to indicate which obs to include
insample1 <- (abs(y) < 5) # only use obs where returns are smaller than 5
insample2 <- (abs(y) < 4)
insample3 <- (abs(y) < 3)

sum(!insample1)

## [1] 27

sum(!insample2)

## [1] 54

sum(!insample3)

## [1] 111

out1 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample1)
out2 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample2)
out3 <- pcr(y~.,data = X, scale = TRUE, validation = "CV",subset = insample3)
summary(out1)

## Data:    X dimension: 4880 28 
##  Y dimension: 4880 1
## Fit method: svdpc
## Number of components considered: 28
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.001   0.9984   0.9988   0.9990   0.9985   0.9980   0.9982
## adjCV        1.001   0.9984   0.9988   0.9989   0.9984   0.9979   0.9981
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.9972   0.9975   0.9982    0.9979    0.9980    0.9983    0.9982
## adjCV   0.9970   0.9973   0.9980    0.9976    0.9978    0.9981    0.9975
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV       0.9977    0.9982    0.9986    0.9980    0.9984    0.9988    0.9989
## adjCV    0.9974    0.9979    0.9983    0.9977    0.9980    0.9984    0.9985
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.9987    0.9991    0.9993    0.9995    0.9989    0.9992    0.9992
## adjCV    0.9981    0.9986    0.9988    0.9991    0.9984    0.9987    0.9988
##        28 comps
## CV        1.001
## adjCV     1.000
## 
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X   44.319  49.9246  54.4879   57.620    60.56   63.287   65.712   68.031
## y    0.677   0.7035   0.7136    0.867     1.07    1.077    1.332    1.335
##    9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X   70.213    72.314    74.344    76.316    78.217    80.113    81.946
## y    1.338     1.441     1.459     1.459     1.677     1.698     1.698
##    16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X    83.735     85.44    87.095     88.67    90.212    91.661    93.086
## y     1.698      1.81     1.811      1.84     1.846     1.963     1.967
##    23 comps  24 comps  25 comps  26 comps  27 comps  28 comps
## X    94.487    95.878    97.136    98.266     99.27   100.000
## y     1.983     1.986     2.154     2.154      2.24     2.241

#summary(out2)
#summary(out3)

validationplot(out1,val.type = "MSEP")

By dropping 27, 54, 111 observations, the full explanatory power goes down from $4.162$% to $2.24$%, $1.779$%, $1.13$%. The validation plot that corresponds to the last subsample with $M =8$ and $0.657$%..

validationplot(out3,val.type = "MSEP")

Extra Question

Repeat the analysis using S&P 500 index and S&P 500 constituents.

ECLR

This exercise is part of the ECLR page.

Principal Component Analysis - DJI

Principal Component Analysis. Time Series Application.

PCA Framework

Data Import

PCA construction

PCA regression

Predictability of returns