Empirical Macroeconomic research depends on the availability of time-series data for macroeconomic variables. Collating the data and potentially making them comparable to other previously published work can be very burdensome. In order to make this process easier and to make work more comparable, Michael McCracken and Serena Ng collaborated with the St Louis FED who maintains the FRED database in order to create a standard Macro dataset for the U.S. You can read in detail about this dataset here.
Here we will show how to access and use these data. The workthrough
here is largely based on the example presented in Yankang Chen’s vignette to
the fbi
package.
After importing and preparing the data we will apply a factor estimation to the dataset. As you will see there are more than 120 variables in the dataset. Many of these variables will be highly correlated with each other and potentially represent very similar information. For many economic applications it may be advantageous to extract a smaller number of factors from this large dataset. It is the process of extracting such factors which will be demonstrated below.
Ensure that your working directory is set correctly.
setwd("YOUR/WORKING/DIRECTORY")
We will also use a standard set of libraries.
library(tidyverse)
library(ggplot2)
library(stargazer)
There are two ways in which you could upload the FRED-MD dataset. YOu
can either download the current (or indeed a past version) of the
FRED-MD database from Michael
McCracken’s FRED website, or you could use a R package that was
designed to directly access the data. The package is called
fbi
and is writtenby Yankang Chen. This is a package that
is not available from the CRAN webpage from which you can download most
R packages. It is available from the author’s Github page. If you
go to that page and read the README
section you will find
the instructions. Here the package’sauthor tells you that you should
ensure that you already have the stats
, readr
and pracma
installed. So please go ahead and check that you
have these, and if not please install these packages via the normal
way.
Then you can install the package directly from the Github page using
the install_github
function from the devtools
package (which is a package you also need to have installed). As usual,
this only needs to be done once on your computer.
devtools::install_github("cykbennie/fbi")
When I installed this I got the following message
I answered “1” and the package installed fully.
Let’s use the fbi
package to import the data. We first
need to point the code at the right place of storage for the dataset
(filepath
).
library(fbi)
filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)
In your environment you will have an object data.md
. The
file was stored at “https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv”.
The filename “current.csv” will always have the most current version.
But you can get timed snapshots. We will reload the data using the
dataset fixed in September 2024.
filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/2024-09.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)
data.md.0 <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = FALSE)
names(data.md)
## [1] "date" "RPI" "W875RX1" "DPCERA3M086SBEA"
## [5] "CMRMTSPLx" "RETAILx" "INDPRO" "IPFPNSS"
## [9] "IPFINAL" "IPCONGD" "IPDCONGD" "IPNCONGD"
## [13] "IPBUSEQ" "IPMAT" "IPDMAT" "IPNMAT"
## [17] "IPMANSICS" "IPB51222S" "IPFUELS" "CUMFNS"
## [21] "HWI" "HWIURATIO" "CLF16OV" "CE16OV"
## [25] "UNRATE" "UEMPMEAN" "UEMPLT5" "UEMP5TO14"
## [29] "UEMP15OV" "UEMP15T26" "UEMP27OV" "CLAIMSx"
## [33] "PAYEMS" "USGOOD" "CES1021000001" "USCONS"
## [37] "MANEMP" "DMANEMP" "NDMANEMP" "SRVPRD"
## [41] "USTPU" "USWTRADE" "USTRADE" "USFIRE"
## [45] "USGOVT" "CES0600000007" "AWOTMAN" "AWHMAN"
## [49] "HOUST" "HOUSTNE" "HOUSTMW" "HOUSTS"
## [53] "HOUSTW" "PERMIT" "PERMITNE" "PERMITMW"
## [57] "PERMITS" "PERMITW" "ACOGNO" "AMDMNOx"
## [61] "ANDENOx" "AMDMUOx" "BUSINVx" "ISRATIOx"
## [65] "M1SL" "M2SL" "M2REAL" "BOGMBASE"
## [69] "TOTRESNS" "NONBORRES" "BUSLOANS" "REALLN"
## [73] "NONREVSL" "CONSPI" "S&P 500" "S&P div yield"
## [77] "S&P PE ratio" "FEDFUNDS" "CP3Mx" "TB3MS"
## [81] "TB6MS" "GS1" "GS5" "GS10"
## [85] "AAA" "BAA" "COMPAPFFx" "TB3SMFFM"
## [89] "TB6SMFFM" "T1YFFM" "T5YFFM" "T10YFFM"
## [93] "AAAFFM" "BAAFFM" "TWEXAFEGSMTHx" "EXSZUSx"
## [97] "EXJPUSx" "EXUSUKx" "EXCAUSx" "WPSFD49207"
## [101] "WPSFD49502" "WPSID61" "WPSID62" "OILPRICEx"
## [105] "PPICMM" "CPIAUCSL" "CPIAPPSL" "CPITRNSL"
## [109] "CPIMEDSL" "CUSR0000SAC" "CUSR0000SAD" "CUSR0000SAS"
## [113] "CPIULFSL" "CUSR0000SA0L2" "CUSR0000SA0L5" "PCEPI"
## [117] "DDURRG3M086SBEA" "DNDGRG3M086SBEA" "DSERRG3M086SBEA" "CES0600000008"
## [121] "CES2000000008" "CES3000000008" "UMCSENTx" "DTCOLNVHFNM"
## [125] "DTCTHFNM" "INVEST" "VIXCLSx"
We will continue to use this timed version to ensure that the following results do not change as the data are updated.
The fredmd
function has a number of input options
date_start
, if = NULL
then uses the
earliest available data, if you want to begin at a certain date set,
e.g. date_start = as.Date("2020-01-01")
.date_end
, if = NULL
then uses the latest
available data, if you want to and at a certain date set,
e.g. date_start = as.Date("2022-09-01")
.transform
, if = FALSE
then the original
data, if = TRUE
then transformed data, see details
below.It is important to understand the data structure. Fortunately the
fbi
package has a useful function which allows you to
access information on the variables in the dataset, including
information on the transformations used if transform = TRUE
is set.
var.info <-describe_md(names(data.md),name.only = FALSE)
head(var.info)
## id tcode ttype fred
## 1 1 5 First difference of natural log: ln(x)-ln(x-1) RPI
## 1.1 1 5 First difference of natural log: ln(x)-ln(x-1) RPI
## 2 2 5 First difference of natural log: ln(x)-ln(x-1) W875RX1
## 3 3 5 First difference of natural log: ln(x)-ln(x-1) DPCERA3M086SBEA
## 4 4 5 First difference of natural log: ln(x)-ln(x-1) CMRMTSPLx
## 5 5 5 First difference of natural log: ln(x)-ln(x-1) RETAILx
## description gsi gsi:description
## 1 Real Personal Income M_14386177 PI
## 1.1 Real Personal Income M_14386177 PI
## 2 Real personal income ex transfer receipts M_145256755 PI less transfers
## 3 Real personal consumption expenditures M_123008274 Real Consumption
## 4 Real Manu. and Trade Industries Sales M_110156998 M&T sales
## 5 Retail and Food Services Sales M_130439509 Retail sales
## group edited
## 1 Output and Income FALSE
## 1.1 Output and Income FALSE
## 2 Output and Income FALSE
## 3 Consumption, Orders, and Inventories FALSE
## 4 Consumption, Orders, and Inventories TRUE
## 5 Consumption, Orders, and Inventories TRUE
In var.info
you can now find the descriptions of the
data. Let’s look at one of them
var.info[var.info$fred == "VXOCLSx",]
## [1] id tcode ttype fred
## [5] description gsi gsi:description group
## [9] edited
## <0 rows> (or 0-length row.names)
Note: The function
describe_md
will return variable information available in 2015-01. Some variables have changed and some have been added. You may have to refer to this file whichh tracks changes to the database. See example below.
Here you can see that the transformed version of the industrial
production (INDPRO
) series is the log difference. You may
be interested in what type of different data transformations have been
used here.
unique(var.info$ttype)
## [1] First difference of natural log: ln(x)-ln(x-1)
## [2] First difference: x(t)-x(t-1)
## [3] Level (i.e. no transformation): x(t)
## [4] Natural log: ln(x)
## [5] Second difference of natural log: (ln(x)-ln(x-1))-(ln(x-1)-ln(x-2))
## [6] First difference of percent change: (x(t)/x(t-1)-1)-(x(t-1)/x(t-2)-1)
## 6 Levels: First difference of natural log: ln(x)-ln(x-1) ...
As you can see some series are not transformed at all and then there
are five different types of data transformations. Essentially the
transformations are done to create stationary data-series. Let us
illustrate this by displaying the untransformed INDPRO
(from data.md.0
) and the transformed series (from
data.md
).
p1 <- ggplot(data.md.0, aes(x = date, y = INDPRO)) +
geom_line() +
labs(title = "Industrial Production", y = "INDPRO")
p1
p2 <- ggplot(data.md, aes(x = date, y = INDPRO)) +
geom_line() +
labs(title = "Industrial Production Growth", y = "delta(log(INDPRO))")
p2
You can look at data summaries using the standard
summary
function.
var.info[var.info$fred == "FEDFUNDS",]
## id tcode ttype fred description
## 84 84 2 First difference: x(t)-x(t-1) FEDFUNDS E?ective Federal Funds Rate
## gsi gsi:description group edited
## 84 M_110155157 Fed Funds Interest and Exchange Rates FALSE
summary(data.md$FEDFUNDS)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -6.630000 -0.070000 0.010000 0.003621 0.120000 3.060000 1
Here you can see the summary info for the first difference of the Effective Federal Funds rate. You can also see that there is one missing observation. This is most likely the first observation which gets “lost” as a first difference was calculated.
Let’s look at another series, the number of new private housing
permits, PERMIT
.
var.info[var.info$fred == "PERMIT",]
## id tcode ttype fred description
## 55 55 4 Natural log: ln(x) PERMIT New Private Housing Permits (SAAR)
## gsi gsi:description group edited
## 55 M_110155532 BP: total Housing FALSE
summary(data.md$PERMIT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.240 7.024 7.213 7.182 7.408 7.791 12
Here you can see that there are 12 missing observations. Let’s look at a plot of the data
p3 <- ggplot(data.md, aes(x = date, y = PERMIT)) +
geom_line() +
labs(title = "New private housing permits", y = "log(PERMIT)")
p3
You could investigate which observations are missing.
data.md[is.na(data.md$PERMIT),c("date", "PERMIT")]
## date PERMIT
## 1 1959-01-01 NA
## 2 1959-02-01 NA
## 3 1959-03-01 NA
## 4 1959-04-01 NA
## 5 1959-05-01 NA
## 6 1959-06-01 NA
## 7 1959-07-01 NA
## 8 1959-08-01 NA
## 9 1959-09-01 NA
## 10 1959-10-01 NA
## 11 1959-11-01 NA
## 12 1959-12-01 NA
Clearly it is the 1959 data that are not available for this data series. In fact we can find out how many observations are missing altogether.
sum(is.na(data.md))
## [1] 1159
If you read the paper describing the FRED-MD dataset you will find a
discussion about outlier removal. The function that removes outlying
observations (replacing them with NA
) is called
rm_outliers.fredmd()
. We apply this to the
data_clean
dataset. Refer to the paper to see how exactly
an outlier is identified. There are different ways to deal with outliers
and in practical work you have to carefully think about this
procedure.
data.md.clean <- rm_outliers.fredmd(data.md)
As outliers are replaced by NA
we can easily find out
how many outliers were identified.
sum(is.na(data.md.clean))-sum(is.na(data.md))
## [1] 163
Before applying a factor model we will have to think about how to
deal with missing data. In the end we will want to use only observations
that have data available for all variables. In
data.md.clean
we have 788 observations for 127 variables
(but note that the first one is just a date variable and we will
eventually not use it).
First, we will remove variables if they have too many missing values (recall that they could come from outliers being removed).
col_na_prop <- apply(is.na(data.md.clean), 2, mean)
This basically identified the proportion of missing values. Recall that for all variables in which a growth rate was calculated we know that at least one observation (the first one) is missing.
Let’s see which variables have more than 5% of observations missing.
col_na_prop[col_na_prop>0.05]
## ACOGNO ANDENOx TWEXAFEGSMTHx UMCSENTx VIXCLSx
## 0.50634518 0.13959391 0.21446701 0.29060914 0.05329949
You can refer to the information in var_info
to confirm
that ACOGNO
and ANDENOx
relate to order
information for consumer and capital good respectively.
UMCSENTx
is a consumer sentiment index. As it turns out,
for the variables TWEXAFEGSMTHx
and VIXCLSx
we
cannot find any information in var.info
. These are
variables that have been added to the database after 2015. We can find
information on these variables from this
file. VIXCLSx
is a measure of stock market volatility
implied by option prices. The series TWEXAFEGSMTHx
is new
and has replaced another series which is also not included in
var.info
. In fact, the information in the file linked above
is also not conclusive, but a web search for the variable id
TWEXAFEGSMTH
reveals that it is a “Nominal Advanced Foreign
Economies U.S. Dollar Index”.
Let us now remove these five series from the dataset.
data.md.select <- data.md.clean[, (col_na_prop < 0.05)]
The remaining variables will still have missing data. We shall now remove all rows of data (dates) in which we have any missing information.
data.md.select.2 <- na.omit(data.md.select)
By comparing the number of rows of data.md.select
and
data.md.select.2
you can identify that
nrow(data.md.select)-nrow(data.md.select.2)
= 76 dates have
been removed.
It may be good to know whether this removed mainly dispersed dates or a whole period. As we still have the date variable in both datasets we can compare the two sets of dates and identify which ones have disappeared.
data.md.select$date[!(data.md.select$date %in% data.md.select.2$date)]
## [1] "1959-01-01" "1959-02-01" "1959-03-01" "1959-04-01" "1959-05-01"
## [6] "1959-06-01" "1959-07-01" "1959-08-01" "1959-09-01" "1959-10-01"
## [11] "1959-11-01" "1959-12-01" "1960-04-01" "1971-10-01" "1971-12-01"
## [16] "1973-07-01" "1974-01-01" "1974-02-01" "1977-01-01" "1977-02-01"
## [21] "1977-12-01" "1978-01-01" "1978-04-01" "1979-10-01" "1980-03-01"
## [26] "1980-04-01" "1980-05-01" "1980-07-01" "1980-10-01" "1980-11-01"
## [31] "1980-12-01" "1981-02-01" "1981-04-01" "1981-05-01" "1981-06-01"
## [36] "1981-09-01" "1981-11-01" "1982-01-01" "1982-08-01" "2001-02-01"
## [41] "2001-10-01" "2006-10-01" "2006-11-01" "2008-01-01" "2008-02-01"
## [46] "2008-03-01" "2008-04-01" "2008-05-01" "2008-09-01" "2008-11-01"
## [51] "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01" "2009-04-01"
## [56] "2009-05-01" "2010-03-01" "2010-04-01" "2010-12-01" "2011-01-01"
## [61] "2013-01-01" "2015-12-01" "2016-01-01" "2020-03-01" "2020-04-01"
## [66] "2020-05-01" "2020-06-01" "2020-07-01" "2021-01-01" "2021-02-01"
## [71] "2021-03-01" "2021-04-01" "2021-06-01" "2021-07-01" "2024-07-01"
## [76] "2024-08-01"
You can see that 12 of the dates arise from the entire year 1959 scratched. The only coherent period of missing data arose from March to July 2020 and basically the first half of 2021. This is likely related to some missing data from the Covid period.
In fact, for this exercise we will reduce the sample to include data until March 2014. This results in a data period which has also been used in the McCracken and Ng (2016) paper introducing the FRED-MD dataset.
data.md.select.2 <- as.data.frame(data.md.select.2)
datesel <- (data.md.select.2$date < "2014-04-01")
data.md.select.2 <- data.md.select.2[datesel,]
We will continue working with the data.md.select.2
dataset. But before proceeding we will remove the date variable as a
variable but use the date information as row names.
dates <- data.md.select.2$date
data.md.select.2 <- data.md.select.2[,2:ncol(data.md.select.2)]
rownames(data.md.select.2) <- dates
This is not the place where the theory behind factor models is to be
explained. Here we will merely show how to implement factor estimation.
The data we currently hold in data.md.select.2
can be
represented as the \((T \times N)\)
matrix \(\mathbf{X}\). The idea of a
factor representation is that \(\mathbf{X}\) can be decomposed into
\[\begin{equation} \mathbf{X} = \mathbf{F} \mathbf{\Lambda'} + e \end{equation}\]
where \(\mathbf{F}\) is a \((T \times r)\) matrix of common factors with \(r\lt\lt N\). The \((N \times r)\) matrix of factor loadings, \(\mathbf{\Lambda}\), describes how each of the \(N\) variables in \(\mathbf{X}\) is related to each of the \(r\) factors in \(\mathbf{F}\). The term \(\mathbf{F} \mathbf{\Lambda'}\) delivers a \((T \times N)\) matrix of common components.
The error component \(e\) is also a \((T \times N)\) matrix.
There are a number of key issues when estimating the factors.
A good discussion is available from Bai and Ng,
2019. The method proposed there is implemented through the
fbi
package we installed earlier.
out <- rpca(data.md.select.2, kmax = 8, standardize = TRUE, tau = 0)
Here we implemented the factor estimation without rank regularisation
of the common component (tau = 0
), for standardised
variables (meaning that all variables are scaled to have 0 mean and
standard deviation 1) and for 8 factors.
The output of the factor model estimation (out
) saves a
number of things, amongst others the estimated factors
($Fhat
) and the factor loadings ($Lamhat
).
Let’s look at the factors
factors.est <- as.data.frame(out$Fhat) # extract factors and put in dataframe
factors.est$date <- dates # add date variable
# create a long dataframe
factors.est <- factors.est %>%
pivot_longer(!date, names_to = "Factor", values_to = "Values")
# plot the time series for the extracted factors
ggplot(factors.est, aes(x=date, y = Values)) +
geom_line() +
facet_wrap(vars(Factor))
Let us see how the factors load on the different variables. As we are using standardised variables, looking at the coefficients in \(\Lambda\) is informative.
lambda.1 <- sort(out$Lamhat[,1])
lambda.1[1:10]
## USGOOD IPMANSICS PAYEMS INDPRO MANEMP IPFPNSS CUMFNS
## -0.8433376 -0.8281917 -0.8275448 -0.8097664 -0.8062688 -0.7778244 -0.7747936
## DMANEMP IPFINAL IPDMAT
## -0.7725531 -0.7248871 -0.6988267
It is important to understand that the signs of the factor loadings
do not have a useful interpretation. These are the 10 variables with the
largest loadings on factor 1. This can be compared to those reported in
McCracken and Ng (2016, Table 2). Of the ten variables loading highest
(as reported in Table 2) seven appear on the above list. The three
variables that are not identified here (NAPM
,
NAPMNOI
, NAPMPI
) are all variables that are no
longer on the FRED-MD database (see
the change log for confirmation).
Here we will attempt to replicate Table 2 in the McCracken and Ng (2016) paper.
After the factors are estimated, we regress the \(i\)th series in the dataset on a set of \(r\) (orthogonal) factors. For \(k = 1,\ldots,r\), this yields \(R_i(k)^2\) for series \(i\). The incremental explanatory power of factor \(k\) is \(incR_i^2(k) = R_i^2(k)-R_i^2(k-1), k = 2, \ldots, r\). The average importance of factor-\(k\) is \(mR^2(k)=\frac{1}{N}\sum_{i=1}^N incR_i^2(k)\). We compute these quantities using two approaches. First, as described, regressing series \(n\) on \(k\) factors at each iteration. Second, regressing series \(n\) on \(k\)th factor. After that, I will construct incremental \(R^2\).
K = dim(out$Fhat)[2] # extracts the number of Factors, same as kmax as set in rpca
N = dim(out$X)[2] # extracts number of series
# create a matrix which stores various R^2
# dimension KxN
R2.ik <- matrix(0,K,N) # stores R_i^2(k)
colnames(R2.ik) <- colnames(out$X)
R2.inc.ik <- R2.ik # version 1
# loop through all factors (k) and all series (n)
for (k in seq(1,K)) {
for (i in seq(1,N)){
# regress series i on factors 1 to k
temp = summary(lm(out$X[,i]~ out$Fhat[,1:k]))
R2.ik[k,i] = temp$r.squared
# regress series i on factor k only
temp = summary(lm(out$X[,i]~ out$Fhat[,k]))
R2.inc.ik[k,i] = temp$r.squared
}
}
# Alternative way to calculate the contribution of the kth factor
R2.inc.ik.alt <- R2.inc.ik # version 2
R2.inc.ik.alt[2:K,] <- R2.ik[2:K,] - R2.ik[1:K-1,]
At this stage we have saved \(R_i(k)^2\) in R2.ik
and \(incR_i^2(k)\) in R2.inc.ik
as
well as R2.inc.ik.alt
. We could calculate the contribution
of the \(k\)th factor by either \(incR_i^2(k) = R_i^2(k)-R_i^2(k-1), k = 2, \ldots,
r\) (as implemented in R2.inc.ik.alt
) or by merely
regressing the \(i\)th variable on the
\(k\)th factor (as implemented in
R2.inc.ik
). The latter version works only as the factors
are orthogonal. Compare R2.inc.ik
and
R2.inc.ik.alt
to convince yourself that they are
identical.
Also compare R2.ik
and R2.inc.ik
. Looking
at these two matrices you should see that the first row is identical but
subsequent rows are not.
Now we are in a position to use that information to calculate \(mR^2(k)\). This will be saved in
R2.m.k
and R2.m.alt
as we use two ways to
calculate the same quantity.
# Storage vectors
R2.m.k <- matrix(0,K,1)
# Importance of factors 1 to K
for (i in seq(1,K)){
R2.m.k[i] = mean(R2.inc.ik[i,])
}
# an alternative way to calculate this would be
# R2.m.k <- rowMeans(R2.m.k)
print(R2.m.k)
## [,1]
## [1,] 0.13961874
## [2,] 0.07735284
## [3,] 0.06698293
## [4,] 0.05319509
## [5,] 0.04622442
## [6,] 0.03471929
## [7,] 0.02753143
## [8,] 0.02539408
Here you can see that the first factor explains almost 14% of the variation in the variables. The second factor contributes close to 8% explanation of variation. This compares to 16% and 7% for the first two factors in the McCracken and Ng paper in Table 2.
Next we wish to identify the 10 variables that load most heavily on the first 8 factors (also replicating the 8 subtables in Table 2). In fact we will only show the first two, but you can replicate this for all 8.
for (k in seq(1,2)){
names(k) = "Factor" # this gives k a name, for display reasons only
# order starting with variable loading heaviest onto the kth factor
temp <- sort(R2.inc.ik[k,],decreasing = TRUE)
temp2 <- c(k,temp[1:10])
stargazer(temp2,type = "text") # stargazer produces a good looking table
}
##
## ==================================================================================
## Factor USGOOD IPMANSICS PAYEMS INDPRO MANEMP IPFPNSS CUMFNS DMANEMP IPFINAL IPDMAT
## ----------------------------------------------------------------------------------
## 1 0.712 0.687 0.686 0.657 0.651 0.606 0.601 0.598 0.526 0.489
## ----------------------------------------------------------------------------------
##
## ==========================================================================================================
## Factor CUSR0000SAC AAAFFM DNDGRG3M086SBEA T10YFFM CUSR0000SA0L2 BAAFFM PCEPI CPIAUCSL T5YFFM CUSR0000SA0L5
## ----------------------------------------------------------------------------------------------------------
## 2 0.331 0.331 0.330 0.326 0.322 0.321 0.319 0.316 0.284 0.284
## ----------------------------------------------------------------------------------------------------------
You can compare these to the ones shown in Table 2. You will find that there is a lot of overlap with the variables loading on Factor 1 and indeed the loadings are in the same ballpark (between 0.5 and 0.7). For the second factor you can see that some variables are similar, but there is fewer overlap for factor 2. Also the loadings are markedly lower (here between 0.25 and 0.33; between 0.2 and 0.6 in Table 2).
A practical question in Factor analysis is how many factors one
should use. When calling the rpca
function earlier we did
set kmax = 8
indicating that we wanted the 8 strongest
factors. But should we have used 6 instead, or perhaps more, like
10?
The information criterion used to determine the number of factors is
stored in out$IC2
.
plot(out$IC2)
out$IC2
## [1] -0.104441107 -0.151046670 -0.192870900 -0.222464212 -0.247149826
## [6] -0.257501304 -0.258368527 -0.257656434 -0.001662511
As you can see there are 9 such values. The minimum value is the 7th,
which is associated with 7 factors. The last value, here the 9th
(kmax + 1
), is the information criterion for not using any
Factor. You can also find the optimal number of factors from
out$ic2 =
7.
The optimal number of factors depends on the estimated covariance matrix and the metric used. The covariance matrix does vary over time. The result above was for the data from Jan 1960 to March 2014. Let’s attempt to replicate the top panel of Figure 2. In here McCracken and Ng estimate the optimal number of factors for a recursive sample.
We have to re-estimate the model using different time spans. In order to ensure that results do not change we shall download a particular vintage of data, the October 2024 vintage, for this analysis.
filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/2024-10.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)
data.md.clean <- rm_outliers.fredmd(data.md)
col_na_prop <- apply(is.na(data.md.clean), 2, mean)
col_na_prop[col_na_prop>0.05]
## ACOGNO AMDMNOx ANDENOx AMDMUOx TWEXAFEGSMTHx
## 0.50570342 0.50443599 0.50443599 0.50316857 0.21419518
## UMCSENTx VIXCLSx
## 0.29024081 0.05323194
data.md.select <- data.md.clean[, (col_na_prop < 0.05)]
data.md.select.2 <- na.omit(data.md.select)
This has applied the same data-cleaning procedure as discussed earlier, removing outliers and removing series with too many missing values.
dates <- data.md.select.2$date
data.md.select.2 <- data.md.select.2[,2:ncol(data.md.select.2)] # removes date variable
rownames(data.md.select.2) <- dates # adds date info as rownames
data.md.select.2 <- as.data.frame(data.md.select.2)
dates <- as.Date(dates) # date variable
date.years <- year(dates) # years corresponding to each observation
T = dim(data.md.select.2)[1]
N = 2023-1969 # Number of Years for which we will re-estimate
# create results storage vector
IC <- matrix(0,N,1)
ddate <- matrix(0,N,1)
# We re-estimate the factors for samples ending in 1970, 1971, ..., 2023
for (t in seq(1970,2023)) {
count <- t-1969 # this keeps track of the iteration we are in: 1,2,3, etc
T.sel <- (date.years <= t) # boolean variable indicating which obs is "in-sample"
temp.data <- data.md.select.2[1:sum(T.sel),] # Selects the first sum(T.sel) obs
out1 <- rpca(temp.data, kmax = 12, standardize = TRUE, tau = 0)
# Save the optimal factor number
IC[count] = out1$ic2
ddate[count] <- t
}
We have now stored in IC
the optimal factor numbers from
an increasing sample with sample end-points starting from 1970, ending
in 2023.
IC <- ts(IC, start = c(year(ddate[1]), frequency = 1)) # Define IC as an annual time-series
plot(IC)
You can see from the plot that the pattern is roughly similar to that shown in McCracken and Ng (2016). It is not exactly the same for the following reasons
Also note that this analysis uses an increasing sample size which may well explain the generally increasing number of factors.
Repeat the above analysis but keep the sample size the same as you move through time. This is instead of using an increasing sample size as we did above. Do you still find an increasing number of factors?
In this walkthrough you learned about the very valuable FRED-MD dataset which is very useful to produce reproducible macroeconomic research. We illustrated its use by replicating and updating the factor model estimation in McCracken, M. and Ng, S. (2016).