FRED-MD - Introduction to Data Handling and Statistics

Introduction

Empirical Macroeconomic research depends on the availability of time-series data for macroeconomic variables. Collating the data and potentially making them comparable to other previously published work can be very burdensome. In order to make this process easier and to make work more comparable, Michael McCracken and Serena Ng collaborated with the St Louis FED who maintains the FRED database in order to create a standard Macro dataset for the U.S. You can read in detail about this dataset here.

Here we will show how to access and use these data. The workthrough here is largely based on the example presented in Yankang Chen’s vignette to the fbi package.

After importing and preparing the data we will apply a factor estimation to the dataset. As you will see there are more than 120 variables in the dataset. Many of these variables will be highly correlated with each other and potentially represent very similar information. For many economic applications it may be advantageous to extract a smaller number of factors from this large dataset. It is the process of extracting such factors which will be demonstrated below.

Prepare your workfile

Ensure that your working directory is set correctly.

setwd("YOUR/WORKING/DIRECTORY")

We will also use a standard set of libraries.

library(tidyverse)
library(ggplot2)
library(stargazer)

Data upload

There are two ways in which you could upload the FRED-MD dataset. YOu can either download the current (or indeed a past version) of the FRED-MD database from Michael McCracken’s FRED website, or you could use a R package that was designed to directly access the data. The package is called fbi and is writtenby Yankang Chen. This is a package that is not available from the CRAN webpage from which you can download most R packages. It is available from the author’s Github page. If you go to that page and read the README section you will find the instructions. Here the package’sauthor tells you that you should ensure that you already have the stats, readr and pracma installed. So please go ahead and check that you have these, and if not please install these packages via the normal way.

Then you can install the package directly from the Github page using the install_github function from the devtools package (which is a package you also need to have installed). As usual, this only needs to be done once on your computer.

devtools::install_github("cykbennie/fbi")

When I installed this I got the following message

I answered “1” and the package installed fully.

Let’s use the fbi package to import the data. We first need to point the code at the right place of storage for the dataset (filepath).

library(fbi)

filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)

In your environment you will have an object data.md. The file was stored at “https://files.stlouisfed.org/files/htdocs/fred-md/monthly/current.csv”. The filename “current.csv” will always have the most current version. But you can get timed snapshots. We will reload the data using the dataset fixed in September 2024.

filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/2024-09.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)
data.md.0 <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = FALSE)
names(data.md)

##   [1] "date"            "RPI"             "W875RX1"         "DPCERA3M086SBEA"
##   [5] "CMRMTSPLx"       "RETAILx"         "INDPRO"          "IPFPNSS"        
##   [9] "IPFINAL"         "IPCONGD"         "IPDCONGD"        "IPNCONGD"       
##  [13] "IPBUSEQ"         "IPMAT"           "IPDMAT"          "IPNMAT"         
##  [17] "IPMANSICS"       "IPB51222S"       "IPFUELS"         "CUMFNS"         
##  [21] "HWI"             "HWIURATIO"       "CLF16OV"         "CE16OV"         
##  [25] "UNRATE"          "UEMPMEAN"        "UEMPLT5"         "UEMP5TO14"      
##  [29] "UEMP15OV"        "UEMP15T26"       "UEMP27OV"        "CLAIMSx"        
##  [33] "PAYEMS"          "USGOOD"          "CES1021000001"   "USCONS"         
##  [37] "MANEMP"          "DMANEMP"         "NDMANEMP"        "SRVPRD"         
##  [41] "USTPU"           "USWTRADE"        "USTRADE"         "USFIRE"         
##  [45] "USGOVT"          "CES0600000007"   "AWOTMAN"         "AWHMAN"         
##  [49] "HOUST"           "HOUSTNE"         "HOUSTMW"         "HOUSTS"         
##  [53] "HOUSTW"          "PERMIT"          "PERMITNE"        "PERMITMW"       
##  [57] "PERMITS"         "PERMITW"         "ACOGNO"          "AMDMNOx"        
##  [61] "ANDENOx"         "AMDMUOx"         "BUSINVx"         "ISRATIOx"       
##  [65] "M1SL"            "M2SL"            "M2REAL"          "BOGMBASE"       
##  [69] "TOTRESNS"        "NONBORRES"       "BUSLOANS"        "REALLN"         
##  [73] "NONREVSL"        "CONSPI"          "S&P 500"         "S&P div yield"  
##  [77] "S&P PE ratio"    "FEDFUNDS"        "CP3Mx"           "TB3MS"          
##  [81] "TB6MS"           "GS1"             "GS5"             "GS10"           
##  [85] "AAA"             "BAA"             "COMPAPFFx"       "TB3SMFFM"       
##  [89] "TB6SMFFM"        "T1YFFM"          "T5YFFM"          "T10YFFM"        
##  [93] "AAAFFM"          "BAAFFM"          "TWEXAFEGSMTHx"   "EXSZUSx"        
##  [97] "EXJPUSx"         "EXUSUKx"         "EXCAUSx"         "WPSFD49207"     
## [101] "WPSFD49502"      "WPSID61"         "WPSID62"         "OILPRICEx"      
## [105] "PPICMM"          "CPIAUCSL"        "CPIAPPSL"        "CPITRNSL"       
## [109] "CPIMEDSL"        "CUSR0000SAC"     "CUSR0000SAD"     "CUSR0000SAS"    
## [113] "CPIULFSL"        "CUSR0000SA0L2"   "CUSR0000SA0L5"   "PCEPI"          
## [117] "DDURRG3M086SBEA" "DNDGRG3M086SBEA" "DSERRG3M086SBEA" "CES0600000008"  
## [121] "CES2000000008"   "CES3000000008"   "UMCSENTx"        "DTCOLNVHFNM"    
## [125] "DTCTHFNM"        "INVEST"          "VIXCLSx"

We will continue to use this timed version to ensure that the following results do not change as the data are updated.

The fredmd function has a number of input options

date_start, if = NULL then uses the earliest available data, if you want to begin at a certain date set, e.g. date_start = as.Date("2020-01-01").
date_end, if = NULL then uses the latest available data, if you want to and at a certain date set, e.g. date_start = as.Date("2022-09-01").
transform, if = FALSE then the original data, if = TRUE then transformed data, see details below.

It is important to understand the data structure. Fortunately the fbi package has a useful function which allows you to access information on the variables in the dataset, including information on the transformations used if transform = TRUE is set.

var.info <-describe_md(names(data.md),name.only = FALSE)
head(var.info)

##     id tcode                                          ttype            fred
## 1    1     5 First difference of natural log: ln(x)-ln(x-1)             RPI
## 1.1  1     5 First difference of natural log: ln(x)-ln(x-1)             RPI
## 2    2     5 First difference of natural log: ln(x)-ln(x-1)         W875RX1
## 3    3     5 First difference of natural log: ln(x)-ln(x-1) DPCERA3M086SBEA
## 4    4     5 First difference of natural log: ln(x)-ln(x-1)       CMRMTSPLx
## 5    5     5 First difference of natural log: ln(x)-ln(x-1)         RETAILx
##                                   description         gsi   gsi:description
## 1                        Real Personal Income  M_14386177                PI
## 1.1                      Real Personal Income  M_14386177                PI
## 2   Real personal income ex transfer receipts M_145256755 PI less transfers
## 3      Real personal consumption expenditures M_123008274  Real Consumption
## 4       Real Manu. and Trade Industries Sales M_110156998         M&T sales
## 5              Retail and Food Services Sales M_130439509      Retail sales
##                                    group edited
## 1                      Output and Income  FALSE
## 1.1                    Output and Income  FALSE
## 2                      Output and Income  FALSE
## 3   Consumption, Orders, and Inventories  FALSE
## 4   Consumption, Orders, and Inventories   TRUE
## 5   Consumption, Orders, and Inventories   TRUE

In var.info you can now find the descriptions of the data. Let’s look at one of them

var.info[var.info$fred == "VXOCLSx",]

## [1] id              tcode           ttype           fred           
## [5] description     gsi             gsi:description group          
## [9] edited         
## <0 rows> (or 0-length row.names)

Note: The function describe_md will return variable information available in 2015-01. Some variables have changed and some have been added. You may have to refer to this file whichh tracks changes to the database. See example below.

Here you can see that the transformed version of the industrial production (INDPRO) series is the log difference. You may be interested in what type of different data transformations have been used here.

unique(var.info$ttype)

## [1] First difference of natural log: ln(x)-ln(x-1)                       
## [2] First difference: x(t)-x(t-1)                                        
## [3] Level (i.e. no transformation): x(t)                                 
## [4] Natural log: ln(x)                                                   
## [5] Second difference of natural log: (ln(x)-ln(x-1))-(ln(x-1)-ln(x-2))  
## [6] First difference of percent change: (x(t)/x(t-1)-1)-(x(t-1)/x(t-2)-1)
## 6 Levels: First difference of natural log: ln(x)-ln(x-1) ...

As you can see some series are not transformed at all and then there are five different types of data transformations. Essentially the transformations are done to create stationary data-series. Let us illustrate this by displaying the untransformed INDPRO (from data.md.0) and the transformed series (from data.md).

p1 <- ggplot(data.md.0, aes(x = date, y = INDPRO)) + 
  geom_line() +
  labs(title = "Industrial Production", y = "INDPRO")
p1

p2 <- ggplot(data.md, aes(x = date, y = INDPRO)) + 
  geom_line() +
  labs(title = "Industrial Production Growth", y = "delta(log(INDPRO))")
p2

Data summary and preparation

You can look at data summaries using the standard summary function.

var.info[var.info$fred == "FEDFUNDS",]

##    id tcode                         ttype     fred                 description
## 84 84     2 First difference: x(t)-x(t-1) FEDFUNDS E?ective Federal Funds Rate
##            gsi gsi:description                       group edited
## 84 M_110155157       Fed Funds Interest and Exchange Rates  FALSE

summary(data.md$FEDFUNDS)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -6.630000 -0.070000  0.010000  0.003621  0.120000  3.060000         1

Here you can see the summary info for the first difference of the Effective Federal Funds rate. You can also see that there is one missing observation. This is most likely the first observation which gets “lost” as a first difference was calculated.

Let’s look at another series, the number of new private housing permits, PERMIT.

var.info[var.info$fred == "PERMIT",]

##    id tcode              ttype   fred                        description
## 55 55     4 Natural log: ln(x) PERMIT New Private Housing Permits (SAAR)
##            gsi gsi:description   group edited
## 55 M_110155532       BP: total Housing  FALSE

summary(data.md$PERMIT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   6.240   7.024   7.213   7.182   7.408   7.791      12

Here you can see that there are 12 missing observations. Let’s look at a plot of the data

p3 <- ggplot(data.md, aes(x = date, y = PERMIT)) + 
  geom_line() +
  labs(title = "New private housing permits", y = "log(PERMIT)")
p3

You could investigate which observations are missing.

data.md[is.na(data.md$PERMIT),c("date", "PERMIT")]

##          date PERMIT
## 1  1959-01-01     NA
## 2  1959-02-01     NA
## 3  1959-03-01     NA
## 4  1959-04-01     NA
## 5  1959-05-01     NA
## 6  1959-06-01     NA
## 7  1959-07-01     NA
## 8  1959-08-01     NA
## 9  1959-09-01     NA
## 10 1959-10-01     NA
## 11 1959-11-01     NA
## 12 1959-12-01     NA

Clearly it is the 1959 data that are not available for this data series. In fact we can find out how many observations are missing altogether.

sum(is.na(data.md))

## [1] 1159

If you read the paper describing the FRED-MD dataset you will find a discussion about outlier removal. The function that removes outlying observations (replacing them with NA) is called rm_outliers.fredmd(). We apply this to the data_clean dataset. Refer to the paper to see how exactly an outlier is identified. There are different ways to deal with outliers and in practical work you have to carefully think about this procedure.

data.md.clean <- rm_outliers.fredmd(data.md)

As outliers are replaced by NA we can easily find out how many outliers were identified.

sum(is.na(data.md.clean))-sum(is.na(data.md))

## [1] 163

Factor Model estimation

Data matrix setup

Before applying a factor model we will have to think about how to deal with missing data. In the end we will want to use only observations that have data available for all variables. In data.md.clean we have 788 observations for 127 variables (but note that the first one is just a date variable and we will eventually not use it).

First, we will remove variables if they have too many missing values (recall that they could come from outliers being removed).

col_na_prop <- apply(is.na(data.md.clean), 2, mean)

This basically identified the proportion of missing values. Recall that for all variables in which a growth rate was calculated we know that at least one observation (the first one) is missing.

Let’s see which variables have more than 5% of observations missing.

col_na_prop[col_na_prop>0.05]

##        ACOGNO       ANDENOx TWEXAFEGSMTHx      UMCSENTx       VIXCLSx 
##    0.50634518    0.13959391    0.21446701    0.29060914    0.05329949

You can refer to the information in var_info to confirm that ACOGNO and ANDENOx relate to order information for consumer and capital good respectively. UMCSENTx is a consumer sentiment index. As it turns out, for the variables TWEXAFEGSMTHx and VIXCLSxwe cannot find any information in var.info. These are variables that have been added to the database after 2015. We can find information on these variables from this file. VIXCLSx is a measure of stock market volatility implied by option prices. The series TWEXAFEGSMTHx is new and has replaced another series which is also not included in var.info. In fact, the information in the file linked above is also not conclusive, but a web search for the variable id TWEXAFEGSMTH reveals that it is a “Nominal Advanced Foreign Economies U.S. Dollar Index”.

Let us now remove these five series from the dataset.

data.md.select <- data.md.clean[, (col_na_prop < 0.05)]

The remaining variables will still have missing data. We shall now remove all rows of data (dates) in which we have any missing information.

data.md.select.2 <- na.omit(data.md.select)

By comparing the number of rows of data.md.select and data.md.select.2 you can identify that nrow(data.md.select)-nrow(data.md.select.2) = 76 dates have been removed.

It may be good to know whether this removed mainly dispersed dates or a whole period. As we still have the date variable in both datasets we can compare the two sets of dates and identify which ones have disappeared.

data.md.select$date[!(data.md.select$date %in% data.md.select.2$date)]

##  [1] "1959-01-01" "1959-02-01" "1959-03-01" "1959-04-01" "1959-05-01"
##  [6] "1959-06-01" "1959-07-01" "1959-08-01" "1959-09-01" "1959-10-01"
## [11] "1959-11-01" "1959-12-01" "1960-04-01" "1971-10-01" "1971-12-01"
## [16] "1973-07-01" "1974-01-01" "1974-02-01" "1977-01-01" "1977-02-01"
## [21] "1977-12-01" "1978-01-01" "1978-04-01" "1979-10-01" "1980-03-01"
## [26] "1980-04-01" "1980-05-01" "1980-07-01" "1980-10-01" "1980-11-01"
## [31] "1980-12-01" "1981-02-01" "1981-04-01" "1981-05-01" "1981-06-01"
## [36] "1981-09-01" "1981-11-01" "1982-01-01" "1982-08-01" "2001-02-01"
## [41] "2001-10-01" "2006-10-01" "2006-11-01" "2008-01-01" "2008-02-01"
## [46] "2008-03-01" "2008-04-01" "2008-05-01" "2008-09-01" "2008-11-01"
## [51] "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01" "2009-04-01"
## [56] "2009-05-01" "2010-03-01" "2010-04-01" "2010-12-01" "2011-01-01"
## [61] "2013-01-01" "2015-12-01" "2016-01-01" "2020-03-01" "2020-04-01"
## [66] "2020-05-01" "2020-06-01" "2020-07-01" "2021-01-01" "2021-02-01"
## [71] "2021-03-01" "2021-04-01" "2021-06-01" "2021-07-01" "2024-07-01"
## [76] "2024-08-01"

You can see that 12 of the dates arise from the entire year 1959 scratched. The only coherent period of missing data arose from March to July 2020 and basically the first half of 2021. This is likely related to some missing data from the Covid period.

In fact, for this exercise we will reduce the sample to include data until March 2014. This results in a data period which has also been used in the McCracken and Ng (2016) paper introducing the FRED-MD dataset.

data.md.select.2 <- as.data.frame(data.md.select.2)
datesel <- (data.md.select.2$date < "2014-04-01")
data.md.select.2 <- data.md.select.2[datesel,]

We will continue working with the data.md.select.2 dataset. But before proceeding we will remove the date variable as a variable but use the date information as row names.

dates <- data.md.select.2$date
data.md.select.2 <- data.md.select.2[,2:ncol(data.md.select.2)]
rownames(data.md.select.2) <- dates

Factor Model estimation

This is not the place where the theory behind factor models is to be explained. Here we will merely show how to implement factor estimation. The data we currently hold in data.md.select.2 can be represented as the $(T \times N)$ matrix $\mathbf{X}$. The idea of a factor representation is that $\mathbf{X}$ can be decomposed into

\[\begin{equation} \mathbf{X} = \mathbf{F} \mathbf{\Lambda'} + e \end{equation}\]

where $\mathbf{F}$ is a $(T \times r)$ matrix of common factors with $r\lt\lt N$. The $(N \times r)$ matrix of factor loadings, $\mathbf{\Lambda}$, describes how each of the $N$ variables in $\mathbf{X}$ is related to each of the $r$ factors in $\mathbf{F}$. The term $\mathbf{F} \mathbf{\Lambda'}$ delivers a $(T \times N)$ matrix of common components.

The error component $e$ is also a $(T \times N)$ matrix.

There are a number of key issues when estimating the factors.

What could be the variance covariance structure of the error components in $e$?
How many factors ($r$) are required?

A good discussion is available from Bai and Ng, 2019. The method proposed there is implemented through the fbi package we installed earlier.

out <- rpca(data.md.select.2, kmax = 8, standardize = TRUE, tau = 0)

Here we implemented the factor estimation without rank regularisation of the common component (tau = 0), for standardised variables (meaning that all variables are scaled to have 0 mean and standard deviation 1) and for 8 factors.

Investigating the factors

The output of the factor model estimation (out) saves a number of things, amongst others the estimated factors ($Fhat) and the factor loadings ($Lamhat). Let’s look at the factors

factors.est <- as.data.frame(out$Fhat) # extract factors and put in dataframe
factors.est$date <- dates               # add date variable

# create a long dataframe
factors.est <- factors.est %>% 
  pivot_longer(!date, names_to = "Factor", values_to = "Values")

# plot the time series for the extracted factors
ggplot(factors.est, aes(x=date, y = Values)) + 
  geom_line() + 
  facet_wrap(vars(Factor))

Let us see how the factors load on the different variables. As we are using standardised variables, looking at the coefficients in $\Lambda$ is informative.

lambda.1 <- sort(out$Lamhat[,1])
lambda.1[1:10]

##     USGOOD  IPMANSICS     PAYEMS     INDPRO     MANEMP    IPFPNSS     CUMFNS 
## -0.8433376 -0.8281917 -0.8275448 -0.8097664 -0.8062688 -0.7778244 -0.7747936 
##    DMANEMP    IPFINAL     IPDMAT 
## -0.7725531 -0.7248871 -0.6988267

It is important to understand that the signs of the factor loadings do not have a useful interpretation. These are the 10 variables with the largest loadings on factor 1. This can be compared to those reported in McCracken and Ng (2016, Table 2). Of the ten variables loading highest (as reported in Table 2) seven appear on the above list. The three variables that are not identified here (NAPM, NAPMNOI, NAPMPI) are all variables that are no longer on the FRED-MD database (see the change log for confirmation).

Replicate Table 2 from McCracken and Ng

Here we will attempt to replicate Table 2 in the McCracken and Ng (2016) paper.

After the factors are estimated, we regress the $i$th series in the dataset on a set of $r$ (orthogonal) factors. For $k = 1,\ldots,r$, this yields $R_i(k)^2$ for series $i$. The incremental explanatory power of factor $k$ is $incR_i^2(k) = R_i^2(k)-R_i^2(k-1), k = 2, \ldots, r$. The average importance of factor-$k$ is $mR^2(k)=\frac{1}{N}\sum_{i=1}^N incR_i^2(k)$. We compute these quantities using two approaches. First, as described, regressing series $n$ on $k$ factors at each iteration. Second, regressing series $n$ on $k$th factor. After that, I will construct incremental $R^2$.

K = dim(out$Fhat)[2] # extracts the number of Factors, same as kmax as set in rpca
N = dim(out$X)[2]    # extracts number of series

# create a matrix which stores various R^2
# dimension KxN
R2.ik <- matrix(0,K,N)  # stores R_i^2(k)
colnames(R2.ik)  <-  colnames(out$X)
R2.inc.ik <- R2.ik      # version 1

# loop through all factors (k) and all series (n)
 for (k in seq(1,K)) {
   for (i in seq(1,N)){
     # regress series i on factors 1 to k
     temp = summary(lm(out$X[,i]~ out$Fhat[,1:k]))
     R2.ik[k,i] = temp$r.squared
     # regress series i on factor k only
     temp = summary(lm(out$X[,i]~ out$Fhat[,k]))
     R2.inc.ik[k,i] = temp$r.squared
   }
 }

# Alternative way to calculate the contribution of the kth factor
R2.inc.ik.alt <- R2.inc.ik  # version 2
R2.inc.ik.alt[2:K,] <- R2.ik[2:K,] - R2.ik[1:K-1,]

At this stage we have saved $R_i(k)^2$ in R2.ik and $incR_i^2(k)$ in R2.inc.ik as well as R2.inc.ik.alt. We could calculate the contribution of the $k$th factor by either $incR_i^2(k) = R_i^2(k)-R_i^2(k-1), k = 2, \ldots, r$ (as implemented in R2.inc.ik.alt) or by merely regressing the $i$th variable on the $k$th factor (as implemented in R2.inc.ik). The latter version works only as the factors are orthogonal. Compare R2.inc.ik and R2.inc.ik.alt to convince yourself that they are identical.

Also compare R2.ik and R2.inc.ik. Looking at these two matrices you should see that the first row is identical but subsequent rows are not.

Now we are in a position to use that information to calculate $mR^2(k)$. This will be saved in R2.m.k and R2.m.alt as we use two ways to calculate the same quantity.

# Storage vectors
R2.m.k <- matrix(0,K,1)

# Importance of factors 1 to K
for (i in seq(1,K)){
  R2.m.k[i] = mean(R2.inc.ik[i,])
}

# an alternative way to calculate this would be
# R2.m.k <- rowMeans(R2.m.k)
print(R2.m.k)

##            [,1]
## [1,] 0.13961874
## [2,] 0.07735284
## [3,] 0.06698293
## [4,] 0.05319509
## [5,] 0.04622442
## [6,] 0.03471929
## [7,] 0.02753143
## [8,] 0.02539408

Here you can see that the first factor explains almost 14% of the variation in the variables. The second factor contributes close to 8% explanation of variation. This compares to 16% and 7% for the first two factors in the McCracken and Ng paper in Table 2.

Next we wish to identify the 10 variables that load most heavily on the first 8 factors (also replicating the 8 subtables in Table 2). In fact we will only show the first two, but you can replicate this for all 8.

for (k in seq(1,2)){
  names(k) = "Factor" #  this gives k a name, for display reasons only
  # order starting with variable loading heaviest onto the kth factor
  temp <- sort(R2.inc.ik[k,],decreasing = TRUE)  
  temp2 <- c(k,temp[1:10])
  stargazer(temp2,type = "text")  # stargazer produces a good looking table
}

## 
## ==================================================================================
## Factor USGOOD IPMANSICS PAYEMS INDPRO MANEMP IPFPNSS CUMFNS DMANEMP IPFINAL IPDMAT
## ----------------------------------------------------------------------------------
## 1      0.712    0.687   0.686  0.657  0.651   0.606  0.601   0.598   0.526  0.489 
## ----------------------------------------------------------------------------------
## 
## ==========================================================================================================
## Factor CUSR0000SAC AAAFFM DNDGRG3M086SBEA T10YFFM CUSR0000SA0L2 BAAFFM PCEPI CPIAUCSL T5YFFM CUSR0000SA0L5
## ----------------------------------------------------------------------------------------------------------
## 2         0.331    0.331       0.330       0.326      0.322     0.321  0.319  0.316   0.284      0.284    
## ----------------------------------------------------------------------------------------------------------

You can compare these to the ones shown in Table 2. You will find that there is a lot of overlap with the variables loading on Factor 1 and indeed the loadings are in the same ballpark (between 0.5 and 0.7). For the second factor you can see that some variables are similar, but there is fewer overlap for factor 2. Also the loadings are markedly lower (here between 0.25 and 0.33; between 0.2 and 0.6 in Table 2).

Best number of factors

A practical question in Factor analysis is how many factors one should use. When calling the rpca function earlier we did set kmax = 8 indicating that we wanted the 8 strongest factors. But should we have used 6 instead, or perhaps more, like 10?

The information criterion used to determine the number of factors is stored in out$IC2.

plot(out$IC2)

out$IC2

## [1] -0.104441107 -0.151046670 -0.192870900 -0.222464212 -0.247149826
## [6] -0.257501304 -0.258368527 -0.257656434 -0.001662511

As you can see there are 9 such values. The minimum value is the 7th, which is associated with 7 factors. The last value, here the 9th (kmax + 1), is the information criterion for not using any Factor. You can also find the optimal number of factors from out$ic2 = 7.

The optimal number of factors depends on the estimated covariance matrix and the metric used. The covariance matrix does vary over time. The result above was for the data from Jan 1960 to March 2014. Let’s attempt to replicate the top panel of Figure 2. In here McCracken and Ng estimate the optimal number of factors for a recursive sample.

We have to re-estimate the model using different time spans. In order to ensure that results do not change we shall download a particular vintage of data, the October 2024 vintage, for this analysis.

filepath <- "https://files.stlouisfed.org/files/htdocs/fred-md/monthly/2024-10.csv"
data.md <- fredmd(filepath, date_start = NULL, date_end = NULL, transform = TRUE)

data.md.clean <- rm_outliers.fredmd(data.md)
col_na_prop <- apply(is.na(data.md.clean), 2, mean)
col_na_prop[col_na_prop>0.05]

##        ACOGNO       AMDMNOx       ANDENOx       AMDMUOx TWEXAFEGSMTHx 
##    0.50570342    0.50443599    0.50443599    0.50316857    0.21419518 
##      UMCSENTx       VIXCLSx 
##    0.29024081    0.05323194

data.md.select <- data.md.clean[, (col_na_prop < 0.05)]
data.md.select.2 <- na.omit(data.md.select)

This has applied the same data-cleaning procedure as discussed earlier, removing outliers and removing series with too many missing values.

dates <- data.md.select.2$date
data.md.select.2 <- data.md.select.2[,2:ncol(data.md.select.2)] # removes date variable
rownames(data.md.select.2) <- dates   # adds date info as rownames
data.md.select.2 <- as.data.frame(data.md.select.2)
dates <- as.Date(dates)       # date variable
date.years <- year(dates)     # years corresponding to each observation

T = dim(data.md.select.2)[1]
N = 2023-1969     # Number of Years for which we will re-estimate

# create results storage vector
IC <- matrix(0,N,1)
ddate <- matrix(0,N,1)

# We re-estimate the factors for samples ending in 1970, 1971, ..., 2023

for (t in seq(1970,2023)) {
  count <- t-1969   # this keeps track of the iteration we are in: 1,2,3, etc
  T.sel <- (date.years <= t)  # boolean variable indicating which obs is "in-sample"
  temp.data <- data.md.select.2[1:sum(T.sel),]  # Selects the first sum(T.sel) obs
  out1 <- rpca(temp.data, kmax = 12, standardize = TRUE, tau = 0)
  
  # Save the optimal factor number
  IC[count] = out1$ic2
  ddate[count] <- t
}

We have now stored in IC the optimal factor numbers from an increasing sample with sample end-points starting from 1970, ending in 2023.

IC <- ts(IC, start = c(year(ddate[1]), frequency = 1)) # Define IC as an annual time-series
plot(IC)

You can see from the plot that the pattern is roughly similar to that shown in McCracken and Ng (2016). It is not exactly the same for the following reasons

that they have used different data vintages whereas we have used only the Oct 2024 data vinatage and only changed the sample end-point. As discussed earlier, different data vintages actually also have different variable compositions.
Our data sample extends beyond the availability in McCracken and Ng (2016).
They re-estimated more frequently than just annually as we have done here.

Also note that this analysis uses an increasing sample size which may well explain the generally increasing number of factors.

Extension

Repeat the above analysis but keep the sample size the same as you move through time. This is instead of using an increasing sample size as we did above. Do you still find an increasing number of factors?

Summary

In this walkthrough you learned about the very valuable FRED-MD dataset which is very useful to produce reproducible macroeconomic research. We illustrated its use by replicating and updating the factor model estimation in McCracken, M. and Ng, S. (2016).

Reading

McCracken, M. and Ng, S. (2016) FRED-MD: A Monthly Database for Macroeconomic Research, Journal of Business & Economic Statistics, 34, 574-589.

Bai, J. and Ng, S. (2019), Rank regularized estimation of approximate factor models, Journal of Econometrics, 212, 78-96.