We add the basic libraries needed for this week’s work:
library(tidyverse) # for almost all data handling tasks
library(ggplot2) # to produce nice graphics
library(stargazer) # to produce nice results tables
library(haven) # to import stata file
library(AER) # access to HS robust standard errors
We will also use one function saved in a .r file. The function in this file will be explained below. This file is available for download from here: stargazer_HC.R. When you get to that page you will see a download button.
You should save this files into your working directory. You can then this functions available to your code using this line.
source("stargazer_HC.r") # includes the robust regression display
As we are using panel methods we also require an additional package
plm
.
#install.packages("plm") # only execute this if plm is not installed yet
library(plm)
The data are an extract from the Understanding Society Survey (formerly the British Household Survey Panel). The question we are looking at using these data is how responsive wages are to local unemployment rates. Or in other words, do wages respond to unemployment rates (increase if unemployment is low and decrease if unemployment is high)? For instance Blanchflower and Oswald (1989) find that there is such a negative relationship.
Load the data (20222_USoc_extract.dta),
which are saved in a STATA datafile (extension .dta
), into
your working directory. There is a function which loads STATA file. It
is called read_dta
and is supplied by the
haven
package.
## [1] "pidp" "age" "jbhrs" "paygu" "wave" "cpi" "year"
## [8] "region" "urate" "male" "race" "educ" "degree" "mfsize9"
data_USoc <- read_dta("20222_USoc_extract.dta")
data_USoc <- as.data.frame(data_USoc) # ensure data frame structure
names(data_USoc)
Let us ensure that categorical variables are stored as
factor
variables. It is easiest to work with these in
R.
data_USoc$region <- as_factor(data_USoc$region)
data_USoc$male <- as_factor(data_USoc$male)
data_USoc$degree <- as_factor(data_USoc$degree)
data_USoc$race <- as_factor(data_USoc$race)
Here we used the as_factor
function to turn the
variables into factor variables. You could have used
as.factor
. But when you get data from STATA files, they
often come with labels and as_factor
ensures that the
factor levels are defined according to the labels rather than just
numbers.
The pay information (paygu
) is provided as a measure of
the (usual) gross pay per month. As workers work for varying numbers of
hours per week (jbhrs
) we divide the monthly pay by the
approximate monthly hours (4*jbhrs
). We shall also adjust
for increasing price levels (as measured by cpi
). These two
adjustments leave us with an inflation adjusted hourly wage. We call
this variable hrpay
and also calculate the natural log of
this variable (lnhrpay
).
data_USoc <- data_USoc %>%
mutate(hrpay = paygu/(jbhrs*4)/(cpi/100)) %>%
mutate(lnhrpay = log(hrpay))
As we wanted to save these additional variables we assign the result
of the operation to data_USoc
.
We will also use the logarithm of the unemployment rate
data_USoc <- data_USoc %>%
mutate(lnurate=log(urate))
To explain the meaning of these let us just pick out all the
observations that pertain to one particular individual
(pidp == 272395767
). The following command does the
following in words: “Take data_USoc
filter/keep all
observations which belong to individual pidp == 272395767, then select a
list of variables (we don’t need to see all 14 variables) and print the
result”:
data_USoc %>% filter(pidp == 272395767) %>%
select(c("pidp","male","wave","year","paygu","age","educ")) %>%
print()
## pidp male wave year paygu age educ
## 1 272395767 female 1 2009 775 40 11
## 2 272395767 female 2 2010 812 41 11
## 3 272395767 female 3 2011 772 42 11
The same person (female) was observed three years in a row (from 2009 to 2011). Their gross monthly income changed, as did, of course, their age, but not their education. This particular person was observed in three consecutive waves. Let’s see whether this is a common pattern.
In the context of this exercise we will ignore the second wave and only look at waves 1 and 3.
data_USoc <- data_USoc %>%
filter(wave != 2) %>%
filter(!is.na(lnhrpay))
The code below figures how many waves we have for each individual (1
or 2) and then saves this in a new variable (n_wave
). This
information will be used later as we may want to know whether only using
observations for which we have both waves makes a difference to the
analysis.
data_USoc <- data_USoc %>%
group_by(pidp) %>%
mutate(n_wave = n())
Now we need to let R know that we are dealing with panel data. This
is why we loaded up the plm
library which contains the
plm.data
function. Using the
index = c("pidp","wave")
we let the function know what
identifies the individuals and what identifies the wave.
pdata_USoc <- pdata.frame(data_USoc, index = c("pidp","wave")) # defines the panel dimensions
We saved the output in pdata_USoc
and we will use this
for any panel data estimations.
When dealing with panel data it is super useful to understand in how
many and in which wave individuals are presented. We already calculated
the n_wave
variable which tells us in how many of our
remaining two waves observations are represented. We also have
information (wave
) on which wave someone is represented in.
To disentagle this we merely need a contingency table
of
the n_wave
and waves
variables.
table(pdata_USoc$n_wave,pdata_USoc$wave, dnn = c("n_waves","waves"))
## waves
## n_waves 1 3
## 1 9666 4112
## 2 13078 13078
Naturally the 13078 respondents which have two observations
(n_wave == 2)
are represented in waves 1 and 3. Then we
have (n_wave == 1
) 9666 respondents which are represented
in wave 1 and the 4112 which are represented in wave 3.
For the respondents for which we have 2 waves of observations we can
actually calculate a difference, or change in variables. This will
become important in a later model estimation (although for that we could
let R do the work in the background) and hence we will calculate these
variables explicitly, here for lnhrpay
and
lnurate
.
# the lag function below will recognise the panel nature of the data and
# will only calculate lags for individuals
# we also need to specify that we are calculating k=2 step lag as
# we calculate the difference between wave 3 and 1
Dlnhrpay <- pdata_USoc$lnhrpay-lag(pdata_USoc$lnhrpay,k=2)
Dlnurate <- pdata_USoc$lnurate-lag(pdata_USoc$lnurate,k=2)
pdata_USoc$Dlnhrpay <- Dlnhrpay # add the new series to the dataframe
pdata_USoc$Dlnurate <- Dlnurate
For a later purpose we will also identify all individuals who moved from one region to another between waves 1 and 3. It is not so important to understand this code.
temp <- pdata_USoc # create a temporary dataframe
temp <- temp %>% filter(n_wave == 2) %>% # only keep individuals with two waves
group_by(pidp) %>% # group data by individual
mutate(move = ifelse(length(unique(region))==1,"no move","move")) %>%
select(pidp,wave,move) # only keep these 3 variables
# the move variable will take the value 1 if both regions are identical (no move)
# and 2 if there are two different regions (move)
temp$move <- as_factor(temp$move) # convert to factor variable
# the following merges the new variable into the pdata_USoc dataframe
pdata_USoc <- merge(pdata_USoc,temp,all.x = TRUE)
Let’s check how many movers there are.
pdata_USoc %>% count(move)
## move n
## 1 no move 25804
## 2 move 352
## 3 <NA> 13778
So ther are 352 observations associated with movers. That means that there are 176 movers as each mover will have two observations.
We will use the lnhrpay
and the urate
variables below. We therefore will have a look at these variables. Below
we will use the stargazer
function to show some summary
statistics.
stargazer(pdata_USoc[,c("lnhrpay","urate","year","Dlnhrpay","Dlnurate")],type = "text")
##
## ==================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------
## lnhrpay 39,934 2.300 0.640 -7.800 8.900
## urate 39,934 7.900 1.300 5.800 11.000
## year 39,934 2,010.000 1.100 2,009 2,013
## Dlnhrpay 13,078 -0.009 0.520 -10.000 9.500
## Dlnurate 13,078 0.037 0.065 -0.460 0.550
## --------------------------------------------------
Let us look at some summary statistics grouped by region
pdata_USoc %>% group_by(region) %>%
summarise(n = n(), mean_lnhrpay = mean(lnhrpay),mean_urate = mean(urate))
## # A tibble: 12 × 4
## region n mean_lnhrpay mean_urate
## <fct> <int> <dbl> <dbl>
## 1 north east 1576 2.21 9.88
## 2 north west 4280 2.24 8.44
## 3 yorkshire and the humber 3247 2.20 9.00
## 4 east midlands 3107 2.20 9.15
## 5 west midlands 3454 2.23 7.67
## 6 east of england 3724 2.32 6.58
## 7 london 5736 2.42 9.30
## 8 south east 5125 2.39 6.10
## 9 south west 3119 2.25 6.11
## 10 wales 1831 2.14 8.45
## 11 scotland 3020 2.27 7.76
## 12 northern ireland 1715 2.24 6.78
Below we will want to use the mean lnhrpay
and mean
lnurate
as calculated for every region-year. The following
will group the data by region-wave (as we have 12 regions and 2 waves we
will 24 such groups). This is similar to the above command but note that
we start with pdata_USoc <-
to ensure that the
calculated average wage and unemployment rate values are added as
variables to the data frame. Also, instead of summarise
(which displays the calculated statistics) we use the
mutate
function as we want the calculated series to be
saved in the data frame.
pdata_USoc <- pdata_USoc %>%
group_by(region,year) %>%
mutate(mean_lnhrpay = mean(lnhrpay),mean_urate = mean(urate))
We start by estimating a model which does not use the panel nature of
the data. We relate the log hourly pay (lnhrpay
) to the
region/wave specific log unemployment rate (lnurate
).
POLS0 <- lm(lnhrpay~lnurate, data = pdata_USoc)
stargazer_HC(POLS0)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## lnurate -0.100***
## (0.019)
##
## Constant 2.500***
## (0.039)
##
## ---------------------------------------------------------
## Observations 39,934
## R2 0.001
## Adjusted R2 0.001
## Residual Std. Error 0.630 (df = 39932)
## F Statistic 30.000*** (df = 1; 39932)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Let’s add the predicted model values to the data frame. As our explanatory variable only has 24 different values we will only get 24 different predicted values.
pdata_USoc$pred_POLS0 <- POLS0$fitted.values
Here we basically used all observations available, whether they were
from wave 1 or 3. We pooled the observations and hence
we could use our normal lm
function to estimate this model.
We could estimate this model also with the plm
function
(see below).
POLS0a <- plm(lnhrpay~lnurate, data = pdata_USoc, model = "pooling")
stargazer_HC(POLS0a)
##
## ==================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## --------------------------------------------------
## lnurate -0.100***
## (0.022)
##
## Constant 2.500***
## (0.046)
##
## --------------------------------------------------
## Observations 39,934
## R2 0.001
## Adjusted R2 0.001
## F Statistic 30.000*** (df = 1; 39932)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
However, the plm
package we imported earlier has a few
additional panel specific tricks up its sleeve (we will use them
shortly).
Now we plot the predicted values of log hourly pay and compare them against the realised values.
# pdf("Lecture6plot_Pooled.pdf",width = 5.5, height = 4) # uncomment to save as pdf
ggplot(pdata_USoc, aes(x=lnurate,y=pred_POLS0)) +
geom_point(aes(colour = "red")) +
geom_line(aes(colour = "red")) +
geom_point(aes(y = mean_lnhrpay,colour = "blue")) +
ggtitle("Predicted values - Pooled OLS") +
ylab("Log Hourly Wage") +
xlab("Log Unemployment Rate") +
scale_colour_manual(name="Data/Model", values = c(red = "red", blue = "blue"),
labels=c("Data", "POLS"))
# dev.off() # uncomment to save as pdf
Now we will include a dummy variable for wave == 3
. The
wave
variable is a factor variable with two levels (1 and
3) for waves 1 and 3.
POLS1 <- lm(lnhrpay~lnurate+wave, data = pdata_USoc)
stargazer_HC(POLS1)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## lnurate -0.097***
## (0.019)
##
## wave3 -0.019***
## (0.006)
##
## Constant 2.500***
## (0.039)
##
## ---------------------------------------------------------
## Observations 39,934
## R2 0.001
## Adjusted R2 0.001
## Residual Std. Error 0.630 (df = 39931)
## F Statistic 20.000*** (df = 2; 39931)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
The first wave is the base category of wave
and hence is
not included. So far we have used the standard lm
function
to estimate this model.
Alternatively this could be estimated using the plm
package
POLS1a <- plm(lnhrpay~lnurate+wave, data = pdata_USoc, model = "pooling")
stargazer_HC(POLS1a)
##
## ==================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## --------------------------------------------------
## lnurate -0.097***
## (0.022)
##
## wave3 -0.019***
## (0.005)
##
## Constant 2.500***
## (0.046)
##
## --------------------------------------------------
## Observations 39,934
## R2 0.001
## Adjusted R2 0.001
## F Statistic 20.000*** (df = 2; 39931)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
This regression will have observations for individuals for which we
only observe one wave (n_wave == 1
). Let’s restrict the
analysis to only individuals for which we have two waves
(n_wave == 2
).
POLS2 <- lm(lnhrpay~lnurate+wave, data = pdata_USoc, subset = (n_wave ==2))
stargazer_HC(POLS2)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## lnurate -0.096***
## (0.023)
##
## wave3 -0.005
## (0.008)
##
## Constant 2.500***
## (0.046)
##
## ---------------------------------------------------------
## Observations 26,156
## R2 0.001
## Adjusted R2 0.001
## Residual Std. Error 0.610 (df = 26153)
## F Statistic 10.000*** (df = 2; 26153)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
or using the plm
function
POLS2a <- plm(lnhrpay~lnurate+wave, data = pdata_USoc, subset = (n_wave ==2), model = "pooling")
stargazer_HC(POLS2a)
##
## ==================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## --------------------------------------------------
## lnurate -0.096***
## (0.029)
##
## wave3 -0.005
## (0.005)
##
## Constant 2.500***
## (0.058)
##
## --------------------------------------------------
## Observations 26,156
## R2 0.001
## Adjusted R2 0.001
## F Statistic 10.000*** (df = 2; 26153)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Now we estimate a first difference (FD) model. We will only do this
using the plm
function. If we were to use the
lm
function we had to first calculate differenced series
(which we have done on this occasion, but only to illustrate the
mechanics). Before we estimate the model let’s look at the data for a
few respondents.
pdata_USoc %>% filter(pidp %in% c("3915445","68001367","68004087","68195851")) %>%
select(c("pidp","male","wave","lnhrpay","Dlnhrpay","lnurate", "Dlnurate")) %>%
print()
## # A tibble: 6 × 9
## # Groups: region, year [5]
## region year pidp male wave lnhrpay Dlnhrpay lnurate Dlnurate
## <fct> <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 scotland 2011 3915445 female 3 1.88 NA 2.09 NA
## 2 north east 2009 68001367 male 1 2.45 NA 2.22 NA
## 3 north east 2009 68004087 male 1 1.83 NA 2.22 NA
## 4 north east 2011 68004087 male 3 1.90 0.0728 2.38 0.160
## 5 north west 2009 68195851 female 1 2.20 NA 2.13 NA
## 6 west midlands 2011 68195851 female 3 1.84 -0.360 2.08 -0.0488
When estimating a FD model we are basically running a regression of
Dlnhrpay
on Dlnurate
. Respondents for whom we
do not have two waves will not be used in such a model. The calculation
of the Dlnhrpay
and Dlnurate
series happens
automatically inside the plm
function when we specify
model = "fd"
.
FD1a <- plm(lnhrpay~lnurate, data = pdata_USoc, subset = (n_wave ==2), model = "fd")
stargazer_HC(FD1a)
##
## ==================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## --------------------------------------------------
## lnurate 0.042
## (0.060)
##
## Constant -0.011***
## (0.004)
##
## --------------------------------------------------
## Observations 13,078
## R2 0.00003
## Adjusted R2 -0.00005
## F Statistic 0.350 (df = 1; 13076)
## ==================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
We can show a scatter plot of the available difference observations
and the regression line estimated by FD1a
.
# pdf("Lecture6plot_FD_R.pdf",width = 5.5, height = 4) # uncomment to save as pdf
ggplot(pdata_USoc, aes(x=Dlnurate,y=Dlnhrpay,color=move)) +
geom_point() +
geom_abline(intercept = FD1a$coefficients[1], slope = FD1a$coefficients[2],colour = "blue") +
ggtitle("Scatter plot and FD regression line") +
ylab("D(Log Hourly Wage)") +
xlab("D(Log Unemployment Rate)")
# dev.off() # uncomment to save as pdf
As you can see, there is no obvious relationship between the changes in hourly pay and the respective local unemployment rate.
Now we will show models, POLS0a
, POLS1a
,
POLS2a
and FD1a
in one table. In previous
tables you may have seen that the F-stat takes up a lot of space and
hence we use the omit_stat
option to indicate that we do
not want to see the F-statistic.
stargazer_HC(POLS0a,POLS1a,POLS2a,FD1a,omit.stat = "f")
##
## =====================================================
## Dependent variable:
## ----------------------------------------
## lnhrpay
## (1) (2) (3) (4)
## -----------------------------------------------------
## lnurate -0.100*** -0.097*** -0.096*** 0.042
## (0.022) (0.022) (0.029) (0.060)
##
## wave3 -0.019*** -0.005
## (0.005) (0.005)
##
## Constant 2.500*** 2.500*** 2.500*** -0.011***
## (0.046) (0.046) (0.058) (0.004)
##
## -----------------------------------------------------
## Observations 39,934 39,934 26,156 13,078
## R2 0.001 0.001 0.001 0.00003
## Adjusted R2 0.001 0.001 0.001 -0.00005
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Looking at this table we can briefly investigate whether the data
provide evidence for a negative relationship between local unemployment
rates and wages. As the regressions are run with logarithmic variables,
the coefficients are interpreted as elasticities (refer to your
econometric textbook or lecture notes to remind yourself of this). The
pooled regressions all indicate that 10% increase in the unemployment
rate (say from 8% to 8.8%) has the following effect on on real hourly
pay (using model POLS2a
): \(−0.096 ∗ 0.10 = −0.0096\) i.e. a drop of
0.0096 log-points (1 percent). This effect also is highly statistically
significant. This is similar to the findings in Blanchflower and
Oswald.
However, using the panel data structure available here, the results from the first difference model suggests that this effect is not statistically significant and in fact actually slightly positive (0.042).
A different panel data technique that can be applied is that of a fixed effects regression. In this regression. In the pooled regresison models we estimated
\[y_{it} = \alpha + \beta x_{it} + \epsilon_{it}\]
where \(y\) and \(x\) represent the log wage and the log unemployment rate respectively. In a fixed effects regression we recognise that there are individual specific effects which may explain a significant part of the variation in the dependent variable. Estimating a fixed effects model replaces the constant \(\alpha\) in the above model (which applies to all individuals \(i\)) with an \(i\) specific constant.
\[y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}\]
In R this is implemented by using the “within” model
FE1 <- plm(lnhrpay~lnurate, data = pdata_USoc, subset = (n_wave ==2), model = "within")
stargazer_HC(POLS0a,POLS1a,POLS2a,FD1a,FE1,omit.stat = "f")
##
## ============================================================
## Dependent variable:
## -----------------------------------------------
## lnhrpay
## (1) (2) (3) (4) (5)
## ------------------------------------------------------------
## lnurate -0.100*** -0.097*** -0.096*** 0.042 -0.028
## (0.022) (0.022) (0.029) (0.060) (0.071)
##
## wave3 -0.019*** -0.005
## (0.005) (0.005)
##
## Constant 2.500*** 2.500*** 2.500*** -0.011***
## (0.046) (0.046) (0.058) (0.004)
##
## ------------------------------------------------------------
## Observations 39,934 39,934 26,156 13,078 26,156
## R2 0.001 0.001 0.001 0.00003 0.00002
## Adjusted R2 0.001 0.001 0.001 -0.00005 -1.000
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
As you can see this output does not show any constant. Effectifely the model estimated 13078 constants, one for each individual.