On this page you will learn how to implement regression inference procedures. We start by loading some relevant packages.
library(tidyverse) # for general data handling
library(stargazer) # package for nicer regression output
library(AER) # include the ARE package
We will use the Women’s Wages dataset (mroz.csv).
Let’s first import the data
setwd("YOUR/DIRECTORY/PATH") # This sets the working directory
mydata <- read.csv("mroz.csv", na.strings = ".") # Opens mroz.csv from working directory
Let’s remove all observations with missing wages from the dataframe
mydata <- subset(mydata, is.na(wage) == FALSE) # remove observations with missing wages from dataset
or if you wish to use the tidyverse
methodology
mydata <- mydata %>% filter(!is.na(wage))
An extremely simple model would be to estimate the following OLS
regression which models lwage
as a function of a constant
and experience and the log of the husband’s wage
(log(huswage)
).
\[\begin{equation} lwage_i = \beta_0 + \beta_1 exper_i + \beta_2 log(huswage_i) + u_i \end{equation}\]
This is how we estimate this model:
reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
We will introduce inference in this model.
We use t-tests to test simple coefficient restrictions on regression coefficients. Let’s initially have a look at our regression output:
stargazer(reg_ex1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## exper 0.017***
## (0.004)
##
## log(huswage) 0.240***
## (0.064)
##
## Constant 0.540***
## (0.140)
##
## -----------------------------------------------
## Observations 428
## R2 0.059
## Adjusted R2 0.055
## Residual Std. Error 0.700 (df = 425)
## F Statistic 13.000*** (df = 2; 425)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As you can see, this output contains t-statistics and their associated p-values. These test statistics and their p-values are all associate to the following hypothesis test: \(H_0:\beta_i=0;H_A:\beta_i\neq 0\). Here \(beta_i\) represents the \(i\)th unknown population parameter. If you want to test any other hypothesis (rather than the two-sided, equal to 0 hypothesis) you will need to access the regression output in order to calculate this test statistic. For the moment let us think about the following
\[\begin{equation} H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} \gt 0.01 \end{equation}\]
We then need to calculate the test statistics
\[\begin{equation} t-stat = \frac{\widehat{\beta}_{i} - 0.01}{se_{\widehat{\beta}_{i}}} \end{equation}\]
We start by saving the regression summary
reg_ex1_sm <- summary(reg_ex1)
This will save a lot of summary statistics in
reg_ex1_sm
, such as the \(R^2\), \(F-Statistic\) and, importantly, the
standard errors for the coefficient estimates. You can get to the
regression coefficients and the associated standard errors as
follows:
reg_ex1_sm$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.535 0.1391 3.8 0.000139
## exper 0.017 0.0042 3.9 0.000098
## log(huswage) 0.236 0.0637 3.7 0.000232
Noting that the results for the exper
variables are in
the 2nd row we can calculate the relevant test statistic according
to:
t_test = (reg_ex1_sm$coefficients[2,1]-0.01)/reg_ex1_sm$coefficients[2,2]
where we recognise that that the experience coefficient is saved in the 2nd row of coefficients. As it turns out the value of this t-test is 1.5755. Sometime, especially if you have many explanatory variables, it can be awkward to have to count in which row the relevant coefficients are. But you can also use the name of the relevant variable as follows:
t_test = (reg_ex1_sm$coefficients["exper",1]-0.01)/reg_ex1_sm$coefficients["exper",2]
We do not know the p-value for this particular test yet. We will turn to this shortly.
Let’s say we are interested whether two additional variables age and educ should be included into the model. As a good econometrics student, or even master, you know that these type of statistics are calculated using a F test and to calculate a F-test you need residual sum of squares from a restricted model (that is model reg_ex1) and an unrestricted model. The latter we estimate here:
reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata)
You could now use output from both these regressions to calculate the
appropriate F-test as you have both, the resricted and the unrestricted
model. However, let’s make out live a little easier using the
lht
(short for linear hypothesis test).
lht(reg_ex2, c("age = 0"," educ = 0"))
##
## Linear hypothesis test:
## age = 0
## educ = 0
##
## Model 1: restricted model
## Model 2: lwage ~ exper + log(huswage) + age + educ
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 425 210
## 2 423 188 2 22 24.7 0.000000000069 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This requires as input the unrestricted model, (here
reg_ex2
) and the restrictions. Here we have two
restrictions, i.e. that the coefficients to the two variables
age
and educ
are 0 and hence both variables
are irrelevant. You could add more restrictions by adding them into the
list c( )
. But everything else remains unchanged.
Here we find that the null hypothesis of both these variables being irrelevant (their parameters being equal to 0) is clearly rejected. The p-value is very small.
The lht
function is very convenient to use for testing
purposes, as you will only need the unrestricted model and then add the
restriction. Internally the function will then estimate the restricted
model, but you as the user will not see it.
Often you will want to calculate p-values for test statistics. We did that above for the t-test we calculated manually. You need a number of ingredients to do that:
Once you know all these ingredients you can use some R internal functions to get p-values.
Let’s go to the t-test we calculated earlier (and saved in t_test) to test
\[\begin{equation} H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} \gt 0.01 \end{equation}\]
To get the p-value here (right-tailed area) we can call on the
function pt
, which calculates probabilities from a t
distribution (probability under the t-distribution with 425 degrees of
freedom to the left of the test-statistic t_test
). The
degrees of freedom is the number of observations minus the number of
estimated regression coefficients (here 428-3 = 425), but we can read
that directly from the regression output
(reg_ex1$df.residual
).
pt(t_test, reg_ex1$df.residual, lower.tail = FALSE)
## [1] 0.058
The option lower.tail = FALSE
ensured that we were
looking at an upper tail probability.
Of course different test statistics require different distributions.
But the principle is the same. The relevant function to get p-values for
a F-test is pf
. Type ?pf
into R to get to the
help function where you can see how to use it exactly.
These probability functions exist for a range of common distributions. For a list see here.
The t-tests and the F-tests calculated above typically are build on a number of assumptions (no details here, refer to your econometrics notes or textbook). One of these assumptions is a homoskedasticity assumption, assuming that error terms \(u_i\) have a constant variance across all observations.
This is an assumption that is often not justified. If that is the case, then the standard way of calculating standard errors needs to be adjusted to allow for the heteroskedasticity. If that does not happen, then the inference we performed above will be incorrect.
The AER
package we loaded initially will provide the
functionality we require to adjust the inference.
The function needed to calculate heteroskedasticity robust standard
errors is the vconHC
function. This really comes from the
sandwich
package which is automatically loaded by the
AER
package.
vcvHC <- vcovHC(reg_ex1, type = "HC1")
# the type option in this function indicates that there are several options (actually "HC0" to "HC4").
# Using "HC1" will replicate the robust standard errors you would obtain using STATA
# and is a sort of industry standard
This saves the heteroscedastic robust standard error in vcv. In this case this is actually a \(3 \times 3\) matrix which also contains covariances between the three estimated coefficients.
Now you can calculate robust t-tests by using the estimated
coefficients and the new standard errors (square roots of the diagonal
elements on vcvHC
). But note that inference using these
standard errors is only valid for sufficiently large sample sizes
(asymptotically normally distributed t-tests).
You could manually calculate t-tests (as above) but The
coeftest
function provides an easier way to do that.
coeftest(reg_ex1, vcvHC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53487 0.14423 3.71 0.00024 ***
## exper 0.01668 0.00429 3.89 0.00011 ***
## log(huswage) 0.23647 0.06540 3.62 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You will find the (unchanged) regression coefficients along with their new standard errors, t-stats and p-values.
Often you may wish to produce the standard regression output (as
above coming from the stargazer
function), but including
the robust standard errors. You can use the stargazer_HC
function which you can import into your working directory. This file is
available for download from here: stargazer_HC.R.
You should save this file into your working directory. You can then make
this functions available to your code using this line.
source("stargazer_HC.r") # includes the robust regression display
Once you have done that you can call
stargazer_HC(reg_ex1, type_out = "text")
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lwage
## ---------------------------------------------------------
## exper 0.017***
## (0.004)
##
## log(huswage) 0.240***
## (0.065)
##
## Constant 0.540***
## (0.140)
##
## ---------------------------------------------------------
## Observations 428
## R2 0.059
## Adjusted R2 0.055
## Residual Std. Error 0.700 (df = 425)
## F Statistic 13.000*** (df = 2; 425)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
and you can see that the standard error in parenthesis report the
same standard errors as calculated in vcv
.
In fact you can achieve the same directly using the
stargazer
function in combination with the previously
calculated vcvHC
.
stargazer(reg_ex1, type="text", se=list(sqrt(diag(vcvHC))))
##
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## exper 0.017***
## (0.004)
##
## log(huswage) 0.240***
## (0.065)
##
## Constant 0.540***
## (0.140)
##
## -----------------------------------------------
## Observations 428
## R2 0.059
## Adjusted R2 0.055
## Residual Std. Error 0.700 (df = 425)
## F Statistic 13.000*** (df = 2; 425)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
So, if calculating heteroskedasticity robust standard errors is
something you do frequently it is best to use the
stargazer_HC
function.
When the error terms are autocorrelated (and potentially
heteroskedastic) all of the above applies and we need to use yet another
estimator for the coefficient estimate standard errors, sometimes called
the Newey-West estimators. The function from the sandwich
package that you want to use is called vcovHAC()
and you
use it as follows:
vcvHAC <- vcovHAC(reg_ex1)
Everything is as for heteroskedastic error terms. I.e. you would print these standard errors along with the coefficient estimates, t-statistics and p-values from:
coeftest(reg_ex1, vcov = vcvHAC)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53487 0.14652 3.65 0.00029 ***
## exper 0.01668 0.00421 3.96 0.000087 ***
## log(huswage) 0.23647 0.06899 3.43 0.00067 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
or use them as part of a stargazer
display:
stargazer(reg_ex1, type="text", se=list(sqrt(diag(vcvHAC))))
##
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## exper 0.017***
## (0.004)
##
## log(huswage) 0.240***
## (0.069)
##
## Constant 0.540***
## (0.150)
##
## -----------------------------------------------
## Observations 428
## R2 0.059
## Adjusted R2 0.055
## Residual Std. Error 0.700 (df = 425)
## F Statistic 13.000*** (df = 2; 425)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Again you could also used the specifically written function
stargazer_HAC
which you can download from here: stargazer_HAC.R.
You should save this file into your working directory. You can then make
this functions available to your code using the source
function as used above.
source("stargazer_HAC.r") # includes the robust regression display
And once you have done that you can call stargazer_HAC
to generate the usual regression output but with autocorrelation- and
heteroskedasticity-robust standard errors (aka newey-West standard
errors).
stargazer_HAC(reg_ex1, type_out = "text")
##
## =============================================================
## Dependent variable:
## -----------------------------------------
## lwage
## -------------------------------------------------------------
## exper 0.017***
## (0.004)
##
## log(huswage) 0.240***
## (0.069)
##
## Constant 0.540***
## (0.150)
##
## -------------------------------------------------------------
## Observations 428
## R2 0.059
## Adjusted R2 0.055
## Residual Std. Error 0.700 (df = 425)
## F Statistic 13.000*** (df = 2; 425)
## =============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Newey-West standard errors in parenthesis
Of, course, on this occasion, using an example with cross-sectional data, it is impossible to have autocorrelated standard errors. But R does not know this and hence just does as instructed.
F-tests are also based on a homoskedasticity assumption. Fortunately, allowing for error heteroskedasticity works in exactly the same manner as above.
We begin by specifying the same larger model as used in the standard inference section.
reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata)
At the core of tackling the heteroskedasticity issue is again the
calculation of the correct variance-covariance matrix, which is then fed
into the lht
function via the vcov
option.
vcvHC <- vcovHC(reg_ex2, type = "HC1")
lht(reg_ex2, c("age = 0"," educ = 0"), vcov = vcvHC)
##
## Linear hypothesis test:
## age = 0
## educ = 0
##
## Model 1: restricted model
## Model 2: lwage ~ exper + log(huswage) + age + educ
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 425
## 2 423 2 29.6 0.00000000000096 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result remains unchanged, an overwhelming rejection of the null
hypothesis that both age
and educ
are
irrelevant.