We add the basic libraries needed for this week’s work:
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphiscs
library(stargazer) # to produce nice results tables
library(haven) # to import stata file
library(AER) # access to HS robust standard errors
library(estimatr) # use robust se
The data are an extract from the Understanding Society Survey (formerly the British Household Survey Panel).
Upload the data, which are saved in a STATA datafile (extension
.dta). There is a function which loads STATA file. It is
called read_dta and is supplied by the haven
package.
data_USoc <- read_dta("../data/20222_USoc_extract.dta")
data_USoc <- as.data.frame(data_USoc) # ensure data frame structure
names(data_USoc)
## [1] "pidp" "age" "jbhrs" "paygu" "wave" "cpi" "year"
## [8] "region" "urate" "male" "race" "educ" "degree" "mfsize9"
Let us ensure that categorical variables are stored as
factor variables. It is easiest to work with these in
R.
data_USoc$region <- as_factor(data_USoc$region)
data_USoc$male <- as_factor(data_USoc$male)
data_USoc$degree <- as_factor(data_USoc$degree)
data_USoc$race <- as_factor(data_USoc$race)
Click on the little table symbol in your environment tab to see the actual data table.
The variable pidp contains a unique person identifier
and the variable wave indicates the wave and
year the year of observation.
To explain the meaning of these let us just pick out all the
observations that pertain to one particular individual
(pidp == 272395767). The following command does the
following in words: “Take data_USoc filter/keep all
observations which belong to individual pidp == 272395767, then select a
list of variables (we don’t need to see all 14 variables) and print the
result”:
data_USoc %>% filter(pidp == 272395767) %>%
select(c("pidp","male","wave","year","paygu","age","educ")) %>%
print()
## pidp male wave year paygu age educ
## 1 272395767 female 1 2009 774.8302 40 11
## 2 272395767 female 2 2010 812.2778 41 11
## 3 272395767 female 3 2011 772.1625 42 11
The same person (female) was observed three years in a row (from 2009 to 2011). Their gross monthly income changed, as did, of course, their age, but not their education. This particular person was observed in three consequitive waves. Let’s se whether this is a commom pattern.
The code below figures out for how many individuals we have 1, 2 and 3 waves of observations. It is not important to understand that code.
pattern <- data_USoc %>% group_by(pidp) %>%
mutate(n_wave = n()) %>%
select(pidp,n_wave) %>%
unique() %>%
group_by(n_wave)%>%
summarise(n_pat = n()) %>%
print()
## # A tibble: 3 × 2
## n_wave n_pat
## <int> <int>
## 1 1 15099
## 2 2 12450
## 3 3 31091
As you can see only just over half of individuals have records for
three waves. Let us look at the observations for an individual
(pidp == 2670365) which only has observations for two
waves.
data_USoc %>% filter(pidp == 2670365) %>%
select(c("pidp","male","wave","year","paygu","age","educ")) %>%
print()
## pidp male wave year paygu age educ
## 1 2670365 female 2 2010 230.0000 16 11
## 2 2670365 female 3 2011 346.6667 17 11
Let’s use the stargazer function to produce a nice
summary table
stargazer(data_USoc, type = "text")
##
## =======================================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------------------------
## pidp 133,272 839,218,358.000 467,699,610.000 280,165 1,639,568,724
## age 133,272 46.172 18.295 9 103
## jbhrs 64,217 32.594 11.614 0.100 97.000
## paygu 59,216 1,823.574 1,475.064 0.083 15,000.000
## wave 133,272 1.912 0.818 1 3
## cpi 133,272 116.790 4.199 110.800 126.100
## year 133,272 2,010.453 0.991 2,009 2,013
## urate 133,272 7.955 1.311 5.800 10.800
## educ 133,041 12.838 2.316 11 17
## mfsize9 58,989 303.135 484.430 1 1,500
## -----------------------------------------------------------------------
Here are some frequency tables
data_USoc %>% count(wave)
data_USoc %>% count(region)
data_USoc %>% count(male)
data_USoc %>% count(year)
data_USoc %>% count(race)
data_USoc %>% count(educ)
data_USoc %>% count(degree)
data_USoc %>% count(mfsize9)
The pay information (paygu) is provided as a measure of
the (usual) gross pay per month. We shall adjust for increasing price
levels (as measured by cpi). We call this variable
hrpay and also calculate the natural log of this variable
(lnhrpay). Recall that, in order to change or create a new
variable we use the mutate function.
data_USoc <- data_USoc %>%
mutate(hrpay = paygu/(jbhrs*4)/(cpi/100)) %>%
mutate(lnhrpay = log(hrpay))
As we want to save these additional variables for further use we
assign the result of the operation to data_USoc.
Let’s check whether the log of the hourly pay differs if we analyse the data by certain criteria.
data_USoc %>% group_by(degree) %>%
summarise(n = sum(!is.na(lnhrpay)),
mean = mean(lnhrpay,na.rm=TRUE),
sd = sd(lnhrpay,na.rm=TRUE))
data_USoc %>% group_by(educ) %>%
summarise(n = sum(!is.na(lnhrpay)),
mean = mean(lnhrpay,na.rm=TRUE),
sd = sd(lnhrpay,na.rm=TRUE))
data_USoc %>% group_by(male) %>%
summarise(n = sum(!is.na(lnhrpay)),
mean = mean(lnhrpay,na.rm=TRUE),
sd = sd(lnhrpay,na.rm=TRUE))
You can see that difference in the averages for lnhrpay
is about 0.183.
The first hypothesis test we may want to implement is to test whether the difference in the raw data is statistically significant (recall this is not the same economically significant difference).
We use the t.test function. We could create two subsets
of data for lnhrpay, one for males and one for females and
could then feed these two series into the t.test function
(t.test(data_male,data_female, mu = 0)) but there is a more
straightforward way to achieve this. We shall call
t.test(lnhrpay~male, mu=0, data = data_USoc). The
lnhrpay~male is almost like a regression call, the variable
we are interested in is lnhrpay but we want to know whether
it differs according to male. The other inputs set the data
frame (data = data_USoc) and the null hypothesis
(mu=0).
t.test(lnhrpay~male, mu=0, data = data_USoc) # testing that mu = 0
##
## Welch Two Sample t-test
##
## data: lnhrpay by male
## t = -35.022, df = 53923, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -0.1932469 -0.1727632
## sample estimates:
## mean in group female mean in group male
## 2.201202 2.384207
So here the p-value is extremely small indicating that the raw differential in log hourly wages between males and females is certainly statistically significant.
Regression analysis is our bread-and-butter technique and we can obtain the same result with a regression model.
mod1 <- lm(lnhrpay~male,data = data_USoc)
cov1 <- vcovHC(mod1, type = "HC1")
robust_se <- sqrt(diag(cov1))
stargazer(mod1,type="text",se=list(NULL, robust_se))
##
## ================================================
## Dependent variable:
## ----------------------------
## lnhrpay
## ------------------------------------------------
## malemale 0.183***
## (0.005)
##
## Constant 2.201***
## (0.003)
##
## ------------------------------------------------
## Observations 58,960
## R2 0.021
## Adjusted R2 0.021
## Residual Std. Error 0.625 (df = 58958)
## F Statistic 1,251.147*** (df = 1; 58958)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the regression output you can see the variable male
but it says malemale which is indication that basically a
dummy variable has been included into your regression that takes the
value 1 for every male respondent and 0 for everyone else.
The coefficient you can see for that is 0.183 which is exactly the
difference between the male and female averages for
lnhrpay.
If you were to merely calculate the regression model
(mod1 <- lm(lnhrpay~male,data = data_USoc)) and then
show the results with stargazer) you would get a very
similar regression output, but the standard errors for the parameter
estimates would have been calculated on the basis of a homoskedasticity
assumption. Without any further details, this is an assumption which on
many occasions is breached. However, the consequences are not too
worrysome as long as we change how we calculate the standard errors. And
that is what the extra lines achieve.
We can actually write a little function which does add the HC robust
standard errors to the usual output. We have prepared this function for
you. Download it and save it as stargazer_HC.r into your
working directory.
Once you have saved stargazer_HC.r into your working
directory you need to make it accessible to your code by including
source("stargazer_HC.r")
into your script. It is possibly best to do that right at the top
where you load the packages (library commands).
Then we call a regression and if we want robust standard errors we
merely call stargazer_HC rather than
stargazer). Added bonus is that you don’t have to use the
type = "text" option any longer as I used it as the
default.
mod1 <- lm(lnhrpay~male,data = data_USoc)
stargazer_HC(mod1)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## malemale 0.183***
## (0.005)
##
## Constant 2.201***
## (0.003)
##
## ---------------------------------------------------------
## Observations 58,960
## R2 0.021
## Adjusted R2 0.021
## Residual Std. Error 0.625 (df = 58958)
## F Statistic 1,251.147*** (df = 1; 58958)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Sometimes it is useful to extract the actual estimated coefficient
and use it for some subsequent calculation. On this occasion let us
extract the estimated coefficient for male. When we
estimated the regression model we saved the output in mod1.
In fact mod1 contains a lot of information we may want to
use, most notably the coefficient estimates, estimated residuals, fitted
values etc. We access these as follows
mod1$coefficients
## (Intercept) malemale
## 2.2012023 0.1830051
# mod1$residuals # uncomment if you want to see these
# mod1$fitted.values
And if want a particular coefficient we can get that as follows
mod1$coefficients["malemale"] # or
## malemale
## 0.1830051
# mod1$coefficients[1]
where we use the coefficient name we also saw in the regression output.
Let say we want to calculate the actual percentage points raw differential which is a function of this coefficient:
(exp(mod1$coefficients["malemale"])-1)*100
## malemale
## 20.08205
Let’s produce a scatter plot of the data on which the above regression is based.
ggplot(data_USoc, aes(x = male, y = lnhrpay)) +
geom_point() +
geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2])
As you can see this plot has limited informative value as, on the
horizontal axis, we only have two possible outcomes, female and male. On
each of these we have a wide range of outcomes for lnhrpay.
We want to find a way to illustrate the distribution of outcomes on the
lnhrpay scale depending on the value for
male.
ggplot(subset(data_USoc, (lnhrpay>0) & (lnhrpay<5)), aes(x = lnhrpay)) +
geom_histogram() +
facet_grid(~male)
Recall that the first input into the ggplot function is
the dataframe we are using. HOwever, in the above command, instead of
just using data_USoc we use
subset(data_USoc, (lnhrpay>0) & (lnhrpay<5)). All
that does is that we exclude some data, or better we select a subset of
data, namely the data which have log hourly wage larger than 0 and
smaller than 5. Try yourself how these histograms look if you do include
all data (just ude data_USoc).
You could, if you wanted to, fit a normal distribution to these
(remember the variables are the log hourly pay rates). This is not super
straightforward and it turns out that in this case you would want to
generate the graphs seperately. I had to google “R ggplot add normal
distribution fit to geom_histogram” and came across https://stackoverflow.com/questions/1376967/using-stat-function-and-facet-wrap-together-in-ggplot2-in-r/1379074#1379074
which helped me to adopt a solution. The normal density is added with a
stat_function:
# first create a relevant subset of the data
data_temp <- data_USoc %>%
filter((lnhrpay>0) & (lnhrpay<5)) %>% # remove very small and large wages
filter(male == "male")
# now create the plot
ggplot(data_temp,aes(x = lnhrpay)) +
geom_histogram(binwidth = 0.25,aes(y = ..density..)) +
stat_function(fun = dnorm, args=list(mean=mean(data_temp$lnhrpay),
sd=sd(data_temp$lnhrpay)), color = "darkred", size = 1)
# Note: aes(y = ..density..) puts the histogram on a density scale
And now we can do the same for females.
# first create a relevant subset of the data
data_temp <- data_USoc %>%
filter((lnhrpay>0) & (lnhrpay<5)) %>%
filter(male == "female")
# now create the plot
ggplot(data_temp,aes(x = lnhrpay)) +
geom_histogram(binwidth = 0.25,aes(y = ..density..)) +
stat_function(fun = dnorm, args=list(mean=mean(data_temp$lnhrpay),
sd=sd(data_temp$lnhrpay)), color = "darkred", size = 1)
After looking at gender differences we will now look at differences in earnings according to education differences.
Let’s first look at the degree variable in our
dataset.
data_USoc %>% count(degree)
Let us create a new variable which merely differentiates between
having any degree or no degree. So we essentially want to collapse the
first degree and higher degree categories into
a any degree category. Recall that this is a factor
variable and there is a very convenient function
(fct_recode) which allows you to change the levels. The
mutate(grad = ...) function creates a new variable with the
definition of that variable following the equal sign.
data_USoc <- data_USoc %>%
mutate(grad = fct_recode(degree,
"no degree" = "no degree", # new level = old level
"any degree" = "first degree",
"any degree" = "higher degree"))
And now let’s look at the counts of this new variable.
data_USoc %>% count(grad)
And now we run a regression just as we did for gender differences.
mod1 <- lm(lnhrpay~grad,data = data_USoc)
stargazer_HC(mod1)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## gradany degree 0.472***
## (0.005)
##
## Constant 2.138***
## (0.003)
##
## ---------------------------------------------------------
## Observations 58,942
## R2 0.119
## Adjusted R2 0.119
## Residual Std. Error 0.592 (df = 58940)
## F Statistic 7,959.665*** (df = 1; 58940)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
And let’s calculate the actual percentage points raw differential which is a function of the estimated coefficient to the grad variable:
(exp(mod1$coefficients["gradany degree"])-1)*100
## gradany degree
## 60.28478
Clearly there is a massive difference. Graduates, on average, earn more than 60% higher hourly wages.
We found that, individually, gender and degree status make significant differences to hourly pay. Let’s use the full power of regression analysis and see how hourly pay changes as a function of both these factors.
mod1 <- lm(lnhrpay~grad+male,data = data_USoc)
stargazer_HC(mod1)
##
## =========================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## ---------------------------------------------------------
## gradany degree 0.469***
## (0.005)
##
## malemale 0.176***
## (0.005)
##
## Constant 2.060***
## (0.003)
##
## ---------------------------------------------------------
## Observations 58,942
## R2 0.138
## Adjusted R2 0.138
## Residual Std. Error 0.586 (df = 58939)
## F Statistic 4,727.510*** (df = 2; 58939)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Recall that both variables, grad and male
have a base category (female and no degree respectively). It is best to
think about this in form of a Table
| Name | Gender | Degree | male |
grad |
|---|---|---|---|---|
| John | male | no | 1 | 0 |
| Maria | female | no | 0 | 0 |
| Jess | female | yes | 0 | 1 |
| Pete | male | yes | 1 | 1 |
The variable male will contribute to the conditional
expectation for John and Pete and the grad variable will
kick in for Jess and Pete. This sort of setup impies that the “effect””
of being male is the same regardless of whether you have a degree or not
and also the “effect”” of having a degree is the same regardless of
whether you are male or female.
If you want a model setup that does not make that assumption you need to include an interaction term:
mod2 <- lm(lnhrpay~grad*male,data = data_USoc)
stargazer_HC(mod2)
##
## =============================================================
## Dependent variable:
## -------------------------------------
## lnhrpay
## -------------------------------------------------------------
## gradany degree 0.484***
## (0.007)
##
## malemale 0.187***
## (0.006)
##
## gradany degree:malemale -0.034***
## (0.011)
##
## Constant 2.056***
## (0.004)
##
## -------------------------------------------------------------
## Observations 58,942
## R2 0.138
## Adjusted R2 0.138
## Residual Std. Error 0.586 (df = 58938)
## F Statistic 3,155.706*** (df = 3; 58938)
## =============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
The interaction term will take the value 1 only for those who are male and have a degree (in the above table only Pete). This setup allow for the “effect” of a degree to differ between males and females.