Here you will learn to
group_by
,
summarise
and spread
commandsggplot
and
geom_point
lm
We add the basic libraries needed for this week’s work:
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphics
library(AER) # to calculate robust regression standard errors
library(stargazer) # to produce nice results tables
We will also use one function saved in a .r file. The function in this file will be explained below. This file is available for download from here: stargazer_HC.R.
You should save this files into your working directory. You can then this functions available to your code using this line.
source("stargazer_HC.r") # includes the robust regression display
The example we are using here is taken from the CORE - Doing Economics resource. In particular we are using Project 8 which deals with international data on well-being. The data represent several waves of data from the European Value Study (EVS). A wave means that the same survey is repeated at regular intervals (waves). You will learn more about the variables included in that dataset below.
The data have been prepared as demonstrated in the Doing Economics Project 8, up to and including Walk-Through 8.3. Please have a look at this to understand the amount of data work required before an empirical analysis can begin. The datafile is saved as an R data structure (WBdata.Rdata). Download the file and ensure that it is in your working directory.
load("WBdata.Rdata")
str(wb_data) # prints some basic info on variables
## tibble [129,515 × 19] (S3: tbl_df/tbl/data.frame)
## $ S002EVS : chr [1:129515] "1981-1984" "1981-1984" "1981-1984" "1981-1984" ...
## $ S003 : chr [1:129515] "Belgium" "Belgium" "Belgium" "Belgium" ...
## $ S006 : num [1:129515] 1001 1002 1003 1004 1005 ...
## $ A009 : num [1:129515] 3 5 2 5 5 5 5 5 4 4 ...
## $ A170 : num [1:129515] 9 9 3 9 9 9 9 10 8 10 ...
## $ C036 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C037 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C038 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C039 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ C041 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ X001 : chr [1:129515] "Male" "Male" "Male" "Female" ...
## $ X003 : num [1:129515] 53 30 61 60 60 19 38 39 44 76 ...
## $ X007 : chr [1:129515] "Single/Never married" "Married" "Separated" "Married" ...
## $ X011_01 : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ X025A : chr [1:129515] NA NA NA NA ...
## $ Education_1: num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
## $ Education_2: chr [1:129515] NA NA NA NA ...
## $ X028 : chr [1:129515] "Full time" "Full time" "Unemployed" "Housewife" ...
## $ X047D : num [1:129515] NA NA NA NA NA NA NA NA NA NA ...
You can create .RData
objects yourselves by saving your
Environment. In this case all the objects in your environment will be
saved and you can re-upload them as above using the load function. This
is how this datafile has been created.
Checking your environment you will see two objects. Along the proper
datafile (wb_data
) you will find wb_data_Des
which contains some information for each of the variables. It will help
us to navigate the obscure variable names.
wb_data_Des
## Names Labels
## 1 S002EVS EVS-wave
## 2 S003 Country/region
## 3 S006 Respondent number
## 4 A009 Health
## 5 A170 Life satisfaction
## 6 C036 Work Q1
## 7 C037 Work Q2
## 8 C038 Work Q3
## 9 C039 Work Q4
## 10 C041 Work Q5
## 11 X001 Sex
## 12 X003 Age
## 13 X007 Marital status
## 14 X011_01 Number of children
## 15 X025A Education
## 16 Education_1 Education category
## 17 Education_2 Education Description
## 18 X028 Employment
## 19 X047D Monthly household income
## Description
## 1 EVS-wave
## 2 Country/region
## 3 Original respondent number
## 4 State of health (subjective), 1 = Very Poor, 5 = Very good
## 5 Satisfaction with your life
## 6 To develop talents you need to have a job, 1 = Strongly Agree, 5 = Strongly Disagree
## 7 Humiliating to receive money without having to work for it, 1 = Strongly Agree, 5 = Strongly Disagree
## 8 People who don't work become lazy, 1 = Strongly Agree, 5 = Strongly Disagree
## 9 Work is a duty towards society, 1 = Strongly Agree, 5 = Strongly Disagree
## 10 Work should come first even if it means less spare time, 1 = Strongly Agree, 5 = Strongly Disagree
## 11 Sex
## 12 Age
## 13 Marital status
## 14 How many children you have-deceased children not included
## 15 Educational level respondent: ISCED-code one digit
## 16 Educational ISCED-code one digit
## 17 Education level description
## 18 Employment status
## 19 Monthly household income (? 1,000), corrected for ppp in euros
As you can see there are a number of interesting questions in this dataset. These questions will allow us to investigate whether attitudes to work differ between coountries and whether such differences correlate to different levels of self-reported happiness/life satisfaction.
Details on the variables are available from here.
It is important that you, before you embark on any regression analysis, understand your data. Let us investigate some of the features of this dataset. It has 129515 observations and 19 variables. Let’s see which countries are represented in our dataset.
unique(wb_data$S003) # unique finds all the different values in a variable
## [1] "Belgium" "Canada" "Denmark"
## [4] "France" "Germany" "Iceland"
## [7] "Ireland" "Italy" "Malta"
## [10] "Netherlands" "Norway" "Spain"
## [13] "Sweden" "Great Britain" "United States"
## [16] "Northern Ireland" "Austria" "Bulgaria"
## [19] "Czech Republic" "Estonia" "Finland"
## [22] "Hungary" "Latvia" "Lithuania"
## [25] "Poland" "Portugal" "Romania"
## [28] "Slovakia" "Slovenia" "Croatia"
## [31] "Greece" "Russian Federation" "Turkey"
## [34] "Albania" "Armenia" "Bosnia Herzegovina"
## [37] "Belarus" "Cyprus" "Northern Cyprus"
## [40] "Georgia" "Luxembourg" "Moldova"
## [43] "Montenegro" "Serbia" "Switzerland"
## [46] "Ukraine" "Macedonia" "Kosovo"
As you can see these are 48 countries, almost all European, Canada and the U.S. being the exceptions. In the same manner we can find out how many waves of data we have available.
unique(wb_data$S002EVS)
## [1] "1981-1984" "1990-1993" "1999-2001" "2008-2010"
Let’s find out how many observations/respondents we have for each
country in each year. To do this we will resort to the powerful piping
technique delivered through the functionality of the
tidyverse
(check out this page
in case you are not familiar with that technique):
table1 <- wb_data %>% group_by(S002EVS,S003) %>% # groups by Wave and Country
summarise(n = n()) %>% # summarises each group by calculating obs
spread(S002EVS,n) %>% # put Waves across columns
print(n=Inf) # n = Inf makes sure that all rows are printed
## # A tibble: 48 × 5
## S003 `1981-1984` `1990-1993` `1999-2001` `2008-2010`
## <chr> <int> <int> <int> <int>
## 1 Albania NA NA NA 1200
## 2 Armenia NA NA NA 1224
## 3 Austria NA 1432 NA 1216
## 4 Belarus NA NA NA 1237
## 5 Belgium 1025 2721 1402 1343
## 6 Bosnia Herzegovina NA NA NA 1104
## 7 Bulgaria NA 984 858 1183
## 8 Canada 1241 1717 NA NA
## 9 Croatia NA NA 849 1188
## 10 Cyprus NA NA NA 775
## 11 Czech Republic NA 2076 1637 1308
## 12 Denmark 1163 1020 854 1061
## 13 Estonia NA 982 767 1273
## 14 Finland NA 544 NA 940
## 15 France 1187 981 1233 1341
## 16 Georgia NA NA NA 1233
## 17 Germany 1273 3391 1432 1683
## 18 Great Britain 1153 1459 NA 997
## 19 Greece NA NA 885 1246
## 20 Hungary NA 990 NA 1248
## 21 Iceland 907 690 853 666
## 22 Ireland 1197 997 782 504
## 23 Italy 1329 1970 1422 876
## 24 Kosovo NA NA NA 1339
## 25 Latvia NA 777 840 1197
## 26 Lithuania NA 975 731 1143
## 27 Luxembourg NA NA NA 1165
## 28 Macedonia NA NA NA 1290
## 29 Malta 424 NA 698 730
## 30 Moldova NA NA NA 1174
## 31 Montenegro NA NA NA 1166
## 32 Netherlands 1090 996 923 1250
## 33 Northern Cyprus NA NA NA 404
## 34 Northern Ireland 311 304 625 309
## 35 Norway 1031 1205 NA 992
## 36 Poland NA 958 NA 1050
## 37 Portugal NA 1171 635 764
## 38 Romania NA 1091 NA 1025
## 39 Russian Federation NA NA 2074 1102
## 40 Serbia NA NA NA 1216
## 41 Slovakia NA 1081 1147 1042
## 42 Slovenia NA 1011 627 801
## 43 Spain 2287 2619 741 908
## 44 Sweden 905 944 931 788
## 45 Switzerland NA NA NA 934
## 46 Turkey NA NA 1143 2010
## 47 Ukraine NA NA NA 1178
## 48 United States 2253 1741 NA NA
You can see that the number of countries have increased through time, although Canada and the U.S. have dropped out.
If you look at the dataframe itself (either
view(wb_data)
or double click on the little spreadsheet
icon on the right hand edge of the Environment window) you will
recognise that there are a lot of missing observations (codes as
NA
). Ordinarily we would be interested in finding out how
many effective observations we have for the life satisfaction variable
(A170
). However, the initial datawork in https://www.core-econ.org/doing-economics/book/text/08-03.html#part-81-cleaning-and-summarizing-the-data,
has made sure that all the observations you can see are those with
available data for this variable.
Let’s look at a couple of graphical representations of our data (Look at this page for an introduction to basic graphing techniques). For instance we may be interested in figuring out whether life satisfaction (1 (dissatisfied) to 10 (satisfied)) and self-reported health are related to each other. We shall look at this on the basis of data aggregated for country-waves.
table2 <- wb_data %>% group_by(S002EVS,S003) %>% # groups by Wave and Country
summarise(Avg_LifeSatis = mean(A170),Avg_Health = mean(A009)) # summarises each group by calculating obs
ggplot(table2,aes(Avg_Health,Avg_LifeSatis, colour=S002EVS)) +
geom_point() +
labs(title ="Health v Life Satisfaction",
x = "Avg Health (1 = Very Poor, 5 = Very good)",
y = "Avg Life Satisfaction (1 = dissatisfied, 10 = satisfied)")
We can see a clear positive relation between the two variables.
Let’s see whether we can see a similarly clear relationship between
life satisfaction and respondent’s attitude towards work
(C041
- “Work should come first even if it means less spare
time, 1 = Strongly Agree, 5 = Strongly Disagree”). What would you expect
to see?
table2 <- wb_data %>% group_by(S002EVS,S003) %>% # groups by Wave and Country
summarise(Avg_LifeSatis = mean(A170),Avg_WorkFirst = mean(C041)) # summarises each group by calculating obs
ggplot(table2,aes(Avg_WorkFirst, Avg_LifeSatis,colour=S002EVS)) +
geom_point() +
labs(title ="Work First v Life Satisfaction",
x = "Work First (1 = Strongly Agree, 5 = Strongly Disagree)",
y = "Avg Life Satisfaction (1 = dissatisfied, 10 = satisfied)")
The relationship is less clear. Recall that small values for the “Work First” questions relate to the countries where, on average, respondents agreed more strongly with the statement that work should come first!
The correlation between the two variables is not very clear when looking at it at a country level. Let us continue whether the correlation between these two countries in a country is more clear (i.e., do people in, say, Spain, tend to indicate that they are more satisfied with life if they are more inclined to think that work is more important? If so we should expect a negative correlation in Spain).
We can calculate country wide correlations after grouping the data by country and then applying the summarise command to calculate correlations in a country. We restrict the analysis to the 2008-2010 wave.
table3 <- wb_data %>% filter(S002EVS == "2008-2010") %>%
group_by(S003) %>% # groups by Country
summarise(cor_LS_WF = cor(A170,C041,use = "pairwise.complete.obs"),
med_income = median(X047D)) %>% # correlation, remove missing data
arrange(cor_LS_WF)
ggplot(table3,aes( med_income,cor_LS_WF)) +
geom_point() +
ggtitle("Corr(Life Satisfaction, Work First) v Median Income")
There isn’t any really obvious relation between these correlations and the median income. Also the correlations are fairly close to 0 in all countries.
Regression analysis is a crucial tool in any data analyst’s toolbox. In the first instance regression analysis is tool of descriptive analysis, just like calculating sample means and correlations.
Often regression analysis is seen as some magic tool to discover causal relationships, but it is certainly not. Having said that, a good causal analysis will often use regresison analysis, but this is not the place to discuss when regression results can be interpreted causally.
In order to introduce how regressions are implemented in R we start by creating a new dataset which only contains the British data of the 2008-10 wave.
test_data <- wb_data %>%
filter(S003 =="Great Britain") %>% # pick British data
filter(S002EVS == "2008-2010") # pick latest wave
Now we run a regression of the Life Satisfaction variable
(A170
) against a constant only.
\(LifeSatis_{i} = \alpha + u_{i}\)
To implement a regression we ise the lm
function
(lm
for linear model). In the function call we need to tell
R which data object to use (data=test_data
) and what model
to estimate. TO the left of the ~
symbol we state the
dependent variable (here life satisfaction or the A170
variable). To the right of the ~
symbol we list the
explanatory variables. Here we only have a constant (or a variable which
only takes the value 1 and hence, the model here is specified as
A170~1
). The results of the regression are saved in a new
object which, here, we call mod1
.
mod1 <- lm(A170~1,data=test_data)
After executing this line you will see a new object mod1
in your environment. This object is a list will 11 objects. With
names(mod1)
you can find out what elements are saved in
that list. You can use particular elements of this object, e.g. the
residuals, by calling mod1$residuals
.
If you want a standard regression output you can obtain this as follows:
summary(mod1)
##
## Call:
## lm(formula = A170 ~ 1, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.53 -0.53 0.47 1.47 2.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5296 0.0634 119 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 996 degrees of freedom
In fact, this is not very nice and a better way to display regression
results is to use the stargazer
function.
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## A170
## -----------------------------------------------
## Constant 7.500***
## (0.063)
##
## -----------------------------------------------
## Observations 997
## R2 0.000
## Adjusted R2 0.000
## Residual Std. Error 2.000 (df = 996)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This looks more like a standard regresison output. Check out
?stargazer
to find out what the type option does.
What we will find is that the estimated value for \(\alpha\), \(\widehat{\alpha}=7.530\) is nothing else but the sample mean of all British observations in the 2008-2010 wave.
mean(test_data$A170)
## [1] 7.5
We could now calculate a t-test \(=(\widehat{\alpha}-\alpha_0)/se{\widehat{\alpha}}\) \(=(7.530-0)/0.063=119.524\) which tests the hypothesis that the average response to the Life Satisfaction question is equal to 0 (Yes, this makes no sense as the smallest possible response is 1, but R doesn’t know that!).
This could also be calculated without a regression model using the
t.test
function:
t.test(test_data$A170, mu=0) # testing that mu = 0
##
## One Sample t-test
##
## data: test_data$A170
## t = 119, df = 996, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.4 7.7
## sample estimates:
## mean of x
## 7.5
The differences in the t-tests are rounding differences.
Let’s see how we could use a regression to test for the difference in
means. First we adjust our dataset test_data
to include
British and Swedish data. Note here how we use the increadibly useful
S003 %in% c("Sweden","Great Britain")
condition which
selects all observations for which the country variable
(S003
) is included in the list
c("Sweden","Great Britain")
. Type ?c
in the
console to see what the c()
function does.
test_data <- wb_data %>%
filter(S003 %in% c("Sweden","Great Britain")) %>% # pick British and Swedish data
filter(S002EVS == "2008-2010") # pick latest wave
Then we run a regression with the Life Expectancy as the dependent
variable and a constant and a dummy variable which takes the value 1 if
the respondent is from Sweden and 0 if the respondent is from Britain.
This is achieved by specifying the model as A170~S003
. The
variable name before the ~
is the dependent variable, here
A170
. The variable after the ~
is the
explanatory variable, here S003
. (Note that R automatically
includes a constant into the regression model, even if you do not
specify it explicitly.)
mod1 <- lm(A170~S003,data=test_data)
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## A170
## -----------------------------------------------
## S003Sweden 0.150
## (0.097)
##
## Constant 7.500***
## (0.064)
##
## -----------------------------------------------
## Observations 1,785
## R2 0.001
## Adjusted R2 0.001
## Residual Std. Error 2.000 (df = 1783)
## F Statistic 2.400 (df = 1; 1783)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This regression is a very special one as it uses the country variable
as the explanatory variable. Recall that test_data
only
contain British and Swedish data. The regression picked one country as
the base country (here Britain as it is first in the alphabet) and for
the other it created a dummy variable. Basically a variable (\(Sweden_i\)) which takes the value 1 if the
observation comes from Sweden and 0 otherwise.
\(LifeSatis_{i} = \alpha + \beta~ Sweden_{i} + u_{i}\)
You can see that the constant (\(\widehat{\alpha}\)) still reports the sample average for the British observations. It is identical to the value we saw in the previous regression. But what is the meaning of \(\widehat{\beta}=0.149\)? This is not the average response value for Swedish respondents, but the difference between the British and Swedish average. As it is positive it means that the Swedish average response is larger than the British. In fact it is 7.530+0.149=7.679.
Let’s confirm this by briefly calculating the the two means in the usual manner
test_data %>% group_by(S003) %>% summarise(mean = mean(A170))
## # A tibble: 2 × 2
## S003 mean
## <chr> <dbl>
## 1 Great Britain 7.53
## 2 Sweden 7.68
You can now test whether that difference is significantly different from 0 (\(H_0:\beta = 0\)) which is equivalent to testing that the two averages are identical. The t-test for this hypothesis test would be \(0.149/0.097=1.536\).
The regressions we ran so far were special in the sense that they involved explanatory variables which were either a constant (i.e. ones) or dummy variables (0s or 1s). The result of this was that the resulting estimates represented sample means or differences in sample means.
The interpretation of coefficient estimates changes as explanatory variables take a more general form.
test_data <- wb_data %>%
filter(S003 =="Great Britain") %>% # pick British data
filter(S002EVS == "2008-2010") # pick latest wave
We now estimate a regression model which includes a constant and the
household’s monthly income (in 1,000 Euros) as an explanatory variable
(\(INC_i\) OR VARIABLE
x047D
in our dataset).
\(LifeSatis_{i} = \alpha + \beta~ Inc_{i} + u_{i}\)
mod1 <- lm(A170~X047D,data=test_data)
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## A170
## -----------------------------------------------
## X047D 0.180***
## (0.039)
##
## Constant 7.200***
## (0.095)
##
## -----------------------------------------------
## Observations 997
## R2 0.022
## Adjusted R2 0.021
## Residual Std. Error 2.000 (df = 995)
## F Statistic 22.000*** (df = 1; 995)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here you see that \(\widehat{\beta}=0.184\). What does this
mean? As the income increases by one unit (here that represents an
increase of Euro 1,000) we should expect that Life Satisfaction
increases by 0.184 units. What is the interpretation for \(\widehat{\alpha}=7.190\)? For someone with
0 income we should expect the Life Satisfaction to be 7.119. Let’s
present a graphical representation, including a representation of the
estimated regression line
(+ geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2])
).
ggplot(test_data, aes(x=X047D, y=A170, colour = S003)) +
geom_point() +
geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2]) +
ggtitle("Income v Life Satisfaction, Britain")
This plot is somewhat awkward as each of the points you see may
represent many observations as incomes are only reported in 12 different
values. R has a neat option for cases like this. Instead of using
geom_point
you can use geom_jitter
whcih
raandomises the location of teh points a little such that you can see
the individual points.
ggplot(test_data, aes(x=X047D, y=A170, colour = S003)) +
geom_jitter(width=0.2, size = 0.5) + # Use jitter rather than point so we can see indiv obs
geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2])+
ggtitle("Income v Life Satisfaction, Britain")
The regression parameters, which deliver the line of best fit, are estimated by Ordinary Least Squares (OLS). The name comes from the fact that these parameters are the ones which minimise the sum of squared residuals, \(\Sigma \widehat{u}^2_i = \Sigma (LifeSatis_{i} - \widehat{\alpha} - \widehat{\beta}~ Inc_{i})^2\). These parameters achieve another thing, they ensure that \(Corr(Inc_{i},\widehat{u}_{i})=0\) is true.
This last point is incredibly important, as one of the assumptions underpinning the estimation of regression models by OLS is that \(Corr(Inc_{i},u_{i})=0\), called the exogeneity assumption. Why is that assumption important? If the assumption was not true, then we need to accept that the OLS estimation imposes a feature into the model that is not appropriate for the data. As a result the resulting regression coefficients are biased. As a consequence the resulting regression model cannot be said to have any causal interpretation.
As we cannot observe \(u_i\), the assumption of exogeneity cannot be tested directly and we need to make an argument using economic understanding.
A lot of econometric work is therefore directed at building either models or estimation methods (alternatives to OLS) which make this assumption more defendable. This could be the inclusion of additional explanatory variables (leading to multivariate regression analysis) or the application of alternative estimation methods (like instrumental variables estimation).
In the earlier model we reelated life satisfaction to income. There
is of course no suggestion that life satisfaction should only be related
to income. It is clear that other variables should also be related, such
as health (variable A009
).
Let’s add this health variable as well as a dummy variable indicating whether a respondent is a male of female (a variable taken 0 if female or 1 if male) to the regression to estimate this model:
\[LifeSatis_{i} = \alpha + \beta~ Inc_{i} + \gamma~Health_{i} + \delta~Sex_{i}+ u_{i}\]
mod2 <- lm(A170~X047D+A009+X001 ,data=test_data)
stargazer(mod1, mod2, type="text")
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## A170
## (1) (2)
## -------------------------------------------------------------------
## X047D 0.180*** 0.120***
## (0.039) (0.039)
##
## A009 0.500***
## (0.064)
##
## X001Male -0.240*
## (0.120)
##
## Constant 7.200*** 5.400***
## (0.095) (0.260)
##
## -------------------------------------------------------------------
## Observations 997 997
## R2 0.022 0.084
## Adjusted R2 0.021 0.081
## Residual Std. Error 2.000 (df = 995) 1.900 (df = 993)
## F Statistic 22.000*** (df = 1; 995) 30.000*** (df = 3; 993)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
You can see that the effect of income (X047D
) has
reduced somewhat with this inclusion of extra covariates. You can see
that being a male is associated with a somewhat lower life satisfaction.
The health variable, coded as (1 = very poor and 5 = very good), seems
to be positively correlated with life satisfation (coded from 1 to 10).
This is of course not a surprise as bad health is bound to impact
negatively on life satisfaction.
If you check the data format for the Health variable it is a numeric data format. This means that the way in which the parameter is interpreted it implies that (on average) an increase of 1 on the health score should be expected to relate to an increase of the life satisfaction score of about 0.5. This interpretation implies that increasing your health from a score of 1 to 2 has the same effect on life satisfaction as an increase from 3 to 4. This may not be the case.
In order to allow for a nonlinear relationship between health and
life satisfaction it will be instructive to not treat the health
variable as a numeric variable, but rather as a categorical one. This is
achieved by turning A009
into a factor variable
(as.factor(A009)
). Let’s look at the outcome.
mod3 <- lm(A170~X047D+as.factor(A009)+X001 ,data=test_data)
stargazer(mod1, mod2, mod3, type="text")
##
## ===========================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## A170
## (1) (2) (3)
## -------------------------------------------------------------------------------------------
## X047D 0.180*** 0.120*** 0.120***
## (0.039) (0.039) (0.039)
##
## A009 0.500***
## (0.064)
##
## as.factor(A009)2 0.680
## (0.470)
##
## as.factor(A009)3 1.500***
## (0.430)
##
## as.factor(A009)4 1.800***
## (0.420)
##
## as.factor(A009)5 2.300***
## (0.430)
##
## X001Male -0.240* -0.240*
## (0.120) (0.120)
##
## Constant 7.200*** 5.400*** 5.600***
## (0.095) (0.260) (0.410)
##
## -------------------------------------------------------------------------------------------
## Observations 997 997 997
## R2 0.022 0.084 0.086
## Adjusted R2 0.021 0.081 0.081
## Residual Std. Error 2.000 (df = 995) 1.900 (df = 993) 1.900 (df = 990)
## F Statistic 22.000*** (df = 1; 995) 30.000*** (df = 3; 993) 16.000*** (df = 6; 990)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What R has done here is the following: There are five possible values
for the health variable. The first (A009=1
) is used as the
default. Then there are four extra dummy variables included. For
instance a dummy variable which takes a value 1 if A009=3
(and equally for outcomes 2, 4 and 5). For nstance the estimated
coefficient of 1.515 for the outcome 3 indicates that relative to the
default answer (A009=1
) we should expect the answer to the
Life Satifaction answer to be 1.515 larger if a respondent answered 3 to
the health question. Now you can see from the coefficients that
answering 4 instead of 3 makes actually very little difference to the
life satisfaction question.
This page is not a complete econometric discussion of regression analysis and its assumptions. One of the commonly made assumptions is that the variance of regression error terms is constant across all observations. This is called the homoskedasticity assumption. This is an assumption frequently breached, in which case we say that there is heteroskedasticity.
There are diagnostics test available to test the null hypothesis of
the absence of heteroskedasticity (e.g. the bptest
function
of the AER
package). However, here we concentrate on how to
deal with the presence of heteroskedasticity. The immediate consequence
is that the standard errors that are calculated by R, or most other
econometric packages, are incorrect. For all but very small sample sizes
there is a straightforward fix, heteroskedasticity robust standard
errors can be calculated.
This is not done by the lm
function but by another
function, vcovHC
provided by the AER
package.
You can use the help function for vcovHC
to see what
options you have and how to do this. However, the
stargazer_HC
function which you loaded at the beginning of
this exercise does implement this in an easy to use manner. If you wish
to calculate heteroskedastic standard errors you can call
stargazer_HC
instead of stargazer
to display
the regression output.
stargazer_HC(mod1, type_out="text")
##
## =========================================================
## Dependent variable:
## -------------------------------------
## A170
## ---------------------------------------------------------
## X047D 0.180***
## (0.034)
##
## Constant 7.200***
## (0.100)
##
## ---------------------------------------------------------
## Observations 997
## R2 0.022
## Adjusted R2 0.021
## Residual Std. Error 2.000 (df = 995)
## F Statistic 22.000*** (df = 1; 995)
## =========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Robust standard errors in parenthesis
Comparing the standard errors, the values in parenthesis underneath the coefficient estimates, you will see that the standard errors are somewhat different to the ones in the earlier regression output. The coefficients, however, remain unchanged.
You have learned how to use R to estimate a regression model and how to calculate heteroskedasticity robust standard errors.