We add the basic libraries needed for this week’s work:
library(tidyverse) # for almost all data handling tasks
library(readxl) # to import Excel data
library(ggplot2) # to produce nice graphiscs
library(stargazer) # to produce nice results tables
This is the example from the Siegel et al. (2019) paper which attempts to establish whether gun control laws have a causal impact on firearm deaths. The data will later form the basis of a Diff-in-Diff analysis. The data are from a variety of sources and some of the data have been collated for you in “US_Gun_example.csv”.
the main variables to consider are:
In this lesson we will revise some hypothesis testing and basic (simple) regression analysis.
In terms of R skills you learn how to
group_by,
summarise and spread commandsggplot and
geom_point and line charts geom_linetmap library)t.testfor loops in a little statistical experimentlmThis is a lot and you should expect to take quite a bit of time to replicate this work.
Let’s import the csv file, but you should always, if you are dealing with spreadsheet dta, foirst open the spreadsheet to see whether the data already contain variable names or perhaps contain some extra information which you may not want to import into R.
merge_data <- read.csv("../data/US_Gun_example.csv") # import data
head(merge_data)
Checking your environment you will see one object. You should open that object (click on the little spreadsheet symbol to the right of the object name) or look at the above output. Each line represents one State-Year. Understanding what one line in your dataset represents is absolutly crucial and unless you understand that it is unlikely that you can perform any sensible data analysis.
To get to this dataset a significant amount of data handling and cleaning had to happen. We will share that with you shortly and tou will then also see that the documentation in the above paper was not of super high quality such that it is not 100% certain that the data here are identical to those in the Siegel et al. (2019) paper. In that documentation we will also share with you the exact sources of the data here. you should make sure that you always exactly record where your data are from.
Let’s look at the structure of the data frame.
str(merge_data) # prints some basic info on variables
## 'data.frame': 1071 obs. of 19 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ State_code : chr "AK" "AL" "AR" "AZ" ...
## $ State : chr "Alaska" "Alabama" "Arkansas" "Arizona" ...
## $ Population : int 633687 4467634 2691571 5273477 34479458 4425687 3432835 574504 795699 16356966 ...
## $ Mechanism : chr "Firearm" "Firearm" "Firearm" "Firearm" ...
## $ pol_ubc : int 0 0 0 0 1 0 1 NA 0 0 ...
## $ pol_vmisd : int 0 0 0 0 1 0 1 NA 1 0 ...
## $ pol_mayissue : int 0 1 0 0 1 1 1 NA 1 0 ...
## $ Age.Adjusted.Rate: num 14.83 16.41 15.27 15.92 9.32 ...
## $ logy : num 2.7 2.8 2.73 2.77 2.23 ...
## $ ur : num 6.28 5.18 4.76 4.72 5.47 ...
## $ law.officers : int 1821 15303 7538 18548 106244 15172 9825 4716 2964 63041 ...
## $ law.officers.pc : num 287 343 280 352 308 ...
## $ vcrime : int 3696 19203 12042 28275 210661 15334 11387 4845 4845 129839 ...
## $ vcrime.pc : num 583 430 447 536 611 ...
## $ alcc.pc : num 2.67 1.86 1.74 2.5 2.2 ...
## $ incarc : int 3033 24741 11489 27710 157142 14888 17507 NA 6841 72404 ...
## $ incarc.pc : num 479 554 427 525 456 ...
Some of the names in merge_data are obvious, but let us
introduce a few in more detail.
Age.Adjusted.rate - Deaths by firearm for every 100,000
population in a State-Yearur - Average annual unemployment rate in a
State-Yearlaw.officers.pc - number of law officers per 100,000
population in a State-Yearvcrime.pc - number of violent crimise (excl. homicides)
per 100,000 population in a State-YearThese will be the substantial variables on which we will concentrate in this data introduction.
Let us investigate some of the features of this dataset. It has 1071 observations and 19 variables. Let’s see which states and which years are represented in our dataset.
unique(merge_data$State) # unique finds all the different values in a variable
## [1] "Alaska" "Alabama" "Arkansas"
## [4] "Arizona" "California" "Colorado"
## [7] "Connecticut" "District of Columbia" "Delaware"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Iowa" "Idaho" "Illinois"
## [16] "Indiana" "Kansas" "Kentucky"
## [19] "Louisiana" "Massachusetts" "Maryland"
## [22] "Maine" "Michigan" "Minnesota"
## [25] "Missouri" "Mississippi" "Montana"
## [28] "North Carolina" "North Dakota" "Nebraska"
## [31] "New Hampshire" "New Jersey" "New Mexico"
## [34] "Nevada" "New York" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Virginia" "Vermont" "Washington"
## [49] "Wisconsin" "West Virginia" "Wyoming"
unique(merge_data$Year)
## [1] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## [16] 2016 2017 2018 2019 2020 2021
As you can see these are 51 US states and 21 years of data which explains why we have, in total 1,071 observations (State-Years!). Let us create some summary statistics for the above variables.
sel_vars <- c("Age.Adjusted.Rate","ur","law.officers.pc","vcrime.pc")
summary(merge_data[sel_vars]) # prints some basic info on variables
## Age.Adjusted.Rate ur law.officers.pc vcrime.pc
## Min. : 2.14 Min. : 2.100 Min. : 37.08 Min. : 76.94
## 1st Qu.: 8.84 1st Qu.: 4.204 1st Qu.:255.98 1st Qu.: 268.22
## Median :11.72 Median : 5.283 Median :296.35 Median : 358.09
## Mean :12.05 Mean : 5.648 Mean :310.21 Mean : 393.14
## 3rd Qu.:15.05 3rd Qu.: 6.737 3rd Qu.:343.69 3rd Qu.: 493.70
## Max. :33.82 Max. :13.733 Max. :894.46 Max. :1056.47
## NA's :23
You get some summary statistics, e.g. the mean Age adjusted number of
firearm deaths (per 100,000 population) across all states and all years
is 12.05. Interestingly we can also see that in the
law.officers.pc variable there are 23 missing observations
(NA).
When you work with data it is important to understand its shortcomings and missing data are certainly one of them. Let us identify for which State-Years we do not have data.
test <- merge_data %>%
filter(is.na(law.officers.pc)) %>%
select(Year,State,law.officers.pc)
test
## Year State law.officers.pc
## 1 2001 Nebraska NA
## 2 2002 Nebraska NA
## 3 2003 Nebraska NA
## 4 2004 Nebraska NA
## 5 2005 Nebraska NA
## 6 2006 Nebraska NA
## 7 2007 Nebraska NA
## 8 2008 Nebraska NA
## 9 2008 West Virginia NA
## 10 2009 Nebraska NA
## 11 2010 Nebraska NA
## 12 2011 Nebraska NA
## 13 2012 Nebraska NA
## 14 2013 Nebraska NA
## 15 2014 Nebraska NA
## 16 2014 West Virginia NA
## 17 2015 Nebraska NA
## 18 2016 Nebraska NA
## 19 2017 Nebraska NA
## 20 2018 Nebraska NA
## 21 2019 Nebraska NA
## 22 2020 Nebraska NA
## 23 2021 Nebraska NA
You will realise that there are no law officers data for Nebraska at all and two years of missing data for West Virginia (2008 and 2014). This is important information to keep in mind.
Let us now calculate some summary statistics by State. We shall use the awesome power of the tidyverse methodology. This was used previously but here we shall turn it up a bit (not quite to this level yet, that will come).
We want to create a table that shows, for four states (just a
selection to make the result easier to see), the average values of the
four variables in sel_vars. We start by selecting the rows
of data from merge_data for which the variable
State equals one of the state names in
sel_states
(merge_data %>% filter(State %in% sel_states)). Recall
that the %>% command is called “a pipe” and indicates
that the result to the left of the pipe is sent to the function to the
right of the pipe. This selection is then sent to the
summarise function which we instruct to calculate five
variables, the number of observations (n = n(), the
function n() just counts the number of observations in each
group) and then four averages. For instance
avg.ur = mean( ur) calculates the average unemployment rate
in each state.
sel_states <- c("New York", "California","West Virginia","Nebraska") # define a list with countries
table1 <- merge_data %>% filter(State %in% sel_states) %>% # only looks at selected states
group_by(State) %>% # groups by State
summarise(n = n(),
avg.fds = mean( Age.Adjusted.Rate),
avg.ur = mean( ur),
avg.off = mean( law.officers.pc),
avg.vcr = mean(vcrime.pc)) %>% # calculating no of obs
print()
## # A tibble: 4 × 6
## State n avg.fds avg.ur avg.off avg.vcr
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 California 21 8.40 7.32 314. 481.
## 2 Nebraska 21 8.57 3.54 NA 297.
## 3 New York 21 4.76 6.21 424. 404.
## 4 West Virginia 21 15.0 6.24 NA 301.
What you can see here is that there is significant variation in
deaths due to firearms, with there being three times as many in West
Virginia compared to New York. We can also see that the code did not
calculate the average number of officers for West Virginia and Nebraska.
This is not a surprise for Nebraska as we had previously seen that there
were no officer data for this state. For West Virginia., however, we did
have data, just observations for 2008 and 2014 were missing. So there
could be an average calculated on the basis of the remaining years. By
default, the mean fuunction will not calculate a mean if
there are any missing data. You can change that by adding the option
na.rm = TRUE to the mean function
table1 <- merge_data %>% filter(State %in% sel_states) %>% # only looks at selected states
group_by(State) %>% # groups by State
summarise(n = n(),
avg.fds = mean( Age.Adjusted.Rate, na.rm = TRUE),
avg.ur = mean( ur, na.rm = TRUE),
avg.off = mean( law.officers.pc, na.rm = TRUE),
avg.vcr = mean(vcrime.pc, na.rm = TRUE)) %>% # calculating no of obs
print()
## # A tibble: 4 × 6
## State n avg.fds avg.ur avg.off avg.vcr
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 California 21 8.40 7.32 314. 481.
## 2 Nebraska 21 8.57 3.54 NaN 297.
## 3 New York 21 4.76 6.21 424. 404.
## 4 West Virginia 21 15.0 6.24 244. 301.
Let’s look at a couple of graphical representations of our data. For
instance we may be interested in figuring out whether the rate of
firearm deaths (Age.Adjusted.Rate) is related to the number
of law enforcement officers (law.officers.pc). A good first
look at this may be a scatterplot across the entire dataset.
plot1 <- ggplot(merge_data, aes(x=law.officers.pc, y=Age.Adjusted.Rate)) +
geom_point()
plot1
A relationship is not super clear. Let’s concentrate on the three states we looked at before (recall Nebraska has no data on officers).
temp_data <-merge_data %>% filter(State %in% sel_states) # selection of data
plot2 <- ggplot(temp_data, aes(x=law.officers.pc, y=Age.Adjusted.Rate)) +
geom_point()
plot2
Now this looks as if there was a clear negative relation between the number of officers and the number of firearm deaths. Can we conclude that having more officers will lead to a reduction of firearm deaths? The answer is an emphatic no!
This looks as if there are three clusters of data. Perhaps not
surprising given that we are looking at three states. Let’s make sure
that we can see in the graph which state the data come from. We do this
by adding color = State to the aesthetics.
plot3 <- ggplot(temp_data, aes(x=law.officers.pc, y=Age.Adjusted.Rate, color = State)) +
geom_point()
plot3
Looking at this graph it becomes a little more obvious that it may just be that these states are different and that for some reason (perhaps unrelated to the number of officers) there are more firearm deaths in West Virginia than New York.
If there was a genuine negative relationship between these two variables we would expect to see it as well inside a state. Let’s calculate these and then show a histogram illustrating the values of the intra-state correlations.
table2 <- merge_data %>% group_by(State) %>% # groups by State
summarise(cor_fds_off = cor(Age.Adjusted.Rate,law.officers.pc,use = "pairwise.complete.obs"))
ggplot(table2,aes(x=cor_fds_off)) + geom_histogram(bins=16)
The majority of states show a negative correlation but there are also some states which show a positive correlation.
As our data also have a time dimension, it is often of interest to see how variables evolve through time. This can be done with just a little change in the code. Say we want to show the development of the death rates for the four states then we can do so with the following code.
plot4 <- ggplot(temp_data, aes(x=Year, y=Age.Adjusted.Rate, color = State)) +
geom_line() +
labs(title = "Firearm death rates (per 100,000)")
plot4
From this we can see that death rates increased in West Virginia, from an already high level, after 2015.
A great tool to visualise the differences between regional entities (e.g. countries or here states) is to produce a map where a color scale represents the values of some statistic. The first time you create maps, we are afraid, you will have to struggle a little. But the rewards are great! You can easily skip this section. Just remember that you come back here should you ever think that such a map would add to your work.
We start by uploading a few libraries we will need to work with maps.
We follow the advice of the Lovelace et al. and use
the tmap package. Recall that you will have to install
these packages first in case you havn’t done so yet.
library(tmap) # mapping package
library(sf) # required to deal with shape files
library(spData) # delivers shape files
We want to visualise data, for instance those on Firearm death rates or unemployment rates and how they differ between countries. Perhaps we will be able to see regional differences. Let’s start by creating a list of states for which we have data.
state_list <- unique(merge_data$State) # List of states in our dataset
state_list
## [1] "Alaska" "Alabama" "Arkansas"
## [4] "Arizona" "California" "Colorado"
## [7] "Connecticut" "District of Columbia" "Delaware"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Iowa" "Idaho" "Illinois"
## [16] "Indiana" "Kansas" "Kentucky"
## [19] "Louisiana" "Massachusetts" "Maryland"
## [22] "Maine" "Michigan" "Minnesota"
## [25] "Missouri" "Mississippi" "Montana"
## [28] "North Carolina" "North Dakota" "Nebraska"
## [31] "New Hampshire" "New Jersey" "New Mexico"
## [34] "Nevada" "New York" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Virginia" "Vermont" "Washington"
## [49] "Wisconsin" "West Virginia" "Wyoming"
Now we need to get the map data in. What we need is a list of US
states and their respective shapes, sometimes this info is called a
shape-file. There are a lot of ways to get these and shape files exist
for countries, parliamentary constituencies, postcodes and many more
things. For some objects (like the countries of the world) there is a
package which makes these data available, it is called SpData.
Once this package is loaded (as we have done above) you can get the
shape files for all the countries in the world using the
world command.
us_st <- us_states # save US State file shape files
head(us_st)
## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -114.8136 ymin: 24.55868 xmax: -71.78699 ymax: 42.04964
## Geodetic CRS: NAD83
## GEOID NAME REGION AREA total_pop_10 total_pop_15
## 1 01 Alabama South 133709.27 [km^2] 4712651 4830620
## 2 04 Arizona West 295281.25 [km^2] 6246816 6641928
## 3 08 Colorado West 269573.06 [km^2] 4887061 5278906
## 4 09 Connecticut Norteast 12976.59 [km^2] 3545837 3593222
## 5 12 Florida South 151052.01 [km^2] 18511620 19645772
## 6 13 Georgia South 152725.21 [km^2] 9468815 10006693
## geometry
## 1 MULTIPOLYGON (((-88.20006 3...
## 2 MULTIPOLYGON (((-114.7196 3...
## 3 MULTIPOLYGON (((-109.0501 4...
## 4 MULTIPOLYGON (((-73.48731 4...
## 5 MULTIPOLYGON (((-81.81169 2...
## 6 MULTIPOLYGON (((-85.60516 3...
You can see from the data extract what type of data are contained.
The actual geographical shapes of the countries are contained in the
geometry information. Basically they are coordinate points
(latitude and longitude) which, when connected, describe the shape of a
country. Importantly, NAME is the state name. We will use
that information to extract the shape information for the states we have
in our datafile. In particular we will
d_sel <- us_st %>%
filter(NAME %in% state_list)
Let’s see whether this did indeed select all the countries we wanted
state_map <- d_sel$NAME # states included in d_sel
setdiff(state_list,state_map) # finds the difference between the two arguments
## [1] "Alaska" "Hawaii"
Why are these states not included in state_map? The two
missing states are not connected to the main body of the US and the
coders of the spData package decided to exclude them. We
will roll with this, but with a bit of research you can find shape files
that include these two states as well. As the names of US states are
very standardised we, here, have no issues with different spellings of
names. When you are dealing with countries you will often have to check
that countries are spelled identically in both the shape file and your
dataframe with the data.
Let’s create a first map. As it so happens, d_sel
contains a number of numerical variables, such as
total_pop_15 which represents the population of each state
in 2015.
names(d_sel)
## [1] "GEOID" "NAME" "REGION" "AREA" "total_pop_10"
## [6] "total_pop_15" "geometry"
Let’s create a very basic map of the U.S. (map1) and
then one which looks a little nicer and has the country color-filled
depending on the total_pop_15 variable (map2).
For more guidance look at the excexellent (Chapter 8 of Geocomputing
with R by Lovelace et al.)[https://geocompr.robinlovelace.net/adv-map.html]
d_sel$pop <- d_sel$total_pop_15/1000000 # scale a new pop var to measure pop in millions
map1 <- tm_shape(d_sel) + # basic map
tm_borders() # adds borders
map2 <- tm_shape(d_sel) +
tm_borders() +
tm_fill(col = "pop") +
tm_style("natural")
tmap_arrange(map1, map2) # this arranges the maps next to each other
Now, all we need to do, is add the variable which we want to
visualise with a map to the d_sel dataframe. Let’s
calculate two maps, one with the age adjusted firearm death rates in
2001 and another for the year 2021. That measn we wish to add two
variables to d_sel
fds2001 <- merge_data %>% filter(Year == 2001) %>%
mutate(fds2001 = Age.Adjusted.Rate,
NAME = State) %>% #just create a var with a new name
select(NAME, fds2001)
fds2021 <- merge_data %>% filter(Year == 2021) %>%
mutate(fds2021 = Age.Adjusted.Rate,
NAME = State) %>% #just create a var with a new name
select(NAME, fds2021)
d_sel <- merge(d_sel, fds2001)
d_sel <- merge(d_sel, fds2021)
Now we can create the maps
map3 <- tm_shape(d_sel) +
tm_borders() +
tm_fill(col = "fds2001") +
tm_style("natural")
map4 <- tm_shape(d_sel) +
tm_borders() +
tm_fill(col = "fds2021") +
tm_style("natural")
tmap_arrange(map3, map4) # this arranges the maps next to each other
There is an issues with these graphs. The scales are different. If we want to compare the two maps, then the colour coding should be identical. After a bit of a web-search we the solution is implemented below.
map5 <- tm_shape(d_sel) +
tm_borders() +
tm_fill(col = "fds2001", breaks = c(0,5,10,20,30,40)) +
tm_style("natural")
map6 <- tm_shape(d_sel) +
tm_borders() +
tm_fill(col = "fds2021", breaks = c(0,5,10,20,30,40)) +
tm_style("natural")
tmap_arrange(map5, map6) # this arranges the maps next to each other
Now we can clearly see that there have been some states where the number of firearm deaths have increase.
Let’s investigate whether there are differences in some of the responses between countries. But before we do so we need to revisit some basic hypothesis testing. At the core is the understanding that there is some underlying population statistic (for instance the difference between the average firearm death rates in 2001 and 2021), but all we observe is a sample statistics. What hypothesis testing does is that it uses the sample information to help us judge on some hypothesis regarding the true underlying (unknown!) population statistic.
Let’s create a sample statistic. We shall compare the violent crime
rates (vcrime.pc) in 2001 to that in 2021
test_data_2001 <- merge_data %>%
filter(Year == 2001) # pick 2001
mean_2001 <- mean(test_data_2001$vcrime.pc)
test_data_2021 <- merge_data %>%
filter(Year == 2021) # pick 2021
mean_2021 <- mean(test_data_2021$vcrime.pc)
sample_diff <- mean_2021 - mean_2001
So we can see that the sample difference is -54.1107799, hence the average violent crime rate across the states has gone down. There is the proof, 2021 is safer than 2001. Or is it? The outcomes we observe are the result of many random processes. Random processes that determine whether a crime happens or not and randomness in the reporting of crime.
Just because in 2021 the average is below 2001 does not necessarily mean that the U.S is safer in 2021. In a different state of the world we would have received a different statistic. Is this difference perhaps just the chance of some chance variation? It is to answer this question that we perform hypothesis tests.
In order to perform a hypothesis test we first formulate a null
hypothesis. Here that the difference in population means
(mu) is equal to 0 using the t.test
function.
t.test(test_data_2021$vcrime.pc,test_data_2001$vcrime.pc, mu=0) # testing that mu = 0
##
## Welch Two Sample t-test
##
## data: test_data_2021$vcrime.pc and test_data_2001$vcrime.pc
## t = -1.5375, df = 94.887, p-value = 0.1275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -123.98112 15.75956
## sample estimates:
## mean of x mean of y
## 370.2264 424.3372
How can we use this information to evaluate our initial null hypothesis. To judge this we need to know what random distribution the sample test statistic follows (if the null hypothesis was true). In this case this is an approximate normal distribution. The p-value then then tells us how likely it is to get a result like the one we got (a difference of -54 or larger) if the null hypothesis was true (i.e. the true population means were the same). Here the p-value is 0.1275. This implies that we judge there to be a probability of more than 10% that we could have received that difference or larger if the two means were actually the same.
What about the difference between 2021 and 2001?
test_data_2011 <- merge_data %>%
filter(Year == 2011)
t.test(test_data_2021$vcrime.pc,test_data_2011$vcrime.pc, mu=0) # testing that mu = 0
##
## Welch Two Sample t-test
##
## data: test_data_2021$vcrime.pc and test_data_2011$vcrime.pc
## t = 0.37061, df = 99.381, p-value = 0.7117
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -47.90612 69.91315
## sample estimates:
## mean of x mean of y
## 370.2264 359.2229
Here you can see that the p-value is 0.7117, hence the probability of getting a result like the one we got, a difference in means of about 11, if the true population means were equal, is about 71%. So this says that it is quite likely to get the data we see if there were no differences between 2011 and 2021. When is a p-value small enough for us to reject a null hypothesis?. There are actually some “conventions” in the sense that we often say that we reject the null hypothesis if the p-value is smaller than either 0.1, 0.05 or 0.01.
But we shouldn’t just adopt such a convention without understanding what these values mean. In order to do so we will add another variable to our dataset, a truly random variable, but drawn from the same distribution for all individuals
set.seed(19) # set random seed for replicability
merge_data$rvar <- rnorm(nrow(merge_data)) # add random variable
We will now check whether the average value for that variable differs between Years Of course we know that it shouldn’t as all observations are draws from the same random variable, a standard normal random variable and hence the true population mean for all years is 0.
But let’s pretend we didn’t know that.
test_data <- merge_data
years <- unique(merge_data$Year) # List of all years
n_years <- length(years)
Now we will perform the above test for all possible combinations of
years and will record the respective p-value. Don’t worry too much about
this double for loop.
save_pvalue <- matrix(NA,n_years,n_years)
for (i in seq(2,n_years)){
for (j in seq(1,(i-1))){
test_data_1 <- test_data %>% filter(Year == years[i])
test_data_2 <- test_data %>% filter(Year == years[j])
tt <- t.test(test_data_1$rvar,test_data_2$rvar, mu=0) # testing that mu = 0
save_pvalue[i,j] <- unlist(tt["p.value"]) # this will just pick the p-value
}
}
This leaves us with (21*21-21)/2=210 hypothesis tests. All of which with a null hypothesis which we know to be true (population means are identical). Let’s see how many of these hypothesis tests delivered p-values which are smaller than 10%.
tre <- (save_pvalue<0.1) # value of TRUE if pvalue < 0.1
cols <- c("TRUE" = "#FFFFFF","FALSE" = "#66FF33")
image(1:nrow(tre), 1:ncol(tre), as.matrix(tre), col=cols)
table(tre)
## tre
## FALSE TRUE
## 189 21
The green blots on the graph indicate rejections of the null hypothesis. As you can see, 21 of the 210 tests produced a test statistic with a p-value smaller than 10%. So for these we may be tempted to say that we reject the null hypothesis. So here we have arrived at the point where we can perhaps understand what it means to perform a hypothesis test. Even if the null hypothesis is correct (which in reality we will of course not know) we may actually reject the null hypothesis. We call this making a Type 1 error. Vice versa, if in truth the null hypothesis is incorrect we may come to the conclusion not to reject the null hypothesis (this is what is called a Type 2 error).
As you can see here we have made a Type 1 error in about 15% of cases (this may differ slightly to wjhat was presented in the lecture or what you get when you replicate this as the numbers are random draws. But the number should be fairly close to 10%). This is no accident. If we had checked what percentage of these tests (remember for all the null hypothesis is true) had p-values < 5% we would have found approximately 5% of tests that had p-values smaller than 5%. In fact this is what a hypothesis test is designed to do. So this gives us now a clue of the role of this threshold against which we compare the p-value.
You may wonder then why we do not use a threshold as small as possible, after all that would minimise the probability of making a Type 1 error. However, the flip side of reducing a Type 1 error is that we would at the same time increase the probability of making a Type 2 error, i.e. a failure to reject an incorrect null hypothesis.
Hypothesis testing is a crucial tool of empirical analysis. Another tool we will use repeatedly is that of regression analysis. In fact, sometimes, running a regression is a convenient way to deliver a hypothesis test. Let us demonstrate this with one of the above examples, the difference in violent crime rates between 2001 and 2021 across all states in the US.
Let’s start by creating a new dataset which only contains the 2001 data.
test_data <- merge_data %>%
filter(Year == 2001)
Now we run a regression of the violent crime variable
(vcrime.pc) against a constant only.
\(vcrime.pc_{i} = \alpha + u_{i}\)
mod1 <- lm(vcrime.pc~1,data=test_data)
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## vcrime.pc
## -----------------------------------------------
## Constant 424.337***
## (27.624)
##
## -----------------------------------------------
## Observations 51
## R2 0.000
## Adjusted R2 0.000
## Residual Std. Error 197.274 (df = 50)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What we will find is that the estimated value for \(\alpha\), \(\widehat{\alpha}=424.337\) is nothing else
but the sample mean of vcrime.pc in 2001. We could now
calculate a t-test \(=\widehat{\alpha}/se{\widehat{\alpha}}\)
\(=(424.337-0)/27.624=15.36\) which
tests the hypothesis that the average value for vcrime.pc
is equal to 0 (Yes, this is basically impossible, but R doesn’t know
that!). This could also be calculated
t.test(test_data$vcrime.pc, mu=0) # testing that mu = 0
##
## One Sample t-test
##
## data: test_data$vcrime.pc
## t = 15.361, df = 50, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 368.8529 479.8216
## sample estimates:
## mean of x
## 424.3372
The differences are rounding differences.
Let’s see how we could use a regression to test for the difference in
means. First we adjust our dataset test_data to include the
2001 and 2021 data. Note here we use the incredibly useful
Year %in% c(2001,2021) condition which selects all
observations for which the value for the year variable
(Year) is included in the list c(2001,2021).
Type ?c in the console to see what the c()
function does.
test_data <- merge_data %>%
filter(Year %in% c(2001,2021)) %>%
mutate(Year2021 = (Year == 2021))
In the above code we created a new variable (Year2021).
This is a Boolean variable, which takes a value of TRUE (or 1) if the
condition we specified is correct (Year == 2021) and FALSE
(or 0) otherwise. This type of variable is also called a dummy
variable.
Then we run a regression with the vcrime.pc as the
dependent variable and a constant and Year2021a dummy
variable which takes the value 1 if the Year is 2021 and 0 if the Year
is not 2021. This is achieved by specifying the model as
~S003. The variable name before the ~ is the
dependent variable, here A170. The variable after the
~ is the explanatory variable, here S003.
(Note that R automatically includes a constant into the regression
model, even if you do not specify it explicitly.)
mod1 <- lm(vcrime.pc~Year2021,data=test_data)
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## vcrime.pc
## -----------------------------------------------
## Year2021 -54.111
## (35.194)
##
## Constant 424.337***
## (24.886)
##
## -----------------------------------------------
## Observations 102
## R2 0.023
## Adjusted R2 0.013
## Residual Std. Error 177.722 (df = 100)
## F Statistic 2.364 (df = 1; 100)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This regression is a very special one as it uses the
Year2021 variable as the explanatory variable. Recall that
test_data only contains data from 2001 and 2021. The
regression picked one year (2001) as the base year and for the other we
created a dummy the variable Year2021, a variable which
takes the value 1 if the observation relates to the year 2021 and 0
otherwise.
\(vcrime.pc_{i} = \alpha + \beta~ Year2021_{i} + u_{i}\)
You can see that the constant (\(\widehat{\alpha}\)) still reports the sample average for the 2001 observations. It is identical to the value we saw in the previous regression. But what is the meaning of \(\widehat{\beta}=-54.111\)? This is not the average response value for year 2021 observations, but the difference between the average of the 2021 and 2001 observations. As it is negative it means that the 2021 average is smaller than the 2001. In fact it is 424.337 - 54.111=370.226.
If all you were after is the calculation of the average values in 2001 and 2021, then there was no need to estimate a regression, you could just calculate the following:
test_data %>% group_by(Year) %>% summarise(avg_vcrime.pc = mean(vcrime.pc))
## # A tibble: 2 × 2
## Year avg_vcrime.pc
## <int> <dbl>
## 1 2001 424.
## 2 2021 370.
But in the regression framework you can now test whether that difference is significantly different from 0 (\(H_0:\beta = 0\)) which is equivalent to testing that the two averages are identical. The t-test for this hypothesis test would be \((-54.111-0)/35.194=1.538\) which, but for rounding differences, is identical to the test on equality of two means we performed previously.
The regressions we ran so far were special in the sense that they involved explanatory variables which were either a constant (i.e. ones) or dummy variables (0s or 1s). The result of this was that the resulting estimates represented sample means or differences in sample means.
The interpretation of coefficient estimates changes as explanatory variables take a more general form.
test_data <- merge_data %>%
filter(Year == 2021) # pick latest wave
We now estimate a regression model for vcrime.pc which
includes a constant and the number of law enforcement officers (per
capita) as an explanatory variable (law.officers.pc).
\(vcrime.pc_{i} = \alpha + \beta~ law.officers.pc_{i} + u_{i}\)
mod1 <- lm(vcrime.pc~law.officers.pc,data=test_data)
stargazer(mod1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## vcrime.pc
## -----------------------------------------------
## law.officers.pc 0.204
## (0.193)
##
## Constant 311.421***
## (61.296)
##
## -----------------------------------------------
## Observations 50
## R2 0.023
## Adjusted R2 0.002
## Residual Std. Error 156.747 (df = 48)
## F Statistic 1.114 (df = 1; 48)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here you see that \(\widehat{\beta}=0.204\). What does this mean? As the number of law officers increases by one unit (here that represents an increase of one officer per 100,000 population) we should expect that the number of violent crimes increases by 0.204 units (0.204 crimes per 100,000 population).
What is the interpretation for \(\widehat{\alpha}=311.421\)? For a state with 0 law officers (per 100,000 population) we should expect there to be 311 violent crimes per 100,000 population. Let’s present a graphical representation.
ggplot(test_data, aes(x=law.officers.pc, y=vcrime.pc)) +
geom_point() +
geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2])+
ggtitle("Violent Crime v Law Officers")
Note a few tweaks in the graphical representation.
geom_abline adds a line. We specify the intercept and slope
from our regression model (mod1$coefficients[1] and
mod1$coefficients[2]). ggtitle adds the title
to the graph.
The regression parameters, which deliver the line of best fit, are estimated by Ordinary Least Squares (OLS). The name comes from the fact that these parameters are the ones which minimise the sum of squared residuals, \(\Sigma \widehat{u}^2_i = \Sigma (vcrime.pc_{i} - \widehat{\alpha} - \widehat{\beta}~ law.officers.pc_{i})^2\). These parameters achieve another thing, they ensure that \(Corr(law.officers.pc_{i},\widehat{u}_{i})=0\) is true.
This last point is increadibly important, as one of the assumptions underpinning the estimation of regression models by OLS is that \(Corr(law.officers.pc_{i},u_{i})=0\). Why is that assumption important? If the assumption was not true, then we need to accept that the OLS estimation imposes a feature into the model that is not appropriate for the data. As a result the resulting regression coefficients are biased. As a consequence the resulting regression model cannot be said to have any causal interpretation.
As we cannot observe \(u_i\), the assumption of exogeneity cannot be tested directly and we need to make an argument using economic understanding.
A lot of econometric work is therefore directed at building either models or estimation methods (alternatives to OLS) which make this assumption more defendable. This could be the inclusion of additional explanatory variables (leading to multivariate regression analysis) or the application of alternative estimation methods (like instrumental variables estimation).