In this computer lab you will be practicing the following
Let’s start by loading some useful packages
library(readxl) # enable the read_excel function
library(tidyverse) # for almost all data handling tasks
library(ggplot2) # plotting toolbox
library(stargazer) # for nice regression output
Here we are looking at replicating aspects of this paper:
Siegel et al (2019) The Impact of State Firearm Laws on Homicide and Suicide Deaths in the USA, 1991–2016: a Panel Study, J Gen Intern Med 34(10):2021–8.
this is the paper which we will replicate throughout the unit in order to demonstrate some Diff-in-Diff techniques (not in this session but later).
Import the data from the “US_Gun_example.csv” file. Recall, make sure
the file (from the Week 3 BB page) is saved in your working directory,
that you set the working directory correctly and that you set the the
na= option in the read.csv function to the
value in which missing values are coded in the csv file. To do this
correctly you will have to open the csv file (with your spreadsheet
software, e.g. Excel) and check for instance cell G9. The
stringsAsFactors = TRUE option in read.csv
automatically converts character variables into factor (categorical)
variables. This is useful when yo know that these variables represent
categories (like here states).
setwd("YOUR WORKING DORECTORY")
data <- read.csv(XXXX,na="XXXX", stringsAsFactors = TRUE)
str(data)
## 'data.frame': 1071 obs. of 19 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ State_code : Factor w/ 51 levels "AK","AL","AR",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 2 1 4 3 5 6 7 9 8 10 ...
## $ Population : int 633687 4467634 2691571 5273477 34479458 4425687 3432835 574504 795699 16356966 ...
## $ Mechanism : Factor w/ 1 level "Firearm": 1 1 1 1 1 1 1 1 1 1 ...
## $ pol_ubc : int 0 0 0 0 1 0 1 NA 0 0 ...
## $ pol_vmisd : int 0 0 0 0 1 0 1 NA 1 0 ...
## $ pol_mayissue : int 0 1 0 0 1 1 1 NA 1 0 ...
## $ Age.Adjusted.Rate: num 14.83 16.41 15.27 15.92 9.32 ...
## $ logy : num 2.7 2.8 2.73 2.77 2.23 ...
## $ ur : num 6.28 5.18 4.76 4.72 5.47 ...
## $ law.officers : int 1821 15303 7538 18548 106244 15172 9825 4716 2964 63041 ...
## $ law.officers.pc : num 287 343 280 352 308 ...
## $ vcrime : int 3696 19203 12042 28275 210661 15334 11387 4845 4845 129839 ...
## $ vcrime.pc : num 583 430 447 536 611 ...
## $ alcc.pc : num 2.67 1.86 1.74 2.5 2.2 ...
## $ incarc : int 3033 24741 11489 27710 157142 14888 17507 NA 6841 72404 ...
## $ incarc.pc : num 479 554 427 525 456 ...
You got it right if the output from str(data) looks like
the above.
A variable that is used in the paper but not yet included in the “US_Gun_example.csv” dataset is the age structure of a state’s population. In the Siegel et al. (2019) paper you will find that they use a variable called “Percent male among population ages 15-29”. We shall attempt to use a different variable “Proportion of 18-24 year olds in the population”. You will see below that adding this data to our datasets require a bit of work. With enough work we could add the data used in the paper, but for today’s exercise we will make our life a little easier. But the work done in what follows is quite typical of the work that needs doing when you merge data.
We shall import a new datafile that contains some of that information. The data are sourced from the StatsAmerica website. Download the “US states population age and sex.csv” file from the Week 3 BB page and save it into your working folder.
data_pop <- read.csv(XXXX, stringsAsFactors = TRUE)
str(data_pop)
## 'data.frame': 63882 obs. of 17 variables:
## $ IBRC_Geo_ID : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Statefips : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Countyfips : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Description : Factor w/ 3197 levels "Abbeville County, SC",..: 2892 2892 2892 2892 2892 2892 2892 2892 2892 2892 ...
## $ Year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ Total.Population : int 282162411 284968955 287625193 290107933 292805298 295516599 298379912 301231207 304093966 306771529 ...
## $ Population.0.4 : int 19178293 19298217 19429192 19592446 19785885 19917400 19938883 20125962 20271127 20244518 ...
## $ Population.5.17 : int 53197896 53372958 53507265 53508312 53511850 53606269 53818831 53893443 53833475 53889649 ...
## $ Population.18.24 : int 27315274 27992652 28480708 28916746 29302179 29441546 29602839 29808025 30194274 30530346 ...
## $ Population.25.44 : int 84973340 84523274 83990295 83398001 83066831 82764185 82638980 82509693 82399959 82211153 ...
## $ Population.45.64 : int 62428040 64491563 66695526 68828899 70935234 73137401 75216272 77068373 78617510 80272688 ...
## $ Population.65. : int 35069568 35290291 35522207 35863529 36203319 36649798 37164107 37825711 38777621 39623175 ...
## $ Population.Under.18: int 72376189 72671175 72936457 73100758 73297735 73523669 73757714 74019405 74104602 74134167 ...
## $ Population.18.54 : int 150287588 151902194 152463197 153134701 153998940 154701635 155527978 156257657 157054680 157608587 ...
## $ Population.55. : int 59498634 60395586 62225539 63872474 65508623 67291295 69094220 70954145 72934684 75028775 ...
## $ Male.Population : int 138443407 139891492 141230559 142428897 143828012 145197078 146647265 148064854 149489951 150807454 ...
## $ Female.Population : int 143719004 145077463 146394634 147679036 148977286 150319521 151732647 153166353 154604015 155964075 ...
This file has 63882 rows. How many would you have expected if there
were 51 states and 21 years for each state? Exactly, 1,071, which is the
number of rows in the data object. You need to figure out
why there are so many rows before we can merge data into the
data dataframe. Have a look at the spreadsheet. Can you see
the problem/issue?
The spreadsheet includes data for every county and not only for the whole state of Alabama. For instance you see that some rows have in the description column only the name of a state and others have the name of a county. To illustrate this look at the following snippet from the data table:
data_pop[1979:1982,]
Note that in lines 1979 and 1980 we are having data for the whole
state of Arizona and in lines 1981 and 1982 you find data for Apache
County (in Arizona). We only want statewide data. In the above snippet
you can see that there is a variable called Countyfips
which is a numerical code for the different counties. The statewide data
have a value of 0 in the Countyfips variable. You should
confirm (by looking at the data) that this is true for the other states
as well.
One additional aspect of the data is that you will see that
population data are only available from 2000 to 2019. This is not
aligned with the 2001 to 2021 date range in data. The
common years are 2001 to 2019 and therefore we should expect to get 969
(=51*19) observations which we can match.
Let us first filter out the statewide data and remove the county level data.
data_pop <- data_pop %>% filter(Countyfips == 0) # we only keep data with Countyfips equal to 0
You will notice that this dataframe now has
rnrow(data_pop2)` rows of data. This is still too many
rows. Let’s look at the different geographies in our dataset.
unique(data_pop$Description)
## [1] U.S. Alabama Alaska
## [4] Arizona Arkansas California
## [7] Colorado Connecticut Delaware
## [10] District of Columbia Florida Georgia
## [13] Hawaii Idaho Illinois
## [16] Indiana Iowa Kansas
## [19] Kentucky Louisiana Maine
## [22] Maryland Massachusetts Michigan
## [25] Minnesota Mississippi Missouri
## [28] Montana Nebraska Nevada
## [31] New Hampshire New Jersey New Mexico
## [34] New York North Carolina North Dakota
## [37] Ohio Oklahoma Oregon
## [40] Pennsylvania Rhode Island South Carolina
## [43] South Dakota Tennessee Texas
## [46] Utah Vermont Virginia
## [49] Washington West Virginia Wisconsin
## [52] Wyoming Puerto Rico
## 3197 Levels: Abbeville County, SC Acadia Parish, LA ... Ziebach County, SD
You will immediately see that there are also observations for the
entire U.S.. So, let’s extract the data that are from states and years
which are also represented in data, our original dataset.
Complete the following code for this task.
state_list <- unique(data$XXXX) # creates a list with state names in data
year_list <- XXXX(XXXX$Year) # creates a list of years in data
data_pop <- data_pop %>% filter(Description %in% XXXX) %>%
filter(XXXX XXXX year_list)
You got it right if data_pop has 969 observations and
you can replicate the following table:
summary(data_pop$Total.Population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 494657 1651059 4219239 6041273 6859789 39512223
Let’s look at the variables that are contained in this datafile.
names(data_pop)
## [1] "IBRC_Geo_ID" "Statefips" "Countyfips"
## [4] "Description" "Year" "Total.Population"
## [7] "Population.0.4" "Population.5.17" "Population.18.24"
## [10] "Population.25.44" "Population.45.64" "Population.65."
## [13] "Population.Under.18" "Population.18.54" "Population.55."
## [16] "Male.Population" "Female.Population"
We shall not merge all of these variables into data but
only what we want, namely the “Proportion of 18-24 year olds in the
population”. That is actually not one of the variables in the list.
There is the population between 18 and 24
(Population.18.24) and the overall population
(Total.Population) and we can calculate the proportion we
need as a new variable, prop18.24. Complete the following
code:
data_pop$prop18.24 <- 100*XXXX$Population.18.24/data_pop$XXXX
You get it right if you can replicate these summary statistics for the new variable.
summary(data_pop$prop18.24)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.955 9.463 9.867 9.944 10.282 14.385
Now we select only the variables we wish to merge into
data, namely only prop18.24. However, in order
to merge the data into data we also need the year
(Year) and state name (Description).
data_pop <- data_pop %>% select(Year, Description, prop18.24)
It is easiest to merge datafiles if the variables on which we want to
match (state name and Year) are called the same in both datasets
(data and data_pop). This is true for the
Year variable, but not for the state name
(State in data and Description in
data_pop). Let’s fix that and change the state variable
name in data_pop to State.
names(data_pop)[names(data_pop)=="Describtion"] <- "State"
Then look at names(data_pop) and see whether you
achieved what you wanted … no, you didn’t? The name has not changed?
Sometimes you make a mistake but there is no error message. Look at the
previous line again and try and figure out what the problem is, correct
it and rename the variable. But the message here is an important one.
Don’t assume that just because R ran your line and didn’t spit out an
error message that everything you wanted to happen did happen. You
should always check whether the result is as expected.
names(data_pop)[names(data_pop)=="Description"] <- "State"
Now we are in a position to merge the two datafiles.
data2 <- merge(data,data_pop)
As result your datafile has gained one variable,
prop18.24, but lost a few rows. By default, the merge
function deletes rows for which it did not have matching rows in both
datafiles and therefore all 2020 and 2021 observations have gone. Look
at the help for merge (by typing ?merge into the console)
and find the change you need in the above line to make sure that we keep
all 1071 observations from the data dataframe. Then re-run
the above line.
data2 <- merge(data,data_pop, all.x = TRUE)
Your data2 dataframe should end up with 1071 rows and 20
variables.
Merging datasets is a super important skill, so let’s practice this
here again. We wish to differentiate between different regions in the
U.S. In your data2 dataframe one of the information is the
state, coded by both State and State_code
variables. What we need is an additional variable that tells you which
region the state is in.
So, for instance:
| State | State code | Region |
|---|---|---|
| Alabama | AL | South |
| Alaska | AK | West |
| Arizona | AZ | West |
You will first have to find a dataset on the internet that maps
states to regions. Go to your favorite search engine and search for
something like “csv us States and regions”. Alternatively you could
enlist the help of an AI. For instance you could go to Google Bart and
ask something like “create a csv file that maps US states to regions”.
Then save that file into your working directory and merge the Region
variable into your data2 file.
states_info <- read_xlsx("../data/states.xlsx")
# The variable names may be different in your file
states_info <- states_info %>% select(STATEAB, Region)
names(states_info)[names(states_info)=="STATEAB"] <- "State_code"
data2 <- merge(data2, states_info)
Make sure that the region variable is called Region and
that the State or State code variable name aligns with that in
data. If you got it right the following code should give
you the same result
tab_regions <- data2 %>% select(State_code, Region) %>%
unique() %>%
group_by(Region) %>%
summarise(n = n()) %>%
print()
## # A tibble: 4 × 2
## Region n
## <chr> <int>
## 1 Midwest 12
## 2 Northeast 9
## 3 South 17
## 4 West 13
This table shows that there are 12 states in the Midwest region and 9 in the Northeast. Altogether the U.S. is divided into four regions.
Here we will practice some time-series plotting. Let’s start with a
simple plot for prop18.24 for Florida. You googled the
internet to find an example for how to create a time-series plot using
ggplot and found the following lines which seem relevant. In the bit of
code you did find on the internet you see the element
subset(dataset, country == "Brazil"). This takes a
dataframe called dataset and extracts all rows for which
the variable country takes the value “Brazil”. Adjust this
code to create the following plot from the data in your
data2.
g1 <- ggplot(subset(dataset, country == "Brazil"), aes(x=dates,y=deaths_weekly)) +
geom_line(size=1) +
ggtitle("Covid-19 weekly deaths, Brazil")
g1
Now you want to compare this proportion between Florida and Texas. As
you create your graph you should be able to select the data from these
two states by using
subset(data2, State %in% c("Florida","Texas")). How to
create separate lines for the two states you learned in the Week 2
lecture (or better the accompanying code).
If you wish you could experiment with the theme options
in ggplot. Why not try the following code:
Check out the GGplot cheat sheet for more tricks and illustrations of the ggplot packages’ capabilities and other themes available.
What we now do is to aggregate or average data across the sample
period. We shall use the awesome power of the tidyverse language to do
this. We want to calculate the average prop18.24, the
average rate of firearm deaths (Age.Adjusted.Rate), the
average for law.officers.pc and the average of
ur for each state.
tab1 <- data2 %>% group_by(State) %>%
summarise(avg_prop18.24 = mean(prop18.24),
avg_fad.rate = mean(Age.Adjusted.Rate),
avg_law.officers.pc = mean(law.officers.pc),
avg_ur = mean(ur)) %>%
print()
## # A tibble: 51 × 5
## State avg_prop18.24 avg_fad.rate avg_law.officers.pc avg_ur
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama NA 18.5 310. 6.04
## 2 Alaska NA 19.9 274. 6.93
## 3 Arizona NA 15.3 338. 6.26
## 4 Arkansas NA 17.0 316. 5.55
## 5 California NA 8.40 314. 7.32
## 6 Colorado NA 12.4 336. 5.33
## 7 Connecticut NA 5.22 275. 5.89
## 8 Delaware NA 10.4 355. 5.32
## 9 District of Columbia NA 18.1 791. 7.40
## 10 Florida NA 11.9 371. 5.69
## # ℹ 41 more rows
As you can see the code calculated the average values for the firearm
death rate (for instance, on average there were 18.5 firearm deaths per
100,000 in Alabama per year), but the code did not calculate the average
proportion of 18 to 24 year olds. We get “NA” for all states. The reason
for that is that some of the prop18.24 observations are not
available, in particular the data for years 2020 and 2021. The
mean function by default refuses to calculate the mean
value if any of the data are “NA”s. However, there is a way to instruct
the mean function to ignore missing values and calculate
the mean value on the basis of the available data. Check the help for
the mean function (type ?mean into the console) to find the
option you should add to the mean function to achieve this.
If you get it right you should be able to replicate the following table.
## # A tibble: 51 × 5
## State avg_prop18.24 avg_fad.rate avg_law.officers.pc avg_ur
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 9.86 18.5 310. 6.04
## 2 Alaska 10.4 19.9 274. 6.93
## 3 Arizona 9.93 15.3 338. 6.26
## 4 Arkansas 9.73 17.0 316. 5.55
## 5 California 10.2 8.40 314. 7.32
## 6 Colorado 9.79 12.4 336. 5.33
## 7 Connecticut 9.17 5.22 275. 5.89
## 8 Delaware 9.67 10.4 355. 5.32
## 9 District of Columbia 12.5 18.1 791. 7.40
## 10 Florida 8.89 11.9 371. 5.69
## # ℹ 41 more rows
It is possibly not good practice to calculate averages over different
sample sizes (over 2001 to 2019 for prop18.24 and over 2001
to 2021 for Age.Adjusted.Rate). We therefore repeat the
calculation but only for the years up to and including 2019.
There is one mistake in the code and you should get an error message.
tab1 <- data2 %>% filter(year <= 2019) %>%
group_by(State) %>%
summarise(avg_prop18.24 = mean(prop18.24, na.rm = TRUE),
avg_fad.rate = mean(Age.Adjusted.Rate),
avg_law.officers.pc = mean(law.officers.pc),
avg_ur = mean(ur)) %>%
print()
Fix the error to obtain the following table.
## # A tibble: 51 × 6
## State Region avg_prop18.24 avg_fad.rate avg_law.officers.pc avg_ur
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama South 9.86 17.9 305. 6.16
## 2 Alaska West 10.4 19.5 275. 6.88
## 3 Arizona West 9.93 15.1 341. 6.24
## 4 Arkansas South 9.73 16.4 313. 5.59
## 5 California West 10.2 8.36 315. 7.17
## 6 Colorado West 9.79 12.0 336. 5.25
## 7 Connecticut North… 9.17 5.11 278. 5.76
## 8 Delaware South 9.67 9.87 354. 5.19
## 9 District of Col… South 12.5 17.6 797. 7.40
## 10 Florida South 8.89 11.7 374. 5.61
## # ℹ 41 more rows
Let’s create a few plots which show the average death numbers against some of our state specific information.
ggplot(tab1,aes(avg_prop18.24,avg_fad.rate)) +
geom_point() +
ggtitle("Proportion of 18 to 24 year olds v Deaths by Firearms")
In order to demonstrate another trick in ggplot box of tricks we will
also use the Region information.
ggplot(tab1,aes(avg_prop18.24,avg_fad.rate)) +
geom_point() +
facet_wrap(vars(Region)) +
ggtitle("Proportion of 18 to 24 year olds v Deaths by Firearms")
Very neat indeed. What we learn from these scatterplots is that there is no obvious correlation between the average proportions of 18 to 24 year olds and the rate of firearm deaths.
Let’s perform some hypothesis tests to check whether there are significant differences between the average rates of cases and deaths since June 2020 between regions.
We therefore continue to work with the data in tab1. In
tab2 we calculate regional averages.
tab2 <- tab1 %>%
group_by(Region) %>%
summarise(RAvg_cases = mean(avg_fad.rate),
n = n()) %>% print()
## # A tibble: 4 × 3
## Region RAvg_cases n
## <chr> <dbl> <int>
## 1 Midwest 10.1 12
## 2 Northeast 6.55 9
## 3 South 14.4 17
## 4 West 13.0 13
Let’s see whether we find the regional averages to be statistically
significantly different. Say we compare the avg_fad.rate in
the Northeast to that in the Midwest. So test the null hypothesis that
\(H_0: \mu_{NE}=\mu_{MW}\) (or \(H_0: \mu_{NE}-\mu_{MW}=0\)) against the
alternative hypothesis that \(H_A:
\mu_{NE}\neq\mu_{MW}\), where \(\mu\) represents the average firearm death
rate (per 100,000 population) in states in the respective region over
the sample period.
test_data_NE <- tab1 %>%
filter(Region == "Northeast") # pick Northeast states
test_data_MW <- tab1 %>%
filter(Region == "Midwest") # pick Midwest states
t.test(test_data_NE$avg_fad.rate,test_data_MW$avg_fad.rate, mu=0) # testing that mu = 0
##
## Welch Two Sample t-test
##
## data: test_data_NE$avg_fad.rate and test_data_MW$avg_fad.rate
## t = -3.2399, df = 15.565, p-value = 0.005282
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.895230 -1.225443
## sample estimates:
## mean of x mean of y
## 6.548129 10.108465
The difference in the averages is 6.548 - 10.108 = -3.56 (more than 3 in 100,000 population). We get a t-test statistic of about -3.2. If in truth the two means were the same (\(H_0\) was correct) then we should expect the test statistic to be around 0. Is -3.2 far enough away from 0 for us to conclude that we should stop supporting the null hypothesis? Is -3.2 large (in absolute terms) enough?
The answer is yes and the p-value does tell us that it is. The p-value is 0.005282 0.53%. This means that if the \(H_0\) was correct, the probability of getting a difference of -3.56 (per 100,000 population) or a more extreme difference is 0.53%. We judge this probability to be too small for us to continue to support the \(H_0\) and we reject the \(H_0\). We do so as the p-value is smaller than any of the usual significance levels (10%, 5% or 1%).
We are not restricted to testing whether two population means are the same. You could also test whether the difference in the population is anything different but 0. Say a politician claims that evidently the firearm death rate rate in the Northeast is smaller by more than 3 per 100,000 population than the firearm death rate in the Midwest.
Here our \(H_0\) is \(H_0: \mu_{NE}=\mu_{MW}-3\) (or \(\mu_{NE}-\mu_{MW}=-3\)) and we would test this against an alternative hypothesis of \(H_0: \mu_{NE}<\mu_{MW}-3\) (or \(H_0: \mu_{NE}-\mu_{MW}<-3\)). Here the statement of the politician is represented in the \(H_A\).
# testing that mu = -3
t.test(test_data_NE$avg_fad.rate,test_data_MW$avg_fad.rate, mu=-3, alternative = "less")
##
## Welch Two Sample t-test
##
## data: test_data_NE$avg_fad.rate and test_data_MW$avg_fad.rate
## t = -0.5099, df = 15.565, p-value = 0.3086
## alternative hypothesis: true difference in means is less than -3
## 95 percent confidence interval:
## -Inf -1.638469
## sample estimates:
## mean of x mean of y
## 6.548129 10.108465
Note the following. The parameter mu now takes the value
-3 as we are hypothesising that the difference in the means is -3 (or
smaller than that in the \(H_A\)).
Also, in contrast to the previous test we now care whether the deviation
is less than -3. In this case we wonder whether it is really smaller.
Hence we use the additional input into the test function,
alternative = "less". (The default for this input is
alternative = "two.sided" and that is what is used, as in
the previous case, if you don’t add it to the t.test
function). Also check ?t.test for an explanation of these
optional input parameters.
Again we find ourselves asking whether the sample difference we obtained (-3.56) is consistent with the null hypothesis (of the population difference being -3). The p-value is 0.3086, so the probability of obtaining a sample difference as big as -3.56 (or smaller) is just a little over 30%. Say we set out to perform a test at a 10% significance level, then we would judge that a probability of just above 30% is larger than that p-value and we would fail to reject the null hypothesis.
So let’s perform another test. A Republican governor of a Southern state in the U.S. claims that the average firearm death rate in the South is just as big as the one in the West. Perform the appropriate hypothesis test.
test_data_SO XXXX tab1 %>%
filter(XXXX == "South") # pick Southern states
XXXX <- tab1 XXXX
XXXX(XXXX == "West") # pick Western states
XXXX(XXXX$avg_fad.rate,XXXX$XXXX, XXXX=0) # testing that mu = 0
##
## Welch Two Sample t-test
##
## data: test_data_SO$avg_fad.rate and test_data_WE$avg_fad.rate
## t = 1.0149, df = 19.808, p-value = 0.3224
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.515748 4.384902
## sample estimates:
## mean of x mean of y
## 14.42065 12.98607
The p-value is certainly larger than any of the usual significance levels and we fail to reject \(H_0\). This means that the opposition governor’s statement is supported by the data or at least the data do not contradict it.
To perform inference in the context of regressions it pays to use an
additional package, the car package. So please load this
package.
library(car)
If you get an error message it is likely that you first have to install that package.
In the lecture we talked about the following regression (Lecture Week 2 - Regression Analysis - Example 3)
\(vcrime.pc_i = \alpha + \beta ~law.officer.pc_i + u_i\)
Let us estimate this again, using the subset function to filter the
2021 data (as in the lecture) from data2.
mod1 <- lm(vcrime.pc~law.officers.pc,data=subset(data2, Year == 2021))
stargazer(mod1,type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## vcrime.pc
## -----------------------------------------------
## law.officers.pc 0.204
## (0.193)
##
## Constant 311.421***
## (61.296)
##
## -----------------------------------------------
## Observations 50
## R2 0.023
## Adjusted R2 0.002
## Residual Std. Error 156.747 (df = 48)
## F Statistic 1.114 (df = 1; 48)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Let’s change the dependent variable to the rate of firearm deaths
(Age.Adjusted.rate or AAR for short) and use
several explanatory variables, the number of law officers, the
unemployment rate and the amount of per capita alcohol consumption.
Also, let’s use all the years of data in our dataset.
\(AAR_i = \alpha + \beta_1 ~law.officer.pc_i + \beta_2 ~ur_i + \beta_3 ~alcc.pc_i + u_i\)
We will estimate two models, one with only
law.officers.pc as the explanatory variable and one with
all three explanatory variables.
mod2 <- lm(Age.Adjusted.Rate~law.officers.pc,data=data2)
mod3 <- lm(Age.Adjusted.Rate~law.officers.pc+ur+alcc.pc,data=data2)
stargazer(mod2,mod3,type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## Age.Adjusted.Rate
## (1) (2)
## --------------------------------------------------------------------
## law.officers.pc 0.006*** 0.007***
## (0.002) (0.002)
##
## ur 0.039
## (0.076)
##
## alcc.pc -0.528*
## (0.286)
##
## Constant 10.301*** 11.151***
## (0.501) (0.865)
##
## --------------------------------------------------------------------
## Observations 1,048 1,048
## R2 0.014 0.017
## Adjusted R2 0.013 0.014
## Residual Std. Error 4.907 (df = 1046) 4.902 (df = 1044)
## F Statistic 14.461*** (df = 1; 1046) 6.118*** (df = 3; 1044)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
How would we interpret the value of \(\hat{\beta}_1 = 0.007\)? For a one unit increase in the explanatory variable (law officers per 100,000) we would expect the number of firearm deaths to increase by 0.007. Is that a lot or is that economically significant? There are two aspects to that. Is an increase of 1 officer per 100,000 a lot? To answer that we need to know how many officers there typically are. And to know whether 0.007 is a large increase we need to know how many firearm deaths there typically are.
Let’s look at the summary stats to help with this judgement.
summary(data2[c("Age.Adjusted.Rate", "law.officers.pc", "ur", "alcc.pc")])
## Age.Adjusted.Rate law.officers.pc ur alcc.pc
## Min. : 2.14 Min. : 37.08 Min. : 2.100 Min. :1.271
## 1st Qu.: 8.84 1st Qu.:255.98 1st Qu.: 4.204 1st Qu.:2.110
## Median :11.72 Median :296.35 Median : 5.283 Median :2.344
## Mean :12.05 Mean :310.21 Mean : 5.648 Mean :2.441
## 3rd Qu.:15.05 3rd Qu.:343.69 3rd Qu.: 6.737 3rd Qu.:2.657
## Max. :33.82 Max. :894.46 Max. :13.733 Max. :4.799
## NA's :23
Now we can judge the economic importance of this effect. One officer
extra per 100,000 population is not a large change when on average,
across all states and years, the average number of officers is 310. So
let’s consider a 5% increase in the number of officers as this seems
like a perhaps feasible but significant policy. That would be around 15
officers per 100,000. This means the effect on the rate of firearm
deaths would be 15*0.007 =0.105. Now the question is whether this would
be a sizable effect on the outcome variable
(Age.Adjusted.Rate)? The average rate is 12.05, which
implies that the effect of increasing the number of law enforcement
officers by 5% would be to increase the number of firearm deaths by less
than 1%. This certainly does not imply a very large effect.
You can see that on the face of it, higher numbers of law enforcement officers seem to suggest a higher rate of firearm deaths. This initially certainly seems counter-intuitive until you realise that it is quite likely that police will have higher numbers in states in which crime is a bigger problem. This is a classic example of simultaneity. Crime impacts the numbers of police and the numbers of police may impact crime. So this is an excellent example to understand that just looking at regression results you cannot make causal statements, here you would not be justified in arguing that higher numbers of law enforcement officers cause more crime.
We may also want to look at the \(R^2\) of the above regression. You can see that they are both very small 0.014 and 0.017, meaning that both regressions explain less than 2% o fthe variation in the dependent variable.
Let us investigate a little further why this regression explains so little variation. We plot a scatter graph where different colors represents different states.
ggplot(data2,aes(law.officers.pc,Age.Adjusted.Rate, color = State)) +
geom_point() +
guides(color = "none") + # removes the legend
ggtitle("Number of law enforcement officers v Deaths by Firearms")
What you can see from here is that observations for the same state cluster together and that different states seem to differ significantly between each other. This variation is not reflected in the above estimation. In next week’s computer lab you will see how this issue can be tackled.
In today’s computer lab you achieved a lot. You produced various plots and started to explore the amazing power of the ggplot package. You merged data, performed hypothesis tests and learned how to run and display multiple regressions.
TAKE A BREAK!