Firearm Laws - Data Acquisition and Management (Part 1)

Introduction

Here we are looking at replicating aspects of this paper:

Siegel et al (2019) The Impact of State Firearm Laws on Homicide and Suicide Deaths in the USA, 1991–2016: a Panel Study, J Gen Intern Med 34(10):2021–8.

In this paper the authors evaluate whether the introduction of different types of gun control laws have a causal impact on a range of outcome variables, e.g. the rate of gun related deaths. In these walk-throughs we will attempt to first access the data used in the paper (this Part 1) and then to replicate the empirical results. We will be unable to obtain the exact same dataset and therefore our attempt will only be partially successful. But in any case the process shown here will reflect the process of an empirical piece of research.

In Part 2 we will use the data acquired to implement a Difference in Difference methodology to evaluate the effectiveness of Laws that attempt to restrict the ownership of firearms on the number of firearm deaths.

Load packages

Let’s start by loading the packages we will use throughout this work.

library(tidyverse)
library(readxl)
library(stringr)
library(ggplot2)
library(plm)          # for panel data
library(sandwich)     # for cluster robust standard errors
library(stargazer)    # for nice regression output
library(stringr)      # to extract strings

Data

The core data are information about Gun Laws and fatality statistics

Gun Law data

Siegel and co-authors collated this information and published them on https://www.statefirearmlaws.org/. The following uploads the database (“Firearmlaws DATABASE_0.xlsx”). Each row indicates a state-year and in columns indicates whether in a particular state-year a policy is in place. From the database you can also download a spreedsheet that adds explanations to the variable definitions. This is important information and we download it also (“Firearmlaws codebook_0.xlsx”). Both files (and any later datafiles) are saved in a data subfolder.

# set data directory
datadir <- "../data/"

law_data <- read_excel(paste0(datadir,"Firearmlaws DATABASE_0.xlsx"))
law_data_codebook <- read_excel(paste0(datadir,"Firearmlaws codebook_0.xlsx"))
names(law_data_codebook) = make.names(names(law_data_codebook)) # This eliminates spaces in variable names

The last line of the above code ensures that variable names have no spaces. For instance it turns the variable name Sub Category into Sub.Category. This makes working with variables much easier.

The following lists all the different policy categories used.

names(law_data)

##   [1] "state"                        "year"                        
##   [3] "felony"                       "invcommitment"               
##   [5] "invoutpatient"                "danger"                      
##   [7] "drugmisdemeanor"              "alctreatment"                
##   [9] "alcoholism"                   "relinquishment"              
##  [11] "violent"                      "violenth"                    
##  [13] "violentpartial"               "dealer"                      
##  [15] "dealerh"                      "recordsall"                  
##  [17] "recordsallh"                  "recordsdealer"               
##  [19] "recordsdealerh"               "reportall"                   
##  [21] "reportallh"                   "reportdealer"                
##  [23] "reportdealerh"                "purge"                       
##  [25] "residential"                  "theft"                       
##  [27] "security"                     "inspection"                  
##  [29] "ammlicense"                   "ammrecords"                  
##  [31] "permit"                       "permith"                     
##  [33] "fingerprint"                  "training"                    
##  [35] "permitlaw"                    "registration"                
##  [37] "registrationh"                "defactoreg"                  
##  [39] "defactoregh"                  "ammpermit"                   
##  [41] "ammrestrict"                  "age21handgunsale"            
##  [43] "age18longgunsale"             "age21longgunsaled"           
##  [45] "age21longgunsale"             "age21handgunpossess"         
##  [47] "age18longgunpossess"          "age21longgunpossess"         
##  [49] "loststolen"                   "amm18"                       
##  [51] "amm21h"                       "universal"                   
##  [53] "universalh"                   "gunshow"                     
##  [55] "gunshowh"                     "universalpermit"             
##  [57] "universalpermith"             "backgroundpurge"             
##  [59] "ammbackground"                "threedaylimit"               
##  [61] "mentalhealth"                 "statechecks"                 
##  [63] "statechecksh"                 "waiting"                     
##  [65] "waitingh"                     "assault"                     
##  [67] "onefeature"                   "assaultlist"                 
##  [69] "assaultregister"              "assaulttransfer"             
##  [71] "magazine"                     "tenroundlimit"               
##  [73] "magazinepreowned"             "onepermonth"                 
##  [75] "traffickingbackground"        "traffickingprohibited"       
##  [77] "traffickingprohibitedh"       "strawpurchase"               
##  [79] "strawpurchaseh"               "microstamp"                  
##  [81] "gvro"                         "gvrolawenforcement"          
##  [83] "college"                      "collegeconcealed"            
##  [85] "elementary"                   "opencarryh"                  
##  [87] "opencarryl"                   "opencarrypermith"            
##  [89] "opencarrypermitl"             "permitconcealed"             
##  [91] "mayissue"                     "showing"                     
##  [93] "ccbackground"                 "ccbackgroundnics"            
##  [95] "ccrenewbackground"            "ccrevoke"                    
##  [97] "nosyg"                        "personalized"                
##  [99] "lockd"                        "lockp"                       
## [101] "locked"                       "lockstandards"               
## [103] "capliability"                 "capaccess"                   
## [105] "capuses"                      "capunloaded"                 
## [107] "cap18"                        "cap16"                       
## [109] "cap14"                        "junkgun"                     
## [111] "liability"                    "immunity"                    
## [113] "preemption"                   "preemptionnarrow"            
## [115] "preemptionbroad"              "mcdv"                        
## [117] "mcdvdating"                   "mcdvsurrender"               
## [119] "mcdvsurrendernoconditions"    "mcdvsurrenderdating"         
## [121] "mcdvremovalallowed"           "mcdvremovalrequired"         
## [123] "incidentremoval"              "incidentall"                 
## [125] "dvro"                         "dvrodating"                  
## [127] "exparte"                      "expartedating"               
## [129] "dvrosurrender"                "dvrosurrendernoconditions"   
## [131] "dvrosurrenderdating"          "expartesurrender"            
## [133] "expartesurrendernoconditions" "expartesurrenderdating"      
## [135] "dvroremoval"                  "stalking"                    
## [137] "lawtotal"

You can get details on particular policies by looking at the codebook. For instance if you wish to know what policy is “junkgun”.

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "junkgun","Detailed.Description.of.Provision"]))

## [1] "The law prohibits the sale of handguns that fail to meet one or more of the following requirements: (1) Passes drop testing and firing testing; (2) Passes a melting point test; (3) Possesses specific handgun safety features; (4) Appears on a list of approved handguns. This may or may not apply to private sellers."

The policies are also in a number of different categories (variable category in law_data_codebook) and sub categories (variable Sub.category). Let’s see what categories there are and how many different policies fall under each of them.

table(law_data_codebook$Category)

## 
##                       Ammunition regulations 
##                                            7 
## Assault weapons and large-capacity magazines 
##                                            8 
##                            Background checks 
##                                           11 
##                            Buyer regulations 
##                                           17 
##                      Child access prevention 
##                                           11 
##                   Concealed carry permitting 
##                                            7 
##                           Dealer regulations 
##                                           17 
##                            Domestic violence 
##                                           21 
##                              Gun trafficking 
##                                            7 
##                                     Immunity 
##                                            1 
##                       Possession regulations 
##                                           12 
##                                   Preemption 
##                                            3 
##    Prohibitions for high-risk gun possession 
##                                           11 
##                            Stand your ground 
##                                            1

The most recent observation from the policy database is 2020.

State level fatality data

Data on state level fatalities and their reason can be obtained from the US Center for Disease Control, in particular the “Fatal and Nonfatal Injury Reports”. Following the lead from the Siegel et al (2019) paper we are after the age adjusted homicide rates.

Data are in state level files for Years. For instance you can find a file that provides annual information on causes of deaths and also broken down by the race of the deceased. The data for the state of Alabama is available from (“AL by race.csv”). Let’s upload these data.

deaths_data_race <- read_csv(paste0(datadir,"AL by race.csv"), na = "--")
names(deaths_data_race) = make.names(names(deaths_data_race))
# only keep rows that start with the state name 
deaths_data_race <- deaths_data_race %>% filter(State == "Alabama")
head(deaths_data_race) # displays the first few observations

## # A tibble: 6 × 9
##   State    Year Mechanism   Race  Deaths Population Crude.Rate Age.Adjusted.Rate
##   <chr>   <dbl> <chr>       <chr> <chr>       <dbl> <chr>      <chr>            
## 1 Alabama  2020 Cut/Pierce  White 24        3445489 0.7        0.69             
## 2 Alabama  2020 Cut/Pierce  Black 24        1350111 1.78       1.77             
## 3 Alabama  2020 Cut/Pierce  Asia… <NA>        87585 <NA>       <NA>             
## 4 Alabama  2020 Drowning (… White 59        3445489 1.71       1.7              
## 5 Alabama  2020 Drowning (… Black 16**      1350111 1.19**     1.26**           
## 6 Alabama  2020 Drowning (… Amer… <NA>        38347 <NA>       <NA>             
## # ℹ 1 more variable: Years.of.Potential.Life.Lost <dbl>

When checking the notes to the table you will realise that they do not report number of deaths if they are below 10. For example there is no reported number of deaths by “Cut/Pierce” for “Asien / HI Native / Pac.Islander” ethnicity. That does not mean that there were none. All we know that there were not more than 10. For values below 20 they are labeled as unsafe values (labelled with a **). The fact that numbers below 10 are not recorded means that we cannot get statewide numbers by aggregating across all races.

While it would be very interesting to do this analysis broken down by races, and the above data are suitable for such an analysis, we will not pursue this line here and we will have to base our analysis on different datasets, not the datafiles that break the numbers down by race.

Let’s look at the data for Alabama not broken down by race (“AL.csv”):

deaths_data <- read_csv(paste0(datadir,"AL.csv"), na = "--")
names(deaths_data) = make.names(names(deaths_data))
# only keep rows that start with the state name 
deaths_data <- deaths_data %>% filter(State == "Alabama")

# remove "**", needs applying twice
remove_astast <- function(x){ return (str_remove(x, "[**]"))}
test <- data.frame(lapply(deaths_data,remove_astast))
deaths_data <- data.frame(lapply(test,remove_astast))

# turn data into numeric
deaths_data$Deaths <- as.numeric(deaths_data$Deaths)
head(deaths_data,10)

##      State Year                           Mechanism Deaths Population
## 1  Alabama 2021                          Cut/Pierce     48    5039877
## 2  Alabama 2021 Drowning (includes water transport)     82    5039877
## 3  Alabama 2021                                Fall    321    5039877
## 4  Alabama 2021                          Fire/Flame    113    5039877
## 5  Alabama 2021                Hot object/Substance     NA    5039877
## 6  Alabama 2021                             Firearm     NA    5039877
## 7  Alabama 2021                           Machinery     12    5039877
## 8  Alabama 2021               Natural/Environmental     34    5039877
## 9  Alabama 2021                      Drug Poisoning     NA    5039877
## 10 Alabama 2021                  Non-Drug Poisoning     79    5039877
##    Crude.Rate Age.Adjusted.Rate Years.of.Potential.Life.Lost
## 1        0.95                 1                         1009
## 2        1.63              1.59                         1924
## 3        6.37              5.43                         1382
## 4        2.24              1.89                         1208
## 5        <NA>              <NA>                         <NA>
## 6       26.09             26.29                        34236
## 7        0.24              0.21                          160
## 8        0.68              0.56                          504
## 9       27.94              30.1                        33079
## 10       1.57              1.59                         1500

Let’s explore which types of causes of death are in the dataset.

print("Causes of death")

## [1] "Causes of death"

unique(deaths_data$Mechanism)

##  [1] "Cut/Pierce"                                    
##  [2] "Drowning (includes water transport)"           
##  [3] "Fall"                                          
##  [4] "Fire/Flame"                                    
##  [5] "Hot object/Substance"                          
##  [6] "Firearm"                                       
##  [7] "Machinery"                                     
##  [8] "Natural/Environmental"                         
##  [9] "Drug Poisoning"                                
## [10] "Non-Drug Poisoning"                            
## [11] "Struck by / against"                           
## [12] "Suffocation"                                   
## [13] "Motor vehicle, traffic"                        
## [14] "Pedal cyclist, other"                          
## [15] "Pedestrian, other"                             
## [16] "Transport, other land"                         
## [17] "Transport, other (excl. drown by water transp)"
## [18] "Other specified and classifiable"              
## [19] "Other specified / NEC"                         
## [20] "Unspecified"                                   
## [21] "Overexertion"

You will see that these indicate the “mechanism” of death, but do not indicate whether a death was a homicide, accident or suicide. Recall that we are really after the homicide data. We are returning to this point a little later.

When downloading the data from CDC website you can only download the data for one state at the time. Here are the filter settings used when downloading data from the WISQARS Fatal and Nonfatal Injury Reports.

image:

After downloading these data you should have 51 such files, one for each state. This is the point where we have to do a bit of organising. Make sure that the files have a consistent and clear naming structure. Here we saved the files under the two letter abbreviation for each state, for instance “AL” for Alabama. At this stage you should make sure that you have a list of all the states and the associated abbreviations. You can get such a list from many places. Search the internet for something like “list of us states and abbreviations”.

I saved such a table into an Excel file

USstates <- read_excel(paste0(datadir,"USStates_Dictionary.xlsx"))
head(USstates)

## # A tibble: 6 × 4
##   State      State_code State_num_code Region
##   <chr>      <chr>               <dbl> <chr> 
## 1 Alabama    AL                      1 South 
## 2 Alaska     AK                      2 West  
## 3 Arizona    AZ                      4 West  
## 4 Arkansas   AR                      5 South 
## 5 California CA                      6 West  
## 6 Colorado   CO                      8 West

Working with 51 files is awkward and we should merge these data together into one dataframe. If we know that there will be 52 files all structured the same way (as they all come from the same source) and all named consistently, using the two letter abbreviation, then we should be able to “automate” the merging process.

Roughly, the following code does the following. Import the data from the first state (Alabama) and save that into deaths_data. Then loop through the remaining states. Upload the new states data into temp_data and add (using the bind_rows function) temp_data into deaths_data. Then move on to the next state and do the same. In that way the deaths_data file will grow until it contains the data for all 51 states.

# start with first state
USstates_code <- USstates$State_code[1]   # pick first state code
USstates_name <- USstates$State[1]        # pick first state name
filename <- paste0(datadir, USstates_code, ".csv") # create text variable with filename to upload

deaths_data <- read_csv(filename, na = "--")
# only keep rows that start with the state name 
deaths_data <- deaths_data %>% filter(State == "Alabama")

for (i in seq(2,nrow(USstates))){
  USstates_code <- USstates$State_code[i]
  USstates_name <- USstates$State[i]
  filename <- paste0(datadir, USstates_code, ".csv")
  
  # get the next file
  temp_data <- read_csv(filename, na = "--")
  # only keep rows that start with the state name 
  temp_data <- temp_data %>% filter(State == USstates_name)
  
  # attach to deaths_data
  deaths_data <- bind_rows(deaths_data,temp_data)

}


#after loop
names(deaths_data) = make.names(names(deaths_data))

# remove "**", needs applying twice
remove_astast <- function(x){ return (str_remove(x, "[**]"))}
test <- data.frame(lapply(deaths_data,remove_astast))
deaths_data <- data.frame(lapply(test,remove_astast))

# turn data into numeric
deaths_data$Year <- as.numeric(deaths_data$Year)
deaths_data$Deaths <- as.numeric(deaths_data$Deaths)
deaths_data$Age.Adjusted.Rate <- as.numeric(deaths_data$Age.Adjusted.Rate)
deaths_data$Population <- as.numeric(deaths_data$Population)
deaths_data$Crude.Rate <- as.numeric(deaths_data$Crude.Rate)
deaths_data$Years.of.Potential.Life.Lost <- as.numeric(deaths_data$Years.of.Potential.Life.Lost)

After going through this code you should find that tdeaths_data has 20,895 rows of data and 8 variables.

At this stage we have the information on the laws and the information on the death rates in two different data objects, law_data and deaths_data. Let us now merge these two data sets into one big data object. First we compare the variable names to ensure that we have matching variables

names(deaths_data)

## [1] "State"                        "Year"                        
## [3] "Mechanism"                    "Deaths"                      
## [5] "Population"                   "Crude.Rate"                  
## [7] "Age.Adjusted.Rate"            "Years.of.Potential.Life.Lost"

names(law_data)

##   [1] "state"                        "year"                        
##   [3] "felony"                       "invcommitment"               
##   [5] "invoutpatient"                "danger"                      
##   [7] "drugmisdemeanor"              "alctreatment"                
##   [9] "alcoholism"                   "relinquishment"              
##  [11] "violent"                      "violenth"                    
##  [13] "violentpartial"               "dealer"                      
##  [15] "dealerh"                      "recordsall"                  
##  [17] "recordsallh"                  "recordsdealer"               
##  [19] "recordsdealerh"               "reportall"                   
##  [21] "reportallh"                   "reportdealer"                
##  [23] "reportdealerh"                "purge"                       
##  [25] "residential"                  "theft"                       
##  [27] "security"                     "inspection"                  
##  [29] "ammlicense"                   "ammrecords"                  
##  [31] "permit"                       "permith"                     
##  [33] "fingerprint"                  "training"                    
##  [35] "permitlaw"                    "registration"                
##  [37] "registrationh"                "defactoreg"                  
##  [39] "defactoregh"                  "ammpermit"                   
##  [41] "ammrestrict"                  "age21handgunsale"            
##  [43] "age18longgunsale"             "age21longgunsaled"           
##  [45] "age21longgunsale"             "age21handgunpossess"         
##  [47] "age18longgunpossess"          "age21longgunpossess"         
##  [49] "loststolen"                   "amm18"                       
##  [51] "amm21h"                       "universal"                   
##  [53] "universalh"                   "gunshow"                     
##  [55] "gunshowh"                     "universalpermit"             
##  [57] "universalpermith"             "backgroundpurge"             
##  [59] "ammbackground"                "threedaylimit"               
##  [61] "mentalhealth"                 "statechecks"                 
##  [63] "statechecksh"                 "waiting"                     
##  [65] "waitingh"                     "assault"                     
##  [67] "onefeature"                   "assaultlist"                 
##  [69] "assaultregister"              "assaulttransfer"             
##  [71] "magazine"                     "tenroundlimit"               
##  [73] "magazinepreowned"             "onepermonth"                 
##  [75] "traffickingbackground"        "traffickingprohibited"       
##  [77] "traffickingprohibitedh"       "strawpurchase"               
##  [79] "strawpurchaseh"               "microstamp"                  
##  [81] "gvro"                         "gvrolawenforcement"          
##  [83] "college"                      "collegeconcealed"            
##  [85] "elementary"                   "opencarryh"                  
##  [87] "opencarryl"                   "opencarrypermith"            
##  [89] "opencarrypermitl"             "permitconcealed"             
##  [91] "mayissue"                     "showing"                     
##  [93] "ccbackground"                 "ccbackgroundnics"            
##  [95] "ccrenewbackground"            "ccrevoke"                    
##  [97] "nosyg"                        "personalized"                
##  [99] "lockd"                        "lockp"                       
## [101] "locked"                       "lockstandards"               
## [103] "capliability"                 "capaccess"                   
## [105] "capuses"                      "capunloaded"                 
## [107] "cap18"                        "cap16"                       
## [109] "cap14"                        "junkgun"                     
## [111] "liability"                    "immunity"                    
## [113] "preemption"                   "preemptionnarrow"            
## [115] "preemptionbroad"              "mcdv"                        
## [117] "mcdvdating"                   "mcdvsurrender"               
## [119] "mcdvsurrendernoconditions"    "mcdvsurrenderdating"         
## [121] "mcdvremovalallowed"           "mcdvremovalrequired"         
## [123] "incidentremoval"              "incidentall"                 
## [125] "dvro"                         "dvrodating"                  
## [127] "exparte"                      "expartedating"               
## [129] "dvrosurrender"                "dvrosurrendernoconditions"   
## [131] "dvrosurrenderdating"          "expartesurrender"            
## [133] "expartesurrendernoconditions" "expartesurrenderdating"      
## [135] "dvroremoval"                  "stalking"                    
## [137] "lawtotal"

When you merge data files you need to be very clear what you wish to merge. Here we have data at the State-Year level. So we want to merge data from the same State-Year together into the same row. In law_data we have one row of data for each state-year.

law_data %>% filter(state == "Alabama", year == "2015")

## # A tibble: 1 × 137
##   state    year felony invcommitment invoutpatient danger drugmisdemeanor
##   <chr>   <dbl>  <dbl>         <dbl>         <dbl>  <dbl>           <dbl>
## 1 Alabama  2015      0             1             0      1               0
## # ℹ 130 more variables: alctreatment <dbl>, alcoholism <dbl>,
## #   relinquishment <dbl>, violent <dbl>, violenth <dbl>, violentpartial <dbl>,
## #   dealer <dbl>, dealerh <dbl>, recordsall <dbl>, recordsallh <dbl>,
## #   recordsdealer <dbl>, recordsdealerh <dbl>, reportall <dbl>,
## #   reportallh <dbl>, reportdealer <dbl>, reportdealerh <dbl>, purge <dbl>,
## #   residential <dbl>, theft <dbl>, security <dbl>, inspection <dbl>,
## #   ammlicense <dbl>, ammrecords <dbl>, permit <dbl>, permith <dbl>, …

But note that in deaths_data we actually have 20 rows of data for each state-year. One row for each mechanism of death.

deaths_data %>% filter(State == "Alabama", Year == "2015")

##      State Year                                      Mechanism Deaths
## 1  Alabama 2015                                     Cut/Pierce     41
## 2  Alabama 2015            Drowning (includes water transport)     90
## 3  Alabama 2015                                           Fall    265
## 4  Alabama 2015                                     Fire/Flame     92
## 5  Alabama 2015                           Hot object/Substance     NA
## 6  Alabama 2015                                        Firearm    958
## 7  Alabama 2015                                      Machinery     18
## 8  Alabama 2015                          Natural/Environmental     25
## 9  Alabama 2015                                 Drug Poisoning    736
## 10 Alabama 2015                             Non-Drug Poisoning     71
## 11 Alabama 2015                            Struck by / against     17
## 12 Alabama 2015                                    Suffocation    246
## 13 Alabama 2015                         Motor vehicle, traffic    929
## 14 Alabama 2015                           Pedal cyclist, other     NA
## 15 Alabama 2015                              Pedestrian, other     21
## 16 Alabama 2015                          Transport, other land     28
## 17 Alabama 2015 Transport, other (excl. drown by water transp)     NA
## 18 Alabama 2015               Other specified and classifiable     25
## 19 Alabama 2015                          Other specified / NEC     20
## 20 Alabama 2015                                    Unspecified    271
##    Population Crude.Rate Age.Adjusted.Rate Years.of.Potential.Life.Lost
## 1     4854803       0.85              0.86                          823
## 2     4854803       1.85              1.80                         2356
## 3     4854803       5.46              4.81                         1019
## 4     4854803       1.90              1.74                         1330
## 5     4854803         NA                NA                           NA
## 6     4854803      19.73             19.60                        22293
## 7     4854803       0.37              0.31                          155
## 8     4854803       0.52              0.47                          222
## 9     4854803      15.16             15.70                        16715
## 10    4854803       1.46              1.43                         1360
## 11    4854803       0.35              0.32                          238
## 12    4854803       5.07              4.90                         4720
## 13    4854803      19.14             18.56                        21472
## 14    4854803         NA                NA                           NA
## 15    4854803       0.43              0.45                          662
## 16    4854803       0.58              0.59                          566
## 17    4854803         NA                NA                           NA
## 18    4854803       0.52              0.52                          704
## 19    4854803       0.41              0.38                          277
## 20    4854803       5.58              5.09                         2466

The aim here is to keep the 20 rows of data for the different mechanisms and add the same law data to each of the 20 rows of data (say for Alabama in 2015).

We want to match on State and Year, but note that the variables are capitalised in deaths_data and not capitalised in law_data. As R is case sensitive we will have to harmonise the names before merging.

names(law_data)[names(law_data) == "state"] <- "State"
names(law_data)[names(law_data) == "year"] <- "Year"
merge_data <- merge(deaths_data,law_data, all.x = TRUE)

We end up with 20895 rows of data. This is the same number of row as in deaths_data. Check out what would have happened had you set the all.x option to FALSE. Read the help for the merge function to understand what this option does.

It is important to understand why you have the number of observations you have. Here we have 51 states, 21 years and 21 methods of death. If we had a complete set of observations this would deliver 22,491 obs, but for some states we are “missing” some methods of death.

Let us check this to make sure

table1 <- merge_data %>% group_by(State, Year) %>%
                          summarise(n = n()) %>%  # counts number of rows in each group
                          spread(Year,n) %>% print()

## # A tibble: 51 × 22
## # Groups:   State [51]
##    State   `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010`
##    <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1 Alabama     20     19     19     20     20     19     20     21     19     20
##  2 Alaska      18     19     19     19     17     19     19     18     17     18
##  3 Arizona     20     20     21     20     20     20     20     20     20     20
##  4 Arkans…     19     19     21     20     20     20     20     20     19     20
##  5 Califo…     20     21     21     20     20     21     21     20     21     21
##  6 Colora…     19     20     19     21     20     20     20     21     21     20
##  7 Connec…     20     18     19     20     20     19     19     20     19     20
##  8 Delawa…     18     18     17     17     18     17     19     18     18     17
##  9 Distri…     18     16     14     17     13     18     18     14     18     18
## 10 Florida     20     20     21     20     20     20     21     21     20     21
## # ℹ 41 more rows
## # ℹ 11 more variables: `2011` <int>, `2012` <int>, `2013` <int>, `2014` <int>,
## #   `2015` <int>, `2016` <int>, `2017` <int>, `2018` <int>, `2019` <int>,
## #   `2020` <int>, `2021` <int>

As you can see from this table, the number of observations by State/Year is often less than 21.

Unfortunately the data series, when accessed through this database, only start in 2001. In the Siegel et al. (2019) paper data from 1991 are used (up to and including 2016). This means that here we can use data from 2001 to 2020 (last policy observation).

Before we continue there is one extra item to take care off. We already merged two datafiles and we will merge data from more datafiles a little later. When one merges on state names there is always the risk that different datasets use slightly different spelling and therefore the data may not match perfectly. This is one reason why there are codes for states and countries. In the US there are two letter codes representing states. They are contained in the USstates dataframe we previously loaded, but they are not yet contained in the merge_data dataframe. Let us add an additional column State_code with that two letter code.

merge_data <- merge(merge_data, USstates)
names(merge_data)[names(merge_data)=="Code"] <- "State_code"

We will use this to facilitate the merging of datafiles.

Summary Stats and Data Exploration

Let’s calculate some firearm death statistics. The data from the CDC deliver an age adjusted rate which indicates the number of deaths for every 100,000 of population in state $s$ in a particular year $t$. Let’s average that rate in each sate across all years.

table2 <- merge_data %>% filter(Mechanism == "Firearm") %>% 
                          group_by(State) %>% 
                          summarise(avg_aarate = mean(Age.Adjusted.Rate,na.rm = TRUE), n=n()) %>% 
                          arrange(-avg_aarate) %>%  # sort in decreasing order of avg_aarate
                          print()

## # A tibble: 51 × 3
##    State                avg_aarate     n
##    <chr>                     <dbl> <int>
##  1 Louisiana                  20.2    21
##  2 Alaska                     19.9    21
##  3 Mississippi                19.7    21
##  4 Alabama                    18.5    21
##  5 District of Columbia       18.1    21
##  6 Wyoming                    17.9    21
##  7 New Mexico                 17.2    21
##  8 Montana                    17.1    21
##  9 Arkansas                   17.0    21
## 10 Tennessee                  16.2    21
## # ℹ 41 more rows

If you are working on replicating or extending some published work you should try and see whether your data are the same as the ones used in the publication you are working off. This is best done by matching your own summary statistics to those in the paper. The Siegel et al (2019) paper does not really have any descriptive statistic, but report some data for 2016 (their Table 2). They report, for that year an age adjusted homicide rate of 14.2 and an age adjusted suicide rate of 14.1 in Louisiana. Let’s look at our data for state state-year:

merge_data %>% filter(Year == 2016, State == "Louisiana") %>% 
  select(State, Year, Mechanism, Age.Adjusted.Rate)

##        State Year                                      Mechanism
## 1  Louisiana 2016            Drowning (includes water transport)
## 2  Louisiana 2016                                           Fall
## 3  Louisiana 2016                          Other specified / NEC
## 4  Louisiana 2016                          Transport, other land
## 5  Louisiana 2016                                      Machinery
## 6  Louisiana 2016                                     Cut/Pierce
## 7  Louisiana 2016 Transport, other (excl. drown by water transp)
## 8  Louisiana 2016                             Non-Drug Poisoning
## 9  Louisiana 2016                                     Fire/Flame
## 10 Louisiana 2016                                        Firearm
## 11 Louisiana 2016                                 Drug Poisoning
## 12 Louisiana 2016                              Pedestrian, other
## 13 Louisiana 2016                            Struck by / against
## 14 Louisiana 2016               Other specified and classifiable
## 15 Louisiana 2016                         Motor vehicle, traffic
## 16 Louisiana 2016                          Natural/Environmental
## 17 Louisiana 2016                           Hot object/Substance
## 18 Louisiana 2016                                    Suffocation
## 19 Louisiana 2016                           Pedal cyclist, other
## 20 Louisiana 2016                                    Unspecified
##    Age.Adjusted.Rate
## 1               2.57
## 2               6.45
## 3               0.72
## 4               0.50
## 5               0.22
## 6               0.78
## 7               0.36
## 8               1.25
## 9               1.31
## 10             21.20
## 11             21.91
## 12              0.50
## 13              0.37
## 14              0.75
## 15             17.08
## 16              0.63
## 17                NA
## 18              7.37
## 19                NA
## 20              3.19

You will realise that the data we obtained does not actually record whether a death was a suicide or homicide, but rather we have information on the mechanism of death. Of course a death by firearm could be either a homicide or a suicide. Therefore, we really do not have the data to replicate what they have actually done. Here we will proceed by working with the firearm data.

In the table above you can see that we have a value of 21.20 (per 100,000 population) firearm deaths in Louisiana for 2016. That is not the same as the 14.2 in the paper but we wouldn’t expect this as this doesn’t measure homicides. But most likely it is quite highly correlated and it is certainly, size wise in the same ballpark.

Let’s look at a few time-series plots of the data. Whenever you are working with data that have a time-series dimension you should do this as this will give the reader a better understanding of the data.

plot1data <- merge_data %>% filter(Mechanism == "Firearm") %>% 
                            filter(State %in% c("Alabama","Tennessee","Maine", "New York"))
p1 <- ggplot(plot1data, aes( x = Year, y = Age.Adjusted.Rate, color = State)) +
            geom_line() +
            labs(title="Age Adjusted Death by Firearm Rates")
p1

There clearly seems to be an upward trend in the number of firearm deaths after 2015, but for New York in this selection of States.

More Data Setup

In this section we will prepare the information on gun control laws in a way that makes it useable in further regression analysis. We will also load more covariates into the dataset.

Policy set-up

Late we wish to evaluate whether a particular gun control policy has a causal effect. For instance consider universal background checks. We can refer back to the law_data_codebook to see which policy codes are available. In fact there are two policy columns that relate to universal background checks.

# the toString function makes the output easier to read. Try what happens if you don't use it
print(toString(law_data_codebook[law_data_codebook$Variable.Name == "universal","Detailed.Description.of.Provision"]))

## [1] "Both licensed dealers and private sellers must conduct background checks at point of purchase for all firearms. This may or may not include exemptions for buyers who have already undergone a background check for a concealed carry permit or other licensing requirements. Background checks must be explicitly required."

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "universalh","Detailed.Description.of.Provision"]))

## [1] "Both licensed dealers and private sellers must conduct background checks at point of purchase for handguns. This may or may not include exemptions for buyers who have already undergone a background check for a concealed carry permit or other licensing requirements. Background checks must be explicitly required."

So the difference is that the universalh policy only covers handguns whereas universal covers all firearms. We shall create a new variable that combines these two. We will create a new variable $pol_{st}$ that takes the value 1 if either of these two policies is in place in state $s$ at time $t$.

To make this happen with the tidyverse approach we have to indicate to R, before calling the definition of the new variable mutate(pol_ubc = max(universal,universalh)) that the max here is an operation that should happen row-by-row (rowwise()).

merge_data <- merge_data %>% rowwise() %>%  mutate(pol_ubc = max(universal,universalh)) # universal background checks

Open the merge_data object to check that your operation actually did what it was meant to do.

A different policy to consider, as it was found to have significant effects in the Siegel et al. (2019) paper, is that of a ban for weapons for people who had violent misdemeanors on their records.

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "violent","Detailed.Description.of.Provision"]))

## [1] "Law prohibits firearm possession by people who have committed violent misdemeanors punishable by less than one year of imprisonment. Simple assault misdemeanors must be included. Does not count if there is an explicit exemption for crimes punishable by less than one year of imprisonment."

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "violenth","Detailed.Description.of.Provision"]))

## [1] "Law prohibits handgun possession by people who have committed violent misdemeanors punishable by less than one year of imprisonment. Simple assault misdemeanors must be included. Does not count if there is an explicit exemption for crimes punishable by less than one year of imprisonment."

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "violentpartial","Detailed.Description.of.Provision"]))

## [1] "Law prohibits firearm possession by people who have committed violent misdemeanors punishable by more than one year of imprisonment."

As for universal background checks we set up a policy variable pol_vmisd.

merge_data <- merge_data %>% rowwise() %>%  mutate(pol_vmisd = max(violent,violenth,violentpartial)) # universal background checks

In the Siegel et al. (2019) paper there was one policy which was estimated to have a positive effect on firearm deaths. This policy was labelled “Shall issue laws” and described as the “absence of any discretion of law enforcement authorities in deciding whether to grant a concealed carry permit”. When scanning the law_data_codebook it is not obvious which entry represents this policy.

You search the dataframe with the following function which finds the entries in law_data_codebook$Detailed.Description.of.Provision which contains the word “shall”

grepl("shall",law_data_codebook$Detailed.Description.of.Provision)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE

If R finds a cell which contains that word it will represent this with a “TRUE” in the results. There is none, so the word “shall” is not contained in any of the descriptions.

Re-read the above description and you will realise that it actually describes the absence of a policy to control who should be allowed to carry a concealed weapon. With that in mind scan the policies categorised as “Concealed carry permitting” policies. You will find a policy labelled “mayissue” (in line 77). The description is:

print(toString(law_data_codebook[law_data_codebook$Variable.Name == "mayissue","Detailed.Description.of.Provision"]))

## [1] "Law provides authorities with discretion in deciding whether to grant a concealed carry permit, or the law bans all concealed weapons."

Therefore it may well be that the policy in the Siegel et al. (2019) paper is coded as the absence of that policy (we cannot really be certain about this as the authors do not exactly describe this, although we could investigate further as in their Table 1 they also provide listings of states and number of policy changes). We could therefore code it as 1-mayissue. However, for now we just use “mayissue” as a policy which has a potential firearm death reducing effect (rather than an increasing effect of the absence of the policy).

merge_data <- merge_data %>% rowwise() %>%  mutate(pol_mayissue = mayissue) # universal background checks

There is a range of other policies that were all estimated to not have a policy impact. Of course the conclusion may change as we are looking at a different sample period.

Before we go to estimating models it is useful to figure out how much variation we can observe in the policy in the states. What we require for a difference-in-difference setup is variation in policy in at least one state. A good way to do that is to visualise the series of 0s and 1s for a particular policy across the states. This is not a standard visualisation task, so it is likely that you may wish to search the internet for advise on “R ggplot visualise elements of a matrix”. This admittedly already uses the a pre-conception that representing the 0s and 1s in a matrix is useful.

ggplot(merge_data, aes(x = Year, y = State)) + 
  geom_raster(aes(fill=pol_ubc)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Year", y="State", title="Policy - Universal Background Check")

There are two important things you can see from here.

There are no data for 2021 (as discussed above) and no data for the District of Columbia, reducing the number of effective states to 50.
Most states do not have any universal background checks policy. Some states have this policy in place for the entire sample period (e.g. Indiana).
Some states have a universal background check policy in place for the entire sample period, e.g. Rhode Island.
Some states introduce the policy during the sample period, e.g. Virginia. This is important as a DiD estimation relies on these states for identifying the policy effect.

Let’s create the similar plot for the other two policies.

ggplot(merge_data, aes(x = Year, y = State)) + 
  geom_raster(aes(fill=pol_vmisd)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Year", y="State", title="Policy - Violent Misdemeanors")

Here you can see that there is variation across states, but no variation in a state. This means that, with the given sample (only starting in 2000 due to the lack of earlier data for mechanisms of death), we cannot investigate the effectiveness of the violent misdemeanor policy. In the Siegel et al. (2019) paper they can investigate the policy as there is some variation between 1991 and 2000. Running the following code would demonstrate this (Result not shown here).

temp <- law_data
temp <- temp %>% rowwise() %>%  mutate(pol_vmisd = max(violent,violenth,violentpartial)) # universal background checks

ggplot(temp, aes(x = Year, y = State)) + 
  geom_raster(aes(fill=pol_vmisd)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Year", y="State", title="Policy -Violent Misdeameanor - 1991-2020")

Lastly for the mayissue policy.

ggplot(merge_data, aes(x = Year, y = State)) + 
  geom_raster(aes(fill=pol_mayissue)) + 
  scale_fill_gradient(low="grey90", high="red") +
  labs(x="Year", y="State", title="Policy - May Issue")

We can see that there is variation in the policy inside states and therefore we can investigate the policy. The variation is of the type that this policy existed at the beginning of the sample, but then was withdrawn, e.g. Wisconsin or Alabama. In fact this may be the reason why Siegel et al. (2019) used the reverse of this as a policy.

Data on covariates

The Siegel et al. (2019) paper uses a number of extra covariates that can be used in the DiD regressions.

State unemployment data
Number of law enforcement officers by state
Violent crime rate (excl homicides)
per capita alcohol consumption
age structure of the population

We will obtain the data where we can and then create datafiles with State-Year observations such that we can merge them into the merge_data dataframe. For this to work smoothly, each of these datafiles should have a column called State with the name of the state and a column called Year with the year. The Spelling has to match this spelling.

Unemployment data

Unemployment data are standard economic data and we can obtain them from FRED.

# note the skip = 1 to skip the first row, column headings are in row 2
# recall, always have a look at the original datafile
ur_data <- read.csv(paste0(datadir,"Unemployment Rate by State (Percent).csv"),skip = 1)
head(ur_data[,1:7])

##   Series.ID Region.Name Region.Code X01.01.2001 X01.02.2001 X01.03.2001
## 1      ALUR     Alabama           1         4.7         4.8         4.8
## 2      AKUR      Alaska           2         6.2         6.1         6.1
## 3      AZUR     Arizona           4         4.0         4.1         4.2
## 4      ARUR    Arkansas           5         4.3         4.4         4.5
## 5      CAUR  California           6         4.8         4.8         4.9
## 6      COUR    Colorado           8         2.7         2.8         2.9
##   X01.04.2001
## 1         4.9
## 2         6.1
## 3         4.2
## 4         4.5
## 5         5.0
## 6         3.1

The data are actually monthly data. And they are not in the long format (i.e. one row for a state-month observation). So there are two things we need to do with this file. 1) we need to turn it into a long format and then we need to average the unemployment rate over years. Let’s start with turning this into a long format.

We start by removing the Series.ID variable and the Region.Code and renaming the Region.Name variable to State.

ur_data <- ur_data %>% select(-Series.ID) %>% select(-Region.Code)
names(ur_data)[names(ur_data)=="Region.Name"] <- "State"

Now we use the pivot_longer function to turn the wide into a long dataset. The current column names (e.g. “X01.01.2001”) is turned into values for the new variable Date and the values in the table are turned into a new variable ur. The State variable is left as it is (!State basically tells the function “Not State”, meaning do not do anything with the sate variable).

It is unlikely that you will apply this function often enough to remember how this function works. You just need to know this function exists and then use the help function or examples on the internet to figure out what to do. And remember you cannot break the computer. Just try and test what does the job and what doesn’t.

ur_data_long <- ur_data %>% pivot_longer(!State, names_to = "Date", values_to = "ur")
head(ur_data_long)

## # A tibble: 6 × 3
##   State   Date           ur
##   <chr>   <chr>       <dbl>
## 1 Alabama X01.01.2001   4.7
## 2 Alabama X01.02.2001   4.8
## 3 Alabama X01.03.2001   4.8
## 4 Alabama X01.04.2001   4.9
## 5 Alabama X01.05.2001   4.9
## 6 Alabama X01.06.2001   5

Step 1 is done. You can now see that in the date column we have the monthly dates. They all have an “X” at the beginning which results from the fact that variable names in R have to start with a letter and when the data were imported R added an “X” to the column headings which were the dates. All we need from the dates is the year which in the last four digits. So what we want to extract from the date column is the last four digits. This is not an operation you come across very often and hence a web-search sounds like the way to go here. Go to your favourite search engine and search for something like “R extracting last characters”.

You may find different pieces of advice as there are many different ways to achieve this. Here I adopt a method which uses the str_sub function (Check out the help function by typing ?str_sub into the console.)

ur_data_long$Year <- as.numeric(str_sub(ur_data_long$Date,-4))
head(ur_data_long)

## # A tibble: 6 × 4
##   State   Date           ur  Year
##   <chr>   <chr>       <dbl> <dbl>
## 1 Alabama X01.01.2001   4.7  2001
## 2 Alabama X01.02.2001   4.8  2001
## 3 Alabama X01.03.2001   4.8  2001
## 4 Alabama X01.04.2001   4.9  2001
## 5 Alabama X01.05.2001   4.9  2001
## 6 Alabama X01.06.2001   5    2001

As you can see, this has now created a Year column. All we got to do now is to calculate an average unemployment rate for each state in each year.

ur_data_long <- ur_data_long %>% group_by(State, Year) %>% 
                                    summarise(ur = mean(ur))

Let’s plot some data to confirm that we have obtained sensible data.

plot1ur <- ur_data_long %>% 
                            filter(State %in% c("Alabama","Tennessee","Maine", "New York"))
p2 <- ggplot(plot1ur, aes( x = Year, y = ur, color = State)) +
            geom_line() +
            labs(title="Unemployment rate")
p2

This plot looks reasonable showing unemployment rate peaks in the aftermath of the financial crisis and Covid. ur_data_long is now ready to be merged with merge_data.

merge_data <- merge(merge_data,ur_data_long, all.x = TRUE)

Law enforcement officer numbers

The F.B.I. has a Crime Data Explorer and in there you can find a number of crime related data, amongst others the Law Enforcement Employees Data which provides a detailed breakdown of law enforcement employees for very small geographies like cities. As the datafile contains state information we can aggregate these to state level data.

The data are saved in “law enforcement employees FBIdata_1960_2022.csv”.

officers_data <- read.csv(paste0(datadir,"law enforcement employees FBIdata_1960_2022.csv"))

Have a look at the spreadsheet to get an idea of the data. There are more than 700,000 rows of data. One row represents the information for one agency in a particular year. But when inspecting the datafile you would have seen that there are lots of agencies.

Let’s have a look at one particular row of data to get a feel for the data.

officers_data[99,]

##    data_year pub_agency_name state_abbr county_name agency_type_name
## 99      2021       LaFayette         AL    CHAMBERS             City
##    total_pe_ct pe_ct_per_1000
## 99          12            4.2

Let’s start by changing the names of the State and Year variables.

names(officers_data)[names(officers_data)=="data_year"] <- "Year"
names(officers_data)[names(officers_data)=="state_abbr"] <- "State_code"

As we need the number of officers in a state in one year we need to aggregate over all agencies in a particular year in a state. We also restrict our attention to years 2001 onwards.

officers_data_agg <- officers_data %>% filter(Year >= 2001) %>% 
                                      group_by(State_code, Year) %>% 
                                      summarise(law.officers = sum(total_pe_ct))

Let’s look at the data to get a feel for the information in these data.

plot1off <- officers_data_agg %>% filter(State_code %in% c("AL","TN","ME", "WV"))
p2 <- ggplot(plot1off, aes( x = Year, y = law.officers, color = State_code)) +
            geom_line() +
            labs(title="Number of Law Enforcement Officers")
p2

You can see that the data in officer_data_agg are absolute numbers and we will have to scale them to number of officers per 100,000 population. We will do this after merging the data into merge_data.

merge_data <- merge(merge_data,officers_data_agg, all.x = TRUE)

As it turns out that there are no officers data for West Virginia for the years 2008 and 2014, it was important to add the option all.x = TRUE to ensure that these two years remain preserved for 2008 and 2014 in the merged dataset. Without this option, these two observations would be removed from the merge_data dataframe.

Before continuing we shall scale this variable by the population to calculate how many law enforcement officers a state has per 100,000 population.

merge_data$law.officers.pc <- 100000*merge_data$law.officers/merge_data$Population

plot1off <- merge_data %>% filter(State_code %in% c("AL","TN","ME", "WV")) %>% 
                            select(Year,law.officers.pc, State_code)
p2 <- ggplot(plot1off, aes( x = Year, y = law.officers.pc, color = State_code)) +
            geom_line() +
            labs(title="Number of Law Enforcement Officers (per 100,000 population)")
p2

There are three things we learn from this figure.

There are missing observations for West Virginia (WV) in 2008 and 2014. We didn’t see these missing observations in the previous graph as the R just interpolated the missing info. Now we tried to calculate with a missing observation, that produces a NA entry and causes a missing observation in the plot.
When comparing law officers by 100,000 population all states have similar type of numbers of officers. Roughly between 200 and 400.
There seems to be quite a lot of variation in these numbers in Alabama. And in Tennesee there is a peak of law officers in the year 2007. You may want to investigate why this is the case.

Violent crime rate

From the F.B.I. website you can access the Data Discovery Tool which is an interface which gives you access for a range of data. There are four categories of violent crime, homicide, rape, robbery and aggravated assault. Tick the last three and then download the data state by state for the years 2001 to 2021. (“add query” to add up to 5 states at a time, then submit, show the table and download; then reset query and chose the next 5 states).

image:

While doing this for every state is a chore (although you can do it for 5 states at a time) it is ok as the resulting tables are all formatted identically and therefore it is fairly straightforward to get all data into one table.

In fact, when you download data like this, it may be wise to download more data than you now think you need. For instance you could download data back to 1991 (the Siegel sample) just in case you were to find the additional earlier years of firearm deaths. The data are available from “Violent crime excl homicide by State.xlsx”.

vcrime_data <- read_excel(paste0(datadir,"Violent crime excl homicide by State.xlsx"))
head(vcrime_data)

## # A tibble: 6 × 52
##    Year    AL    AK    AZ    AR     CA    CO    CT    DE    DC     FL    GA
##   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1  2001 19203  3696 28275 12042 210661 15334 11387  4845  4845 129839 41073
## 2  2002 19628  3594 29784 11359 205993 15703 10723  4810  4810 127810 38665
## 3  2003 19032  3838 28197 12269 203144 15572 10933  5504  5504 123356 38778
## 4  2004 19070  4122 28538 13638 186783 16920 10013  5077  5077 122808 39604
## 5  2005 19304  4162 30033 14481 187675 18325  9437  5295  5295 125074 40161
## 6  2006 19171  4574 32923 15333 191997 18623 10368  5941  5941 128473 44149
## # ℹ 40 more variables: HI <dbl>, ID <dbl>, IL <dbl>, IN <dbl>, IA <dbl>,
## #   KS <dbl>, KY <dbl>, LA <dbl>, ME <dbl>, MD <dbl>, MA <dbl>, MI <dbl>,
## #   MN <dbl>, MS <dbl>, MO <dbl>, MT <dbl>, NE <dbl>, NV <dbl>, NH <dbl>,
## #   NJ <dbl>, NM <dbl>, NY <dbl>, NC <dbl>, ND <dbl>, OH <dbl>, OK <dbl>,
## #   OR <dbl>, PA <dbl>, RI <dbl>, SC <dbl>, SD <dbl>, TN <dbl>, TX <dbl>,
## #   UT <dbl>, VT <dbl>, VA <dbl>, WA <dbl>, WV <dbl>, WI <dbl>, WY <dbl>

You can see that the data are not arranged in the format which we need, i.e. one row for each State-Year. Rather we have years in rows and states in columns. We shall again apply the pivot_longer function.

vcrime_data_long <- vcrime_data %>% pivot_longer(!Year, names_to = "State_code", values_to = "vcrime")
head(vcrime_data_long)

## # A tibble: 6 × 3
##    Year State_code vcrime
##   <dbl> <chr>       <dbl>
## 1  2001 AL          19203
## 2  2001 AK           3696
## 3  2001 AZ          28275
## 4  2001 AR          12042
## 5  2001 CA         210661
## 6  2001 CO          15334

These are now in the right format for merging. Again we have number of crimes and we will want to transform it to a per 100,000 population measure once we merged this number into merge_data.

merge_data <- merge(merge_data,vcrime_data_long, all.x = TRUE)

Before continuing we shall scale this variable by the population to calculate how many law enforcement officers a state has per 100,000 population.

merge_data$vcrime.pc <- 100000*merge_data$vcrime/merge_data$Population

Homicide data

As exploring where to get the violent crime data we eventually also discovered a source for the homicide data as that was the fourth category of violent crime. With the same process (as for other violent crime) we can download the state by state homicide data. They have been collated in the “Homicide by State.xlsx” file.

homi_data <- read_excel(paste0(datadir,"Homicide by State.xlsx"))
head(homi_data)

## # A tibble: 6 × 52
##    Year    AL    AK    AZ    AR    CA    CO    CT    DE    DC    FL    GA    HI
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1991   469    42   291   264  3859   199   187    37   482  1248   849    45
## 2  1992   455    44   312   259  3921   216   166    32   443  1208   741    42
## 3  1993   484    54   339   247  4096   206   206    35   454  1224   789    45
## 4  1994   501    38   426   294  3703   199   215    33   399  1165   703    50
## 5  1995   475    55   439   259  3531   216   150    25   360  1037   683    56
## 6  1996   444    45   377   219  2916   180   158    31   397  1077   630    40
## # ℹ 39 more variables: ID <dbl>, IL <dbl>, IN <dbl>, IA <dbl>, KS <dbl>,
## #   KY <dbl>, LA <dbl>, ME <dbl>, MD <dbl>, MA <dbl>, MI <dbl>, MN <dbl>,
## #   MS <dbl>, MO <dbl>, MT <dbl>, NE <dbl>, NV <dbl>, NH <dbl>, NJ <dbl>,
## #   NM <dbl>, NY <dbl>, NC <dbl>, ND <dbl>, OH <dbl>, OK <dbl>, OR <dbl>,
## #   PA <dbl>, RI <dbl>, SC <dbl>, SD <dbl>, TN <dbl>, TX <dbl>, UT <dbl>,
## #   VT <dbl>, VA <dbl>, WA <dbl>, WV <dbl>, WI <dbl>, WY <dbl>

As for the violent crime data we will go through the same process of applying pivot_longer, then merging with merge_data and then calculating the homicide rate per 100,000 population.

homi_data_long <- homi_data %>% pivot_longer(!Year, names_to = "State_code", values_to = "homi")
merge_data <- merge(merge_data,homi_data_long, all.x = TRUE)
merge_data$homi.pc <- 100000*merge_data$homi/merge_data$Population

Do we now have the homicide data used in the Siegel et al (2019) paper. Let’s see what the value for Louisiana is in 2016. In the paper that was 14.2.

merge_data %>% filter(State == "Louisiana", Year == 2016) %>% select(State, Year, homi.pc) %>% unique()

##       State Year  homi.pc
## 1 Louisiana 2016 11.85556

As you can see the value is not exactly the same. The reason is that the data we got from the F.B.I. website are not age adjusted data (as are the data from the CDC). This means that in the later analysis we can either use the age adjusted firearm deaths data or the non-age adjusted homicide data. We could of course also attempt to apply the age adjustment procedure, but we will not do this here.

Per-capita alcohol consumption

The Siegel paper indicates that two different sources were used to obtain a full set of data. One of the sources was the National Institute on Alcohol Abuse and Alcoholism. I therefore performed a web search for “National institute on alcohol abuse and alcoholism state data” and was directed to Surveillance Report webpage which seemed to contain under Surveillance Report #120 exactly what was needed. Under that item there was a link to a datafile which seemed to contain exactly the information needed.

This is a text file (it is best opened with a basic text editor, in Windows there is one called Notepad, on a Mac there is one called TextEdit). At the beginning of the file you will see information on the variables contained in the file. Here we have converted this text file into a csv file (which you can achieve by copying and pasting the portions of the file that actually contain the data into Excel and then applying the “Text to columns” function).

alcc_data <- read.csv(paste0(datadir,"Alcohol consumption by US state.csv"))
head(alcc_data)

##   Year State.no Type.of.beverage Gallons.of.beverage
## 1 1970        1                1             3863000
## 2 1970        1                2             1412000
## 3 1970        1                3            33098000
## 4 1970        1                4                   .
## 5 1970        2                1              945000
## 6 1970        2                2              470000
##   Gallons.of.ethanol..absolute.alcohol. Population..age.14.and.older.
## 1                               1738350                       2499000
## 2                                225920                       2499000
## 3                               1489410                       2499000
## 4                               3453680                       2499000
## 5                                425250                        205000
## 6                                 75200                        205000
##   Gallons.of.ethanol.per.capita.age.14.and.older
## 1                                           6956
## 2                                            904
## 3                                           5960
## 4                                          13820
## 5                                          20744
## 6                                           3668
##   Decile.for.per.capita.consumption.age.14.and.older
## 1                                                  9
## 2                                                  9
## 3                                                 10
## 4                                                 10
## 5                                                  1
## 6                                                  2
##   Population..age.21.and.older. Gallons.of.ethanol.per.capita.age.21.and.older
## 1                       2020000                                           8606
## 2                       2020000                                           1118
## 3                       2020000                                           7373
## 4                       2020000                                          17097
## 5                        165000                                          25773
## 6                        165000                                           4558
##   Decile.for.per.capita.consumption.age.21.and.older Type.of.data.source
## 1                                                  9                   .
## 2                                                  9                   .
## 3                                                 10                   .
## 4                                                 10                   .
## 5                                                  1                   .
## 6                                                  2                   .
##   Time.varying.alcohol.by.volume..ABV.
## 1                                    .
## 2                                    .
## 3                                    .
## 4                                    .
## 5                                    .
## 6                                    .
##   Gallons.of.ethanol.derived.from.time.varying.ABV
## 1                                                .
## 2                                                .
## 3                                                .
## 4                                                .
## 5                                                .
## 6                                                .

You can see that here, the states are not indexed by their name, nor by their two letter code, but rather by a number. So to be able to merge this with the remaining data we need to introduce either the name or the two letter code. At the top of the text file linked above you can see which number relates to which state. This is a standard state numbering, E.g. Wyoming is state 56, and this numbering was actually contained in the US States dictionary we uploaded earlier under the State_num_code variable which was also merged into the merge_data dataset.

We adjust the state number variable in alcc_data to the name of the numeric code column in merge_data such that we can match on these.

names(alcc_data)[names(alcc_data)=="State.no"] <- "State_num_code"

Finally we should identify the variable which we want to merge into merge_data. The Siegel et al. (2019) paper does not clearly state which series was to use, but for now we use the variable called Gallons.of.ethanol.per.capita.age.14.and.older and will then rename this variable to alcc_pc. Finally, upon inspecting the dataframe you will see that for each State-Year there are 4 lines with observations for beverage types 1, 2, 3 and 4. The data description in the earlier text file comes to help to identify 1:Spirits, 2: Wine, 3: Beer and 4: All beverages. We chose Type.of.beverage == "4".

names(alcc_data)[names(alcc_data)=="Gallons.of.ethanol.per.capita.age.14.and.older"] <- "alcc.pc"
alcc_data <- alcc_data %>% filter(Type.of.beverage == "4") %>% 
                  filter(Year >= 2001) %>% 
                  select(State_num_code, Year, alcc.pc) %>% 
                  mutate(alcc.pc = alcc.pc/10000) %>% 
                  arrange(State_num_code, Year)

Note that we divided the alcc.pc variable by 10,000. As described in the text file, this scales the variable to gallons per person. This file is ready for merging to merge_data.

merge_data <- merge(merge_data,alcc_data, all.x = TRUE)

Incarceration rate

The paper says that the data come from the U.S. Bureau of Justice Statistics website. If we search for that we get to this website. Using the search function from that website you do fairly quickly find your way to annual reports like this one for 2021 from where you can download the data tables of the report, one of which is the table which lists the number of prisoners by state. However, having to find this table for all years is a rather burdensome endeavour. Furthermore, there is no guarantee that the tables, across the years, have stayed in the same format and therefore collating a complete dataset in this way is unlikely to be straightforward.

On that page is a “Related Datsets” section and there is a link labeled “National Archive of Criminal Justice Data’s (NACJD) 1978-2021”. If you follow that link you get to this website from where you can download a complete dataset. I downloaded the ASCII version and converted that to a “csv” file which we can upload here as “Incarceration data.csv”. The file has data organised in State/Years by row. There are a lot (231) variables and as you download the data you should also download the codebook which explains what the variables are. From that document you can see that you can get information on incarceration numbers by gender and length of maximum sentence. There is a variable that gives the total number of person’s incarcerated in a state at 31 December of any given year. This variable is labelled CUSTTOT. While the Siegel et al. (2019) paper is unclear on what number they use this seems the most likely and we will use it here.

# we also specify codes for missing data
incarceration_data <- read.csv(paste0(datadir,"Incarceration data.csv"),na.strings = c("-9" , "-8" , "-2" , "-1"))

Let’s look at a few plots to become familiar with the data

plot1inc <- incarceration_data %>% 
                            filter(STATE %in% c("AL","TN","ME", "NY"))
p2 <- ggplot(plot1inc, aes( x = YEAR, y = CUSTOTT, color = STATE)) +
            geom_line() +
            labs(title="Incarceration Numbers")
p2

As you can see from the plot, there are only data up to 1982. To investigate what is going on we shall look at data for one state, say Alabama (AL).

test <- incarceration_data %>% filter(STATE == "AL")

If you look at the resulting table you will realise that apparently the variables that were collected changed between 1982 and 1983. In particular, the breakdown by gender is only available from 1983 onwards. The variables that may be of interest to us (after 1983) are CUSTOTM and CUSTOTF which capture the male and female totals in custody. The sum of the two should then represent the total number of people in custody. We therefore may now want to calculate CUSTOTT as the sum of these two for 1983 onwards.

incarceration_data <- incarceration_data %>% mutate(CUSTOTT = case_when(
                                                  !is.na(CUSTOTT) ~ CUSTOTT,
                                                  is.na(CUSTOTT) ~ CUSTOTF+CUSTOTM))

You should look at the dataset to confirm that we now have data in the CUSTOTT column for all years. We can also confirm this graphically.

plot1inc <- incarceration_data %>% 
                            filter(STATE %in% c("AL","TN","ME", "NY"))
p2 <- ggplot(plot1inc, aes( x = YEAR, y = CUSTOTT, color = STATE)) +
            geom_line() +
            labs(title="Incarceration Numbers")
p2

This looks good and in particular we cannot see any obvious structural break between 1982 and 1983, giving us confidence that our calculation is sensible.

Now we extract the data we will need for our analysis and which we will merge into our merge_data dataframe.

incarceration_data <- incarceration_data %>% filter(YEAR >= 2001) %>% 
                          select(YEAR, STATE, CUSTOTT)

Finally we will have to adjust the column names for the state code and the year to match those in merge_data to facilitate the merging process. We also change the name of the CUSTOTT variable.

names(incarceration_data)[names(incarceration_data)=="YEAR"] <- "Year"
names(incarceration_data)[names(incarceration_data)=="STATE"] <- "State_code"
names(incarceration_data)[names(incarceration_data)=="CUSTOTT"] <- "incarc"

As you can see these are the total number of person incarcerated. We will want to scale this to a number like person per 100,000 population. We can do that after merging as merge_data contains a variable which contains the size of the population in a state in a particular year.

merge_data <- merge(merge_data,incarceration_data, all.x = TRUE)

We now scale the variable to represent the number of people incarcerated for every 100,000.

merge_data$incarc.pc <- 100000*merge_data$incarc/merge_data$Population

Age proportion

A variable that is used in the paper is the age structure of a state’s population. In the Siegel et al. (2019) paper you will find that they use a variable called “Percent male among population ages 15-29”. We shall attempt to use a different variable “Proportion of 18-24 year olds in the population”. You will see below that adding this data to our datasets require a bit of work. With enough work we could add the data used in the paper, but for today’s exercise we will make our life a little easier.

We shall import a new datafile that contains some of that information. The data are sourced from the StatsAmerica website. The “US states population age and sex.csv” file should be saved into your working folder.

## 'data.frame':    63882 obs. of  17 variables:
##  $ IBRC_Geo_ID        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Statefips          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Countyfips         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Description        : Factor w/ 3197 levels "Abbeville County, SC",..: 2892 2892 2892 2892 2892 2892 2892 2892 2892 2892 ...
##  $ Year               : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
##  $ Total.Population   : int  282162411 284968955 287625193 290107933 292805298 295516599 298379912 301231207 304093966 306771529 ...
##  $ Population.0.4     : int  19178293 19298217 19429192 19592446 19785885 19917400 19938883 20125962 20271127 20244518 ...
##  $ Population.5.17    : int  53197896 53372958 53507265 53508312 53511850 53606269 53818831 53893443 53833475 53889649 ...
##  $ Population.18.24   : int  27315274 27992652 28480708 28916746 29302179 29441546 29602839 29808025 30194274 30530346 ...
##  $ Population.25.44   : int  84973340 84523274 83990295 83398001 83066831 82764185 82638980 82509693 82399959 82211153 ...
##  $ Population.45.64   : int  62428040 64491563 66695526 68828899 70935234 73137401 75216272 77068373 78617510 80272688 ...
##  $ Population.65.     : int  35069568 35290291 35522207 35863529 36203319 36649798 37164107 37825711 38777621 39623175 ...
##  $ Population.Under.18: int  72376189 72671175 72936457 73100758 73297735 73523669 73757714 74019405 74104602 74134167 ...
##  $ Population.18.54   : int  150287588 151902194 152463197 153134701 153998940 154701635 155527978 156257657 157054680 157608587 ...
##  $ Population.55.     : int  59498634 60395586 62225539 63872474 65508623 67291295 69094220 70954145 72934684 75028775 ...
##  $ Male.Population    : int  138443407 139891492 141230559 142428897 143828012 145197078 146647265 148064854 149489951 150807454 ...
##  $ Female.Population  : int  143719004 145077463 146394634 147679036 148977286 150319521 151732647 153166353 154604015 155964075 ...

This file has 63882 rows. How many would you have expected if there were 51 states and 21 years for each state? 1,071 rows. The spreadsheet includes data for every county and not only for the whole state of Alabama. For instance you see that some rows have in the description column only the name of a state and others have the name of a county. To illustrate this look at the following snippet from the data table:

data_pop[1979:1982,]

##      IBRC_Geo_ID Statefips Countyfips       Description Year Total.Population
## 1979        4000         4          0           Arizona 2018          7158024
## 1980        4000         4          0           Arizona 2019          7278717
## 1981        4001         4          1 Apache County, AZ 2000            69507
## 1982        4001         4          1 Apache County, AZ 2001            67863
##      Population.0.4 Population.5.17 Population.18.24 Population.25.44
## 1979         432798         1204802           686858          1863583
## 1980         429788         1210448           693844          1903120
## 1981           6281           20398             6601            17384
## 1982           5821           19498             6530            16785
##      Population.45.64 Population.65. Population.Under.18 Population.18.54
## 1979          1713811        1256172             1637600          3399827
## 1980          1732884        1308633             1640236          3449054
## 1981            13070           5773               26679            31663
## 1982            13312           5917               25319            31174
##      Population.55. Male.Population Female.Population
## 1979        2120597         3560169           3604059
## 1980        2189427         3622802           3669041
## 1981          11165           34441             35066
## 1982          11370           33725             34138

Note that in lines 1979 and 1980 we are having data for the whole state of Arizona and in lines 1981 and 1982 you find data for Apache County (in Arizona). We only want statewide data. In the above snippet you can see that there is a variable called Countyfips which is a numerical code for the different counties. The statewide data have a value of 0 in the Countyfips variable. You should confirm (by looking at the data) that this is true for the other states as well.

One additional aspect of the data is that you will see that population data are only available from 2000 to 2019. This is not aligned with the 2001 to 2021 date range in data. The common years are 2001 to 2019 and therefore we should expect to get 969 (=51*19) observations which we can match.

Let us first filter out the statewide data and remove the county level data.

data_pop <- data_pop %>% filter(Countyfips == 0)  # we only keep data with Countyfips equal to 0

You will notice that this dataframe now has nrow(data_pop2) rows of data. This is still too many rows. Let’s look at the different geographies in our dataset.

unique(data_pop$Description)

##  [1] U.S.                 Alabama              Alaska              
##  [4] Arizona              Arkansas             California          
##  [7] Colorado             Connecticut          Delaware            
## [10] District of Columbia Florida              Georgia             
## [13] Hawaii               Idaho                Illinois            
## [16] Indiana              Iowa                 Kansas              
## [19] Kentucky             Louisiana            Maine               
## [22] Maryland             Massachusetts        Michigan            
## [25] Minnesota            Mississippi          Missouri            
## [28] Montana              Nebraska             Nevada              
## [31] New Hampshire        New Jersey           New Mexico          
## [34] New York             North Carolina       North Dakota        
## [37] Ohio                 Oklahoma             Oregon              
## [40] Pennsylvania         Rhode Island         South Carolina      
## [43] South Dakota         Tennessee            Texas               
## [46] Utah                 Vermont              Virginia            
## [49] Washington           West Virginia        Wisconsin           
## [52] Wyoming              Puerto Rico         
## 3197 Levels: Abbeville County, SC Acadia Parish, LA ... Ziebach County, SD

You will immediately see that there are also observations for the entire U.S.. So, let’s extract the data that are from states and years which are also represented in merge_data, our original dataset. Complete the following code for this task.

You got it right if data_pop has 969 observations and you can replicate the following table:

summary(data_pop$Total.Population)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   494657  1651059  4219239  6041273  6859789 39512223

Let’s look at the variables that are contained in this datafile.

names(data_pop)

##  [1] "IBRC_Geo_ID"         "Statefips"           "Countyfips"         
##  [4] "Description"         "Year"                "Total.Population"   
##  [7] "Population.0.4"      "Population.5.17"     "Population.18.24"   
## [10] "Population.25.44"    "Population.45.64"    "Population.65."     
## [13] "Population.Under.18" "Population.18.54"    "Population.55."     
## [16] "Male.Population"     "Female.Population"

We shall not merge all of these variables into data but only what we want, namely the “Proportion of 18-24 year olds in the population”. That is actually not one of the variables in the list. There is the population between 18 and 24 (Population.18.24) and the overall population (Total.Population) and we can calculate the proportion we need as a new variable, prop18.24. Complete the following code:

You get it right if you can replicate these summary statistics for the new variable.

summary(data_pop$prop18.24)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.955   9.463   9.867   9.944  10.282  14.385

Now we select only the variables we wish to merge into data, namely only prop18.24. However, in order to merge the data into data we also need the year (Year) and state name (Description).

data_pop <- data_pop %>%  select(Year, Description, prop18.24)

It is easiest to merge datafiles if the variables on which we want to match (state name and Year) are called the same in both datasets (merge_data and data_pop). This is true for the Year variable, but not for the state name (State in merge_data and Description in data_pop). Let’s fix that and change the state variable name in data_pop to State.

names(data_pop)[names(data_pop)=="Description"] <- "State"

Now we are in a position to merge the two datafiles.

merge_data <- merge(merge_data,data_pop, all.x = TRUE)

As result your datafile has gained one variable, prop18.24.

Final data prep

We have now assembled all the data. Recall that at this stage we still have all the mechanisms of death in the dataset. That means that for every year state we still have up to 21 observations. Let’s racall the different mechanisms of death.

unique(merge_data$Mechanism)

##  [1] "Unspecified"                                   
##  [2] "Firearm"                                       
##  [3] "Pedal cyclist, other"                          
##  [4] "Motor vehicle, traffic"                        
##  [5] "Drug Poisoning"                                
##  [6] "Fall"                                          
##  [7] "Cut/Pierce"                                    
##  [8] "Hot object/Substance"                          
##  [9] "Struck by / against"                           
## [10] "Other specified and classifiable"              
## [11] "Fire/Flame"                                    
## [12] "Machinery"                                     
## [13] "Natural/Environmental"                         
## [14] "Other specified / NEC"                         
## [15] "Suffocation"                                   
## [16] "Drowning (includes water transport)"           
## [17] "Pedestrian, other"                             
## [18] "Transport, other (excl. drown by water transp)"
## [19] "Non-Drug Poisoning"                            
## [20] "Transport, other land"                         
## [21] "Overexertion"

At this stage we will extract only the firearm data and continue working with these.

data3 <- merge_data %>% filter(Mechanism == "Firearm")

As expected we get $51 8 21 = 1071$ observations. Let us also rename the variable Age.Adjusted.Rate which contains the age adjusted rate of firearm deaths (per 100,000). The new name will be firearm.deaths.pc.

names(data3)[names(data3)=="Age.Adjusted.Rate"] <- "firearm.deaths.pc"

The datafile also has loads of variables. You can check the whole list with names(data3). In principle there is no problem with keeping them all in the datafile as there are not that many observations that the memory in your computer would be challenged. But sometimes it is nicer to work with a compact dataset. For this reason we will select the variables we are likely to use in further analysis.

data3 <- data3 %>% select(Year, State, State_code, State_num_code, Population, firearm.deaths.pc, Region, pol_ubc, pol_vmisd, pol_mayissue, ur, law.officers.pc,vcrime.pc, homi.pc, alcc.pc, incarc.pc,prop18.24)

Now we have a dataset with 1,071 rows and 17 variables. Let us save this into a csv file such that for further work we can just start with that new csv file.

write.csv(data3,paste0(datadir,"US Gun_example_v3.csv"))