Preparing your workfile

R is a powerful software for statistical analysis. It is open source and hence FREE software. It is constantly developped and functionality is being improved. The “price” we pay for this is that we have to do a little more set-up work to make it work.

In particular we need packages which are provided for free by researchers/programmers that provide useful functionality.

But we add libraries which enhance its capabilities.

library(tidyverse)    # for almost all data handling tasks
library(readxl)       # to import Excel data
library(ggplot2)      # to produce nice graphiscs
library(stargazer)    # to produce nice results tables

Introduction

Here we will replicate a seminal piece of work by David Card and Alan B. Krueger, Minimum Wage and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania, AER, 1994.

The data and the code in STATA are available from the data page of Angrist and Pischke’s Mostly Harmless Econometrics.

How to learn R

Learning any programming language, and yes, that is what R essentially is, can only be done by doing. Here we are not exposing you to a bottom-up introduction of R, but we are immediately exposing you to the exciting possibilities of R and the beauty and excitement of working with real-life data. As we do so we expose you to the basic tools of working with data in R and help you to learn these.

However, you have to accept from the outset that this is going to be a bumby raod! The most crucial skill you will have to embrace and develop is that of finding solutions to prolems. You will, as any programmer, yes, even the most experienced ones, search the internet for solutions to your problem. Don’t think that you can remember all the important commands from the top of your mind.

Whenever we introduce a new R functionality we will briefly explain it, but importantly we will also give you links to further resources where you can find more usage examples and details. Also do not frget to use the build-in help function in R (type ?FUNCTIONNAME into the Console and press Enter). This is important as it is the help that will be available offline as well.

R programming concepts used

The following R programming concepts were used in this document. The links lead to pages that provide help with these issues.

use of packages/libraries: ECLR
importing of data: csv.read, read_excel
Summarising data: descriptive statistics
Plotting data using ggplot: Cookbook for R, free 1st chapter of datacamp interactive tutorial
running regressions: ECLR
using the stargazer package for summary statistics and regresison output
using the tidverse to produce subsets and groupings of data and summarise these
hypothesis tests on sample means, using the t.test function

Importing Data

The data here are stored as an xlsx format in public.xlsx. In order to import these we require the readxl package which we loaded earlier.

ck_data<- read_xlsx("data/CK_public.xlsx")
str(ck_data)  # prints some basic info on variables

## Classes 'tbl_df', 'tbl' and 'data.frame':    410 obs. of  46 variables:
##  $ SHEET   : num  46 49 506 56 61 62 445 451 455 458 ...
##  $ CHAIN   : num  1 2 2 4 4 4 1 1 2 2 ...
##  $ CO_OWNED: num  0 0 1 1 1 1 0 0 1 1 ...
##  $ STATE   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SOUTHJ  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CENTRALJ: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NORTHJ  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PA1     : num  1 1 1 1 1 1 0 0 0 1 ...
##  $ PA2     : num  0 0 0 0 0 0 1 1 1 0 ...
##  $ SHORE   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NCALLS  : num  0 0 0 0 0 2 0 0 0 2 ...
##  $ EMPFT   : chr  "30" "6.5" "3" "20" ...
##  $ EMPPT   : chr  "15" "6.5" "7" "20" ...
##  $ NMGRS   : chr  "3" "4" "2" "4" ...
##  $ WAGE_ST : chr  "." "." "." "5" ...
##  $ INCTIME : chr  "19" "26" "13" "26" ...
##  $ FIRSTINC: chr  "." "." "0.37" "0.1" ...
##  $ BONUS   : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ PCTAFF  : chr  "." "." "30" "0" ...
##  $ MEALS   : num  2 2 2 2 3 2 2 2 1 1 ...
##  $ OPEN    : num  6.5 10 11 10 10 10 6 0 11 11 ...
##  $ HRSOPEN : num  16.5 13 10 12 12 12 18 24 10 10 ...
##  $ PSODA   : chr  "1.03" "1.01" "0.95" "0.87" ...
##  $ PFRY    : chr  "1.03" "0.9" "0.74" "0.82" ...
##  $ PENTREE : chr  "0.52" "2.35" "2.33" "1.79" ...
##  $ NREGS   : chr  "3" "4" "3" "2" ...
##  $ NREGS11 : chr  "3" "3" "3" "2" ...
##  $ TYPE2   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ STATUS2 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DATE2   : num  111792 111292 111292 111492 111492 ...
##  $ NCALLS2 : chr  "1" "." "." "." ...
##  $ EMPFT2  : chr  "3.5" "0" "3" "0" ...
##  $ EMPPT2  : chr  "35" "15" "7" "36" ...
##  $ NMGRS2  : chr  "3" "4" "4" "2" ...
##  $ WAGE_ST2: chr  "4.3" "4.45" "5" "5.25" ...
##  $ INCTIME2: chr  "26" "13" "19" "26" ...
##  $ FIRSTIN2: chr  "0.08" "0.05" "0.25" "0.15" ...
##  $ SPECIAL2: chr  "1" "0" "." "0" ...
##  $ MEALS2  : chr  "2" "2" "1" "2" ...
##  $ OPEN2R  : chr  "6.5" "10" "11" "10" ...
##  $ HRSOPEN2: chr  "16.5" "13" "11" "12" ...
##  $ PSODA2  : chr  "1.03" "1.01" "0.95" "0.92" ...
##  $ PFRY2   : chr  "." "0.89" "0.74" "0.79" ...
##  $ PENTREE2: chr  "0.94" "2.35" "2.33" "0.87" ...
##  $ NREGS2  : chr  "4" "4" "4" "2" ...
##  $ NREGS112: chr  "4" "4" "3" "2" ...

Not all the variable names are intuitive. The codebook contains details regarding the variables. Importantly you can see which data format the variables have. In particular you can see numeric (num) and character/text (chr) variables. As you can see, even the variables which are labeld as chr are actually numbers. The reason for this is that there are some missing observations ("."). As we use the read_xlsx function to import the data we can specify that missing data are coded with an “.”. By specifying this at the outset we ensure that the data are formatted as numerical data where appropriate.

ck_data<- read_xlsx("data/CK_public.xlsx",na = ".")
str(ck_data)  # prints some basic info on variables

## Classes 'tbl_df', 'tbl' and 'data.frame':    410 obs. of  46 variables:
##  $ SHEET   : num  46 49 506 56 61 62 445 451 455 458 ...
##  $ CHAIN   : num  1 2 2 4 4 4 1 1 2 2 ...
##  $ CO_OWNED: num  0 0 1 1 1 1 0 0 1 1 ...
##  $ STATE   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SOUTHJ  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CENTRALJ: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NORTHJ  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PA1     : num  1 1 1 1 1 1 0 0 0 1 ...
##  $ PA2     : num  0 0 0 0 0 0 1 1 1 0 ...
##  $ SHORE   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NCALLS  : num  0 0 0 0 0 2 0 0 0 2 ...
##  $ EMPFT   : num  30 6.5 3 20 6 0 50 10 2 2 ...
##  $ EMPPT   : num  15 6.5 7 20 26 31 35 17 8 10 ...
##  $ NMGRS   : num  3 4 2 4 5 5 3 5 5 2 ...
##  $ WAGE_ST : num  NA NA NA 5 5.5 5 5 5 5.25 5 ...
##  $ INCTIME : num  19 26 13 26 52 26 26 52 13 19 ...
##  $ FIRSTINC: num  NA NA 0.37 0.1 0.15 0.07 0.1 0.25 0.25 0.15 ...
##  $ BONUS   : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ PCTAFF  : num  NA NA 30 0 0 45 0 0 0 0 ...
##  $ MEALS   : num  2 2 2 2 3 2 2 2 1 1 ...
##  $ OPEN    : num  6.5 10 11 10 10 10 6 0 11 11 ...
##  $ HRSOPEN : num  16.5 13 10 12 12 12 18 24 10 10 ...
##  $ PSODA   : num  1.03 1.01 0.95 0.87 0.87 0.87 1.04 1.05 0.73 0.94 ...
##  $ PFRY    : num  1.03 0.9 0.74 0.82 0.77 0.77 0.88 0.84 0.73 0.73 ...
##  $ PENTREE : num  0.52 2.35 2.33 1.79 1.65 0.95 0.94 0.96 2.32 2.32 ...
##  $ NREGS   : num  3 4 3 2 2 2 3 6 2 4 ...
##  $ NREGS11 : num  3 3 3 2 2 2 3 4 2 4 ...
##  $ TYPE2   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ STATUS2 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DATE2   : num  111792 111292 111292 111492 111492 ...
##  $ NCALLS2 : num  1 NA NA NA NA NA NA 2 NA 1 ...
##  $ EMPFT2  : num  3.5 0 3 0 28 NA 15 26 3 2 ...
##  $ EMPPT2  : num  35 15 7 36 3 NA 18 9 12 9 ...
##  $ NMGRS2  : num  3 4 4 2 6 NA 5 6 2 2 ...
##  $ WAGE_ST2: num  4.3 4.45 5 5.25 4.75 NA 4.75 5 5 5 ...
##  $ INCTIME2: num  26 13 19 26 13 26 26 26 13 13 ...
##  $ FIRSTIN2: num  0.08 0.05 0.25 0.15 0.15 NA 0.15 0.2 0.25 0.25 ...
##  $ SPECIAL2: num  1 0 NA 0 0 0 0 0 0 0 ...
##  $ MEALS2  : num  2 2 1 2 2 2 2 2 2 1 ...
##  $ OPEN2R  : num  6.5 10 11 10 10 10 6 0 11 11 ...
##  $ HRSOPEN2: num  16.5 13 11 12 12 12 18 24 11 10.5 ...
##  $ PSODA2  : num  1.03 1.01 0.95 0.92 1.01 NA 1.04 1.11 0.94 0.9 ...
##  $ PFRY2   : num  NA 0.89 0.74 0.79 0.84 0.84 0.86 0.84 0.84 0.73 ...
##  $ PENTREE2: num  0.94 2.35 2.33 0.87 0.95 1.79 0.94 0.94 2.32 2.32 ...
##  $ NREGS2  : num  4 4 4 2 2 3 3 6 4 4 ...
##  $ NREGS112: num  4 4 3 2 2 3 3 3 3 3 ...

We can see that now all variables are numeric (num) variables which is as we expect.

There are 410 observations, each representing one fast-food restaurant. Each restaurant has some variables which characterise the store and then two sets of variables which are observed before the new minimum wage was introduced to New Jersey (Wave 1: Feb 15 to Mar 14, 1992) and after the policy change (Wave 1: Nov 5 to Dec 31, 1992). Variables which relate to the second wave will have a 2 at the end of the variable name.

The following variables will be important for the analysis here:

STATE, 1 if New Jersey (NJ); 0 if Pennsylvania (Pa)
WAGE_ST, starting wage ($/hr), Wave 1
WAGE_ST2, starting wage ($/hr), after policy, Wave 2
STATUS2, the status of the Wave 2 interview, 0 = refused, 1 = completed, 3, permanently closed, 2, 4 and 5 = temporarily closed
CHAIN, 1 = Burger King; 2 = KFC; 3 = Roy Rogers; 4 = Wendy’s
EMPFT, # full-time employees before policy implementation
EMPFT2, # full-time employees after policy implementation
EMPPT, # part-time employees before policy implementation
EMPPT2, # part-time employees after policy implementation
NMGRS, # managers/ass’t managers before policy implementation
NMGRS2, # managers/ass’t managers before policy implementation

For later it will be convenient to have STATE and CHAIN variables which aren’t numeric, but categorical variables. In R these are called factor variables.

ck_data$STATEf <- as.factor(ck_data$STATE)  # translates a variable into a factor variable
levels(ck_data$STATEf) <- c("Pennsylvania","New Jersey") # changes the names of the categories

ck_data$CHAINf <- as.factor(ck_data$CHAIN)  
levels(ck_data$CHAINf) <- c("Burger King","KFC", "Roy Rogers", "Wendy's")

Some Summary Statistics

Let’s create some summary statistics, replicating elements of Card and Krueger’s Table 1.

Tab1 <- ck_data %>% group_by(STATEf) %>% 
          summarise(n = n()) %>% 
          print()

## # A tibble: 2 x 2
##   STATEf           n
##   <fct>        <int>
## 1 Pennsylvania    79
## 2 New Jersey     331

A frequency table for the STATUS2 variable we can obtain as follows (note that STATE = 1 for New Jersey):

table(ck_data$STATUS2,ck_data$STATEf)

##    
##     Pennsylvania New Jersey
##   0            0          1
##   1           78        321
##   2            0          2
##   3            1          5
##   4            0          1
##   5            0          1

All these data are identical to those in the Card and Krueger paper.

Now we replicate some of the summary statistics in Table 2. First we want proportions of different chain types. At core of this we will first calculate a frequency table again (table()) but then we feed the result of this straight into the prop.table() function which translates frequencies into proportions. The addition of the margin = 2 option ensures that proportions are calculated by state (2=columns). Try for yourself what changes if you either set margin = 1 (1 for rows) or leave this option out.

prop.table(table(ck_data$CHAINf,ck_data$STATEf,dnn = c("Chain", "State")),margin = 2)

##              State
## Chain         Pennsylvania New Jersey
##   Burger King    0.4430380  0.4108761
##   KFC            0.1518987  0.2054381
##   Roy Rogers     0.2151899  0.2477341
##   Wendy's        0.1898734  0.1359517

Let’s also see whether there are other differences between the characteristics. For instance we can look at the distribution of starting wages before the change in minimum wage in New Jersey (WAGE_ST).

At this stage it is not so important to understand the commands for these plots.

ggplot(ck_data, aes(WAGE_ST, stat(density), colour = STATEf)) +
  geom_freqpoly(bins = 10) +
  ggtitle(paste("Starting wage distribution, Feb/Mar 1992"))

Or here an alternative visualisation.

ggplot(ck_data,aes(WAGE_ST, colour = STATEf), colour = STATEf) + 
    geom_histogram(position="identity", 
                   aes(y = ..density..),
                   bins = 10,
                   alpha = 0.2) +
    ggtitle(paste("Starting wage distribution, Feb/Mar 1992"))

Both plots sow that the starting wage distribution is fairly similar in both states, with peaks at the minimum wage of $4.25 and $5.00.

Policy Evaluation

First we can evaluate whether the legislation has been implemented.

Tab1 <- ck_data %>% group_by(STATEf) %>% 
          summarise(wage_FEB = mean(WAGE_ST,na.rm = TRUE), 
                    wage_DEC = mean(WAGE_ST2,na.rm = TRUE)) %>% 
          print()

## # A tibble: 2 x 3
##   STATEf       wage_FEB wage_DEC
##   <fct>           <dbl>    <dbl>
## 1 Pennsylvania     4.63     4.62
## 2 New Jersey       4.61     5.08

We can clearly see that the average wage in New Jersey has increased. We could also compare the wage distributions as above.

ggplot(ck_data, aes(WAGE_ST2, stat(density), colour = STATEf)) +
  geom_freqpoly(bins = 10) +
  ggtitle(paste("Starting wage distribution, Nov/Dec 1992"))

The difference is very obvious.

In order to evaluate whether the increased minimum wage has an impact on employment we want to compare the employment numbers in the two states, before and after the policy implementation.

In the list of variables above you can see that we have before and after policy employee numbers for full-time staff and part-time staff. Card and Krueger calculated a full-time equivalent (FTE) employee number. In order to calculate this they made the assumption that, on average, part-time employees worked 50% of a full-time employee and that manager (NMGRS and NMGRS2) worked full time.

Hence we will generate two new variables FTE and FTE2. As almost always the same result can be achieved in different ways in R. So we demonstrate two fifferent ways to create these two variables.

ck_data$FTE <- ck_data$EMPFT + ck_data$NMGRS + 0.5*ck_data$EMPPT
ck_data <- ck_data %>%  mutate(FTE2 = EMPFT2 + NMGRS2 + 0.5*EMPPT2)

Tab2 <- ck_data %>% group_by(STATEf) %>% 
          summarise(meanFTE_FEB = mean(FTE,na.rm = TRUE), 
                    meanFTE_DEC = mean(FTE2,na.rm = TRUE)) %>% 
          print()

## # A tibble: 2 x 3
##   STATEf       meanFTE_FEB meanFTE_DEC
##   <fct>              <dbl>       <dbl>
## 1 Pennsylvania        23.3        21.2
## 2 New Jersey          20.4        21.0

From here you can clearly see that on average stores in New Jersey have increased employment, while teh average employment in stores in Pennsylvania has actually decreased. That employment would increase despite the minimum wage increasing was a truly earth-shattering result in 1992.

These findings clearly contradicted a simplistic analysis of labour markets (assuming they were competitive) which would have suggested that an increased minimum wage should reduce employment. This particular empirical finding was, perhaps not surprisingly, disputed

These findings were challenged in Neumark, David, and William Wascher. 2000. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Comment.” American Economic Review, 90 (5): 1362-1396.
A reply (by and large defending the original findings) to this criticism was published in the same journal edition by Card, David, and Alan B. Krueger. 2000. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Reply.” American Economic Review, 90 (5): 1397-1420.
An overview of the mpirical evidence is provided in this Report by Arindrajit Dube for the UK Government. “Especially for the set of studies that consider broad groups of workers, the overall evidence base suggests an employment impact of close to zero.”

Replication of Minimum Wage and Employment by Card and Krueger