In this little project we will demonstrate how to use the mightily powerful packages of the “tidyverse” to perform some data analysis. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different data slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do.
So before we do anything else you should install the
tidyverse
package and then load it:
library(tidyverse)
By the way, at this stage you should take five minuted to learn about https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ a real hero for data nerds. And if you think at the end of this section “Wow, that is powerful and quite straightforward” you got him to thank for it.
Let’s get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993 (get the datafile from here: mlb1.csv).
mydata <- read.csv("data/mlb1.csv")
Let’s check out what variables we have in this data-file
names(mydata)
## [1] "salary" "teamsal" "nl" "years" "games" "atbats"
## [7] "runs" "hits" "doubles" "triples" "hruns" "rbis"
## [13] "bavg" "bb" "so" "sbases" "fldperc" "frstbase"
## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher" "yrsallst"
## [25] "hispan" "black" "whitepop" "blackpop" "hisppop" "pcinc"
## [31] "gamesyr" "hrunsyr" "atbatsyr" "allstar" "slugavg" "rbisyr"
## [37] "sbasesyr" "runsyr" "percwhte" "percblck" "perchisp" "blckpb"
## [43] "hispph" "whtepw" "blckph" "hisppb" "lsalary"
You can find short variable descriptions from a link in the Datasets
page (see above). You need to understand what data types the variables
represent (check str(mydata)
to confirm the R
datatypes.)
You can perhaps see that the positional information is organised in individual positional variables (“frstbase” “scndbase” “shrtstop” “thrdbase” “outfield” “catcher”) that take the value 1 if a player plays in a particular position.
To confirm that each player is only assigned one position we calculate the following:
temp <- rowSums(mydata[,c("frstbase","scndbase","shrtstop","thrdbase","outfield","catcher")])
min(temp)
## [1] 1
max(temp)
## [1] 1
As the result is one for both min and max value we have confirmed that every player has been assigned exactly one position.
A similar situation exists with the ethnicity variable. We have two variables (“hispan” “black”) which are 1 if the respective player is ither black or hispanic. If both are 0 the player is white.
Let us now create two variables (“position” and “race”) which summarise the respective information in one variable each.
mydata$position <- "First Base"
mydata$position[mydata$scndbase == 1] <- "Second Base"
mydata$position[mydata$shrtstop == 1] <- "Short Stop"
mydata$position[mydata$thrdbase == 1] <- "Third Base"
mydata$position[mydata$outfield == 1] <- "Outfield"
mydata$position[mydata$catcher == 1] <- "Catcher"
mydata$position <- as.factor(mydata$position) # now ensure it is a factor variable
mydata$race <- "White"
mydata$race[mydata$hispan == 1] <- "Hispanic"
mydata$race[mydata$black == 1] <- "Black"
mydata$race <- as.factor(mydata$race) # now ensure it is a factor variable
Almost the most difficult task in data analysis, in particular if you have data with so many different variables as the dataset here, is to know what you are interested in. Once you know that you have to find ways to slice the data into the right bits before you analyse them. That is the main task to learn here.
Remember a few basis commands before we proceed. If you want a quick
summaries for a particular variable in the data frame, say
salary
you use:
summary(mydata$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 109000 253600 675000 1345672 2250000 6329213
summary(mydata$position)
## Catcher First Base Outfield Second Base Short Stop Third Base
## 52 45 136 37 49 34
If you know exactly the particular statistic you are after, you can immediately calculate it as such
max(mydata$salary)
## [1] 6329213
Other useful statistics can be called using the following function:
mean()
, median()
, sd()
and
var()
.
So let’s learn by doing.
Let’s say we want to see the average salary for each position. First we’ll see how we do it and we explain what happened afterwards.
mydata %>% group_by(position) %>% summarise(mean(salary))
## # A tibble: 6 × 2
## position `mean(salary)`
## <fct> <dbl>
## 1 Catcher 892519.
## 2 First Base 1586781.
## 3 Outfield 1539324.
## 4 Second Base 1309641.
## 5 Short Stop 1069211.
## 6 Third Base 1382647.
Here we used the %>%
piping operator. What this does
is best described in words. Here we did the following: “Thake the
dataset mydata, group the data by position and then summarise the data
by presenting the mean salary for each group”.
We can clearly see that average salaries vary between positions. Amongst the fielding positions included in this dataset, the average salary is highest amongst players on the First Base (closely followed by Outfielders) and lowest for catchers. Note that, although this is a rather old dataset, this ordering has not changed.
Let’s show a few variations here:
mydata %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
## position number avg.salary
## <fct> <int> <dbl>
## 1 Catcher 52 892519.
## 2 First Base 45 1586781.
## 3 Outfield 136 1539324.
## 4 Second Base 37 1309641.
## 5 Short Stop 49 1069211.
## 6 Third Base 34 1382647.
Here we added another aspect of the above groups to the final
display, namely the number of observations. By checking
length(salary)
we are basically finding out how many
observations for each position (as that is what we grouped by before)
there are. Here, for instance, we see that there are 52 catchers in the
database.
Also by not just, in summarise
, saying
mean(salary)
but rather
avg.salary = mean(salary)
we can rename the column in which
the salary mean is displayed.
Let’s start with what I call simple pivot tables. Tables where we group by one variable.
Now we look at each of the main tools in our toolbox
The main work in the example above was done by the
group_by
command. The variables by which we group will
typically be categorical variables. Often these will be defined as
factor variables. But they could also be, for instance, int
variables, such as black
.
mydata %>% group_by(black) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
## black `length(salary)` `mean(salary)`
## <int> <int> <dbl>
## 1 0 245 1209602.
## 2 1 108 1654350.
Interestingly this would suggest that black players earn higher salaries. However,
mydata %>% group_by(hispan) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
## hispan `length(salary)` `mean(salary)`
## <int> <int> <dbl>
## 1 0 289 1410990.
## 2 1 64 1050723.
reveals that it is hispanics that earned significantly less than the others and the full variety is only revealed by using our race variable:
mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))
## # A tibble: 3 × 3
## race `length(salary)` `mean(salary)`
## <fct> <int> <dbl>
## 1 Black 108 1654350.
## 2 Hispanic 64 1050723.
## 3 White 181 1265780.
On face value these resuts suggest that, on average, black players earn most and hispanic players the least. Of course there are a numer of other factors at play which these very simple summary statistics does not take account of and the three groups very likely differ in other aspects that are relevant for player salary.
The filter_by
command allows us to remove a subset of
the data. Here is how we could use this command if we only wanted to
look at players that have not (by 1993) been an All Star player
(yrsallst == 0
).
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
## position number avg.salary
## <fct> <int> <dbl>
## 1 Catcher 42 587167.
## 2 First Base 31 827747.
## 3 Outfield 93 858689.
## 4 Second Base 25 717133.
## 5 Short Stop 38 687741.
## 6 Third Base 21 701786.
When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower.
We can look at all All Stars (yrsallst > 0
) by
changing the input into the filter
command:
mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
## position number avg.salary
## <fct> <int> <dbl>
## 1 Catcher 10 2175000
## 2 First Base 14 3267500
## 3 Outfield 43 3011395.
## 4 Second Base 12 2544032.
## 5 Short Stop 11 2387014.
## 6 Third Base 13 2482500
immediately seeing that All Stars attract significantly higher salaries (note, this is not a causal relationship!). They are All Stars because they are good players and it is being a good player that earns them a high salary. Of course there may still be a premium for All Stars, but you cannot conclude this from the above statistics.
Let’s say you wanted to arrange the table such that positions with
lower salaries are shown first. The arrange
command is the
tool you need.
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary)) %>% arrange(avg.salary)
## # A tibble: 6 × 3
## position number avg.salary
## <fct> <int> <dbl>
## 1 Catcher 42 587167.
## 2 Short Stop 38 687741.
## 3 Third Base 21 701786.
## 4 Second Base 25 717133.
## 5 First Base 31 827747.
## 6 Outfield 93 858689.
Clearly, different positions pay, on average, differently.
These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells.
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))
## # A tibble: 18 × 3
## # Groups: position [6]
## position race avg.salary
## <fct> <fct> <dbl>
## 1 Catcher Black 736000
## 2 Catcher Hispanic 970214.
## 3 Catcher White 887151.
## 4 First Base Black 1582917.
## 5 First Base Hispanic 977833.
## 6 First Base White 1799058.
## 7 Outfield Black 1728032.
## 8 Outfield Hispanic 1344532.
## 9 Outfield White 1319637.
## 10 Second Base Black 1715208.
## 11 Second Base Hispanic 1315357.
## 12 Second Base White 1160343
## 13 Short Stop Black 2007098.
## 14 Short Stop Hispanic 682711.
## 15 Short Stop White 1103050.
## 16 Third Base Black 1019889.
## 17 Third Base Hispanic 1309722.
## 18 Third Base White 1540992.
As you can see it is pretty straightforward to group by more than one
variable (you merely add another variable to the group_by()
command), but we would like to display the result differently (positions
in rows and race in columns).
At this stage it is useful to notice that R returned the above tables
in what are known as tibbles
, which are a type of
dataframe. The above result had three variables: position
,
race
and avg.salary
, the last being the new
display variable we created containing the grouped averages.
Rearranging the data display such that variation on one of the grouping variables is shown across different columns is achieved as follows:
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)
## # A tibble: 6 × 4
## # Groups: position [6]
## position Black Hispanic White
## <fct> <dbl> <dbl> <dbl>
## 1 Catcher 736000 970214. 887151.
## 2 First Base 1582917. 977833. 1799058.
## 3 Outfield 1728032. 1344532. 1319637.
## 4 Second Base 1715208. 1315357. 1160343
## 5 Short Stop 2007098. 682711. 1103050.
## 6 Third Base 1019889. 1309722. 1540992.
As you see we merely added the spread
command at the
end, meaning that we send the previous result to the spread
command. The spread command takes as the first input the variable that
should form the colums (here race
) and as the second input
the variable that should show in the cells (here `avg.salary’).
To illustrate that you can also group by more than two variables we
first create a new variable AS
which is a boolean variable
(TRUE or FALSE) depending on whether a player was an all start in 1993.
Then we merely add this new variable into our list of group_by
variables.
mydata$AS <- (mydata$yrsallst>0)
mydata %>% group_by(AS,position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary) %>% arrange(AS)
## # A tibble: 12 × 5
## # Groups: AS, position [12]
## AS position Black Hispanic White
## <lgl> <fct> <dbl> <dbl> <dbl>
## 1 FALSE Catcher 172000 238300 647153.
## 2 FALSE First Base 625694. 521500 1014194.
## 3 FALSE Outfield 831629. 762221. 931295.
## 4 FALSE Second Base 708750 1014000 626458.
## 5 FALSE Short Stop 269375 510719. 938065.
## 6 FALSE Third Base 553167. 456250 808154.
## 7 TRUE Catcher 1300000 2800000 2121429.
## 8 TRUE First Base 3018750 2575000 3565000
## 9 TRUE Outfield 3136667. 2975000 2678833.
## 10 TRUE Second Base 2721666. 2068750 2584036.
## 11 TRUE Short Stop 4324061. 1600000 1696995.
## 12 TRUE Third Base 1953333. 3016667 2599537
We smuggled one extra tool into this analysis. The last command here
is arrange(AS)
. This merely told R to order the rows in the
display table according to the variable AS
. The rows are
ordered in ascending order (as AS
is a boolean variable
that means from FALSE to TRUE). If you wanted a reversed ordering of
AS
and in addition a secondary ordering according to
position name you would achieve this by using
arrange(desc(AS),position)
instead.
Looking at this table we can already see that the above result, that black players earn on average more than white players, must have been due to some underlying confounding factor. If you compare each row in the above table you can see that for almost all combinations of position and All Status it is the white players that earn the most (one exception, for instance, being All Star short stops).
Without even attempting to settle this for good, we will take one more step to investigating this by looking at the number of players of different races in the different positions.
mydata %>% group_by(AS,position,race) %>% summarise(number = length(salary)) %>% spread(race,number) %>% arrange(AS)
## # A tibble: 12 × 5
## # Groups: AS, position [12]
## AS position Black Hispanic White
## <lgl> <fct> <int> <int> <int>
## 1 FALSE Catcher 1 5 36
## 2 FALSE First Base 6 7 18
## 3 FALSE Outfield 44 14 35
## 4 FALSE Second Base 4 5 16
## 5 FALSE Short Stop 4 16 18
## 6 FALSE Third Base 6 2 13
## 7 TRUE Catcher 1 2 7
## 8 TRUE First Base 4 2 8
## 9 TRUE Outfield 28 5 10
## 10 TRUE Second Base 4 2 6
## 11 TRUE Short Stop 3 3 5
## 12 TRUE Third Base 3 1 9
This paints an interesting picture. The vast majority of black professional baseball players (at least on the fielding positions) played as Outfielders and these were quite well paid positions. This is the reason why the overall average appeared to be highest for black players.
Through this small exercise you got a taste of how to use the mighty piping technique. Once you understand the architecture of the commands you will realise that this is an almighty technique.