Here we will introduce one of the most powerful tools R has to offer, graphical representation of data. Adrian Pagan, an excellent Australian econometrician once said “A simple plot tells a lot!”. So let’s see how to create simple plots.
With R being on open source software you will find quite a few
different ways to produce graphics. Some are build, by default, into the
R software, like the plot function. But as a regular user of R you will
want to develop a wider range of graphics and then you will want to use
a graphics package. The package we will introduce here is the
ggplot2
package. Install this package in case you have not
yet done so on your computer.
install.packages("ggplot2")
As you know, the above package installation you will only have to do once on your computer.
library(tidyverse)
When you load a package with the library()
command you
will often get some warning messages. On most occasions you can just
ignore them. ggplot
will automatically load the
ggplot
package and you are likely to want to do some
manipulation using the tiyverse package.
Let’s get a dataset to look at. We shall use the Baseball wages
dataset, mlb1.csv.
Make sure that it is saved into your working directory. (Recall, use
setwd()
to set your working directory).
mydata <- read.csv("mlb1.csv", na.strings = ".")
Let’s check out what variables we have in this data-file
names(mydata)
## [1] "salary" "teamsal" "nl" "years" "games" "atbats"
## [7] "runs" "hits" "doubles" "triples" "hruns" "rbis"
## [13] "bavg" "bb" "so" "sbases" "fldperc" "frstbase"
## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher" "yrsallst"
## [25] "hispan" "black" "whitepop" "blackpop" "hisppop" "pcinc"
## [31] "gamesyr" "hrunsyr" "atbatsyr" "allstar" "slugavg" "rbisyr"
## [37] "sbasesyr" "runsyr" "percwhte" "percblck" "perchisp" "blckpb"
## [43] "hispph" "whtepw" "blckph" "hisppb" "lsalary"
You can find short variable descriptions here and of
course you need to understand what data types the variables represent
(check str(mydata)
to confirm the R datatypes.) The
datafile contains information on 353 baseball players.
You can see that there are several variables which indicate which fielding position a player has. For instance player 10 is an outfield player as the player has a value of 1 in the outfield variable and 0s otherwise.
mydata[10,18:23]
## frstbase scndbase shrtstop thrdbase outfield catcher
## 10 0 0 0 0 1 0
The following will create one new variable called
poition
, which indicates the position a player plays
in.
mydata <- mydata %>% mutate(position = case_when(frstbase == 1 ~ "First Base",
scndbase == 1 ~ "Second Base",
shrtstop == 1 ~ "Shortstop",
thrdbase == 1 ~ "Third Base",
outfield == 1 ~ "Outfield",
catcher == 1 ~ "Catcher")) %>% mutate(position = as.factor(position))
In a similar manner we create a variable which describes the ethnicity of players.
mydata <- mydata %>% mutate(ethnicity = case_when(black == 1 ~ "black",
hispan == 1 ~ "hispanic",
(black == 0) & (hispan == 0) ~ "white")) %>%
mutate(ethnicity = as.factor(ethnicity))
Here we categorise players as either black hispanic or white. This is the information available from the data but may not well represent more complicated ethnicities.
Rhas some basic build-in graphics capability. This is called with the
plot
function. The plot function recognises the datatype
and adjusts the graphics type accordingly. For instance:
plot(mydata$salary)
It has recognised that the data in salary
are numeric
and this triggers a scatterplot. To produce a scatter plot you need a
second variable and as we did not deliver any it automatically uses the
data index (which here is just like a numbering of the players in the
datafile). If you want to set which variable appears on the horizontal
axis (e.g. the batting average, bavg
), then you need to set
this as follows:
plot(mydata$bavg,mydata$salary) # first variable for horizontal, 2nd for vertical axis
If we are sending a categrical/factor variable to plot
such as the position
variable, then the function will
automatically create a bar graph.
plot(mydata$position)
These plots can be very useful to get a first idea of what the data look like and it is possible to adapt certain aspects of these plots, however, when you want to prepare nice looking plots for a report, then you are likely to want to use a more powerful tool.
ggplot
The ggplot
package is one of these packages you can use
to produce really beautiful graphics. There are literally books written just to help you with
using this package to produce nice graphics. This here is just the
briefest of introductions.
There is a large range of different graphs that are available from
the ggplot
package. A great overview and an excellent
source of help is available from this ggplot
cheatsheet. It is well worth having this available on your computer.
Of ourse, al ot more detail is available from Hadley Wikham’s online book.
ggplot
In general the structure of creating
g1 <- ggplot(mydata, aes(y = VARIABLENAME, x = VARIABLENAME, colour = VARIABLENAME)) +
geom_GRAPHTYPE() +
geom_OTHERSTUFF()
It is worth briefly thinking about what this does, such that what
follows makes more sense. First, we create a new object into which we
will save the graph, here called g1
. We call the function
ggplot(mydata, )
which instructs R to prepare a graph using
data stored in the mydata
object. This implies that in what
follows R just assumes that variables we are using are in this
dataframe.
aes(y = VARIABLENAME1, x = VARIABLENAME2, colour = VARIABLENAME3)
then provides further detail. In particular it tells R that
VARIABLENAME1
should be on the vertical/y axis and
VARIABLENAM2
should be on the horizontal/x axis. Further,
data points will be displayed in different colors as per
VARIABLENAME3
. Not all graph types will actually need all
of this information.
At this stage, R has just prepared a blank canvas with those
parameters, but has not yet created a graph. That is achieved by
+ geom_GRAPHTYPE()
. As you will soon see there are
different graph types available. Then you can add further details to
your graph + geom_OTHERSTUFF()
.
It is useful to differentiate between graphs that use one and those that use two variables.
Say we want to graphically represent the distribution of a
categorical variable, like the position
variable.
g1 <- ggplot(mydata, aes(x = position)) +
geom_bar()
This has produced the graph and it is stored in g1
. You
can see it by calling g1
.
g1
This is very similar to the bar graph produced in the basic graph section. Let’s change the esthetics a little by adding a theme and a title.
g1 <- ggplot(mydata, aes(x = position)) +
geom_bar() +
theme_classic() +
labs(x = "Player position", y = "Number of players", title ="Bar Chart for Player positions")
g1
If you wish to represent numerical data you have a few more options than a bar chart.
Let’s assume you want a histogram of the hruns
variable,
which indicates how many career home runs a player has.
g2 <- ggplot(mydata, aes(x = hruns)) +
geom_histogram() +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
g2
## `stat_bin()` using `bins = 30`. Pick
## better value with `binwidth`.
You get a good idea from this about the distribution of the home run
variable. Instead of geom_histogram()
you could use a
smooth versio of the histogram, called geom_density()
. Try
for yourself what the option size = 1
inside the function
call does by changing the value (remember you cannot break your
computer).
g3 <- ggplot(mydata, aes(x = hruns)) +
geom_density(size = 1) +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
g3
And now you will learn one of the incredibly powerful modifications
you can undertake with ggplot
. In the initial function
call, in the aesthetics (aes
) settings we add
colour = position
. See what happens:
g4 <- ggplot(mydata, aes(x = hruns, color = position)) +
geom_density(size = 1) +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
g4
We now get different density plots by position. You can now easily see that Catchers, Shortstops and Second Base players have fewer home runs, they are the very specilist fielding positions.
This is pretty nice already, but we can do even better. Try and
identify the what was added and play with the alpha
parameter to see what it does.
g4 <- ggplot(mydata, aes(x = hruns, fill = position, colour = position)) +
geom_density(size = 1, alpha = 0.2) +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
g4
It is important to understand that a data analyst will rarely remember all of these commands. A crucial programming technique is searching the internet. You could, for instance, search for “r ggplot overlying density plots” and then you should find an example which produces a similar plot. You can then copy and paste the code and adjust it to your dataset.
In this section you can learn how to produce graphs that visualise two variables.
Perhaps the most classic of such plots is a scatter diagram.
g5 <- ggplot(mydata, aes(x = hruns, y = salary)) +
geom_point() +
theme_classic() +
labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
g5
Here we can see that in general there is a positive relationship between the number of career home runs and the salary variables.
After we learned that the number of home runs differs significantly between positions (some positions job it is to get home runs, other positions have other specialisations), it is obvious that there is more to the data. Let’s see how we can use position information in this graph. We could add a colour differentiation
g7 <- ggplot(mydata, aes(x = hruns, y = salary, colour = position)) +
geom_point() +
theme_classic() +
labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
g7
But this is somewhat too messy to actually see anything here. Another way to do this is what is called facetting.
g8 <- ggplot(mydata, aes(x = hruns, y = salary)) +
geom_point() +
theme_classic() +
facet_wrap(vars(position)) +
labs(x = "Home Runs", y = "Salary", title ="Scatter plots")
g8
The addition of + facet_wrap(vars(position))
produced a
scatter plot for each position separately. In this case this does not
really help us a lot other than to see that outfielders are the players
which score, on average, more home runs than other positions.
But faceting can be incredibly revealing. Look at this graph:
g9 <- ggplot(mydata, aes(x = position)) +
geom_bar() +
theme_classic() +
facet_grid(vars(ethnicity)) +
labs(title ="Bar graph of positions (conditional on ethnicity) ")
g9
This very clearly illustrates that, at the time, black players predominantly played in the outfield.
In the above you really only scratched the surface of what
ggplot
can do. Another very common type of graph is a
time-series plot. The Basketball data do not have any time dimension
such that there is no obvious use for a time-series plot. However, it
will be constructed in a very similar way, just using the
geom_line()
option. Explore the possibilities in Hadley Wickham’s book.
The following is a fairly straightforward way to save graphs as an
image. You can then import that image into any software you wish, such
as Microsoft Word. Let’s say you wish to include the graph we saved in
g9
into another document.
ggsave("baseball.png", g9)
## Saving 7 x 5 in image
This will create “baseball.png” in your working directory from where
you can load it to wherever you want to use it. Use the help function
(?ggsave
) in the console to find out what options you have.
For instance you could, instead of the square image saved on this
occasion, save a wide format picture.
ggsave("baseball_wide.png", width = 15, height = 9, g9)
This very useful function but only works for plots produced with
ggplot
. If you want to save any other plot you can produce
that plot in the console. Then you will see it in the “Plots” window in
RStudio, and from there you can save it using the “Export” button.
In this section you learned how to produce visual representations of data. These can be a very important tool when you communicate your work.