%load_ext pretty_jupyter
Introduction¶
This walk-through is part of the ECLR page.
Here we will introduce one of the most powerful tools Python has to offer, graphical representation of data. Adrian Pagan, an excellent Australian econometrician once said “A simple plot tells a lot!”. So let’s see how to create simple plots.
With Python being on open source software you will find quite a few different ways to produce graphics. The packages matplotlib
and seaborn
have been the most graphing packages in Python. A more recent package is the lets_plot
package and this is the package we will introduce here. It is basically a replica of the very popular ggplot
package for the R programming language. As Econometricians will often also be familiar with R we shall introduce graphing in this package. But as a regular Python user you may want to develop a wider range of graphics and then you may want to also use other graphics package.
In this walkthrough you will learn to
- use regression results to obtain in-sample predictions
- use regression results to obtain out-of-sample predictions
Preparing your workfile¶
Install the lets-plot
package in case you have not yet done so on your computer (pip install lets-plot
in your terminal).
We add the basic packages needed for this work:
import numpy as np
import pandas as pd
from lets_plot import *
It turns out that, before you can create any graphs you have to call the LetsPlot.setup_html()
function.
# Lets_plot requires a setup
LetsPlot.setup_html()
Loading a dataset and data tidying¶
Let's get a dataset to look at. We shall use the Baseball wages dataset, mlb1.csv. Either download that file to your working directory or directly link to the file as shown below. Before you upload a csv (or EXCEL) file you should always inspect the file to understand what you should expect. Upon inspection you will find that missing values are coded as "." and hence we pass that information on to the pd.read_csv
function.
# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/mlb1.csv"
df_bball = pd.read_csv(url, na_values = '.')
Let's check out what variables we have in this data-file.
df_bball.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 353 entries, 0 to 352 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 salary 353 non-null int64 1 teamsal 353 non-null int64 2 nl 353 non-null int64 3 years 353 non-null int64 4 games 353 non-null int64 5 atbats 353 non-null int64 6 runs 353 non-null int64 7 hits 353 non-null int64 8 doubles 353 non-null int64 9 triples 353 non-null int64 10 hruns 353 non-null int64 11 rbis 353 non-null int64 12 bavg 353 non-null int64 13 bb 353 non-null int64 14 so 353 non-null int64 15 sbases 353 non-null int64 16 fldperc 353 non-null int64 17 frstbase 353 non-null int64 18 scndbase 353 non-null int64 19 shrtstop 353 non-null int64 20 thrdbase 353 non-null int64 21 outfield 353 non-null int64 22 catcher 353 non-null int64 23 yrsallst 353 non-null int64 24 hispan 353 non-null int64 25 black 353 non-null int64 26 whitepop 330 non-null float64 27 blackpop 339 non-null float64 28 hisppop 330 non-null float64 29 pcinc 353 non-null int64 30 gamesyr 353 non-null float64 31 hrunsyr 353 non-null float64 32 atbatsyr 353 non-null float64 33 allstar 353 non-null float64 34 slugavg 353 non-null float64 35 rbisyr 353 non-null float64 36 sbasesyr 353 non-null float64 37 runsyr 353 non-null float64 38 percwhte 330 non-null float64 39 percblck 330 non-null float64 40 perchisp 330 non-null float64 41 blckpb 330 non-null float64 42 hispph 330 non-null float64 43 whtepw 330 non-null float64 44 blckph 330 non-null float64 45 hisppb 330 non-null float64 46 lsalary 353 non-null float64 dtypes: float64(20), int64(27) memory usage: 129.7 KB
All variables are coded as numeric variables. You can find short variable descriptions here.
df_bball.shape
(353, 47)
The datafile contains information on 353 baseball players and contains 47 variables. Let's look at an individual observation.
df_bball.iloc[9,17:23]
frstbase 0.0 scndbase 0.0 shrtstop 0.0 thrdbase 0.0 outfield 1.0 catcher 0.0 Name: 9, dtype: float64
You can see that there are several variables which indicate which fielding position a player has. For instance player 10 is an outfield player as the player has a value of 1 in the outfield variable and 0s otherwise.
It is often more convenient to have one (categorical) variable which contains this positional information. The following will create one new variable called position
, which indicates the position a player plays in. Turns out that this is not a super straightforward operation. If you wish you can jump ahead to the next section.
If you stick around for the next short section you will learn a few things about Python, in particular the use of the .apply
method and the use of lambda
functions. But you can also jump across this section.
apply and lambda functions¶
apply method¶
With the apply method you can apply a function easily to each row or column of a DataFrame. It allows you to quickly perform operations on your data without needing to write complicated loops or manually iterate over each row or column. Let's illustrate this with an example. We will look at our dataframe df_bball
. We wish to find the maximum across the ['salary', 'teamsalary', 'hits']
columns.
df_bball[['salary', 'teamsal', 'hits']].apply(max,axis=0) # applies the max functions to all columns (axis = 0)
salary 6329213 teamsal 42866000 hits 3025 dtype: int64
You could also find the max value across all variables for players 0 to 4, only looking at variables 0 to 10.
df_bball.iloc[0:4,].apply(max,axis=1) # applies the max functions to all columns (axis = 0)
0 38407380.0 1 38407380.0 2 38407380.0 3 38407380.0 dtype: float64
For each we pick out the team salary and clearly they all play for the same team.
Hopefully you can see that the .apply
method can be used to apply functions to rows or columns.
lambda functions and the apply method¶
In the apply
example we used an existing function, the max
fnction. But often you will want to apply a bespoke function to either columns or rows in a dataframe. In python you can define a function which may implement chuncks of code which you potentially want to apply several times and hence it is useful to create a named function which you can then use again and again. lambda
functions are not that. lambda
functions are used to do little things which you will only do once, i.e. only apply to one dataframe but not repeatedly to several dataframes. For instance we could consider writing a short function which checks whether a particular baseball player can be considered a veteran (very experienced player, years
> 10).
temp = df_bball.copy()
temp['Exper_cat'] = temp.apply(lambda row: 'Veteran' if row['years'] > 10 else 'Not Veteran', axis = 1)
The apply method¶
Adjust the code of the simple example to our problem at hand. See how this is done in the next step.
¶
This is the code which creates the new (categorical) variable position
.
df_bball['position'] = df_bball.apply(
lambda row: 'First Base' if row['frstbase'] == 1 else
'Second Base' if row['scndbase'] == 1 else
'Shortstop' if row['shrtstop'] == 1 else
'Third Base' if row['thrdbase'] == 1 else
'Outfield' if row['outfield'] == 1 else
'Catcher' if row['catcher'] == 1 else None, axis=1
)
# Convert the 'position' column to a categorical type (similar to as.factor() in R)
df_bball['position'] = pd.Categorical(df_bball['position'])
In a similar manner we categorise players as either black hispanic or white. This is the information available from the data but may not well represent more complicated ethnicities.
df_bball['ethnicity'] = df_bball.apply(
lambda row: 'black' if row['black'] == 1 else
'hispanic' if row['hispan'] == 1 else
'white' if (row['black'] == 0) & (row['hispan'] == 0) else None, axis=1
)
# Convert the 'position' column to a categorical type (similar to as.factor() in R)
df_bball['ethnicity'] = pd.Categorical(df_bball['ethnicity'])
Basic graphics in Python¶
Univariate data representations¶
Often you will want to provide a graphical representation of variables to get a good understanding of a variables distribution. For categorical variables this will often be a bar chart. We will use the functionality of the lets-plot
package, which by and large mimicks the functionality of R's ggplot
package. A bar chart for the position
variable in df_bball
is implemented as follows:
(
ggplot(df_bball, aes(x = 'position')) +
geom_bar()
)
Note that the entire command was inclosed by ( )
. This is to ensure that Python does not complain about the line breaks which make this much easier to read. One of the great features of the package is that you can get really nice plots with the basic function call as above, but also that you have a lot of possibilities to change the look of the graphs by adding themes or titles.
(
ggplot(df_bball, aes(x = 'position')) +
geom_bar() +
theme_classic() +
labs(x = "Player position", y = "Number of players", title ="Bar Chart for Player positions")
)
If you wish to represent numerical data you have a few more options than a bar chart.
Let's assume you want a histogram of the hruns
variable, which indicates how many career home runs a player has.
(
ggplot(df_bball, aes(x = 'hruns')) +
geom_histogram() +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
)
You get a good idea from this about the distribution of the home run variable. Instead of geom_histogram()
you could use a smooth version of the histogram, called geom_density()
. Try for yourself what the option size = 1
inside the function call does by changing the value (remember you cannot break your computer).
(
ggplot(df_bball, aes(x = 'hruns')) +
geom_density(size = 1) +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
)
And now you will learn one of the incredibly powerful modifications you can undertake with ggplot
. In the initial function call, in the aesthetics (aes
) settings we add colour = 'position'
. See what happens:
(
ggplot(df_bball, aes(x = 'hruns', color = 'position')) +
geom_density(size = 1) +
theme_classic() +
labs(x = "Home Runs", y = "", title ="Career Home Runs")
)
We now get different density plots by position. You can now easily see that Catchers, Shortstops and Second Base players have fewer home runs, they are specialist fielding positions.
It is important to understand that a data analyst will rarely remember all of these commands. A crucial programming technique is searching the internet. You could, for instance, search for "Python lets-plot overlying density plots" and then you should find an example which produces a similar plot. You can then copy and paste the code and adjust it to your dataset.
The general structure of the ggplot function¶
It is useful to actually understand the general the structure of creating graphs through ggplot
. This is the general architecture:
( ggplot(mydata, aes(y = 'VARIABLENAME1', x = 'VARIABLENAME2', colour = 'VARIABLENAME3')) + geom_GRAPHTYPE() + OTHERSTUFF() )
It is worth briefly thinking about what this does, such that what follows makes more sense. We call the function ggplot(mydata, )
which instructs R to prepare a graph using data stored in the mydata
dataframe. This implies that in what follows R just assumes that variables we are using are in this dataframe.
aes(y = VARIABLENAME1, x = VARIABLENAME2, colour = VARIABLENAME3)
then provides further detail. In particular it tells R that VARIABLENAME1
should be on the vertical/y axis and VARIABLENAM2
should be on the horizontal/x axis. Further, data points will be displayed in different colors as per VARIABLENAME3
. Not all graph types will actually need all of this information. For instance the bar chart above only needed to know an x axis variable.
At this stage, R has just prepared a blank canvas with those parameters, but has not yet created a graph. That is achieved by + geom_GRAPHTYPE()
. As you will soon see there are different graph types available. Above we used + geom_bar()
. Then you can add further details to your graph + OTHERSTUFF()
. Above we added information on axis and graph titles (using the labs()
option) and added a different theme + theme_classic()
.
Bivariate data representations¶
In this section you can learn how to produce graphs that visualise two variables.
Perhaps the most classic of such plots is a scatter diagram.
(
ggplot(df_bball, aes(x = 'hruns', y = 'salary')) +
geom_point() +
theme_classic() +
labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
)
Here we can see that in general there is a positive relationship between the number of career home runs and the salary variables.
After we learned that the number of home runs differs significantly between positions (some positions job it is to get home runs, other positions have other specialisations), it is obvious that there is more to the data. Let’s see how we can use position information in this graph. We could add a colour differentiation.
(
ggplot(df_bball, aes(x = 'hruns', y = 'salary', color = 'position')) +
geom_point() +
theme_classic() +
labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
)
But this is somewhat too messy to actually see anything here. Another way to do this is what is called facetting.
(
ggplot(df_bball, aes(x = 'hruns', y = 'salary')) +
geom_point() +
theme_classic() +
facet_wrap('position') +
labs(x = "Home Runs", y = "Salary", title ="Scatter plots")
)
The addition of + facet_wrap('position')
produced a scatter plot for each position separately. In this case this does not really help us a lot other than to see that outfielders are the players which score, on average, more home runs than other positions. We should also be aware that there is most likely a correlation between experience and the number of home runs and experience will have an important role in determining salaries.
But faceting can be incredibly revealing. Look at this graph:
(
ggplot(df_bball, aes(x = 'position')) +
geom_bar() +
theme_classic() +
facet_grid('ethnicity') +
labs(title ="Bar graph of positions (conditional on ethnicity) ")
)
This very clearly illustrates that, at the time, black players predominantly played in the outfield.
Other graphs¶
In the above you really only scratched the surface of what ggplot
can do. Another very common type of graph is a time-series plot. The Basetball data do not have any time dimension such that there is no obvious use for a time-series plot. However, it will be constructed in a very similar way, just using the geom_line()
option. Explore the possibilities in Hadley Wickham's book.
Summary¶
In this walkthrough you learned how to produce visual representations of data. These can be a very important tool when you communicate your work.