%load_ext pretty_jupyter

Introduction¶

This walk-through is part of the ECLR page.

Here we will introduce one of the most powerful tools Python has to offer, graphical representation of data. Adrian Pagan, an excellent Australian econometrician once said “A simple plot tells a lot!”. So let’s see how to create simple plots.

With Python being on open source software you will find quite a few different ways to produce graphics. The packages matplotlib and seaborn have been the most graphing packages in Python. A more recent package is the lets_plot package and this is the package we will introduce here. It is basically a replica of the very popular ggplot package for the R programming language. As Econometricians will often also be familiar with R we shall introduce graphing in this package. But as a regular Python user you may want to develop a wider range of graphics and then you may want to also use other graphics package.

In this walkthrough you will learn to

use regression results to obtain in-sample predictions
use regression results to obtain out-of-sample predictions

Preparing your workfile¶

Install the lets-plot package in case you have not yet done so on your computer (pip install lets-plot in your terminal).

We add the basic packages needed for this work:

import numpy as np
import pandas as pd
from lets_plot import *

It turns out that, before you can create any graphs you have to call the LetsPlot.setup_html() function.

# Lets_plot requires a setup
LetsPlot.setup_html()

Loading a dataset and data tidying¶

Let's get a dataset to look at. We shall use the Baseball wages dataset, mlb1.csv. Either download that file to your working directory or directly link to the file as shown below. Before you upload a csv (or EXCEL) file you should always inspect the file to understand what you should expect. Upon inspection you will find that missing values are coded as "." and hence we pass that information on to the pd.read_csv function.

# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/mlb1.csv"
df_bball = pd.read_csv(url, na_values = '.')

Let's check out what variables we have in this data-file.

df_bball.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 47 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   salary    353 non-null    int64  
 1   teamsal   353 non-null    int64  
 2   nl        353 non-null    int64  
 3   years     353 non-null    int64  
 4   games     353 non-null    int64  
 5   atbats    353 non-null    int64  
 6   runs      353 non-null    int64  
 7   hits      353 non-null    int64  
 8   doubles   353 non-null    int64  
 9   triples   353 non-null    int64  
 10  hruns     353 non-null    int64  
 11  rbis      353 non-null    int64  
 12  bavg      353 non-null    int64  
 13  bb        353 non-null    int64  
 14  so        353 non-null    int64  
 15  sbases    353 non-null    int64  
 16  fldperc   353 non-null    int64  
 17  frstbase  353 non-null    int64  
 18  scndbase  353 non-null    int64  
 19  shrtstop  353 non-null    int64  
 20  thrdbase  353 non-null    int64  
 21  outfield  353 non-null    int64  
 22  catcher   353 non-null    int64  
 23  yrsallst  353 non-null    int64  
 24  hispan    353 non-null    int64  
 25  black     353 non-null    int64  
 26  whitepop  330 non-null    float64
 27  blackpop  339 non-null    float64
 28  hisppop   330 non-null    float64
 29  pcinc     353 non-null    int64  
 30  gamesyr   353 non-null    float64
 31  hrunsyr   353 non-null    float64
 32  atbatsyr  353 non-null    float64
 33  allstar   353 non-null    float64
 34  slugavg   353 non-null    float64
 35  rbisyr    353 non-null    float64
 36  sbasesyr  353 non-null    float64
 37  runsyr    353 non-null    float64
 38  percwhte  330 non-null    float64
 39  percblck  330 non-null    float64
 40  perchisp  330 non-null    float64
 41  blckpb    330 non-null    float64
 42  hispph    330 non-null    float64
 43  whtepw    330 non-null    float64
 44  blckph    330 non-null    float64
 45  hisppb    330 non-null    float64
 46  lsalary   353 non-null    float64
dtypes: float64(20), int64(27)
memory usage: 129.7 KB

All variables are coded as numeric variables. You can find short variable descriptions here.

df_bball.shape

(353, 47)

The datafile contains information on 353 baseball players and contains 47 variables. Let's look at an individual observation.

df_bball.iloc[9,17:23]

frstbase    0.0
scndbase    0.0
shrtstop    0.0
thrdbase    0.0
outfield    1.0
catcher     0.0
Name: 9, dtype: float64

You can see that there are several variables which indicate which fielding position a player has. For instance player 10 is an outfield player as the player has a value of 1 in the outfield variable and 0s otherwise.

It is often more convenient to have one (categorical) variable which contains this positional information. The following will create one new variable called position, which indicates the position a player plays in. Turns out that this is not a super straightforward operation. If you wish you can jump ahead to the next section.

If you stick around for the next short section you will learn a few things about Python, in particular the use of the .apply method and the use of lambda functions. But you can also jump across this section.

apply and lambda functions¶

.tabset

apply method¶

With the apply method you can apply a function easily to each row or column of a DataFrame. It allows you to quickly perform operations on your data without needing to write complicated loops or manually iterate over each row or column. Let's illustrate this with an example. We will look at our dataframe df_bball. We wish to find the maximum across the ['salary', 'teamsalary', 'hits'] columns.

df_bball[['salary', 'teamsal', 'hits']].apply(max,axis=0) # applies the max functions to all columns (axis = 0)

salary      6329213
teamsal    42866000
hits           3025
dtype: int64

You could also find the max value across all variables for players 0 to 4, only looking at variables 0 to 10.

df_bball.iloc[0:4,].apply(max,axis=1) # applies the max functions to all columns (axis = 0)

0    38407380.0
1    38407380.0
2    38407380.0
3    38407380.0
dtype: float64

For each we pick out the team salary and clearly they all play for the same team.

Hopefully you can see that the .apply method can be used to apply functions to rows or columns.

lambda functions and the apply method¶

In the apply example we used an existing function, the max fnction. But often you will want to apply a bespoke function to either columns or rows in a dataframe. In python you can define a function which may implement chuncks of code which you potentially want to apply several times and hence it is useful to create a named function which you can then use again and again. lambda functions are not that. lambda functions are used to do little things which you will only do once, i.e. only apply to one dataframe but not repeatedly to several dataframes. For instance we could consider writing a short function which checks whether a particular baseball player can be considered a veteran (very experienced player, years > 10).

temp = df_bball.copy()
temp['Exper_cat'] = temp.apply(lambda row: 'Veteran' if row['years'] > 10 else 'Not Veteran', axis = 1)

The apply method¶

Adjust the code of the simple example to our problem at hand. See how this is done in the next step.

¶

.unlisted|O_O|.unnumbered

This is the code which creates the new (categorical) variable position.

df_bball['position'] = df_bball.apply(
    lambda row: 'First Base' if row['frstbase'] == 1 else 
                'Second Base' if row['scndbase'] == 1 else
                'Shortstop' if row['shrtstop'] == 1 else
                'Third Base' if row['thrdbase'] == 1 else
                'Outfield' if row['outfield'] == 1 else
                'Catcher' if row['catcher'] == 1 else None, axis=1
)

# Convert the 'position' column to a categorical type (similar to as.factor() in R)
df_bball['position'] = pd.Categorical(df_bball['position'])

In a similar manner we categorise players as either black hispanic or white. This is the information available from the data but may not well represent more complicated ethnicities.

df_bball['ethnicity'] = df_bball.apply(
    lambda row: 'black' if row['black'] == 1 else 
                'hispanic' if row['hispan'] == 1 else
                'white' if (row['black'] == 0) & (row['hispan'] == 0) else None, axis=1
)

# Convert the 'position' column to a categorical type (similar to as.factor() in R)
df_bball['ethnicity'] = pd.Categorical(df_bball['ethnicity'])

Basic graphics in Python¶

Univariate data representations¶

Often you will want to provide a graphical representation of variables to get a good understanding of a variables distribution. For categorical variables this will often be a bar chart. We will use the functionality of the lets-plot package, which by and large mimicks the functionality of R's ggplot package. A bar chart for the position variable in df_bball is implemented as follows:

(
  ggplot(df_bball, aes(x = 'position')) + 
    geom_bar() 
)

Note that the entire command was inclosed by ( ). This is to ensure that Python does not complain about the line breaks which make this much easier to read. One of the great features of the package is that you can get really nice plots with the basic function call as above, but also that you have a lot of possibilities to change the look of the graphs by adding themes or titles.

(
    ggplot(df_bball, aes(x = 'position')) + 
        geom_bar() + 
        theme_classic() +
        labs(x = "Player position", y = "Number of players", title ="Bar Chart for Player positions")
)

If you wish to represent numerical data you have a few more options than a bar chart.

Let's assume you want a histogram of the hruns variable, which indicates how many career home runs a player has.

(
    ggplot(df_bball, aes(x = 'hruns')) + 
        geom_histogram() + 
        theme_classic() +
        labs(x = "Home Runs", y = "", title ="Career Home Runs")
)

You get a good idea from this about the distribution of the home run variable. Instead of geom_histogram() you could use a smooth version of the histogram, called geom_density(). Try for yourself what the option size = 1 inside the function call does by changing the value (remember you cannot break your computer).

(
    ggplot(df_bball, aes(x = 'hruns')) + 
        geom_density(size = 1) + 
        theme_classic() +
        labs(x = "Home Runs", y = "", title ="Career Home Runs")
)

And now you will learn one of the incredibly powerful modifications you can undertake with ggplot. In the initial function call, in the aesthetics (aes) settings we add colour = 'position'. See what happens:

(
    ggplot(df_bball, aes(x = 'hruns', color = 'position')) + 
        geom_density(size = 1) + 
        theme_classic() +
        labs(x = "Home Runs", y = "", title ="Career Home Runs")
)

We now get different density plots by position. You can now easily see that Catchers, Shortstops and Second Base players have fewer home runs, they are specialist fielding positions.

It is important to understand that a data analyst will rarely remember all of these commands. A crucial programming technique is searching the internet. You could, for instance, search for "Python lets-plot overlying density plots" and then you should find an example which produces a similar plot. You can then copy and paste the code and adjust it to your dataset.

The general structure of the ggplot function¶

It is useful to actually understand the general the structure of creating graphs through ggplot. This is the general architecture:

( ggplot(mydata, aes(y = 'VARIABLENAME1', x = 'VARIABLENAME2', colour = 'VARIABLENAME3')) + geom_GRAPHTYPE() + OTHERSTUFF() )

It is worth briefly thinking about what this does, such that what follows makes more sense. We call the function ggplot(mydata, ) which instructs R to prepare a graph using data stored in the mydata dataframe. This implies that in what follows R just assumes that variables we are using are in this dataframe.

aes(y = VARIABLENAME1, x = VARIABLENAME2, colour = VARIABLENAME3) then provides further detail. In particular it tells R that VARIABLENAME1 should be on the vertical/y axis and VARIABLENAM2 should be on the horizontal/x axis. Further, data points will be displayed in different colors as per VARIABLENAME3. Not all graph types will actually need all of this information. For instance the bar chart above only needed to know an x axis variable.

At this stage, R has just prepared a blank canvas with those parameters, but has not yet created a graph. That is achieved by + geom_GRAPHTYPE(). As you will soon see there are different graph types available. Above we used + geom_bar(). Then you can add further details to your graph + OTHERSTUFF(). Above we added information on axis and graph titles (using the labs() option) and added a different theme + theme_classic().

Bivariate data representations¶

In this section you can learn how to produce graphs that visualise two variables.

Perhaps the most classic of such plots is a scatter diagram.

(
    ggplot(df_bball, aes(x = 'hruns', y = 'salary')) + 
        geom_point() + 
        theme_classic() +
        labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
)

Here we can see that in general there is a positive relationship between the number of career home runs and the salary variables.

After we learned that the number of home runs differs significantly between positions (some positions job it is to get home runs, other positions have other specialisations), it is obvious that there is more to the data. Let’s see how we can use position information in this graph. We could add a colour differentiation.

(
    ggplot(df_bball, aes(x = 'hruns', y = 'salary', color = 'position')) + 
        geom_point() + 
        theme_classic() +
        labs(x = "Home Runs", y = "Salary", title ="Scatter plot")
)

But this is somewhat too messy to actually see anything here. Another way to do this is what is called facetting.

(
    ggplot(df_bball, aes(x = 'hruns', y = 'salary')) + 
        geom_point() + 
        theme_classic() +
        facet_wrap('position') +
        labs(x = "Home Runs", y = "Salary", title ="Scatter plots")
)

The addition of + facet_wrap('position') produced a scatter plot for each position separately. In this case this does not really help us a lot other than to see that outfielders are the players which score, on average, more home runs than other positions. We should also be aware that there is most likely a correlation between experience and the number of home runs and experience will have an important role in determining salaries.

But faceting can be incredibly revealing. Look at this graph:

(
    ggplot(df_bball, aes(x = 'position')) + 
        geom_bar() + 
        theme_classic() +
        facet_grid('ethnicity') +
        labs(title ="Bar graph of positions (conditional on ethnicity) ")
)

This very clearly illustrates that, at the time, black players predominantly played in the outfield.

Other graphs¶

In the above you really only scratched the surface of what ggplot can do. Another very common type of graph is a time-series plot. The Basetball data do not have any time dimension such that there is no obvious use for a time-series plot. However, it will be constructed in a very similar way, just using the geom_line() option. Explore the possibilities in Hadley Wickham's book.

Summary¶

In this walkthrough you learned how to produce visual representations of data. These can be a very important tool when you communicate your work.