Simple data handling and analysis in Python - Football data¶

This exercise is part of the ECLR page.

Here we analyse some football data. Data are downloaded from Football-data.com. We will use these data to expose you to some basic data handling skills in Python.

Prepare the workspace¶

We start by loading a range of packages. Recall that you need to install these first before you can use them. For instance to install the pingouin package you would use the command pip install pingouin in the terminal. You can do the same when you want to update a package.

Now

In [365]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *
from datetime import datetime

The file contains a lot of variables. Explanations by the website can be found here.

The following is a bespoke plot style for the lets_plot function. There is no need to do this, you could also use the default style.

In [366]:
LetsPlot.setup_html(no_js=True)

### You don't need to use these settings yourself,
### they are just here to make the charts look nicer!
# Set the plot style for prettier charts:
plt.style.use(
    "C:/Users/msassrb2/The University of Manchester Dropbox/Ralf Becker/Python/Football/plot_style.txt"
)

Load the data¶

The following loads the data directly from the Football-data.com website. As this links to the 2024/25 season, this will be constantly updated as more matches are being played.

In [367]:
df2425 = pd.read_csv(
    "https://www.football-data.co.uk/mmz4281/2324/E0.csv",
    na_values="",
)

In order to do some analysis that will not change everytime a new match is being played, we shall also load data from previous seasons and attach these to each other to create one big dataset.

In [368]:
df2324 = pd.read_csv(
    "https://www.football-data.co.uk/mmz4281/2324/E0.csv",
    na_values="",
)

df2223 = pd.read_csv(
    "https://www.football-data.co.uk/mmz4281/2223/E0.csv",
    na_values="",
)

df2122 = pd.read_csv(
    "https://www.football-data.co.uk/mmz4281/2122/E0.csv",
    na_values="",
)

Explore the data¶

Let's explore what sort of information is contained in the dataframes. See what the following commands do. Each of these give you some glimpse of the information in the dataframe.

In [369]:
# df2324.columns
# df2324.shape
# df2324.info
# df2324.head
# df2324.tail
df2324.dtypes
Out[369]:
Div           object
Date          object
Time          object
HomeTeam      object
AwayTeam      object
FTHG           int64
FTAG           int64
FTR           object
HTHG           int64
HTAG           int64
HTR           object
Referee       object
HS             int64
AS             int64
HST            int64
AST            int64
HF             int64
AF             int64
HC             int64
AC             int64
HY             int64
AY             int64
HR             int64
AR             int64
B365H        float64
B365D        float64
B365A        float64
BWH          float64
BWD          float64
BWA          float64
IWH          float64
IWD          float64
IWA          float64
PSH          float64
PSD          float64
PSA          float64
WHH          float64
WHD          float64
WHA          float64
VCH          float64
VCD          float64
VCA          float64
MaxH         float64
MaxD         float64
MaxA         float64
AvgH         float64
AvgD         float64
AvgA         float64
B365>2.5     float64
B365<2.5     float64
P>2.5        float64
P<2.5        float64
Max>2.5      float64
Max<2.5      float64
Avg>2.5      float64
Avg<2.5      float64
AHh          float64
B365AHH      float64
B365AHA      float64
PAHH         float64
PAHA         float64
MaxAHH       float64
MaxAHA       float64
AvgAHH       float64
AvgAHA       float64
B365CH       float64
B365CD       float64
B365CA       float64
BWCH         float64
BWCD         float64
BWCA         float64
IWCH         float64
IWCD         float64
IWCA         float64
PSCH         float64
PSCD         float64
PSCA         float64
WHCH         float64
WHCD         float64
WHCA         float64
VCCH         float64
VCCD         float64
VCCA         float64
MaxCH        float64
MaxCD        float64
MaxCA        float64
AvgCH        float64
AvgCD        float64
AvgCA        float64
B365C>2.5    float64
B365C<2.5    float64
PC>2.5       float64
PC<2.5       float64
MaxC>2.5     float64
MaxC<2.5     float64
AvgC>2.5     float64
AvgC<2.5     float64
AHCh         float64
B365CAHH     float64
B365CAHA     float64
PCAHH        float64
PCAHA        float64
MaxCAHH      float64
MaxCAHA      float64
AvgCAHH      float64
AvgCAHA      float64
dtype: object

When looking at the .dtypes you will see that Python has assigned certain types to variables. Variables with dtype float64 are numbers. If a variable has datatype objet this means that Python has not yet made up its mind on what the right datatype is. Below we will change some datatypes when this is necessary.

Let's also look at one particular observation. Appending the dataframe name with .iloc[rows,cols] where rows and cold indicate which rows and columns you wish to see is one way to single out which parts of the dataframe you want to see. By indicating : in the cols space we say that we wish to see all columns.

In [370]:
df2324.iloc[1,:]
Out[370]:
Div                     E0
Date            12/08/2023
Time                 12:30
HomeTeam           Arsenal
AwayTeam     Nott'm Forest
FTHG                     2
FTAG                     1
FTR                      H
HTHG                     2
HTAG                     0
HTR                      H
Referee           M Oliver
HS                      15
AS                       6
HST                      7
AST                      2
HF                      12
AF                      12
HC                       8
AC                       3
HY                       2
AY                       2
HR                       0
AR                       0
B365H                 1.18
B365D                  7.0
B365A                 15.0
BWH                   1.17
BWD                    7.5
BWA                   15.5
IWH                    1.2
IWD                   7.25
IWA                   14.0
PSH                   1.18
PSD                   7.86
PSA                  15.67
WHH                   1.12
WHD                    6.5
WHA                   12.0
VCH                   1.14
VCD                    7.5
VCA                   17.0
MaxH                  1.21
MaxD                   8.5
MaxA                  17.5
AvgH                  1.18
AvgD                  7.64
AvgA                 15.67
B365>2.5              1.44
B365<2.5              2.75
P>2.5                 1.42
P<2.5                 2.93
Max>2.5               1.45
Max<2.5               2.98
Avg>2.5               1.42
Avg<2.5               2.85
AHh                   -2.0
B365AHH               1.88
B365AHA               2.02
PAHH                  1.88
PAHA                  2.01
MaxAHH                1.91
MaxAHA                2.06
AvgAHH                1.87
AvgAHA                1.99
B365CH                1.18
B365CD                 7.0
B365CA                15.0
BWCH                  1.18
BWCD                   7.0
BWCA                  14.5
IWCH                   1.2
IWCD                   7.0
IWCA                  14.0
PSCH                  1.19
PSCD                   8.0
PSCA                  16.0
WHCH                  1.12
WHCD                   6.5
WHCA                  12.0
VCCH                  1.22
VCCD                   7.0
VCCA                  13.0
MaxCH                 1.22
MaxCD                  8.4
MaxCA                 19.0
AvgCH                 1.19
AvgCD                 7.43
AvgCA                15.98
B365C>2.5              1.5
B365C<2.5             2.63
PC>2.5                1.49
PC<2.5                2.65
MaxC>2.5              1.52
MaxC<2.5              2.79
AvgC>2.5              1.49
AvgC<2.5              2.63
AHCh                  -2.0
B365CAHH              1.95
B365CAHA              1.98
PCAHH                 1.93
PCAHA                 1.97
MaxCAHH               2.01
MaxCAHA               2.09
AvgCAHH               1.95
AvgCAHA               1.92
Name: 1, dtype: object

Most likely you can see that not all elements of this list and only the first and last variables are shown. If you want to show more lines you can change the default display restriction.

In [371]:
pd.options.display.max_rows = 200
df2324.iloc[1,:]
Out[371]:
Div                     E0
Date            12/08/2023
Time                 12:30
HomeTeam           Arsenal
AwayTeam     Nott'm Forest
FTHG                     2
FTAG                     1
FTR                      H
HTHG                     2
HTAG                     0
HTR                      H
Referee           M Oliver
HS                      15
AS                       6
HST                      7
AST                      2
HF                      12
AF                      12
HC                       8
AC                       3
HY                       2
AY                       2
HR                       0
AR                       0
B365H                 1.18
B365D                  7.0
B365A                 15.0
BWH                   1.17
BWD                    7.5
BWA                   15.5
IWH                    1.2
IWD                   7.25
IWA                   14.0
PSH                   1.18
PSD                   7.86
PSA                  15.67
WHH                   1.12
WHD                    6.5
WHA                   12.0
VCH                   1.14
VCD                    7.5
VCA                   17.0
MaxH                  1.21
MaxD                   8.5
MaxA                  17.5
AvgH                  1.18
AvgD                  7.64
AvgA                 15.67
B365>2.5              1.44
B365<2.5              2.75
P>2.5                 1.42
P<2.5                 2.93
Max>2.5               1.45
Max<2.5               2.98
Avg>2.5               1.42
Avg<2.5               2.85
AHh                   -2.0
B365AHH               1.88
B365AHA               2.02
PAHH                  1.88
PAHA                  2.01
MaxAHH                1.91
MaxAHA                2.06
AvgAHH                1.87
AvgAHA                1.99
B365CH                1.18
B365CD                 7.0
B365CA                15.0
BWCH                  1.18
BWCD                   7.0
BWCA                  14.5
IWCH                   1.2
IWCD                   7.0
IWCA                  14.0
PSCH                  1.19
PSCD                   8.0
PSCA                  16.0
WHCH                  1.12
WHCD                   6.5
WHCA                  12.0
VCCH                  1.22
VCCD                   7.0
VCCA                  13.0
MaxCH                 1.22
MaxCD                  8.4
MaxCA                 19.0
AvgCH                 1.19
AvgCD                 7.43
AvgCA                15.98
B365C>2.5              1.5
B365C<2.5             2.63
PC>2.5                1.49
PC<2.5                2.65
MaxC>2.5              1.52
MaxC<2.5              2.79
AvgC>2.5              1.49
AvgC<2.5              2.63
AHCh                  -2.0
B365CAHH              1.95
B365CAHA              1.98
PCAHH                 1.93
PCAHA                 1.97
MaxCAHH               2.01
MaxCAHA               2.09
AvgCAHH               1.95
AvgCAHA               1.92
Name: 1, dtype: object

As you can see from the output this is information, this line represents information on a match played between Arsenal (home team) and Nottingham Forest (away team). Arsenal won the match 2-1. The half time score was 2-0. Also note that, although having called df2324.iloc[1,:] this is not the first match in the dataframe. It is the second as Python starts counting at 0. This is a very computer science thing to do and you will have to get used to that. So this is actually information on the 2nd match in the dataframe.

Dataframes have an index, for this dataframe you can find it as follows:

In [372]:
df2324.index
Out[372]:
RangeIndex(start=0, stop=380, step=1)

Here you see that the index starts at 0 and it stops at 380. This actually means that the last index value is 379 and therefore there are 380 matches in this dataframe.

Now we will add a new variable indicating the season, as well as a variable (MatchNo) that indexes the match and then join these dataframes into one. We add a season variable as it makes identifying matches from particular seasons easier. A nice introduction to merging and concatenating dataframes in python is available from here.

In [373]:
# creates Season variable in each dataframe
# insert the variable in position 1, i.e. 2nd variable
df2122.insert(1, 'Season', "2122")
df2223.insert(1, 'Season', "2223")
df2324.insert(1, 'Season', "2324")

# add Match number variable, e.g. "202122_4" is the 5th match in the 202122 season
# insert the variable as the 3rd variable
df2122["MNo"] = range(380)
df2122.insert(2,'MatchNo', df2122["Season"] + "_" + df2122["MNo"].astype("str"))
df2223["MNo"] = range(380)
df2223.insert(2,'MatchNo', df2223["Season"] + "_" + df2223["MNo"].astype("str"))
df2324["MNo"] = range(380)
df2324.insert(2,'MatchNo', df2324["Season"] + "_" + df2324["MNo"].astype("str"))

# combines the dataframe
df_combined = pd.concat([df2122,df2223,df2324])

The dataframe contains a lot of variables that relate to different betting companies and their odds for betting on matches. We are not interested in these. The last variable that relates to the football match itself is AR the number of red cards for the away team. We shall therefore only select the variabless in columns 0 to 25 (or (0:26) in Python speak, recall this will pick columns indexed 0 to 25). Or you could use the .loc method which allows you to locate columns by names, here selecting columns named "Div" to "AR".

In [374]:
df_matches = df_combined.iloc[:,0:26]
# df_matches = df_combined.loc[:,"Div": "AR"]  # both lines achieve the same
df_matches.shape
Out[374]:
(1140, 26)

The last line tells us how many rows and columns we have in df_matches.

Data types¶

Before we embark on some real data analysis we ensure that some variables have the correct data types. Previously you saw that some variables were object variables. This is Python speak for "uncategorised" data types. For instance the variables HomeTeam and AwayTeam were labelled as object variables. This is a categorical variable and we will enforce this now (also for Referee, FTR and HTR).

In [375]:
df_matches["HomeTeam"] = df_matches["HomeTeam"].astype("category")
df_matches["AwayTeam"] = df_matches["AwayTeam"].astype("category")
df_matches["Referee"] = df_matches["Referee"].astype("category")
df_matches["FTR"] = df_matches["FTR"].astype("category")
df_matches["HTR"] = df_matches["HTR"].astype("category")
df_matches.dtypes
Out[375]:
Div           object
Season        object
MatchNo       object
Date          object
Time          object
HomeTeam    category
AwayTeam    category
FTHG           int64
FTAG           int64
FTR         category
HTHG           int64
HTAG           int64
HTR         category
Referee     category
HS             int64
AS             int64
HST            int64
AST            int64
HF             int64
AF             int64
HC             int64
AC             int64
HY             int64
AY             int64
HR             int64
AR             int64
dtype: object

The Date and Time variables will be turned into types that Python can recognise as dates and times.

In [376]:
df_matches['Date'] = pd.to_datetime(df_combined['Date'], format = '%d/%m/%Y')
df_matches['Time'] = pd.to_datetime(df_combined['Time'], format = '%H:%M')
df_matches.dtypes
Out[376]:
Div                 object
Season              object
MatchNo             object
Date        datetime64[ns]
Time        datetime64[ns]
HomeTeam          category
AwayTeam          category
FTHG                 int64
FTAG                 int64
FTR               category
HTHG                 int64
HTAG                 int64
HTR               category
Referee           category
HS                   int64
AS                   int64
HST                  int64
AST                  int64
HF                   int64
AF                   int64
HC                   int64
AC                   int64
HY                   int64
AY                   int64
HR                   int64
AR                   int64
dtype: object

Analysis plan¶

There is all sorts of things you may be interested from this dataset. Here are a number of things we may want to do.

  • Summary statistics for number of goals scored by home and away teams. Equally for any other variable of interest.
  • Are there more goals scored in evening matches?
  • Are away teams more efficient goal scorers than home teams?

All of the above are questions which have a match as a reference point. But we may also have questions that refer to teams rather than matches.

  • Does Liverpool score more goals than Tottenham?
  • Does Chelsea perform better in the first half of seasons as opposed to the second half of a season?
  • Does Manchester City concede more freekicks than it is awarded free kicks (the other team conceding freekicks)?

In what follows we will produce summary statistics that speak to these questions. We will however, here, not produce hypothesis tests. The analysis will remain purely descriptive.

Organise the data¶

Now that we have a basic understanding of the data and we know what sort of questions we wish to investigate it is worthwhile thinking about whether the dataframe df_matches is in the form which we would like it to be for some subsequent analysis. Recall that each row relates to one match at a time. It is therefore in a shape from which we can easily analyse the questions that relate to matches. However, the questions that relate to teams will be awkward to analyse form df_matches as the same team will sometimes be a home team (and hence appear in the HomeTeam variable) and sometimes an away team (appearing in the AwayTeam variable).

To make our life a little easier for that type of analysis we shall re-organise the information in an alternative dataframe.

The following shows a schematic of df_matches:

MatchNo HomeTeam Away Team FTHG FTAG
1 A B 2 1
2 C D 0 3

A way to organise the data that is better suited to team level analysis would be the following

MatchNo Team Opponent Home_or_Away FTG FTG_Opp
1 A B Home 2 1
1 B A Away 1 2
2 C D Home 0 3
2 D C Away 3 0

This process will double the number of rows, as for every match we will obtain two rows of data, one for the home team and one for the away team.

Once the data are in this format we should be able to perform the analysis tasks based on particular teams. We shall save the data in df_teams and then we can refer to either df_matches or df_teams depending on which is more convenient.

Reshape the data¶

We start by duplicating the dataframe into a df--Home and df_away dataframe. When you do this you use, for instance the code df_home = df_matches.copy(). If you don't add the .copy(), but merely write df_home = df_matches, you will be creating a new item in your object list called df_home, but really all you are creating is a new name for the same object df_matches. You could think of this as an alias for df_matches. If you then change something in df_home you will actually be changing the same thing in df_matches. Try and run the code below without the .copy methods added and you will get an error message.

replicate the data into two datasets

In [377]:
df_home = df_matches.copy()  
df_away = df_matches.copy()
# add the HomeAway variable
df_home.insert(5,'HomeAway', "Home")
df_away.insert(5,'HomeAway', "Away")

df_away.columns
Out[377]:
Index(['Div', 'Season', 'MatchNo', 'Date', 'Time', 'HomeAway', 'HomeTeam',
       'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR', 'Referee',
       'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR',
       'AR'],
      dtype='object')

Now we change the names of the variables in the two dataframes, preparing for attaching them to each other. For df_home we label characteristics for the away team as opponent characteristics, e.g. corners for the home tem are labeled Corners and for the away team they are Corners_Opp.

In [378]:
# variable names not mentioned in the rename method stay the same
df_home = df_home.rename(columns={'HomeTeam': 'Team', 'AwayTeam': 'Opponent', 'FTHG': 'FTG', 'FTAG': 'FTG_Opp', 'FTR': 'FTResult', 
                        'HTHG': 'HTG', 'HTAG': 'HTG_Opp', 'HTR': 'HTResult', 'HS': 'Shots', 'AS': 'Shots_Opp', 
                        'HST': 'ShotsTarget', 'AST': 'ShotsTarget_Opp', 'HF': 'FoulsComm', 'AF': 'FoulsComm_Opp', 
                        'HC': 'Corners', 'AC': 'Corners_Opp', 'HY': 'Yellow', 'AY': 'Yellow_Opp', 'HR': 'Red', 'AR': 'Red_Opp'})

For df_away we label in the opposite way, e.g. the corners for the awy team are Corners and for the home team they are Corners_Opp. Hence this contains the same information but now presented from the perspective of the away team.

In [379]:
df_away = df_away.rename(columns={'HomeTeam': 'Opponent', 'AwayTeam': 'Team', 'FTHG': 'FTG_Opp', 'FTAG': 'FTG', 'FTR': 'FTResult', 
                        'HTHG': 'HTG_Opp', 'HTAG': 'HTG', 'HTR': 'HTResult', 'HS': 'Shots_Opp', 'AS': 'Shots', 
                        'HST': 'ShotsTarget_Opp', 'AST': 'ShotsTarget', 'HF': 'FoulsComm_Opp', 'AF': 'FoulsComm', 
                        'HC': 'Corners_Opp', 'AC': 'Corners', 'HY': 'Yellow_Opp', 'AY': 'Yellow', 'HR': 'Red_Opp', 'AR': 'Red'})

Before joining up the two dataframes we need to take care of the results columns (both the full time result FTResult and the half-time result HTResult), they are still defined as either H (=Home Win), D (=Draw) or A (=Away Win). We need to redefine them as either a Loss (=L), a draw (=D) or a win (=W), but that depends on whether we are thinking of the home or away team.

In [380]:
df_home['FTResult'] = df_home['FTResult'].cat.rename_categories({'H': 'W', 'D': 'D', 'A': 'L'})
df_home['HTResult'] = df_home['HTResult'].cat.rename_categories({'H': 'W', 'D': 'D', 'A': 'L'})
df_away['FTResult'] = df_away['FTResult'].cat.rename_categories({'H': 'L', 'D': 'D', 'A': 'W'})
df_away['HTResult'] = df_away['HTResult'].cat.rename_categories({'H': 'L', 'D': 'D', 'A': 'W'})

Now both dataframes (df_home and df_away) have the same variables. However, they are not in the same order (check df_home/away.columns). To concatenate two dataframes we earlier used the pd.concat function, but you may wonder whether that works if the variables are not in the same order. Well just try it.

In [381]:
df_teams = pd.concat([df_home,df_away])
df_teams = df_teams.sort_values(by=['MatchNo']) # sort by MatchNo
df_teams.head
Out[381]:
<bound method NDFrame.head of    Div Season  MatchNo       Date                Time HomeAway           Team  \
0   E0   2122   2122_0 2021-08-13 1900-01-01 20:00:00     Home      Brentford   
0   E0   2122   2122_0 2021-08-13 1900-01-01 20:00:00     Away        Arsenal   
1   E0   2122   2122_1 2021-08-14 1900-01-01 12:30:00     Away          Leeds   
1   E0   2122   2122_1 2021-08-14 1900-01-01 12:30:00     Home     Man United   
10  E0   2122  2122_10 2021-08-21 1900-01-01 12:30:00     Away        Burnley   
..  ..    ...      ...        ...                 ...      ...            ...   
97  E0   2324  2324_97 2023-10-29 1900-01-01 14:00:00     Home       Brighton   
98  E0   2324  2324_98 2023-10-29 1900-01-01 14:00:00     Away  Nott'm Forest   
98  E0   2324  2324_98 2023-10-29 1900-01-01 14:00:00     Home      Liverpool   
99  E0   2324  2324_99 2023-10-29 1900-01-01 15:30:00     Home     Man United   
99  E0   2324  2324_99 2023-10-29 1900-01-01 15:30:00     Away       Man City   

         Opponent  FTG  FTG_Opp  ... ShotsTarget  ShotsTarget_Opp  FoulsComm  \
0         Arsenal    2        0  ...           3                4         12   
0       Brentford    0        2  ...           4                3          8   
1      Man United    1        5  ...           3                8          9   
1           Leeds    5        1  ...           8                3         11   
10      Liverpool    0        2  ...           3                9         12   
..            ...  ...      ...  ...         ...              ...        ...   
97         Fulham    1        1  ...           7                5         12   
98      Liverpool    0        3  ...           1                8         13   
98  Nott'm Forest    3        0  ...           8                1          9   
99       Man City    0        3  ...           3               10          9   
99     Man United    3        0  ...          10                3          4   

   FoulsComm_Opp Corners  Corners_Opp  Yellow  Yellow_Opp  Red  Red_Opp  
0              8       2            5       0           0    0        0  
0             12       5            2       0           0    0        0  
1             11       4            5       2           1    0        0  
1              9       5            4       1           2    0        0  
10             6       4            8       0           0    0        0  
..           ...     ...          ...     ...         ...  ...      ...  
97             8       7            3       0           3    0        0  
98             9       3            8       3           2    0        0  
98            13       8            3       2           3    0        0  
99             4       7           12       4           1    0        0  
99             9      12            7       1           4    0        0  

[2280 rows x 27 columns]>

You can now see that the first two rows represent the information from the same match, once from the home team's perspective (Brentford) and once from the away team's (Arsenal).

We now have two dataframes, df_matches and df_teams which contain identical information but structured differently. If you are a novice data analyst, then it was probably not obvious that this pre-work will be helpful. And that is fine, but you shall see in what follows that we have simplified future work significantly.

Analysis¶

Match-based questions¶

The questions we shall tackle here are

  • Summary statistics for number of goals scored by home and away teams. Equally for any other variable of interest.
  • Are there more goals scored in evening matches?
  • Are away teams more efficient goal scorers than home teams?

We will use the df_matches dataframe to answer these questions. Let's start by looking at the home (FTHG) and away goals (FTAG) at full time. We use the .describe() method to these two variables to get summary statistics.

In [382]:
df_matches[['FTHG', 'FTAG']].describe()
Out[382]:
FTHG FTAG
count 1140.000000 1140.000000
mean 1.649123 1.334211
std 1.375073 1.244361
min 0.000000 0.000000
25% 1.000000 0.000000
50% 1.000000 1.000000
75% 2.000000 2.000000
max 9.000000 8.000000

As you can see, the average number of goals scored by the home team is larger by about 0.3 on average.

Next we turn to asking whether evening matches produce more goals than earlier matches. We do have the time variable we can use to separate out the matches by the time of the day it is played. What we do not yet have in the dataframe is the number of goals scored altogether. Let us create this variable ('FTG') first.

In [383]:
df_matches['FTG'] = df_matches['FTHG'] + df_matches['FTAG'] 

Now we calculate the average number of goals by kick-off time ('Time').

The following command does the following: take df_matches and group the dataframe by the Time variable. This basically creates different dataframes for all matches of a particular time. Then only keep the FTG variable. Now you have "buckets" for each different kick-off time and these buckets only contain the FTG variable. Now we aggregate these "Buckets" (.agg) by counting the number of matches in each bucket (np.size), as well as calculating the mean (np.mean) and variance (np.var).

In [384]:
tab1 = df_matches.groupby('Time')['FTG'].agg([np.size,np.mean, np.var])
tab1
Out[384]:
size mean var
Time
1900-01-01 12:00:00 3 1.666667 1.333333
1900-01-01 12:30:00 84 2.809524 3.095812
1900-01-01 13:00:00 4 2.500000 1.666667
1900-01-01 13:30:00 4 2.250000 1.583333
1900-01-01 14:00:00 168 2.940476 2.774879
1900-01-01 14:15:00 3 2.000000 1.000000
1900-01-01 15:00:00 387 3.043928 2.886666
1900-01-01 15:30:00 4 3.250000 0.916667
1900-01-01 15:45:00 1 2.000000 NaN
1900-01-01 16:00:00 21 3.666667 2.333333
1900-01-01 16:15:00 1 1.000000 NaN
1900-01-01 16:30:00 89 3.370787 3.213228
1900-01-01 17:30:00 98 2.867347 3.023459
1900-01-01 18:00:00 2 2.000000 8.000000
1900-01-01 19:00:00 2 2.000000 2.000000
1900-01-01 19:30:00 61 2.737705 2.930055
1900-01-01 19:45:00 56 2.428571 2.249351
1900-01-01 20:00:00 127 3.055118 2.846144
1900-01-01 20:15:00 25 3.600000 3.583333

This way of chaining methods is a popular way of creating transformed data.

As you can see there were many different kick-off times, some with very few matches (e.g. 16:15 with only one match). We therefore create a variable that categorises the times into "noon", "early afternoon", "late afternoon" or "evening". Here we use the pd.cut function. You will not have to remember all these commands. But if you search for terms like "create categorical variable from continuous in python" you will find advice that most likely leads you to pd.cut. Searching for the right function to use is an essential coding skill.

In [385]:
# create cut-offs. As the Time variable is in Time format these need to have the same format
time_cuts = pd.to_datetime(['11:00','14:45','15:50','19:15','23:00'], format = '%H:%M')
df_matches["TimeofDay"] = pd.cut(
    df_matches["Time"],
    bins=time_cuts,
    labels=["noon", "early afternoon", "late afternoon", "evening"],
    ordered=True,
)

Confirm that this has done what you expect.

In [386]:
df_matches[['MatchNo','Time','TimeofDay']].head
Out[386]:
<bound method NDFrame.head of       MatchNo                Time        TimeofDay
0      2122_0 1900-01-01 20:00:00          evening
1      2122_1 1900-01-01 12:30:00             noon
2      2122_2 1900-01-01 15:00:00  early afternoon
3      2122_3 1900-01-01 15:00:00  early afternoon
4      2122_4 1900-01-01 15:00:00  early afternoon
..        ...                 ...              ...
375  2324_375 1900-01-01 16:00:00   late afternoon
376  2324_376 1900-01-01 16:00:00   late afternoon
377  2324_377 1900-01-01 16:00:00   late afternoon
378  2324_378 1900-01-01 16:00:00   late afternoon
379  2324_379 1900-01-01 16:00:00   late afternoon

[1140 rows x 3 columns]>

Looks right. That means we can repeat the earlier analysis and group by TimeofDay rather than Time to create more equally sized groups.

In [387]:
tab2 = df_matches.groupby('TimeofDay')['FTG'].agg([np.size,np.mean, np.var])
tab2
Out[387]:
size mean var
TimeofDay
noon 266 2.857143 2.809704
early afternoon 392 3.043367 2.860007
late afternoon 213 3.131455 3.114713
evening 269 2.903346 2.886145

Not surprisingly you do get different average numbers of goals. You could continue to perform hypothesis tests to check whether there are statistically significant differences, but that is going too far for this walk-through.

The last match based question we will investigate here is whether away teams are more efficient goal scorers than home teams? We already used the the goals scored variables (FTHG, FTAG). Here we shall relate this to the number of shots taken by the home team (HS) or the away team (AS).

In effect we will calculate how many shots are required per goal (HS/FTHG and AS/FTAG). Lower values are better.

In [388]:
df_matches['HG_eff'] = df_matches['HS']/df_matches['FTHG'] 
df_matches['AG_eff'] = df_matches['AS']/df_matches['FTAG'] 
df_matches[['FTHG', 'HS', 'HG_eff', 'FTAG', 'AS', 'AG_eff']].describe()
c:\Users\msassrb2\Anaconda3\lib\site-packages\numpy\lib\function_base.py:4009: RuntimeWarning: invalid value encountered in subtract
  diff_b_a = subtract(b, a)
Out[388]:
FTHG HS HG_eff FTAG AS AG_eff
count 1140.000000 1140.000000 1140.000000 1140.000000 1140.000000 1140.0
mean 1.649123 14.400877 inf 1.334211 11.742982 inf
std 1.375073 5.888024 NaN 1.244361 5.264696 NaN
min 0.000000 1.000000 1.400000 0.000000 1.000000 1.0
25% 1.000000 10.000000 5.666667 0.000000 8.000000 5.5
50% 1.000000 14.000000 9.500000 1.000000 11.000000 9.5
75% 2.000000 18.000000 20.000000 2.000000 15.000000 NaN
max 9.000000 36.000000 inf 8.000000 31.000000 inf

Why can the above not calculate average efficiencies? The issue is that there are a substantial number of missing values, the reason being that we are dividing by the number of goals scored and frequently that will be 0.

However, as the minimum number of shots taken (HS and AS) is always larger than 0 (at least in this sample) we can calculate how many goals are scored per shot (FTHG/HS and FTAG/AS). Higher values are better.

In [389]:
df_matches['HG_eff'] = df_matches['FTHG']/df_matches['HS'] 
df_matches['AG_eff'] = df_matches['FTAG']/df_matches['AS'] 
df_matches[['FTHG', 'HS', 'HG_eff', 'FTAG', 'AS', 'AG_eff']].describe().round(2)   # round(2) forces rounding to 2 digits
Out[389]:
FTHG HS HG_eff FTAG AS AG_eff
count 1140.00 1140.00 1140.00 1140.00 1140.00 1140.00
mean 1.65 14.40 0.12 1.33 11.74 0.12
std 1.38 5.89 0.10 1.24 5.26 0.12
min 0.00 1.00 0.00 0.00 1.00 0.00
25% 1.00 10.00 0.05 0.00 8.00 0.00
50% 1.00 14.00 0.11 1.00 11.00 0.11
75% 2.00 18.00 0.18 2.00 15.00 0.18
max 9.00 36.00 0.71 8.00 31.00 1.00

Approximately the efficiency seems to be similar between away and home team. While the home team does score more on average, they also take more shots.

As an aside. We see that there was (at least) one match where the home team took 36 shots and another where the away team scored 31. Let's figure out which matches these were.

In [390]:
df_matches[(df_matches['HS'] == 36)|(df_matches['AS'] == 31)]
Out[390]:
Div Season MatchNo Date Time HomeTeam AwayTeam FTHG FTAG FTR ... HC AC HY AY HR AR FTG TimeofDay HG_eff AG_eff
362 E0 2122 2122_362 2022-05-15 1900-01-01 14:00:00 West Ham Man City 2 2 D ... 4 9 3 1 0 0 4 noon 0.333333 0.064516
376 E0 2324 2324_376 2024-05-19 1900-01-01 16:00:00 Liverpool Wolves 2 0 H ... 10 2 1 1 0 1 2 late afternoon 0.055556 0.000000

2 rows × 30 columns

It may be interesting whether there is a relationship between efficiency and the number of shots taken.

We create a scatterplot with shots taken by the home team (HS) on the horizontal axis and the home team's goal efficiency on the vertical axis (HG_eff). The package we use for plotting here is the lets_plo package which implements the powerful graphing language introduced for ggplot in the R programming language. You can find guidance to the functionality here.

In [391]:
p1 = ggplot(df_matches, aes(x = 'HS', y = 'HG_eff')) + geom_point()
p1
Out[391]:
0 5 10 15 20 25 30 35 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 HG_eff HS

This seems to suggest that there is a negative relationship. But note that behind each point there may be multiple observations.

Team-based questions¶

The questions in this section were:

  • Does Liverpool score more goals than Tottenham?
  • Does Chelsea perform better in the first half of seasons as opposed to the second half of a season?
  • Does Manchester City concede more freekicks than it is awarded free kicks (the other team conceding freekicks)?

We will use the df_teams dataframe to answer these questions.

Beginning with the first question we create a temporary dataframe that only keeps observations involving Liverpool and Tottenham. We call this temp as we know that after this particular analysis we will not need this dataframe any longer.

In [392]:
temp = df_teams.query("Team == 'Liverpool' or Team == 'Tottenham'").copy()
tab3 = temp.groupby('Team')['FTG'].agg(np.mean)
tab3
Out[392]:
Team
Arsenal                  NaN
Aston Villa              NaN
Bournemouth              NaN
Brentford                NaN
Brighton                 NaN
Burnley                  NaN
Chelsea                  NaN
Crystal Palace           NaN
Everton                  NaN
Fulham                   NaN
Leeds                    NaN
Leicester                NaN
Liverpool           2.236842
Luton                    NaN
Man City                 NaN
Man United               NaN
Newcastle                NaN
Norwich                  NaN
Nott'm Forest            NaN
Sheffield United         NaN
Southampton              NaN
Tottenham           1.868421
Watford                  NaN
West Ham                 NaN
Wolves                   NaN
Name: FTG, dtype: float64

You can see that the table does show what we ant to see. In our sample Liverpool averages 2.24 goals a match whereas Tottenham averages 1.87. However, the table does show more than that. It shows a list of all the teams in our sample. Recall that we defined the Team variable as a category variable. Python does have all these categories attached to the variable, but in temp we only find matches for Liverpool and Tottenham (as intended). Surely there is a way to remove these. An internet search for "python remove unused categories" should deliver a way to do so. Having said that, you may have to look around for alternative solutions as the most obvious one using .remove_unused_categories() does not work here.

In [393]:
temp = df_teams.query("Team == 'Liverpool' or Team == 'Tottenham'").copy()
tab4 = temp.groupby('Team')['FTG'].agg(np.mean).dropna()  # .dropna() drops all rows with NA
tab4
Out[393]:
Team
Liverpool    2.236842
Tottenham    1.868421
Name: FTG, dtype: float64

This is a nice opportunity to show another data visualisation. The number of goals scored is discrete variable and comparing bar charts of the number of goals scored in matches may be a nice way to visualise the difference between the two teams.

In [394]:
p2 = ggplot(temp, aes(x = 'FTG', fill = 'Team')) + geom_bar(position = "dodge") 
p2
Out[394]:
0 2 4 6 8 0 5 10 15 20 25 30 35 count FTG Team Liverpool Tottenham

We can see confirmed that Tottenham seems to have more matches with smaller number of goals (0, 1 and 2) than Liverpool.

Let us turn to the next question, whether Chelsea perform better in the first half of seasons as opposed to the second half of a season? To answer this question we start by creating a temporary dataframe containing data with Chelsea matches.

In [395]:
temp = df_teams.query("Team == 'Chelsea'").sort_values('Date').copy()

This contains 114 matches for three seasons (2021/22, 2022/23 and 2023/24), 38 matches in each season. We now need to identify the first 19 matches in each season. We shall create a variable ('Order') that takes values "1" or "2" for the fits and second half of the season respectively. Then we can apply the .groupby method to get separate statistics for the two halfs.

There are different ways to do that. In each season Chelsea will play each other team twice, we could associate the first of these matches to "first half". This is not a task for which the solution is immediately obvious. Here you could either search the internet for a solution, but it is not immediately obvious what the important keywords would be. This feels like a problem which you first have to explain properly. This is a type of problem which is a great test to see whether ChatGPT can be of help.

You could prepare a little problem description to ChatGPT: "I have a python dataframe with variables 'Date' and 'Opponent' listing all dates and opponents against which a sports teanm played matches. Each opponent occurs twice. I want to create a new variable called 'Order' that takes the value 1 for the first match and the value 2 for the second match against each opponent. Can you please help me how to achieve this."

The answer I got from ChatGPT

Group by 'Opponent' and assign order number (1 for first, 2 for second)¶

df['Order'] = df.groupby('Opponent').cumcount() + 1

There was more in the reply (incl. explanations) but this is the core of the answer. The problem was set without the mention of different seasons, so the application had to adjust for that.

In [396]:
temp['Order'] = temp.groupby(['Season','Opponent']).cumcount() + 1

Now we have the order variable and we can check whether average goals ('FTG'), shots takes ('Shots') and shots on target ('ShotsTarget').

In [397]:
tab4 = temp.groupby('Order')[['FTG','Shots','ShotsTarget']].agg(np.mean)
tab4
Out[397]:
FTG Shots ShotsTarget
Order
1 1.614035 13.491228 4.789474
2 1.736842 15.000000 5.456140

The last question we turn to is whether Manchester City ("Man City" in the 'Team' column) concede more freekicks than it is awarded free kicks (the other team conceding freekicks)? This is straightforward if you understand that the relevant information is in the FoulsComm and the FoulsComm_Opp variable. There are a number of ways you could answer this question. For instance you could compare the mean value across the 114 matches for Manchester City for both variables. But let's look at this in another way. Let's check in how many matches Manchester City players were fouled more often than Manchester City players fouled their opponent players. To this we create a new variable which takes the value true if Man City committed more fouls than their opponents (FoulsComm>FoulsComm_Opp) and false otherwise.

In [402]:
temp = df_teams.query("Team == 'Man City'").copy()
temp['Foul_compare'] = (temp['FoulsComm'] > temp['FoulsComm_Opp'])
p3 = ggplot(temp, aes(x = 'Foul_compare')) + geom_bar() 
p3
Out[402]:
true false 0 10 20 30 40 50 60 70 80 count Foul_compare

So clearly there are many more matches in which Manchester City players are fouled more often by their opponents than the other way round.

Summary¶

In this walk-through you have practiced a lot of important generic coding skills

  • Searched for the right functions to solve a problem
  • Asked ChatGPT to help with a problem
  • Adjusted advice from the internet or ChatGPT to fit your problem
  • Checked whether your code has actually done what you wanted to do

And you have done the following substantial tasks

  • Loaded data
  • Reorganised data
  • Descriptive statistics

This exercise is part of the ECLR page.