This exercise is part of the ECLR page.
Here we analyse some football data. The data are downloaded from Football-data.com. We will use these data to expose you to some basic data handling skills in Python.
We start by loading a range of packages. Recall that you need to install these first before you can use them. For instance to install the pingouin package you would use the command pip install pingouin
in the terminal. You can do the same when you want to update a package.
Now
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import pingouin as pg
from lets_plot import *
from datetime import datetime
The file contains a lot of variables. Explanations by the website can be found here.
The following is a bespoke plot style for the lets_plot
function. There is no need to do this, you could also use the default style.
LetsPlot.setup_html(no_js=True)
### You don't need to use these settings yourself,
### they are just here to make the charts look nicer!
# Set the plot style for prettier charts:
plt.style.use(
"C:/Users/msassrb2/The University of Manchester Dropbox/Ralf Becker/Python/Football/plot_style.txt"
)
The following loads the data directly from the Football-data.com website. As this links to the 2024/25 season, this will be constantly updated as more matches are being played.
df2425 = pd.read_csv(
"https://www.football-data.co.uk/mmz4281/2324/E0.csv",
na_values="",
)
In order to do some analysis that will not change everytime a new match is being played, we shall also load data from previous seasons and attach these to each other to create one big dataset.
df2324 = pd.read_csv(
"https://www.football-data.co.uk/mmz4281/2324/E0.csv",
na_values="",
)
df2223 = pd.read_csv(
"https://www.football-data.co.uk/mmz4281/2223/E0.csv",
na_values="",
)
df2122 = pd.read_csv(
"https://www.football-data.co.uk/mmz4281/2122/E0.csv",
na_values="",
)
Let's explore what sort of information is contained in the dataframes. See what the following commands do. Each of these give you some glimpse of the information in the dataframe.
# df2324.columns
# df2324.shape
# df2324.info
# df2324.head
# df2324.tail
df2324.dtypes
Div object Date object Time object HomeTeam object AwayTeam object FTHG int64 FTAG int64 FTR object HTHG int64 HTAG int64 HTR object Referee object HS int64 AS int64 HST int64 AST int64 HF int64 AF int64 HC int64 AC int64 HY int64 AY int64 HR int64 AR int64 B365H float64 B365D float64 B365A float64 BWH float64 BWD float64 BWA float64 IWH float64 IWD float64 IWA float64 PSH float64 PSD float64 PSA float64 WHH float64 WHD float64 WHA float64 VCH float64 VCD float64 VCA float64 MaxH float64 MaxD float64 MaxA float64 AvgH float64 AvgD float64 AvgA float64 B365>2.5 float64 B365<2.5 float64 P>2.5 float64 P<2.5 float64 Max>2.5 float64 Max<2.5 float64 Avg>2.5 float64 Avg<2.5 float64 AHh float64 B365AHH float64 B365AHA float64 PAHH float64 PAHA float64 MaxAHH float64 MaxAHA float64 AvgAHH float64 AvgAHA float64 B365CH float64 B365CD float64 B365CA float64 BWCH float64 BWCD float64 BWCA float64 IWCH float64 IWCD float64 IWCA float64 PSCH float64 PSCD float64 PSCA float64 WHCH float64 WHCD float64 WHCA float64 VCCH float64 VCCD float64 VCCA float64 MaxCH float64 MaxCD float64 MaxCA float64 AvgCH float64 AvgCD float64 AvgCA float64 B365C>2.5 float64 B365C<2.5 float64 PC>2.5 float64 PC<2.5 float64 MaxC>2.5 float64 MaxC<2.5 float64 AvgC>2.5 float64 AvgC<2.5 float64 AHCh float64 B365CAHH float64 B365CAHA float64 PCAHH float64 PCAHA float64 MaxCAHH float64 MaxCAHA float64 AvgCAHH float64 AvgCAHA float64 dtype: object
When looking at the .dtypes
you will see that Python has assigned certain types to variables. Variables with dtype float64
are numbers. If a variable has datatype objet
this means that Python has not yet made up its mind on what the right datatype is. Below we will change some datatypes when this is necessary.
Let's also look at one particular observation. Appending the dataframe name with .iloc[rows,cols]
where rows
and cold
indicate which rows and columns you wish to see is one way to single out which parts of the dataframe you want to see. By indicating :
in the cols
space we say that we wish to see all columns.
df2324.iloc[1,:]
Div E0 Date 12/08/2023 Time 12:30 HomeTeam Arsenal AwayTeam Nott'm Forest FTHG 2 FTAG 1 FTR H HTHG 2 HTAG 0 HTR H Referee M Oliver HS 15 AS 6 HST 7 AST 2 HF 12 AF 12 HC 8 AC 3 HY 2 AY 2 HR 0 AR 0 B365H 1.18 B365D 7.0 B365A 15.0 BWH 1.17 BWD 7.5 BWA 15.5 IWH 1.2 IWD 7.25 IWA 14.0 PSH 1.18 PSD 7.86 PSA 15.67 WHH 1.12 WHD 6.5 WHA 12.0 VCH 1.14 VCD 7.5 VCA 17.0 MaxH 1.21 MaxD 8.5 MaxA 17.5 AvgH 1.18 AvgD 7.64 AvgA 15.67 B365>2.5 1.44 B365<2.5 2.75 P>2.5 1.42 P<2.5 2.93 Max>2.5 1.45 Max<2.5 2.98 Avg>2.5 1.42 Avg<2.5 2.85 AHh -2.0 B365AHH 1.88 B365AHA 2.02 PAHH 1.88 PAHA 2.01 MaxAHH 1.91 MaxAHA 2.06 AvgAHH 1.87 AvgAHA 1.99 B365CH 1.18 B365CD 7.0 B365CA 15.0 BWCH 1.18 BWCD 7.0 BWCA 14.5 IWCH 1.2 IWCD 7.0 IWCA 14.0 PSCH 1.19 PSCD 8.0 PSCA 16.0 WHCH 1.12 WHCD 6.5 WHCA 12.0 VCCH 1.22 VCCD 7.0 VCCA 13.0 MaxCH 1.22 MaxCD 8.4 MaxCA 19.0 AvgCH 1.19 AvgCD 7.43 AvgCA 15.98 B365C>2.5 1.5 B365C<2.5 2.63 PC>2.5 1.49 PC<2.5 2.65 MaxC>2.5 1.52 MaxC<2.5 2.79 AvgC>2.5 1.49 AvgC<2.5 2.63 AHCh -2.0 B365CAHH 1.95 B365CAHA 1.98 PCAHH 1.93 PCAHA 1.97 MaxCAHH 2.01 MaxCAHA 2.09 AvgCAHH 1.95 AvgCAHA 1.92 Name: 1, dtype: object
Most likely you can see that not all elements of this list and only the first and last variables are shown. If you want to show more lines you can change the default display restriction.
pd.options.display.max_rows = 200
df2324.iloc[1,:]
Div E0 Date 12/08/2023 Time 12:30 HomeTeam Arsenal AwayTeam Nott'm Forest FTHG 2 FTAG 1 FTR H HTHG 2 HTAG 0 HTR H Referee M Oliver HS 15 AS 6 HST 7 AST 2 HF 12 AF 12 HC 8 AC 3 HY 2 AY 2 HR 0 AR 0 B365H 1.18 B365D 7.0 B365A 15.0 BWH 1.17 BWD 7.5 BWA 15.5 IWH 1.2 IWD 7.25 IWA 14.0 PSH 1.18 PSD 7.86 PSA 15.67 WHH 1.12 WHD 6.5 WHA 12.0 VCH 1.14 VCD 7.5 VCA 17.0 MaxH 1.21 MaxD 8.5 MaxA 17.5 AvgH 1.18 AvgD 7.64 AvgA 15.67 B365>2.5 1.44 B365<2.5 2.75 P>2.5 1.42 P<2.5 2.93 Max>2.5 1.45 Max<2.5 2.98 Avg>2.5 1.42 Avg<2.5 2.85 AHh -2.0 B365AHH 1.88 B365AHA 2.02 PAHH 1.88 PAHA 2.01 MaxAHH 1.91 MaxAHA 2.06 AvgAHH 1.87 AvgAHA 1.99 B365CH 1.18 B365CD 7.0 B365CA 15.0 BWCH 1.18 BWCD 7.0 BWCA 14.5 IWCH 1.2 IWCD 7.0 IWCA 14.0 PSCH 1.19 PSCD 8.0 PSCA 16.0 WHCH 1.12 WHCD 6.5 WHCA 12.0 VCCH 1.22 VCCD 7.0 VCCA 13.0 MaxCH 1.22 MaxCD 8.4 MaxCA 19.0 AvgCH 1.19 AvgCD 7.43 AvgCA 15.98 B365C>2.5 1.5 B365C<2.5 2.63 PC>2.5 1.49 PC<2.5 2.65 MaxC>2.5 1.52 MaxC<2.5 2.79 AvgC>2.5 1.49 AvgC<2.5 2.63 AHCh -2.0 B365CAHH 1.95 B365CAHA 1.98 PCAHH 1.93 PCAHA 1.97 MaxCAHH 2.01 MaxCAHA 2.09 AvgCAHH 1.95 AvgCAHA 1.92 Name: 1, dtype: object
As you can see from the output this is information, this line represents information on a match played between Arsenal (home team) and Nottingham Forest (away team). Arsenal won the match 2-1. The half time score was 2-0. Also note that, although having called df2324.iloc[1,:]
this is not the first match in the dataframe. It is the second as Python starts counting at 0. This is a very computer science thing to do and you will have to get used to that. So this is actually information on the 2nd match in the dataframe.
Dataframes have an index
, for this dataframe you can find it as follows:
df2324.index
RangeIndex(start=0, stop=380, step=1)
Here you see that the index starts at 0 and it stops at 380. This actually means that the last index value is 379 and therefore there are 380 matches in this dataframe.
Now we will add a new variable indicating the season, as well as a variable (MatchNo
) that indexes the match and then join these dataframes into one. We add a season variable as it makes identifying matches from particular seasons easier. A nice introduction to merging and concatenating dataframes in python is available from here.
# creates Season variable in each dataframe
# insert the variable in position 1, i.e. 2nd variable
df2122.insert(1, 'Season', "2122")
df2223.insert(1, 'Season', "2223")
df2324.insert(1, 'Season', "2324")
# add Match number variable, e.g. "202122_4" is the 5th match in the 202122 season
# insert the variable as the 3rd variable
df2122["MNo"] = range(380)
df2122.insert(2,'MatchNo', df2122["Season"] + "_" + df2122["MNo"].astype("str"))
df2223["MNo"] = range(380)
df2223.insert(2,'MatchNo', df2223["Season"] + "_" + df2223["MNo"].astype("str"))
df2324["MNo"] = range(380)
df2324.insert(2,'MatchNo', df2324["Season"] + "_" + df2324["MNo"].astype("str"))
# combines the dataframe
df_combined = pd.concat([df2122,df2223,df2324])
The dataframe contains a lot of variables that relate to different betting companies and their odds for betting on matches. We are not interested in these. The last variable that relates to the football match itself is AR
the number of red cards for the away team. We shall therefore only select the variabless in columns 0 to 25 (or (0:26) in Python speak, recall this will pick columns indexed 0 to 25). Or you could use the .loc
method which allows you to locate columns by names, here selecting columns named "Div" to "AR".
df_matches = df_combined.iloc[:,0:26]
# df_matches = df_combined.loc[:,"Div": "AR"] # both lines achieve the same
df_matches.shape
(1140, 26)
The last line tells us how many rows and columns we have in df_matches
.
Before we embark on some real data analysis we ensure that some variables have the correct data types. Previously you saw that some variables were object
variables. This is Python speak for "uncategorised" data types. For instance the variables HomeTeam
and AwayTeam
were labelled as object
variables. This is a categorical variable and we will enforce this now (also for Referee
, FTR
and HTR
).
df_matches["HomeTeam"] = df_matches["HomeTeam"].astype("category")
df_matches["AwayTeam"] = df_matches["AwayTeam"].astype("category")
df_matches["Referee"] = df_matches["Referee"].astype("category")
df_matches["FTR"] = df_matches["FTR"].astype("category")
df_matches["HTR"] = df_matches["HTR"].astype("category")
df_matches.dtypes
Div object Season object MatchNo object Date object Time object HomeTeam category AwayTeam category FTHG int64 FTAG int64 FTR category HTHG int64 HTAG int64 HTR category Referee category HS int64 AS int64 HST int64 AST int64 HF int64 AF int64 HC int64 AC int64 HY int64 AY int64 HR int64 AR int64 dtype: object
The Date
and Time
variables will be turned into types that Python can recognise as dates and times.
df_matches['Date'] = pd.to_datetime(df_combined['Date'], format = '%d/%m/%Y')
df_matches['Time'] = pd.to_datetime(df_combined['Time'], format = '%H:%M')
df_matches.dtypes
Div object Season object MatchNo object Date datetime64[ns] Time datetime64[ns] HomeTeam category AwayTeam category FTHG int64 FTAG int64 FTR category HTHG int64 HTAG int64 HTR category Referee category HS int64 AS int64 HST int64 AST int64 HF int64 AF int64 HC int64 AC int64 HY int64 AY int64 HR int64 AR int64 dtype: object
There is all sorts of things you may be interested from this dataset. Here are a number of things we may want to do.
All of the above are questions which have a match as a reference point. But we may also have questions that refer to teams rather than matches.
In what follows we will produce summary statistics that speak to these questions. We will however, here, not produce hypothesis tests. The analysis will remain purely descriptive.
Now that we have a basic understanding of the data and we know what sort of questions we wish to investigate it is worthwhile thinking about whether the dataframe df_matches
is in the form which we would like it to be for some subsequent analysis. Recall that each row relates to one match at a time. It is therefore in a shape from which we can easily analyse the questions that relate to matches. However, the questions that relate to teams will be awkward to analyse form df_matches
as the same team will sometimes be a home team (and hence appear in the HomeTeam
variable) and sometimes an away team (appearing in the AwayTeam
variable).
To make our life a little easier for that type of analysis we shall re-organise the information in an alternative dataframe.
The following shows a schematic of df_matches
:
MatchNo | HomeTeam | Away Team | FTHG | FTAG |
---|---|---|---|---|
1 | A | B | 2 | 1 |
2 | C | D | 0 | 3 |
A way to organise the data that is better suited to team level analysis would be the following
MatchNo | Team | Opponent | Home_or_Away | FTG | FTG_Opp |
---|---|---|---|---|---|
1 | A | B | Home | 2 | 1 |
1 | B | A | Away | 1 | 2 |
2 | C | D | Home | 0 | 3 |
2 | D | C | Away | 3 | 0 |
This process will double the number of rows, as for every match we will obtain two rows of data, one for the home team and one for the away team.
Once the data are in this format we should be able to perform the analysis tasks based on particular teams. We shall save the data in df_teams
and then we can refer to either df_matches
or df_teams
depending on which is more convenient.
We start by duplicating the dataframe into a df--Home
and df_away
dataframe. When you do this you use, for instance the code df_home = df_matches.copy()
. If you don't add the .copy()
, but merely write df_home = df_matches
, you will be creating a new item in your object list called df_home
, but really all you are creating is a new name for the same object df_matches
. You could think of this as an alias for df_matches
. If you then change something in df_home
you will actually be changing the same thing in df_matches
. Try and run the code below without the .copy
methods added and you will get an error message.
replicate the data into two datasets
df_home = df_matches.copy()
df_away = df_matches.copy()
# add the HomeAway variable
df_home.insert(5,'HomeAway', "Home")
df_away.insert(5,'HomeAway', "Away")
df_away.columns
Index(['Div', 'Season', 'MatchNo', 'Date', 'Time', 'HomeAway', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR'], dtype='object')
Now we change the names of the variables in the two dataframes, preparing for attaching them to each other. For df_home
we label characteristics for the away team as opponent characteristics, e.g. corners for the home tem are labeled Corners
and for the away team they are Corners_Opp
.
# variable names not mentioned in the rename method stay the same
df_home = df_home.rename(columns={'HomeTeam': 'Team', 'AwayTeam': 'Opponent', 'FTHG': 'FTG', 'FTAG': 'FTG_Opp', 'FTR': 'FTResult',
'HTHG': 'HTG', 'HTAG': 'HTG_Opp', 'HTR': 'HTResult', 'HS': 'Shots', 'AS': 'Shots_Opp',
'HST': 'ShotsTarget', 'AST': 'ShotsTarget_Opp', 'HF': 'FoulsComm', 'AF': 'FoulsComm_Opp',
'HC': 'Corners', 'AC': 'Corners_Opp', 'HY': 'Yellow', 'AY': 'Yellow_Opp', 'HR': 'Red', 'AR': 'Red_Opp'})
For df_away
we label in the opposite way, e.g. the corners for the awy team are Corners
and for the home team they are Corners_Opp
. Hence this contains the same information but now presented from the perspective of the away team.
df_away = df_away.rename(columns={'HomeTeam': 'Opponent', 'AwayTeam': 'Team', 'FTHG': 'FTG_Opp', 'FTAG': 'FTG', 'FTR': 'FTResult',
'HTHG': 'HTG_Opp', 'HTAG': 'HTG', 'HTR': 'HTResult', 'HS': 'Shots_Opp', 'AS': 'Shots',
'HST': 'ShotsTarget_Opp', 'AST': 'ShotsTarget', 'HF': 'FoulsComm_Opp', 'AF': 'FoulsComm',
'HC': 'Corners_Opp', 'AC': 'Corners', 'HY': 'Yellow_Opp', 'AY': 'Yellow', 'HR': 'Red_Opp', 'AR': 'Red'})
Before joining up the two dataframes we need to take care of the results columns (both the full time result FTResult
and the half-time result HTResult
), they are still defined as either H (=Home Win), D (=Draw) or A (=Away Win). We need to redefine them as either a Loss (=L), a draw (=D) or a win (=W), but that depends on whether we are thinking of the home or away team.
df_home['FTResult'] = df_home['FTResult'].cat.rename_categories({'H': 'W', 'D': 'D', 'A': 'L'})
df_home['HTResult'] = df_home['HTResult'].cat.rename_categories({'H': 'W', 'D': 'D', 'A': 'L'})
df_away['FTResult'] = df_away['FTResult'].cat.rename_categories({'H': 'L', 'D': 'D', 'A': 'W'})
df_away['HTResult'] = df_away['HTResult'].cat.rename_categories({'H': 'L', 'D': 'D', 'A': 'W'})
Now both dataframes (df_home
and df_away
) have the same variables. However, they are not in the same order (check df_home/away.columns
). To concatenate two dataframes we earlier used the pd.concat
function, but you may wonder whether that works if the variables are not in the same order. Well just try it.
df_teams = pd.concat([df_home,df_away])
df_teams = df_teams.sort_values(by=['MatchNo']) # sort by MatchNo
df_teams.head
<bound method NDFrame.head of Div Season MatchNo Date Time HomeAway Team \ 0 E0 2122 2122_0 2021-08-13 1900-01-01 20:00:00 Home Brentford 0 E0 2122 2122_0 2021-08-13 1900-01-01 20:00:00 Away Arsenal 1 E0 2122 2122_1 2021-08-14 1900-01-01 12:30:00 Away Leeds 1 E0 2122 2122_1 2021-08-14 1900-01-01 12:30:00 Home Man United 10 E0 2122 2122_10 2021-08-21 1900-01-01 12:30:00 Away Burnley .. .. ... ... ... ... ... ... 97 E0 2324 2324_97 2023-10-29 1900-01-01 14:00:00 Home Brighton 98 E0 2324 2324_98 2023-10-29 1900-01-01 14:00:00 Away Nott'm Forest 98 E0 2324 2324_98 2023-10-29 1900-01-01 14:00:00 Home Liverpool 99 E0 2324 2324_99 2023-10-29 1900-01-01 15:30:00 Home Man United 99 E0 2324 2324_99 2023-10-29 1900-01-01 15:30:00 Away Man City Opponent FTG FTG_Opp ... ShotsTarget ShotsTarget_Opp FoulsComm \ 0 Arsenal 2 0 ... 3 4 12 0 Brentford 0 2 ... 4 3 8 1 Man United 1 5 ... 3 8 9 1 Leeds 5 1 ... 8 3 11 10 Liverpool 0 2 ... 3 9 12 .. ... ... ... ... ... ... ... 97 Fulham 1 1 ... 7 5 12 98 Liverpool 0 3 ... 1 8 13 98 Nott'm Forest 3 0 ... 8 1 9 99 Man City 0 3 ... 3 10 9 99 Man United 3 0 ... 10 3 4 FoulsComm_Opp Corners Corners_Opp Yellow Yellow_Opp Red Red_Opp 0 8 2 5 0 0 0 0 0 12 5 2 0 0 0 0 1 11 4 5 2 1 0 0 1 9 5 4 1 2 0 0 10 6 4 8 0 0 0 0 .. ... ... ... ... ... ... ... 97 8 7 3 0 3 0 0 98 9 3 8 3 2 0 0 98 13 8 3 2 3 0 0 99 4 7 12 4 1 0 0 99 9 12 7 1 4 0 0 [2280 rows x 27 columns]>
You can now see that the first two rows represent the information from the same match, once from the home team's perspective (Brentford) and once from the away team's (Arsenal).
We now have two dataframes, df_matches
and df_teams
which contain identical information but structured differently. If you are a novice data analyst, then it was probably not obvious that this pre-work will be helpful. And that is fine, but you shall see in what follows that we have simplified future work significantly.
The questions we shall tackle here are
We will use the df_matches
dataframe to answer these questions. Let's start by looking at the home (FTHG
) and away goals (FTAG
) at full time. We use the .describe()
method to these two variables to get summary statistics.
df_matches[['FTHG', 'FTAG']].describe()
FTHG | FTAG | |
---|---|---|
count | 1140.000000 | 1140.000000 |
mean | 1.649123 | 1.334211 |
std | 1.375073 | 1.244361 |
min | 0.000000 | 0.000000 |
25% | 1.000000 | 0.000000 |
50% | 1.000000 | 1.000000 |
75% | 2.000000 | 2.000000 |
max | 9.000000 | 8.000000 |
As you can see, the average number of goals scored by the home team is larger by about 0.3 on average.
Next we turn to asking whether evening matches produce more goals than earlier matches. We do have the time variable we can use to separate out the matches by the time of the day it is played. What we do not yet have in the dataframe is the number of goals scored altogether. Let us create this variable ('FTG') first.
df_matches['FTG'] = df_matches['FTHG'] + df_matches['FTAG']
Now we calculate the average number of goals by kick-off time ('Time').
The following command does the following: take df_matches
and group the dataframe by the Time
variable. This basically creates different dataframes for all matches of a particular time. Then only keep the FTG
variable. Now you have "buckets" for each different kick-off time and these buckets only contain the FTG
variable. Now we aggregate these "Buckets" (.agg
) by counting the number of matches in each bucket (np.size
), as well as calculating the mean (np.mean
) and variance (np.var
).
tab1 = df_matches.groupby('Time')['FTG'].agg([np.size,np.mean, np.var])
tab1
size | mean | var | |
---|---|---|---|
Time | |||
1900-01-01 12:00:00 | 3 | 1.666667 | 1.333333 |
1900-01-01 12:30:00 | 84 | 2.809524 | 3.095812 |
1900-01-01 13:00:00 | 4 | 2.500000 | 1.666667 |
1900-01-01 13:30:00 | 4 | 2.250000 | 1.583333 |
1900-01-01 14:00:00 | 168 | 2.940476 | 2.774879 |
1900-01-01 14:15:00 | 3 | 2.000000 | 1.000000 |
1900-01-01 15:00:00 | 387 | 3.043928 | 2.886666 |
1900-01-01 15:30:00 | 4 | 3.250000 | 0.916667 |
1900-01-01 15:45:00 | 1 | 2.000000 | NaN |
1900-01-01 16:00:00 | 21 | 3.666667 | 2.333333 |
1900-01-01 16:15:00 | 1 | 1.000000 | NaN |
1900-01-01 16:30:00 | 89 | 3.370787 | 3.213228 |
1900-01-01 17:30:00 | 98 | 2.867347 | 3.023459 |
1900-01-01 18:00:00 | 2 | 2.000000 | 8.000000 |
1900-01-01 19:00:00 | 2 | 2.000000 | 2.000000 |
1900-01-01 19:30:00 | 61 | 2.737705 | 2.930055 |
1900-01-01 19:45:00 | 56 | 2.428571 | 2.249351 |
1900-01-01 20:00:00 | 127 | 3.055118 | 2.846144 |
1900-01-01 20:15:00 | 25 | 3.600000 | 3.583333 |
This way of chaining methods is a popular way of creating transformed data.
As you can see there were many different kick-off times, some with very few matches (e.g. 16:15 with only one match). We therefore create a variable that categorises the times into "noon", "early afternoon", "late afternoon" or "evening". Here we use the pd.cut
function. You will not have to remember all these commands. But if you search for terms like "create categorical variable from continuous in python" you will find advice that most likely leads you to pd.cut
. Searching for the right function to use is an essential coding skill.
# create cut-offs. As the Time variable is in Time format these need to have the same format
time_cuts = pd.to_datetime(['11:00','14:45','15:50','19:15','23:00'], format = '%H:%M')
df_matches["TimeofDay"] = pd.cut(
df_matches["Time"],
bins=time_cuts,
labels=["noon", "early afternoon", "late afternoon", "evening"],
ordered=True,
)
Confirm that this has done what you expect.
df_matches[['MatchNo','Time','TimeofDay']].head
<bound method NDFrame.head of MatchNo Time TimeofDay 0 2122_0 1900-01-01 20:00:00 evening 1 2122_1 1900-01-01 12:30:00 noon 2 2122_2 1900-01-01 15:00:00 early afternoon 3 2122_3 1900-01-01 15:00:00 early afternoon 4 2122_4 1900-01-01 15:00:00 early afternoon .. ... ... ... 375 2324_375 1900-01-01 16:00:00 late afternoon 376 2324_376 1900-01-01 16:00:00 late afternoon 377 2324_377 1900-01-01 16:00:00 late afternoon 378 2324_378 1900-01-01 16:00:00 late afternoon 379 2324_379 1900-01-01 16:00:00 late afternoon [1140 rows x 3 columns]>
Looks right. That means we can repeat the earlier analysis and group by TimeofDay
rather than Time
to create more equally sized groups.
tab2 = df_matches.groupby('TimeofDay')['FTG'].agg([np.size,np.mean, np.var])
tab2
size | mean | var | |
---|---|---|---|
TimeofDay | |||
noon | 266 | 2.857143 | 2.809704 |
early afternoon | 392 | 3.043367 | 2.860007 |
late afternoon | 213 | 3.131455 | 3.114713 |
evening | 269 | 2.903346 | 2.886145 |
Not surprisingly you do get different average numbers of goals. You could continue to perform hypothesis tests to check whether there are statistically significant differences, but that is going too far for this walk-through.
The last match based question we will investigate here is whether away teams are more efficient goal scorers than home teams? We already used the the goals scored variables (FTHG
, FTAG
). Here we shall relate this to the number of shots taken by the home team (HS
) or the away team (AS
).
In effect we will calculate how many shots are required per goal (HS/FTHG
and AS/FTAG
). Lower values are better.
df_matches['HG_eff'] = df_matches['HS']/df_matches['FTHG']
df_matches['AG_eff'] = df_matches['AS']/df_matches['FTAG']
df_matches[['FTHG', 'HS', 'HG_eff', 'FTAG', 'AS', 'AG_eff']].describe()
c:\Users\msassrb2\Anaconda3\lib\site-packages\numpy\lib\function_base.py:4009: RuntimeWarning: invalid value encountered in subtract diff_b_a = subtract(b, a)
FTHG | HS | HG_eff | FTAG | AS | AG_eff | |
---|---|---|---|---|---|---|
count | 1140.000000 | 1140.000000 | 1140.000000 | 1140.000000 | 1140.000000 | 1140.0 |
mean | 1.649123 | 14.400877 | inf | 1.334211 | 11.742982 | inf |
std | 1.375073 | 5.888024 | NaN | 1.244361 | 5.264696 | NaN |
min | 0.000000 | 1.000000 | 1.400000 | 0.000000 | 1.000000 | 1.0 |
25% | 1.000000 | 10.000000 | 5.666667 | 0.000000 | 8.000000 | 5.5 |
50% | 1.000000 | 14.000000 | 9.500000 | 1.000000 | 11.000000 | 9.5 |
75% | 2.000000 | 18.000000 | 20.000000 | 2.000000 | 15.000000 | NaN |
max | 9.000000 | 36.000000 | inf | 8.000000 | 31.000000 | inf |
Why can the above not calculate average efficiencies? The issue is that there are a substantial number of missing values, the reason being that we are dividing by the number of goals scored and frequently that will be 0.
However, as the minimum number of shots taken (HS
and AS
) is always larger than 0 (at least in this sample) we can calculate how many goals are scored per shot (FTHG/HS
and FTAG/AS
). Higher values are better.
df_matches['HG_eff'] = df_matches['FTHG']/df_matches['HS']
df_matches['AG_eff'] = df_matches['FTAG']/df_matches['AS']
df_matches[['FTHG', 'HS', 'HG_eff', 'FTAG', 'AS', 'AG_eff']].describe().round(2) # round(2) forces rounding to 2 digits
FTHG | HS | HG_eff | FTAG | AS | AG_eff | |
---|---|---|---|---|---|---|
count | 1140.00 | 1140.00 | 1140.00 | 1140.00 | 1140.00 | 1140.00 |
mean | 1.65 | 14.40 | 0.12 | 1.33 | 11.74 | 0.12 |
std | 1.38 | 5.89 | 0.10 | 1.24 | 5.26 | 0.12 |
min | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 |
25% | 1.00 | 10.00 | 0.05 | 0.00 | 8.00 | 0.00 |
50% | 1.00 | 14.00 | 0.11 | 1.00 | 11.00 | 0.11 |
75% | 2.00 | 18.00 | 0.18 | 2.00 | 15.00 | 0.18 |
max | 9.00 | 36.00 | 0.71 | 8.00 | 31.00 | 1.00 |
Approximately the efficiency seems to be similar between away and home team. While the home team does score more on average, they also take more shots.
As an aside. We see that there was (at least) one match where the home team took 36 shots and another where the away team scored 31. Let's figure out which matches these were.
df_matches[(df_matches['HS'] == 36)|(df_matches['AS'] == 31)]
Div | Season | MatchNo | Date | Time | HomeTeam | AwayTeam | FTHG | FTAG | FTR | ... | HC | AC | HY | AY | HR | AR | FTG | TimeofDay | HG_eff | AG_eff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
362 | E0 | 2122 | 2122_362 | 2022-05-15 | 1900-01-01 14:00:00 | West Ham | Man City | 2 | 2 | D | ... | 4 | 9 | 3 | 1 | 0 | 0 | 4 | noon | 0.333333 | 0.064516 |
376 | E0 | 2324 | 2324_376 | 2024-05-19 | 1900-01-01 16:00:00 | Liverpool | Wolves | 2 | 0 | H | ... | 10 | 2 | 1 | 1 | 0 | 1 | 2 | late afternoon | 0.055556 | 0.000000 |
2 rows × 30 columns
It may be interesting whether there is a relationship between efficiency and the number of shots taken.
We create a scatterplot with shots taken by the home team (HS
) on the horizontal axis and the home team's goal efficiency on the vertical axis (HG_eff
). The package we use for plotting here is the lets_plo
package which implements the powerful graphing language introduced for ggplot
in the R programming language. You can find guidance to the functionality here.
p1 = ggplot(df_matches, aes(x = 'HS', y = 'HG_eff')) + geom_point()
p1
This seems to suggest that there is a negative relationship. But note that behind each point there may be multiple observations.
The questions in this section were:
We will use the df_teams
dataframe to answer these questions.
Beginning with the first question we create a temporary dataframe that only keeps observations involving Liverpool and Tottenham. We call this temp
as we know that after this particular analysis we will not need this dataframe any longer.
temp = df_teams.query("Team == 'Liverpool' or Team == 'Tottenham'").copy()
tab3 = temp.groupby('Team')['FTG'].agg(np.mean)
tab3
Team Arsenal NaN Aston Villa NaN Bournemouth NaN Brentford NaN Brighton NaN Burnley NaN Chelsea NaN Crystal Palace NaN Everton NaN Fulham NaN Leeds NaN Leicester NaN Liverpool 2.236842 Luton NaN Man City NaN Man United NaN Newcastle NaN Norwich NaN Nott'm Forest NaN Sheffield United NaN Southampton NaN Tottenham 1.868421 Watford NaN West Ham NaN Wolves NaN Name: FTG, dtype: float64
You can see that the table does show what we ant to see. In our sample Liverpool averages 2.24 goals a match whereas Tottenham averages 1.87. However, the table does show more than that. It shows a list of all the teams in our sample. Recall that we defined the Team
variable as a category variable. Python does have all these categories attached to the variable, but in temp
we only find matches for Liverpool and Tottenham (as intended). Surely there is a way to remove these. An internet search for "python remove unused categories" should deliver a way to do so. Having said that, you may have to look around for alternative solutions as the most obvious one using .remove_unused_categories()
does not work here.
temp = df_teams.query("Team == 'Liverpool' or Team == 'Tottenham'").copy()
tab4 = temp.groupby('Team')['FTG'].agg(np.mean).dropna() # .dropna() drops all rows with NA
tab4
Team Liverpool 2.236842 Tottenham 1.868421 Name: FTG, dtype: float64
This is a nice opportunity to show another data visualisation. The number of goals scored is discrete variable and comparing bar charts of the number of goals scored in matches may be a nice way to visualise the difference between the two teams.
p2 = ggplot(temp, aes(x = 'FTG', fill = 'Team')) + geom_bar(position = "dodge")
p2
We can see confirmed that Tottenham seems to have more matches with smaller number of goals (0, 1 and 2) than Liverpool.
Let us turn to the next question, whether Chelsea perform better in the first half of seasons as opposed to the second half of a season? To answer this question we start by creating a temporary dataframe containing data with Chelsea matches.
temp = df_teams.query("Team == 'Chelsea'").sort_values('Date').copy()
This contains 114 matches for three seasons (2021/22, 2022/23 and 2023/24), 38 matches in each season. We now need to identify the first 19 matches in each season. We shall create a variable ('Order') that takes values "1" or "2" for the fits and second half of the season respectively. Then we can apply the .groupby
method to get separate statistics for the two halfs.
There are different ways to do that. In each season Chelsea will play each other team twice, we could associate the first of these matches to "first half". This is not a task for which the solution is immediately obvious. Here you could either search the internet for a solution, but it is not immediately obvious what the important keywords would be. This feels like a problem which you first have to explain properly. This is a type of problem which is a great test to see whether ChatGPT can be of help.
You could prepare a little problem description to ChatGPT: "I have a python dataframe with variables 'Date' and 'Opponent' listing all dates and opponents against which a sports teanm played matches. Each opponent occurs twice. I want to create a new variable called 'Order' that takes the value 1 for the first match and the value 2 for the second match against each opponent. Can you please help me how to achieve this."
The answer I got from ChatGPT
df['Order'] = df.groupby('Opponent').cumcount() + 1
There was more in the reply (incl. explanations) but this is the core of the answer. The problem was set without the mention of different seasons, so the application had to adjust for that.
temp['Order'] = temp.groupby(['Season','Opponent']).cumcount() + 1
Now we have the order variable and we can check whether average goals ('FTG'), shots takes ('Shots') and shots on target ('ShotsTarget').
tab4 = temp.groupby('Order')[['FTG','Shots','ShotsTarget']].agg(np.mean)
tab4
FTG | Shots | ShotsTarget | |
---|---|---|---|
Order | |||
1 | 1.614035 | 13.491228 | 4.789474 |
2 | 1.736842 | 15.000000 | 5.456140 |
The last question we turn to is whether Manchester City ("Man City" in the 'Team' column) concede more freekicks than it is awarded free kicks (the other team conceding freekicks)? This is straightforward if you understand that the relevant information is in the FoulsComm
and the FoulsComm_Opp
variable. There are a number of ways you could answer this question. For instance you could compare the mean value across the 114 matches for Manchester City for both variables. But let's look at this in another way. Let's check in how many matches Manchester City players were fouled more often than Manchester City players fouled their opponent players. To this we create a new variable which takes the value true
if Man City committed more fouls than their opponents (FoulsComm>FoulsComm_Opp
) and false
otherwise.
temp = df_teams.query("Team == 'Man City'").copy()
temp['Foul_compare'] = (temp['FoulsComm'] > temp['FoulsComm_Opp'])
p3 = ggplot(temp, aes(x = 'Foul_compare')) + geom_bar()
p3
So clearly there are many more matches in which Manchester City players are fouled more often by their opponents than the other way round.
In this walk-through you have practiced a lot of important generic coding skills
And you have done the following substantial tasks
This exercise is part of the ECLR page.