%load_ext pretty_jupyter
The pretty_jupyter extension is already loaded. To reload it, use: %reload_ext pretty_jupyter
Introduction¶
This walk-through is part of the ECLR page.
If you write code for data analysis you will eventually have to master a number of basic programming techniques. They are not directly related to data analysis but will be required for any bit of coding. These are
- Subsetting
- Objects and pointers
- Boolean variables
- if else conditions
- loops
- apply methods
- functions
- lambda functions
In this walkthrough we will show simple examples of these to help you understand code written by other people and to be able to write more complex code yourself.
Preparing your workfile¶
Install the lets-plot
package in case you have not yet done so on your computer (pip install lets-plot
in your terminal).
We add the basic packages needed for this work:
import numpy as np
import pandas as pd
Loading a dataset¶
Let's get a dataset to look at. We shall use the Baseball wages dataset, mlb1.csv. Either download that file to your working directory or directly link to the file as shown below. Before you upload a csv (or EXCEL) file you should always inspect the file to understand what you should expect. Upon inspection you will find that missing values are coded as "." and hence we pass that information on to the pd.read_csv
function.
# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/mlb1.csv"
df_bball = pd.read_csv(url, na_values = '.')
Let's check out what variables we have in this data-file. Note that Python starts counting variables from 0!
df_bball.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 353 entries, 0 to 352 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 salary 353 non-null int64 1 teamsal 353 non-null int64 2 nl 353 non-null int64 3 years 353 non-null int64 4 games 353 non-null int64 5 atbats 353 non-null int64 6 runs 353 non-null int64 7 hits 353 non-null int64 8 doubles 353 non-null int64 9 triples 353 non-null int64 10 hruns 353 non-null int64 11 rbis 353 non-null int64 12 bavg 353 non-null int64 13 bb 353 non-null int64 14 so 353 non-null int64 15 sbases 353 non-null int64 16 fldperc 353 non-null int64 17 frstbase 353 non-null int64 18 scndbase 353 non-null int64 19 shrtstop 353 non-null int64 20 thrdbase 353 non-null int64 21 outfield 353 non-null int64 22 catcher 353 non-null int64 23 yrsallst 353 non-null int64 24 hispan 353 non-null int64 25 black 353 non-null int64 26 whitepop 330 non-null float64 27 blackpop 339 non-null float64 28 hisppop 330 non-null float64 29 pcinc 353 non-null int64 30 gamesyr 353 non-null float64 31 hrunsyr 353 non-null float64 32 atbatsyr 353 non-null float64 33 allstar 353 non-null float64 34 slugavg 353 non-null float64 35 rbisyr 353 non-null float64 36 sbasesyr 353 non-null float64 37 runsyr 353 non-null float64 38 percwhte 330 non-null float64 39 percblck 330 non-null float64 40 perchisp 330 non-null float64 41 blckpb 330 non-null float64 42 hispph 330 non-null float64 43 whtepw 330 non-null float64 44 blckph 330 non-null float64 45 hisppb 330 non-null float64 46 lsalary 353 non-null float64 dtypes: float64(20), int64(27) memory usage: 129.7 KB
There is a bunch of variables who describe characteristics of the 353 baseball players in the sample.
Subsetting¶
The process of accessing only a particular part of a dataframe is often called subsetting. There are, as usual several ways to do that.
Subsetting a single column/variable¶
Here we will see how to access particular columns in a dataframe. Say you wish to access the hits
variable. Each of the following three commands do exactly the same, access the hits
variable.
#df_bball['hits']
# df_bball.loc[:,'hits'] # use iloc if you are using a name
df_bball.iloc[:, 7] # use iloc if using integers like col numbers to access
0 1939 1 863 2 840 3 816 4 928 ... 348 260 349 183 350 176 351 510 352 4 Name: hits, Length: 353, dtype: int64
Subsetting multiple columns/variables¶
If you want to access multiple columns you can do that in the following way. Say you want to access atbats
to hits
.
df_bball[['atbats', 'runs', 'hits' ]]
# df_bball.loc[:,'atbats':'hits']
# df_bball.iloc[:, 5:8] # 8 is the first column NOT included
atbats | runs | hits | |
---|---|---|---|
0 | 6705 | 1076 | 1939 |
1 | 3333 | 407 | 863 |
2 | 2807 | 370 | 840 |
3 | 3337 | 405 | 816 |
4 | 3603 | 437 | 928 |
... | ... | ... | ... |
348 | 1098 | 150 | 260 |
349 | 700 | 63 | 183 |
350 | 828 | 112 | 176 |
351 | 2087 | 217 | 510 |
352 | 24 | 1 | 4 |
353 rows × 3 columns
Note that ['atbats', 'runs', 'hits' ]
basically produces a list of variables. That list does not have to have the same order as the variables in the original dataframe.
Let's say you also want the salary
variable. The easiest is to use a list.
sel_vars = ['salary', 'atbats', 'runs', 'hits' ]
df_bball[sel_vars]
salary | atbats | runs | hits | |
---|---|---|---|---|
0 | 6329213 | 6705 | 1076 | 1939 |
1 | 3375000 | 3333 | 407 | 863 |
2 | 3100000 | 2807 | 370 | 840 |
3 | 2900000 | 3337 | 405 | 816 |
4 | 1650000 | 3603 | 437 | 928 |
... | ... | ... | ... | ... |
348 | 312000 | 1098 | 150 | 260 |
349 | 275000 | 700 | 63 | 183 |
350 | 250000 | 828 | 112 | 176 |
351 | 200000 | 2087 | 217 | 510 |
352 | 109000 | 24 | 1 | 4 |
353 rows × 4 columns
Subsetting rows¶
If you want to subset a set of rows you can do this as follows if you wanted to view the 8th observation.
# df_bball.iloc[7,:] # gives you the eights row/observation, recall we start counting at 0
df_bball.loc[7,:] # gives you the eighth row/observation
salary 2.750000e+05 teamsal 3.840738e+07 nl 1.000000e+00 years 1.000000e+01 games 4.320000e+02 atbats 1.005000e+03 runs 7.800000e+01 hits 2.400000e+02 doubles 3.500000e+01 triples 5.000000e+00 hruns 1.300000e+01 rbis 9.500000e+01 bavg 2.390000e+02 bb 3.900000e+01 so 1.400000e+02 sbases 1.000000e+00 fldperc 9.900000e+02 frstbase 0.000000e+00 scndbase 0.000000e+00 shrtstop 0.000000e+00 thrdbase 0.000000e+00 outfield 0.000000e+00 catcher 1.000000e+00 yrsallst 0.000000e+00 hispan 0.000000e+00 black 0.000000e+00 whitepop 5.772110e+06 blackpop 1.547725e+06 hisppop 8.934220e+05 pcinc 1.884000e+04 gamesyr 4.320000e+01 hrunsyr 1.300000e+00 atbatsyr 1.005000e+02 allstar 0.000000e+00 slugavg 3.223881e+01 rbisyr 9.500000e+00 sbasesyr 1.000000e-01 runsyr 7.800000e+00 percwhte 7.027797e+01 percblck 1.884423e+01 perchisp 1.087780e+01 blckpb 0.000000e+00 hispph 0.000000e+00 whtepw 7.027797e+01 blckph 0.000000e+00 hisppb 0.000000e+00 lsalary 1.252453e+01 Name: 7, dtype: float64
To get multiple rows you would call
#df_bball.iloc[5:7,:] # gives you the first row/observation, recall we start counting at 0
df_bball.loc[5:7,:]
# df_bball[5:7] # this also works if you are looking at multiple columns, but better to use .loc or .iloc
salary | teamsal | nl | years | games | atbats | runs | hits | doubles | triples | ... | runsyr | percwhte | percblck | perchisp | blckpb | hispph | whtepw | blckph | hisppb | lsalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 700000 | 38407380 | 1 | 17 | 2032 | 7489 | 1136 | 2145 | 270 | 142 | ... | 66.82353 | 70.27797 | 18.84423 | 10.8778 | 18.84423 | 0.0000 | 0.00000 | 10.8778 | 0.00000 | 13.45884 |
6 | 695000 | 38407380 | 1 | 4 | 394 | 1293 | 179 | 303 | 51 | 13 | ... | 44.75000 | 70.27797 | 18.84423 | 10.8778 | 0.00000 | 10.8778 | 0.00000 | 0.0000 | 18.84423 | 13.45167 |
7 | 275000 | 38407380 | 1 | 10 | 432 | 1005 | 78 | 240 | 35 | 5 | ... | 7.80000 | 70.27797 | 18.84423 | 10.8778 | 0.00000 | 0.0000 | 70.27797 | 0.0000 | 0.00000 | 12.52453 |
3 rows × 47 columns
salary | teamsal | nl | years | games | atbats | runs | hits | doubles | triples | ... | runsyr | percwhte | percblck | perchisp | blckpb | hispph | whtepw | blckph | hisppb | lsalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 700000 | 38407380 | 1 | 17 | 2032 | 7489 | 1136 | 2145 | 270 | 142 | ... | 66.82353 | 70.27797 | 18.84423 | 10.8778 | 18.84423 | 0.0000 | 0.0 | 10.8778 | 0.00000 | 13.45884 |
6 | 695000 | 38407380 | 1 | 4 | 394 | 1293 | 179 | 303 | 51 | 13 | ... | 44.75000 | 70.27797 | 18.84423 | 10.8778 | 0.00000 | 10.8778 | 0.0 | 0.0000 | 18.84423 | 13.45167 |
2 rows × 47 columns
Objects and Pointers¶
Let's start with a little example. We define test
which is defined as the first 5 observations of the salary
variable.
test = df_bball.iloc[0:5,0]
test
0 6329213 1 3375000 2 3100000 3 2900000 4 1650000 Name: salary, dtype: int64
Now, for som inexplicable reason, you decide to change the first three values in test
to 0.
test[0:3] = 0
test
0 0 1 0 2 0 3 2900000 4 1650000 Name: salary, dtype: int64
So, that worked. Let's have a look at the original dataframe from where we copied these data.
df_bball.iloc[0:5,0]
0 0 1 0 2 0 3 2900000 4 1650000 Name: salary, dtype: int64
What??!!! How did that happen? Here is what that original line of code test = df_bball.iloc[0:5,0]
actually did. It did not create a "copy" in the usual way. What it actually did was to just create a new and alternative name for these particular five elements of df_bball.iloc[0:5,0]
. Think of a street address and a PO.Box which delivers mail to the same street address. You could either use the street address or the PO.Box and your mail would go to the same place. This is what we did, test
only points at df_bball.iloc[0:5,0]
, that is why it is also called a pointer.
So it is best to think of test
not as a variable but rather as an object, one that points at another.
Clearly the above behaviour is not one we really want, if you thought that, once defined test
is its own independent object. So what to do?
Return to the earlier line where you defined the new object test
and replace it by test = df_bball.iloc[0:5,0].copy()
. But before doing that we have to re-load the data.
df_bball = pd.read_csv(url, na_values = '.')
test = df_bball.iloc[0:5,0].copy()
test[0:3] = 0
df_bball.iloc[0:5,0]
0 6329213 1 3375000 2 3100000 3 2900000 4 1650000 Name: salary, dtype: int64
Observe the different behaviour. Now the values in the original dataframe df_bball
have not changed.
By using the .copy()
when you defined test
you actually created what we would understand as being a copy. This you will have to keep in mind when you actually want to create a copy of data. Especially if you think data will be changed you need to carefully think about whether you will want them changed in the original dataframe or not.
Boolean variables and subsetting objects¶
Boolean variables are variables that take the value TRUE or FALSE. Let's illustrate this.
test = df_bball['salary'] > 1000000
test[0:7]
0 True 1 True 2 True 3 True 4 True 5 False 6 False Name: salary, dtype: bool
For each of the 353 baseball players we asked the question whether they earned more than USD 1,000,000. We then looked at the first 7 observations in the sample and saw that the first 5 did indeed earn more than USD 1,000,000 (TRUE) but the next two did not (FALSE). In test
you will find the answer to that question for all 353 baseball players.
You may be wondering how many in the sample did earn more than that threshold? sum(test)
counts the number of TRUE
values. np.mean(test)
calculates the proportion and test.describe()
tells you which answer was the most frequent.
sum(test)
# np.mean(test)
# test.describe()
143
You may wonder what the big deal is. In what comes in this walkthrough you will find some uses of Boolean variables. But here is one. Say you wanted to create a new dataframe with all the players that did earn more than USD 1,000,000, then here are a number of ways to do that (note that here we are creating copies).
millionaires1 = df_bball[test].copy() # where we use the variable defined above
millionaires2 = df_bball.loc[test].copy()
millionaires3 = df_bball.loc[test,:].copy()
millionaires4 = df_bball[df_bball['salary'] > 1000000].copy()
millionaires5 = df_bball.query('salary > 1000000').copy()
There is not a right or wrong way of doing this.
It is important to understand that you can create Boolean variables from a combination of conditions, say you wanted to select all players who earn more than USD 1,000,000 and have not more than 5 years of experience.
test2 = ( df_bball['salary'] > 1000000) & (df_bball['years'] < 6)
sum(test2)
28
There are only 28 players that fit this criterion. Here the two conditions were connected with the AND (&
) opeator. You may also connect conditions with an OR (|
) operator. There are a few more important operator you may need: IS EQUAL TO is represented by ==
, UNEQUAL TO is !=
.
if-else¶
An important tool in the toolbox of any applied economics and econometrician is the if-else condition. This is important as one will usually write code that works irrespective of certain conditions. Here we will demonstrate this with a simple example.
var1 = "salary"
var2 = "years"
# np.corrcoef delivers a 2x2 matrix, the corr is in the off-diagonal , eg [0,1]
c12 = np.corrcoef(df_bball[var1], df_bball[var2])[0,1]
if c12 > 0:
print("the correlation is positive")
else:
print("the correlation is not positive")
the correlation is positive
You can also think of more than 2 cases.
if c12 > 0.1:
print("the correlation is positive")
elif c12 < -0.1:
print("the correlation is not positive")
else:
print("the variables are basically uncorrelated")
the correlation is positive
Here, we only have a single line of code (print ( )
), but there could be multiple lines of code after every condition. Also note that the code lines between if
, elif
and else
are indent. This is important, the indent tells Python when the code ends and when the next condition comes.
Loops¶
Loops are a way to make your code repeat bits of code, potentially many, many times. Above we calculated a correlation between two variables. We have many variables (check df_bball.shape
) to confirm that there are 47 variables altogether. We saw previously that they are all numerical variables. Let's say you wanted to figure out whether any of the variables is negatively correlated with salary
. We would now have to calculate 46 correlations, the correlations of all variables, other than salary
, with the variable salary
. As salary is the first variable (at position 0), we need to loop through variables 1 to 46.
range(df_bball.shape[1])
range(0, 47)
words = ["Here", "I","want","to","concatenate","words","using","pipe","delimeter"]
print(" ".join([df_bball.columns[1], "is uncorrelated"]))
teamsal is uncorrelated
df_bball.columns[1]
'teamsal'
var1 = "salary"
for i in range(1,47):
# np.corrcoef delivers a 2x2 matrix, the corr is in the off-diagonal , eg [0,1]
c12 = np.corrcoef(df_bball[var1], df_bball.iloc[:,i])[0,1]
if c12 > 0.1:
print(" ".join([df_bball.columns[i], "is positively correlated with salary"]))
elif c12 < -0.1:
print(" ".join(["!!!", df_bball.columns[i], "is negatively correlated with salary"]))
else:
print(" ".join([df_bball.columns[i], "is uncorrelated with salary"]))
teamsal is positively correlated with salary nl is uncorrelated with salary years is positively correlated with salary games is positively correlated with salary atbats is positively correlated with salary runs is positively correlated with salary hits is positively correlated with salary doubles is positively correlated with salary triples is positively correlated with salary hruns is positively correlated with salary rbis is positively correlated with salary bavg is positively correlated with salary bb is positively correlated with salary so is positively correlated with salary sbases is positively correlated with salary fldperc is uncorrelated with salary frstbase is uncorrelated with salary scndbase is uncorrelated with salary shrtstop is uncorrelated with salary thrdbase is uncorrelated with salary outfield is positively correlated with salary !!! catcher is negatively correlated with salary yrsallst is positively correlated with salary hispan is uncorrelated with salary black is positively correlated with salary whitepop is uncorrelated with salary blackpop is uncorrelated with salary hisppop is uncorrelated with salary pcinc is uncorrelated with salary gamesyr is positively correlated with salary hrunsyr is positively correlated with salary atbatsyr is positively correlated with salary allstar is positively correlated with salary slugavg is positively correlated with salary rbisyr is positively correlated with salary sbasesyr is positively correlated with salary runsyr is positively correlated with salary percwhte is uncorrelated with salary percblck is uncorrelated with salary perchisp is uncorrelated with salary blckpb is uncorrelated with salary hispph is uncorrelated with salary whtepw is uncorrelated with salary blckph is uncorrelated with salary hisppb is uncorrelated with salary lsalary is positively correlated with salary
Only one variable is negatively correlated (as defined by a correlation smaller than -0.1). It turns out that very often loops are actually not necessary. The above could have been achieved by using the .corr()
method applied to our dataframe df_bball
.
df_bball.corr()
salary | teamsal | nl | years | games | atbats | runs | hits | doubles | triples | ... | runsyr | percwhte | percblck | perchisp | blckpb | hispph | whtepw | blckph | hisppb | lsalary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
salary | 1.000000 | 0.224725 | -0.049160 | 0.478152 | 0.590643 | 0.618650 | 0.651922 | 0.627098 | 0.628841 | 0.480468 | ... | 0.703346 | -0.005084 | 0.026013 | -0.029627 | 0.121162 | -0.009062 | -0.033995 | 0.069191 | -0.078252 | 0.902083 |
teamsal | 0.224725 | 1.000000 | -0.218707 | 0.177434 | 0.176442 | 0.172746 | 0.213004 | 0.172080 | 0.160806 | 0.160525 | ... | 0.188722 | -0.208500 | 0.224475 | 0.011364 | 0.102823 | -0.056093 | 0.040622 | -0.005969 | -0.016805 | 0.211406 |
nl | -0.049160 | -0.218707 | 1.000000 | -0.119650 | -0.109963 | -0.113653 | -0.118294 | -0.112177 | -0.122457 | -0.025921 | ... | -0.086451 | -0.202080 | 0.115924 | 0.158995 | 0.090623 | 0.025824 | -0.104910 | 0.123718 | 0.017757 | -0.058049 |
years | 0.478152 | 0.177434 | -0.119650 | 1.000000 | 0.941298 | 0.899876 | 0.840024 | 0.882413 | 0.841498 | 0.743295 | ... | 0.501848 | 0.048514 | 0.001075 | -0.080254 | 0.047350 | -0.081055 | 0.049047 | 0.044106 | -0.077223 | 0.580721 |
games | 0.590643 | 0.176442 | -0.109963 | 0.941298 | 1.000000 | 0.988557 | 0.937558 | 0.977384 | 0.938055 | 0.824304 | ... | 0.683550 | 0.070722 | -0.038511 | -0.058642 | 0.052607 | -0.073656 | 0.031436 | 0.065671 | -0.103336 | 0.654003 |
atbats | 0.618650 | 0.172746 | -0.113653 | 0.899876 | 0.988557 | 1.000000 | 0.956522 | 0.994506 | 0.957641 | 0.844203 | ... | 0.722409 | 0.085024 | -0.054600 | -0.058414 | 0.055159 | -0.069456 | 0.017362 | 0.064702 | -0.107641 | 0.655399 |
runs | 0.651922 | 0.213004 | -0.118294 | 0.840024 | 0.937558 | 0.956522 | 1.000000 | 0.963780 | 0.930113 | 0.856561 | ... | 0.788967 | 0.076957 | -0.041508 | -0.064391 | 0.090453 | -0.071851 | -0.016378 | 0.088177 | -0.116891 | 0.663776 |
hits | 0.627098 | 0.172080 | -0.112177 | 0.882413 | 0.977384 | 0.994506 | 0.963780 | 1.000000 | 0.964148 | 0.854465 | ... | 0.732789 | 0.090167 | -0.062515 | -0.055232 | 0.057955 | -0.067400 | 0.005132 | 0.079937 | -0.108191 | 0.652359 |
doubles | 0.628841 | 0.160806 | -0.122457 | 0.841498 | 0.938055 | 0.957641 | 0.930113 | 0.964148 | 1.000000 | 0.795545 | ... | 0.716852 | 0.107900 | -0.069011 | -0.074537 | 0.023351 | -0.089703 | 0.065754 | 0.048074 | -0.130983 | 0.647727 |
triples | 0.480468 | 0.160525 | -0.025921 | 0.743295 | 0.824304 | 0.844203 | 0.856561 | 0.854465 | 0.795545 | 1.000000 | ... | 0.646384 | 0.082546 | -0.054937 | -0.053904 | 0.097226 | -0.024679 | -0.065440 | 0.090673 | -0.072336 | 0.504266 |
hruns | 0.613213 | 0.139215 | -0.114294 | 0.674362 | 0.771054 | 0.792835 | 0.760319 | 0.782930 | 0.797947 | 0.518236 | ... | 0.598251 | 0.075205 | -0.058548 | -0.036741 | 0.040310 | -0.090531 | 0.030598 | 0.107513 | -0.123739 | 0.573756 |
rbis | 0.649485 | 0.160775 | -0.128128 | 0.822251 | 0.924259 | 0.943888 | 0.901255 | 0.942423 | 0.935874 | 0.716881 | ... | 0.690810 | 0.092244 | -0.060728 | -0.061204 | 0.052452 | -0.080551 | 0.022409 | 0.085208 | -0.120404 | 0.647311 |
bavg | 0.281900 | 0.083113 | -0.031582 | 0.197296 | 0.267397 | 0.287366 | 0.310146 | 0.320975 | 0.313955 | 0.271371 | ... | 0.380220 | 0.145895 | -0.071949 | -0.131887 | 0.099048 | -0.030701 | -0.095194 | 0.102388 | -0.082557 | 0.283009 |
bb | 0.586754 | 0.153339 | -0.097623 | 0.814355 | 0.906455 | 0.909815 | 0.908050 | 0.910219 | 0.883706 | 0.726980 | ... | 0.703634 | 0.063377 | -0.048783 | -0.031773 | 0.059800 | -0.117700 | 0.047723 | 0.116229 | -0.152144 | 0.606896 |
so | 0.544744 | 0.168199 | -0.042090 | 0.765119 | 0.834218 | 0.834941 | 0.784454 | 0.808133 | 0.785937 | 0.654337 | ... | 0.633199 | 0.078465 | -0.066673 | -0.030198 | 0.049641 | -0.069235 | 0.004627 | 0.091412 | -0.113120 | 0.593214 |
sbases | 0.376116 | 0.125609 | 0.014823 | 0.560441 | 0.613132 | 0.631819 | 0.702982 | 0.636881 | 0.553066 | 0.751554 | ... | 0.565905 | 0.015315 | -0.015711 | -0.001967 | 0.184581 | -0.040290 | -0.153763 | 0.210751 | -0.078756 | 0.398676 |
fldperc | 0.071402 | -0.041351 | 0.016791 | 0.112168 | 0.094774 | 0.080378 | 0.072477 | 0.081311 | 0.069012 | 0.060381 | ... | 0.105207 | -0.048408 | 0.058598 | -0.006798 | -0.011337 | -0.029590 | 0.018552 | -0.003822 | -0.017526 | 0.118883 |
frstbase | 0.065578 | 0.049948 | -0.058107 | -0.012408 | 0.038432 | 0.055595 | 0.073159 | 0.071536 | 0.099911 | -0.032794 | ... | 0.108391 | 0.028566 | 0.006097 | -0.055211 | -0.057334 | 0.017638 | 0.039442 | -0.082586 | -0.029754 | 0.045705 |
scndbase | -0.008773 | 0.009934 | -0.048314 | 0.007032 | 0.036160 | 0.027751 | 0.033201 | 0.030417 | 0.008469 | 0.017570 | ... | 0.087002 | -0.026734 | 0.003028 | 0.038955 | -0.076319 | -0.080344 | 0.078264 | 0.011302 | -0.010356 | 0.003029 |
shrtstop | -0.078979 | 0.006103 | -0.005252 | 0.000078 | -0.011573 | -0.025996 | -0.051919 | -0.040814 | -0.039417 | 0.020976 | ... | -0.094147 | 0.032263 | 0.001170 | -0.054034 | -0.109217 | 0.151295 | -0.021565 | -0.121644 | 0.181039 | -0.088137 |
thrdbase | 0.008589 | 0.011661 | 0.034967 | -0.000190 | -0.001043 | 0.005332 | 0.005217 | 0.002289 | 0.028301 | -0.066958 | ... | 0.044522 | -0.056700 | 0.019792 | 0.063150 | -0.009098 | -0.028096 | 0.096326 | -0.022121 | -0.042604 | 0.036343 |
outfield | 0.109087 | -0.049246 | 0.049825 | 0.043108 | 0.084796 | 0.097174 | 0.128265 | 0.100470 | 0.069242 | 0.206340 | ... | 0.162219 | 0.037675 | -0.047903 | 0.008636 | 0.296717 | -0.037551 | -0.287567 | 0.276932 | -0.076852 | 0.098903 |
catcher | -0.134022 | -0.003627 | 0.004035 | -0.053513 | -0.171696 | -0.188811 | -0.227358 | -0.193654 | -0.181522 | -0.232384 | ... | -0.345159 | -0.039179 | 0.039852 | 0.005527 | -0.176030 | -0.018857 | 0.231697 | -0.177190 | 0.003305 | -0.125705 |
yrsallst | 0.585597 | 0.108233 | -0.052015 | 0.575058 | 0.694271 | 0.740150 | 0.758991 | 0.767766 | 0.752787 | 0.620296 | ... | 0.567721 | 0.066294 | -0.069767 | -0.005951 | 0.047902 | -0.046810 | -0.063231 | 0.122544 | -0.072812 | 0.486355 |
hispan | -0.098765 | -0.076300 | -0.006757 | -0.113578 | -0.109627 | -0.102324 | -0.118684 | -0.102932 | -0.128557 | -0.054698 | ... | -0.115614 | -0.012705 | 0.007143 | 0.010209 | -0.218643 | 0.726811 | -0.459960 | -0.207599 | 0.769132 | -0.104890 |
black | 0.145830 | 0.003009 | 0.056631 | 0.082212 | 0.101883 | 0.111042 | 0.153198 | 0.123018 | 0.083906 | 0.155459 | ... | 0.186330 | 0.054468 | 0.004719 | -0.095218 | 0.693441 | -0.229165 | -0.652618 | 0.658415 | -0.242509 | 0.120053 |
whitepop | 0.066950 | 0.361005 | -0.032267 | 0.058571 | 0.071079 | 0.076078 | 0.090724 | 0.081674 | 0.066012 | 0.054388 | ... | 0.076966 | -0.101745 | -0.228453 | 0.497645 | -0.081125 | 0.043796 | 0.056013 | 0.211849 | -0.147453 | 0.037944 |
blackpop | 0.092536 | 0.437356 | -0.010157 | 0.089686 | 0.086237 | 0.087322 | 0.105629 | 0.090023 | 0.072147 | 0.060924 | ... | 0.095545 | -0.277057 | 0.107569 | 0.292771 | 0.063291 | -0.011444 | 0.009045 | 0.143331 | -0.062272 | 0.086067 |
hisppop | 0.013839 | 0.202042 | 0.032004 | -0.014732 | 0.003918 | 0.005798 | 0.008567 | 0.009761 | -0.002821 | -0.000564 | ... | -0.004348 | -0.301056 | -0.204505 | 0.786060 | -0.095480 | 0.173869 | 0.012882 | 0.285679 | -0.100998 | -0.017919 |
pcinc | 0.087388 | 0.189144 | -0.065420 | 0.084329 | 0.094369 | 0.102198 | 0.110055 | 0.104433 | 0.104527 | 0.104063 | ... | 0.097856 | 0.419942 | -0.326618 | -0.205609 | -0.108292 | -0.167394 | 0.107294 | -0.011007 | -0.129724 | 0.062538 |
gamesyr | 0.631446 | 0.151354 | -0.067027 | 0.562394 | 0.727361 | 0.732002 | 0.702771 | 0.722837 | 0.703369 | 0.593436 | ... | 0.904229 | 0.038884 | -0.008199 | -0.051131 | 0.049235 | -0.044713 | 0.045313 | 0.051931 | -0.102418 | 0.748082 |
hrunsyr | 0.625160 | 0.150169 | -0.080508 | 0.380162 | 0.511883 | 0.545031 | 0.542792 | 0.533216 | 0.568394 | 0.287547 | ... | 0.653594 | 0.014940 | 0.007304 | -0.034866 | 0.090495 | -0.080439 | 0.004783 | 0.105047 | -0.108702 | 0.595469 |
atbatsyr | 0.671876 | 0.145633 | -0.079880 | 0.542086 | 0.728203 | 0.763757 | 0.740678 | 0.759899 | 0.741534 | 0.629762 | ... | 0.940864 | 0.065100 | -0.037271 | -0.051327 | 0.068186 | -0.037086 | 0.016774 | 0.052075 | -0.104621 | 0.746726 |
allstar | 0.663772 | 0.118567 | -0.039446 | 0.427987 | 0.559417 | 0.607785 | 0.636431 | 0.635221 | 0.624506 | 0.488006 | ... | 0.615430 | 0.039720 | -0.053905 | 0.014057 | 0.058046 | -0.029291 | -0.066734 | 0.137831 | -0.060998 | 0.554186 |
slugavg | 0.143585 | 0.088965 | 0.024717 | 0.076681 | 0.098745 | 0.075407 | 0.110289 | 0.107653 | 0.136653 | 0.056517 | ... | 0.138068 | -0.010690 | -0.029322 | 0.060031 | 0.053253 | -0.051838 | -0.061845 | 0.248921 | -0.054986 | 0.160448 |
rbisyr | 0.707499 | 0.154711 | -0.102559 | 0.487107 | 0.662632 | 0.701526 | 0.684117 | 0.701694 | 0.719388 | 0.479284 | ... | 0.851111 | 0.057420 | -0.021187 | -0.062288 | 0.085368 | -0.064587 | 0.011242 | 0.080958 | -0.118167 | 0.727766 |
sbasesyr | 0.327704 | 0.040428 | 0.051674 | 0.279965 | 0.373591 | 0.401892 | 0.478339 | 0.408158 | 0.336432 | 0.554895 | ... | 0.623379 | 0.014109 | -0.021120 | 0.007865 | 0.228588 | -0.021651 | -0.177808 | 0.231123 | -0.088694 | 0.363151 |
runsyr | 0.703346 | 0.188722 | -0.086451 | 0.501848 | 0.683550 | 0.722409 | 0.788967 | 0.732789 | 0.716852 | 0.646384 | ... | 1.000000 | 0.058002 | -0.022174 | -0.061795 | 0.136203 | -0.051977 | -0.035570 | 0.102455 | -0.123019 | 0.742489 |
percwhte | -0.005084 | -0.208500 | -0.202080 | 0.048514 | 0.070722 | 0.085024 | 0.076957 | 0.090167 | 0.107900 | 0.082546 | ... | 0.058002 | 1.000000 | -0.794112 | -0.465818 | -0.302471 | -0.155558 | 0.163746 | -0.148026 | -0.211004 | -0.036455 |
percblck | 0.026013 | 0.224475 | 0.115924 | 0.001075 | -0.038511 | -0.054600 | -0.041508 | -0.062515 | -0.069011 | -0.054937 | ... | -0.022174 | -0.794112 | 1.000000 | -0.167894 | 0.425222 | -0.033992 | -0.168855 | -0.068376 | 0.256956 | 0.085693 |
perchisp | -0.029627 | 0.011364 | 0.158995 | -0.080254 | -0.058642 | -0.058414 | -0.064391 | -0.055232 | -0.074537 | -0.053904 | ... | -0.061795 | -0.465818 | -0.167894 | 1.000000 | -0.128492 | 0.301806 | -0.019753 | 0.339650 | -0.031864 | -0.065634 |
blckpb | 0.121162 | 0.102823 | 0.090623 | 0.047350 | 0.052607 | 0.055159 | 0.090453 | 0.057955 | 0.023351 | 0.097226 | ... | 0.136203 | -0.302471 | 0.425222 | -0.128492 | 1.000000 | -0.158912 | -0.452552 | 0.368624 | -0.168166 | 0.130458 |
hispph | -0.009062 | -0.056093 | 0.025824 | -0.081055 | -0.073656 | -0.069456 | -0.071851 | -0.067400 | -0.089703 | -0.024679 | ... | -0.051977 | -0.155558 | -0.033992 | 0.301806 | -0.158912 | 1.000000 | -0.334304 | -0.150886 | 0.495370 | -0.000700 |
whtepw | -0.033995 | 0.040622 | -0.104910 | 0.049047 | 0.031436 | 0.017362 | -0.016378 | 0.005132 | 0.065754 | -0.065440 | ... | -0.035570 | 0.163746 | -0.168855 | -0.019753 | -0.452552 | -0.334304 | 1.000000 | -0.429694 | -0.353770 | -0.007709 |
blckph | 0.069191 | -0.005969 | 0.123718 | 0.044106 | 0.065671 | 0.064702 | 0.088177 | 0.079937 | 0.048074 | 0.090673 | ... | 0.102455 | -0.148026 | -0.068376 | 0.339650 | 0.368624 | -0.150886 | -0.429694 | 1.000000 | -0.159671 | 0.047705 |
hisppb | -0.078252 | -0.016805 | 0.017757 | -0.077223 | -0.103336 | -0.107641 | -0.116891 | -0.108191 | -0.130983 | -0.072336 | ... | -0.123019 | -0.211004 | 0.256956 | -0.031864 | -0.168166 | 0.495370 | -0.353770 | -0.159671 | 1.000000 | -0.075591 |
lsalary | 0.902083 | 0.211406 | -0.058049 | 0.580721 | 0.654003 | 0.655399 | 0.663776 | 0.652359 | 0.647727 | 0.504266 | ... | 0.742489 | -0.036455 | 0.085693 | -0.065634 | 0.130458 | -0.000700 | -0.007709 | 0.047705 | -0.075591 | 1.000000 |
47 rows × 47 columns
Looking down the first column (correlations with salary
) you will find that only the correlation with the catcher
variable is smaller than -0.1. Note that the catcher
variable is a binary/dummy variable which takes the value 1 only for players which are catchers and 0 otherwise.
You can see that the .apply
method can be used to apply functions to rows or columns.
Practice the loop¶
The task¶
Calculate the sum of the numbers from 1 to 1000 using a loop.
The solution¶
k = 1000
sol = 0
for i in range(1,k+1):
sol = sol + i
print(sol)
500500
¶
The .apply method¶
With the apply method you can apply a function easily to each row or column of a DataFrame. It allows you to quickly perform operations on your data without needing to write complicated loops or manually iterate over each row or column. Let's illustrate this with an example. We will look at our dataframe df_bball
. We wish to find the maximum across the ['salary', 'teamsalary', 'hits']
columns.
df_bball[['salary', 'teamsal', 'hits']].apply(max,axis=0) # applies the max functions to all columns (axis = 0)
salary 6329213 teamsal 42866000 hits 3025 dtype: int64
You could also find the max value across all variables for players 0 to 4.
df_bball.iloc[0:4,].apply(max,axis=1) # applies the max functions to all rows (axis = 1)
0 38407380.0 1 38407380.0 2 38407380.0 3 38407380.0 dtype: float64
For each player we actually pick out his team's salary and clearly they all play for the same team (with a salary of 38407380).
You can see that the .apply
method can be used to apply functions to rows or columns.
Practice the .apply method¶
The task¶
Calculate the mean of every variable in df_bball
using the apply
method. The mean can be calculated using the np.mean
function.
The solution¶
df_bball.apply(np.mean,axis=0) # applies the max functions to all columns (axis = 0)
salary 1.345672e+06 teamsal 3.079483e+07 nl 4.759207e-01 years 6.325779e+00 games 6.484249e+02 atbats 2.168592e+03 runs 2.904023e+02 hits 5.847365e+02 doubles 1.038810e+02 triples 1.673371e+01 hruns 5.503116e+01 rbis 2.647450e+02 bavg 2.589858e+02 bb 2.120822e+02 so 3.404079e+02 sbases 6.580170e+01 fldperc 9.765099e+02 frstbase 1.274788e-01 scndbase 1.048159e-01 shrtstop 1.388102e-01 thrdbase 9.631728e-02 outfield 3.852691e-01 catcher 1.473088e-01 yrsallst 8.923513e-01 hispan 1.813031e-01 black 3.059490e-01 whitepop 3.984508e+06 blackpop 7.887030e+05 hisppop 9.006723e+05 pcinc 1.883314e+04 gamesyr 9.007604e+01 hrunsyr 7.119053e+00 atbatsyr 2.930280e+02 allstar 9.242893e+00 slugavg 3.953016e+01 rbisyr 3.505021e+01 sbasesyr 8.501081e+00 runsyr 3.858299e+01 percwhte 7.263109e+01 percblck 1.654893e+01 perchisp 1.081999e+01 blckpb 5.144886e+00 hispph 2.004179e+00 whtepw 3.666764e+01 blckph 2.931898e+00 hisppb 3.046493e+00 lsalary 1.349218e+01 dtype: float64
¶
functions¶
Functions are at the core of all programming languages. In fact, even in this walk-through we have already used functions. Earlier we calculated the mean value of some variables. When you called the np.mean
for instance, Python called up a function that calculated a mean of a variable. You could do that without you seeing the details of the mean calculation. That function was written by someone else and you could use that function as it was imported with the pandas
package.
Here we will learn how to write a function ourselves. You would want to do that in case there is a combination of operations you may wish to repeatedly apply.
# here we define a function, sum_integers_to_k is the name of the function
# it requires one input
def sum_integers_to_k(k):
sol = 0
for i in range(1,k+1):
sol = sol + i
return sol
Once you run this code, noting obvious will happen. However, Python will have stored this function in its memory and when you run the next line (sum_integers_to_k(40)
), it will go to that code use 40 as the value of k
, calculate the sum of all numbers to 40 and then return that value.
ans1 = sum_integers_to_k(40)
print(ans1)
820
Practice functions¶
The task¶
Write a function that takes two numbers as inputs, k1
and k2
, and sums up all the integers between these two numbers (including k1
and k2
). The function should return that sum. In order to understand how to use two parameters you may have to find on the internet examples of functions that accept multiple inputs. You could use search terms like "Python functions with two parameters".
The solution¶
Adjust the code of the simple example to our problem at hand. See how this is done in the next step.
# here we define a function, sum_integers_to_k is the name of the function
# it requires one input
def sum_integers_between(k1,k2):
sol = 0
for i in range(k1,k2+1):
sol = sol + i
return sol
And now you can call that function.
ans2 = sum_integers_between(10,40)
print(ans2)
775
¶
lambda functions¶
Lambda functions are funny creatures and it may not be immediately obvious what their values are. Above we introduced functions as something you write when you want to repeatedly call the same set of code lines (with potentially different parameters). lambda
functions are not that. lambda
functions are used to do little things which you will only do once, i.e. only apply to one dataframe but not repeatedly to several dataframes. For instance we could consider writing a short function which checks whether a particular baseball player can be considered a veteran (very experienced player, years
> 10). We use that to create a new variable Exper_cat
.
temp = df_bball.copy()
temp['Exper_cat'] = temp.apply(lambda row: 'Veteran' if (row['years'] > 10) else 'Not Veteran', axis = 1)
# Convert the 'Exper_cat' column to a categorical type
temp['Exper_cat'] = pd.Categorical(temp['Exper_cat'])
Here is what happened in the above line in which we define the new variable (temp['Exper_cat'] =
). We use the .apply( .... , axis = 1)
method meaning that we do something for every row of dataframe temp
. Recall that in place of the four dots ....
we would name the function we want to apply. However, Python does not have a is_the_player_a_veteran_function
ready for use. We therefore need to write that function. And we write that function on teh spot (as we do not think that we will need it again later). So what did lambda row: 'Veteran' if (row['years'] > 10) else 'Not Veteran'
do?
This started with lambda
. This signals to Python that a special type of function is coming. This function has one input which here is labeled row
but it could be labeled however you want. Here we call it row, as the apply
method is applied by rows, so the apply
method will "hand-over2 one row at a time to the lambda
function. The rest of the line now determines what should happen. Here we are asking the function to return Veteran
if the years
variable is larger 10 and Not Veteran
in case that is not correct.
This looks of course very much like an if else
function, although its architecture is somewhat different (compare to the earlier section on if else
functions).
The reason why this is incredibly useful is that the function can become a little more complicated allowing a new variable definition to depend on multiple other variables. Here we illustrate this by defining a new variable in the df_bball
dataframe called position
which indicates which position a player plays in. The definition depends on which of the positional dummy variables (frstbase
to cathcher
) is equal to 1.
df_bball['position'] = df_bball.apply(
lambda row: 'First Base' if row['frstbase'] == 1 else
'Second Base' if row['scndbase'] == 1 else
'Shortstop' if row['shrtstop'] == 1 else
'Third Base' if row['thrdbase'] == 1 else
'Outfield' if row['outfield'] == 1 else
'Catcher' if row['catcher'] == 1 else None, axis=1
)
# Convert the 'position' column to a categorical type
df_bball['position'] = pd.Categorical(df_bball['position'])
Summary¶
In this walkthrough you have learned some basic programming techniques. You will encounter these whenever you try to understanding someone else's code or you have to develop code yourself that does more than just simple work.
This walk-through is part of the ECLR page.