%load_ext pretty_jupyter
The pretty_jupyter extension is already loaded. To reload it, use:
  %reload_ext pretty_jupyter

Introduction

This walk-through is part of the ECLR page.

If you write code for data analysis you will eventually have to master a number of basic programming techniques. They are not directly related to data analysis but will be required for any bit of coding. These are

  • Subsetting
  • Objects and pointers
  • Boolean variables
  • if else conditions
  • loops
  • apply methods
  • functions
  • lambda functions

In this walkthrough we will show simple examples of these to help you understand code written by other people and to be able to write more complex code yourself.

Preparing your workfile

Install the lets-plot package in case you have not yet done so on your computer (pip install lets-plot in your terminal).

We add the basic packages needed for this work:

import numpy as np
import pandas as pd

Loading a dataset

Let's get a dataset to look at. We shall use the Baseball wages dataset, mlb1.csv. Either download that file to your working directory or directly link to the file as shown below. Before you upload a csv (or EXCEL) file you should always inspect the file to understand what you should expect. Upon inspection you will find that missing values are coded as "." and hence we pass that information on to the pd.read_csv function.

# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/mlb1.csv"
df_bball = pd.read_csv(url, na_values = '.')

Let's check out what variables we have in this data-file. Note that Python starts counting variables from 0!

df_bball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 47 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   salary    353 non-null    int64  
 1   teamsal   353 non-null    int64  
 2   nl        353 non-null    int64  
 3   years     353 non-null    int64  
 4   games     353 non-null    int64  
 5   atbats    353 non-null    int64  
 6   runs      353 non-null    int64  
 7   hits      353 non-null    int64  
 8   doubles   353 non-null    int64  
 9   triples   353 non-null    int64  
 10  hruns     353 non-null    int64  
 11  rbis      353 non-null    int64  
 12  bavg      353 non-null    int64  
 13  bb        353 non-null    int64  
 14  so        353 non-null    int64  
 15  sbases    353 non-null    int64  
 16  fldperc   353 non-null    int64  
 17  frstbase  353 non-null    int64  
 18  scndbase  353 non-null    int64  
 19  shrtstop  353 non-null    int64  
 20  thrdbase  353 non-null    int64  
 21  outfield  353 non-null    int64  
 22  catcher   353 non-null    int64  
 23  yrsallst  353 non-null    int64  
 24  hispan    353 non-null    int64  
 25  black     353 non-null    int64  
 26  whitepop  330 non-null    float64
 27  blackpop  339 non-null    float64
 28  hisppop   330 non-null    float64
 29  pcinc     353 non-null    int64  
 30  gamesyr   353 non-null    float64
 31  hrunsyr   353 non-null    float64
 32  atbatsyr  353 non-null    float64
 33  allstar   353 non-null    float64
 34  slugavg   353 non-null    float64
 35  rbisyr    353 non-null    float64
 36  sbasesyr  353 non-null    float64
 37  runsyr    353 non-null    float64
 38  percwhte  330 non-null    float64
 39  percblck  330 non-null    float64
 40  perchisp  330 non-null    float64
 41  blckpb    330 non-null    float64
 42  hispph    330 non-null    float64
 43  whtepw    330 non-null    float64
 44  blckph    330 non-null    float64
 45  hisppb    330 non-null    float64
 46  lsalary   353 non-null    float64
dtypes: float64(20), int64(27)
memory usage: 129.7 KB

There is a bunch of variables who describe characteristics of the 353 baseball players in the sample.

Subsetting

The process of accessing only a particular part of a dataframe is often called subsetting. There are, as usual several ways to do that.

Subsetting a single column/variable

Here we will see how to access particular columns in a dataframe. Say you wish to access the hits variable. Each of the following three commands do exactly the same, access the hits variable.

#df_bball['hits']
# df_bball.loc[:,'hits']    # use iloc if you are using a name
df_bball.iloc[:, 7]  # use iloc if using integers like col numbers to access
0      1939
1       863
2       840
3       816
4       928
       ... 
348     260
349     183
350     176
351     510
352       4
Name: hits, Length: 353, dtype: int64

Subsetting multiple columns/variables

If you want to access multiple columns you can do that in the following way. Say you want to access atbats to hits.

df_bball[['atbats', 'runs', 'hits' ]]
# df_bball.loc[:,'atbats':'hits']    
# df_bball.iloc[:, 5:8]  # 8 is the first column NOT included
atbats runs hits
0 6705 1076 1939
1 3333 407 863
2 2807 370 840
3 3337 405 816
4 3603 437 928
... ... ... ...
348 1098 150 260
349 700 63 183
350 828 112 176
351 2087 217 510
352 24 1 4

353 rows × 3 columns

Note that ['atbats', 'runs', 'hits' ] basically produces a list of variables. That list does not have to have the same order as the variables in the original dataframe.

Let's say you also want the salary variable. The easiest is to use a list.

sel_vars = ['salary', 'atbats', 'runs', 'hits' ]
df_bball[sel_vars]
salary atbats runs hits
0 6329213 6705 1076 1939
1 3375000 3333 407 863
2 3100000 2807 370 840
3 2900000 3337 405 816
4 1650000 3603 437 928
... ... ... ... ...
348 312000 1098 150 260
349 275000 700 63 183
350 250000 828 112 176
351 200000 2087 217 510
352 109000 24 1 4

353 rows × 4 columns

Subsetting rows

If you want to subset a set of rows you can do this as follows if you wanted to view the 8th observation.

# df_bball.iloc[7,:]   # gives you the eights row/observation, recall we start counting at 0
df_bball.loc[7,:]   # gives you the eighth row/observation 
salary      2.750000e+05
teamsal     3.840738e+07
nl          1.000000e+00
years       1.000000e+01
games       4.320000e+02
atbats      1.005000e+03
runs        7.800000e+01
hits        2.400000e+02
doubles     3.500000e+01
triples     5.000000e+00
hruns       1.300000e+01
rbis        9.500000e+01
bavg        2.390000e+02
bb          3.900000e+01
so          1.400000e+02
sbases      1.000000e+00
fldperc     9.900000e+02
frstbase    0.000000e+00
scndbase    0.000000e+00
shrtstop    0.000000e+00
thrdbase    0.000000e+00
outfield    0.000000e+00
catcher     1.000000e+00
yrsallst    0.000000e+00
hispan      0.000000e+00
black       0.000000e+00
whitepop    5.772110e+06
blackpop    1.547725e+06
hisppop     8.934220e+05
pcinc       1.884000e+04
gamesyr     4.320000e+01
hrunsyr     1.300000e+00
atbatsyr    1.005000e+02
allstar     0.000000e+00
slugavg     3.223881e+01
rbisyr      9.500000e+00
sbasesyr    1.000000e-01
runsyr      7.800000e+00
percwhte    7.027797e+01
percblck    1.884423e+01
perchisp    1.087780e+01
blckpb      0.000000e+00
hispph      0.000000e+00
whtepw      7.027797e+01
blckph      0.000000e+00
hisppb      0.000000e+00
lsalary     1.252453e+01
Name: 7, dtype: float64

To get multiple rows you would call

#df_bball.iloc[5:7,:]   # gives you the first row/observation, recall we start counting at 0
df_bball.loc[5:7,:]
# df_bball[5:7]     # this also works if you are looking at multiple columns, but better to use .loc or .iloc
salary teamsal nl years games atbats runs hits doubles triples ... runsyr percwhte percblck perchisp blckpb hispph whtepw blckph hisppb lsalary
5 700000 38407380 1 17 2032 7489 1136 2145 270 142 ... 66.82353 70.27797 18.84423 10.8778 18.84423 0.0000 0.00000 10.8778 0.00000 13.45884
6 695000 38407380 1 4 394 1293 179 303 51 13 ... 44.75000 70.27797 18.84423 10.8778 0.00000 10.8778 0.00000 0.0000 18.84423 13.45167
7 275000 38407380 1 10 432 1005 78 240 35 5 ... 7.80000 70.27797 18.84423 10.8778 0.00000 0.0000 70.27797 0.0000 0.00000 12.52453

3 rows × 47 columns

 
salary teamsal nl years games atbats runs hits doubles triples ... runsyr percwhte percblck perchisp blckpb hispph whtepw blckph hisppb lsalary
5 700000 38407380 1 17 2032 7489 1136 2145 270 142 ... 66.82353 70.27797 18.84423 10.8778 18.84423 0.0000 0.0 10.8778 0.00000 13.45884
6 695000 38407380 1 4 394 1293 179 303 51 13 ... 44.75000 70.27797 18.84423 10.8778 0.00000 10.8778 0.0 0.0000 18.84423 13.45167

2 rows × 47 columns

Objects and Pointers

Let's start with a little example. We define test which is defined as the first 5 observations of the salary variable.

test = df_bball.iloc[0:5,0]
test
0    6329213
1    3375000
2    3100000
3    2900000
4    1650000
Name: salary, dtype: int64

Now, for som inexplicable reason, you decide to change the first three values in test to 0.

test[0:3] = 0
test
0          0
1          0
2          0
3    2900000
4    1650000
Name: salary, dtype: int64

So, that worked. Let's have a look at the original dataframe from where we copied these data.

df_bball.iloc[0:5,0]
0          0
1          0
2          0
3    2900000
4    1650000
Name: salary, dtype: int64

What??!!! How did that happen? Here is what that original line of code test = df_bball.iloc[0:5,0] actually did. It did not create a "copy" in the usual way. What it actually did was to just create a new and alternative name for these particular five elements of df_bball.iloc[0:5,0]. Think of a street address and a PO.Box which delivers mail to the same street address. You could either use the street address or the PO.Box and your mail would go to the same place. This is what we did, test only points at df_bball.iloc[0:5,0], that is why it is also called a pointer.

So it is best to think of test not as a variable but rather as an object, one that points at another.

Clearly the above behaviour is not one we really want, if you thought that, once defined test is its own independent object. So what to do?

Return to the earlier line where you defined the new object test and replace it by test = df_bball.iloc[0:5,0].copy(). But before doing that we have to re-load the data.

df_bball = pd.read_csv(url, na_values = '.')
test = df_bball.iloc[0:5,0].copy()
test[0:3] = 0
df_bball.iloc[0:5,0]
0    6329213
1    3375000
2    3100000
3    2900000
4    1650000
Name: salary, dtype: int64

Observe the different behaviour. Now the values in the original dataframe df_bball have not changed.

By using the .copy() when you defined test you actually created what we would understand as being a copy. This you will have to keep in mind when you actually want to create a copy of data. Especially if you think data will be changed you need to carefully think about whether you will want them changed in the original dataframe or not.

Boolean variables and subsetting objects

Boolean variables are variables that take the value TRUE or FALSE. Let's illustrate this.

test = df_bball['salary'] > 1000000
test[0:7]
0     True
1     True
2     True
3     True
4     True
5    False
6    False
Name: salary, dtype: bool

For each of the 353 baseball players we asked the question whether they earned more than USD 1,000,000. We then looked at the first 7 observations in the sample and saw that the first 5 did indeed earn more than USD 1,000,000 (TRUE) but the next two did not (FALSE). In test you will find the answer to that question for all 353 baseball players.

You may be wondering how many in the sample did earn more than that threshold? sum(test) counts the number of TRUE values. np.mean(test) calculates the proportion and test.describe() tells you which answer was the most frequent.

sum(test)
# np.mean(test)
# test.describe()
143

You may wonder what the big deal is. In what comes in this walkthrough you will find some uses of Boolean variables. But here is one. Say you wanted to create a new dataframe with all the players that did earn more than USD 1,000,000, then here are a number of ways to do that (note that here we are creating copies).

millionaires1 = df_bball[test].copy() # where we use the variable defined above
millionaires2 = df_bball.loc[test].copy()
millionaires3 = df_bball.loc[test,:].copy()
millionaires4 = df_bball[df_bball['salary'] > 1000000].copy()
millionaires5 = df_bball.query('salary > 1000000').copy()

There is not a right or wrong way of doing this.

It is important to understand that you can create Boolean variables from a combination of conditions, say you wanted to select all players who earn more than USD 1,000,000 and have not more than 5 years of experience.

test2 = ( df_bball['salary'] > 1000000) & (df_bball['years'] < 6)
sum(test2)
28

There are only 28 players that fit this criterion. Here the two conditions were connected with the AND (&) opeator. You may also connect conditions with an OR (|) operator. There are a few more important operator you may need: IS EQUAL TO is represented by ==, UNEQUAL TO is !=.

if-else

An important tool in the toolbox of any applied economics and econometrician is the if-else condition. This is important as one will usually write code that works irrespective of certain conditions. Here we will demonstrate this with a simple example.

var1 = "salary"
var2 = "years"
# np.corrcoef delivers a 2x2 matrix, the corr is in the off-diagonal , eg [0,1]
c12 = np.corrcoef(df_bball[var1], df_bball[var2])[0,1]  

if c12 > 0:
    print("the correlation is positive")
else:
    print("the correlation is not positive")
the correlation is positive

You can also think of more than 2 cases.

if c12 > 0.1:
    print("the correlation is positive")
elif c12 < -0.1:
    print("the correlation is not positive")
else:
    print("the variables are basically uncorrelated")
the correlation is positive

Here, we only have a single line of code (print ( )), but there could be multiple lines of code after every condition. Also note that the code lines between if, elif and else are indent. This is important, the indent tells Python when the code ends and when the next condition comes.

Loops

Loops are a way to make your code repeat bits of code, potentially many, many times. Above we calculated a correlation between two variables. We have many variables (check df_bball.shape) to confirm that there are 47 variables altogether. We saw previously that they are all numerical variables. Let's say you wanted to figure out whether any of the variables is negatively correlated with salary. We would now have to calculate 46 correlations, the correlations of all variables, other than salary, with the variable salary. As salary is the first variable (at position 0), we need to loop through variables 1 to 46.

range(df_bball.shape[1])
range(0, 47)
words = ["Here", "I","want","to","concatenate","words","using","pipe","delimeter"]
print(" ".join([df_bball.columns[1], "is uncorrelated"]))
teamsal is uncorrelated
df_bball.columns[1]
'teamsal'
var1 = "salary"

for i in range(1,47):
    
    # np.corrcoef delivers a 2x2 matrix, the corr is in the off-diagonal , eg [0,1]
    c12 = np.corrcoef(df_bball[var1], df_bball.iloc[:,i])[0,1]  
    if c12 > 0.1:
        print(" ".join([df_bball.columns[i], "is positively correlated with salary"]))
    elif c12 < -0.1:
        print(" ".join(["!!!", df_bball.columns[i], "is negatively correlated with salary"]))
    else:
        print(" ".join([df_bball.columns[i], "is uncorrelated with salary"]))
teamsal is positively correlated with salary
nl is uncorrelated with salary
years is positively correlated with salary
games is positively correlated with salary
atbats is positively correlated with salary
runs is positively correlated with salary
hits is positively correlated with salary
doubles is positively correlated with salary
triples is positively correlated with salary
hruns is positively correlated with salary
rbis is positively correlated with salary
bavg is positively correlated with salary
bb is positively correlated with salary
so is positively correlated with salary
sbases is positively correlated with salary
fldperc is uncorrelated with salary
frstbase is uncorrelated with salary
scndbase is uncorrelated with salary
shrtstop is uncorrelated with salary
thrdbase is uncorrelated with salary
outfield is positively correlated with salary
!!! catcher is negatively correlated with salary
yrsallst is positively correlated with salary
hispan is uncorrelated with salary
black is positively correlated with salary
whitepop is uncorrelated with salary
blackpop is uncorrelated with salary
hisppop is uncorrelated with salary
pcinc is uncorrelated with salary
gamesyr is positively correlated with salary
hrunsyr is positively correlated with salary
atbatsyr is positively correlated with salary
allstar is positively correlated with salary
slugavg is positively correlated with salary
rbisyr is positively correlated with salary
sbasesyr is positively correlated with salary
runsyr is positively correlated with salary
percwhte is uncorrelated with salary
percblck is uncorrelated with salary
perchisp is uncorrelated with salary
blckpb is uncorrelated with salary
hispph is uncorrelated with salary
whtepw is uncorrelated with salary
blckph is uncorrelated with salary
hisppb is uncorrelated with salary
lsalary is positively correlated with salary

Only one variable is negatively correlated (as defined by a correlation smaller than -0.1). It turns out that very often loops are actually not necessary. The above could have been achieved by using the .corr() method applied to our dataframe df_bball.

df_bball.corr()
salary teamsal nl years games atbats runs hits doubles triples ... runsyr percwhte percblck perchisp blckpb hispph whtepw blckph hisppb lsalary
salary 1.000000 0.224725 -0.049160 0.478152 0.590643 0.618650 0.651922 0.627098 0.628841 0.480468 ... 0.703346 -0.005084 0.026013 -0.029627 0.121162 -0.009062 -0.033995 0.069191 -0.078252 0.902083
teamsal 0.224725 1.000000 -0.218707 0.177434 0.176442 0.172746 0.213004 0.172080 0.160806 0.160525 ... 0.188722 -0.208500 0.224475 0.011364 0.102823 -0.056093 0.040622 -0.005969 -0.016805 0.211406
nl -0.049160 -0.218707 1.000000 -0.119650 -0.109963 -0.113653 -0.118294 -0.112177 -0.122457 -0.025921 ... -0.086451 -0.202080 0.115924 0.158995 0.090623 0.025824 -0.104910 0.123718 0.017757 -0.058049
years 0.478152 0.177434 -0.119650 1.000000 0.941298 0.899876 0.840024 0.882413 0.841498 0.743295 ... 0.501848 0.048514 0.001075 -0.080254 0.047350 -0.081055 0.049047 0.044106 -0.077223 0.580721
games 0.590643 0.176442 -0.109963 0.941298 1.000000 0.988557 0.937558 0.977384 0.938055 0.824304 ... 0.683550 0.070722 -0.038511 -0.058642 0.052607 -0.073656 0.031436 0.065671 -0.103336 0.654003
atbats 0.618650 0.172746 -0.113653 0.899876 0.988557 1.000000 0.956522 0.994506 0.957641 0.844203 ... 0.722409 0.085024 -0.054600 -0.058414 0.055159 -0.069456 0.017362 0.064702 -0.107641 0.655399
runs 0.651922 0.213004 -0.118294 0.840024 0.937558 0.956522 1.000000 0.963780 0.930113 0.856561 ... 0.788967 0.076957 -0.041508 -0.064391 0.090453 -0.071851 -0.016378 0.088177 -0.116891 0.663776
hits 0.627098 0.172080 -0.112177 0.882413 0.977384 0.994506 0.963780 1.000000 0.964148 0.854465 ... 0.732789 0.090167 -0.062515 -0.055232 0.057955 -0.067400 0.005132 0.079937 -0.108191 0.652359
doubles 0.628841 0.160806 -0.122457 0.841498 0.938055 0.957641 0.930113 0.964148 1.000000 0.795545 ... 0.716852 0.107900 -0.069011 -0.074537 0.023351 -0.089703 0.065754 0.048074 -0.130983 0.647727
triples 0.480468 0.160525 -0.025921 0.743295 0.824304 0.844203 0.856561 0.854465 0.795545 1.000000 ... 0.646384 0.082546 -0.054937 -0.053904 0.097226 -0.024679 -0.065440 0.090673 -0.072336 0.504266
hruns 0.613213 0.139215 -0.114294 0.674362 0.771054 0.792835 0.760319 0.782930 0.797947 0.518236 ... 0.598251 0.075205 -0.058548 -0.036741 0.040310 -0.090531 0.030598 0.107513 -0.123739 0.573756
rbis 0.649485 0.160775 -0.128128 0.822251 0.924259 0.943888 0.901255 0.942423 0.935874 0.716881 ... 0.690810 0.092244 -0.060728 -0.061204 0.052452 -0.080551 0.022409 0.085208 -0.120404 0.647311
bavg 0.281900 0.083113 -0.031582 0.197296 0.267397 0.287366 0.310146 0.320975 0.313955 0.271371 ... 0.380220 0.145895 -0.071949 -0.131887 0.099048 -0.030701 -0.095194 0.102388 -0.082557 0.283009
bb 0.586754 0.153339 -0.097623 0.814355 0.906455 0.909815 0.908050 0.910219 0.883706 0.726980 ... 0.703634 0.063377 -0.048783 -0.031773 0.059800 -0.117700 0.047723 0.116229 -0.152144 0.606896
so 0.544744 0.168199 -0.042090 0.765119 0.834218 0.834941 0.784454 0.808133 0.785937 0.654337 ... 0.633199 0.078465 -0.066673 -0.030198 0.049641 -0.069235 0.004627 0.091412 -0.113120 0.593214
sbases 0.376116 0.125609 0.014823 0.560441 0.613132 0.631819 0.702982 0.636881 0.553066 0.751554 ... 0.565905 0.015315 -0.015711 -0.001967 0.184581 -0.040290 -0.153763 0.210751 -0.078756 0.398676
fldperc 0.071402 -0.041351 0.016791 0.112168 0.094774 0.080378 0.072477 0.081311 0.069012 0.060381 ... 0.105207 -0.048408 0.058598 -0.006798 -0.011337 -0.029590 0.018552 -0.003822 -0.017526 0.118883
frstbase 0.065578 0.049948 -0.058107 -0.012408 0.038432 0.055595 0.073159 0.071536 0.099911 -0.032794 ... 0.108391 0.028566 0.006097 -0.055211 -0.057334 0.017638 0.039442 -0.082586 -0.029754 0.045705
scndbase -0.008773 0.009934 -0.048314 0.007032 0.036160 0.027751 0.033201 0.030417 0.008469 0.017570 ... 0.087002 -0.026734 0.003028 0.038955 -0.076319 -0.080344 0.078264 0.011302 -0.010356 0.003029
shrtstop -0.078979 0.006103 -0.005252 0.000078 -0.011573 -0.025996 -0.051919 -0.040814 -0.039417 0.020976 ... -0.094147 0.032263 0.001170 -0.054034 -0.109217 0.151295 -0.021565 -0.121644 0.181039 -0.088137
thrdbase 0.008589 0.011661 0.034967 -0.000190 -0.001043 0.005332 0.005217 0.002289 0.028301 -0.066958 ... 0.044522 -0.056700 0.019792 0.063150 -0.009098 -0.028096 0.096326 -0.022121 -0.042604 0.036343
outfield 0.109087 -0.049246 0.049825 0.043108 0.084796 0.097174 0.128265 0.100470 0.069242 0.206340 ... 0.162219 0.037675 -0.047903 0.008636 0.296717 -0.037551 -0.287567 0.276932 -0.076852 0.098903
catcher -0.134022 -0.003627 0.004035 -0.053513 -0.171696 -0.188811 -0.227358 -0.193654 -0.181522 -0.232384 ... -0.345159 -0.039179 0.039852 0.005527 -0.176030 -0.018857 0.231697 -0.177190 0.003305 -0.125705
yrsallst 0.585597 0.108233 -0.052015 0.575058 0.694271 0.740150 0.758991 0.767766 0.752787 0.620296 ... 0.567721 0.066294 -0.069767 -0.005951 0.047902 -0.046810 -0.063231 0.122544 -0.072812 0.486355
hispan -0.098765 -0.076300 -0.006757 -0.113578 -0.109627 -0.102324 -0.118684 -0.102932 -0.128557 -0.054698 ... -0.115614 -0.012705 0.007143 0.010209 -0.218643 0.726811 -0.459960 -0.207599 0.769132 -0.104890
black 0.145830 0.003009 0.056631 0.082212 0.101883 0.111042 0.153198 0.123018 0.083906 0.155459 ... 0.186330 0.054468 0.004719 -0.095218 0.693441 -0.229165 -0.652618 0.658415 -0.242509 0.120053
whitepop 0.066950 0.361005 -0.032267 0.058571 0.071079 0.076078 0.090724 0.081674 0.066012 0.054388 ... 0.076966 -0.101745 -0.228453 0.497645 -0.081125 0.043796 0.056013 0.211849 -0.147453 0.037944
blackpop 0.092536 0.437356 -0.010157 0.089686 0.086237 0.087322 0.105629 0.090023 0.072147 0.060924 ... 0.095545 -0.277057 0.107569 0.292771 0.063291 -0.011444 0.009045 0.143331 -0.062272 0.086067
hisppop 0.013839 0.202042 0.032004 -0.014732 0.003918 0.005798 0.008567 0.009761 -0.002821 -0.000564 ... -0.004348 -0.301056 -0.204505 0.786060 -0.095480 0.173869 0.012882 0.285679 -0.100998 -0.017919
pcinc 0.087388 0.189144 -0.065420 0.084329 0.094369 0.102198 0.110055 0.104433 0.104527 0.104063 ... 0.097856 0.419942 -0.326618 -0.205609 -0.108292 -0.167394 0.107294 -0.011007 -0.129724 0.062538
gamesyr 0.631446 0.151354 -0.067027 0.562394 0.727361 0.732002 0.702771 0.722837 0.703369 0.593436 ... 0.904229 0.038884 -0.008199 -0.051131 0.049235 -0.044713 0.045313 0.051931 -0.102418 0.748082
hrunsyr 0.625160 0.150169 -0.080508 0.380162 0.511883 0.545031 0.542792 0.533216 0.568394 0.287547 ... 0.653594 0.014940 0.007304 -0.034866 0.090495 -0.080439 0.004783 0.105047 -0.108702 0.595469
atbatsyr 0.671876 0.145633 -0.079880 0.542086 0.728203 0.763757 0.740678 0.759899 0.741534 0.629762 ... 0.940864 0.065100 -0.037271 -0.051327 0.068186 -0.037086 0.016774 0.052075 -0.104621 0.746726
allstar 0.663772 0.118567 -0.039446 0.427987 0.559417 0.607785 0.636431 0.635221 0.624506 0.488006 ... 0.615430 0.039720 -0.053905 0.014057 0.058046 -0.029291 -0.066734 0.137831 -0.060998 0.554186
slugavg 0.143585 0.088965 0.024717 0.076681 0.098745 0.075407 0.110289 0.107653 0.136653 0.056517 ... 0.138068 -0.010690 -0.029322 0.060031 0.053253 -0.051838 -0.061845 0.248921 -0.054986 0.160448
rbisyr 0.707499 0.154711 -0.102559 0.487107 0.662632 0.701526 0.684117 0.701694 0.719388 0.479284 ... 0.851111 0.057420 -0.021187 -0.062288 0.085368 -0.064587 0.011242 0.080958 -0.118167 0.727766
sbasesyr 0.327704 0.040428 0.051674 0.279965 0.373591 0.401892 0.478339 0.408158 0.336432 0.554895 ... 0.623379 0.014109 -0.021120 0.007865 0.228588 -0.021651 -0.177808 0.231123 -0.088694 0.363151
runsyr 0.703346 0.188722 -0.086451 0.501848 0.683550 0.722409 0.788967 0.732789 0.716852 0.646384 ... 1.000000 0.058002 -0.022174 -0.061795 0.136203 -0.051977 -0.035570 0.102455 -0.123019 0.742489
percwhte -0.005084 -0.208500 -0.202080 0.048514 0.070722 0.085024 0.076957 0.090167 0.107900 0.082546 ... 0.058002 1.000000 -0.794112 -0.465818 -0.302471 -0.155558 0.163746 -0.148026 -0.211004 -0.036455
percblck 0.026013 0.224475 0.115924 0.001075 -0.038511 -0.054600 -0.041508 -0.062515 -0.069011 -0.054937 ... -0.022174 -0.794112 1.000000 -0.167894 0.425222 -0.033992 -0.168855 -0.068376 0.256956 0.085693
perchisp -0.029627 0.011364 0.158995 -0.080254 -0.058642 -0.058414 -0.064391 -0.055232 -0.074537 -0.053904 ... -0.061795 -0.465818 -0.167894 1.000000 -0.128492 0.301806 -0.019753 0.339650 -0.031864 -0.065634
blckpb 0.121162 0.102823 0.090623 0.047350 0.052607 0.055159 0.090453 0.057955 0.023351 0.097226 ... 0.136203 -0.302471 0.425222 -0.128492 1.000000 -0.158912 -0.452552 0.368624 -0.168166 0.130458
hispph -0.009062 -0.056093 0.025824 -0.081055 -0.073656 -0.069456 -0.071851 -0.067400 -0.089703 -0.024679 ... -0.051977 -0.155558 -0.033992 0.301806 -0.158912 1.000000 -0.334304 -0.150886 0.495370 -0.000700
whtepw -0.033995 0.040622 -0.104910 0.049047 0.031436 0.017362 -0.016378 0.005132 0.065754 -0.065440 ... -0.035570 0.163746 -0.168855 -0.019753 -0.452552 -0.334304 1.000000 -0.429694 -0.353770 -0.007709
blckph 0.069191 -0.005969 0.123718 0.044106 0.065671 0.064702 0.088177 0.079937 0.048074 0.090673 ... 0.102455 -0.148026 -0.068376 0.339650 0.368624 -0.150886 -0.429694 1.000000 -0.159671 0.047705
hisppb -0.078252 -0.016805 0.017757 -0.077223 -0.103336 -0.107641 -0.116891 -0.108191 -0.130983 -0.072336 ... -0.123019 -0.211004 0.256956 -0.031864 -0.168166 0.495370 -0.353770 -0.159671 1.000000 -0.075591
lsalary 0.902083 0.211406 -0.058049 0.580721 0.654003 0.655399 0.663776 0.652359 0.647727 0.504266 ... 0.742489 -0.036455 0.085693 -0.065634 0.130458 -0.000700 -0.007709 0.047705 -0.075591 1.000000

47 rows × 47 columns

Looking down the first column (correlations with salary) you will find that only the correlation with the catcher variable is smaller than -0.1. Note that the catcher variable is a binary/dummy variable which takes the value 1 only for players which are catchers and 0 otherwise.

You can see that the .apply method can be used to apply functions to rows or columns.


Practice the loop

The task

Calculate the sum of the numbers from 1 to 1000 using a loop.

The solution
k = 1000
sol = 0

for i in range(1,k+1):
    sol = sol + i
    
print(sol)
500500


The .apply method

With the apply method you can apply a function easily to each row or column of a DataFrame. It allows you to quickly perform operations on your data without needing to write complicated loops or manually iterate over each row or column. Let's illustrate this with an example. We will look at our dataframe df_bball. We wish to find the maximum across the ['salary', 'teamsalary', 'hits'] columns.

df_bball[['salary', 'teamsal', 'hits']].apply(max,axis=0) # applies the max functions to all columns (axis = 0)
salary      6329213
teamsal    42866000
hits           3025
dtype: int64

You could also find the max value across all variables for players 0 to 4.

df_bball.iloc[0:4,].apply(max,axis=1) # applies the max functions to all rows (axis = 1)
0    38407380.0
1    38407380.0
2    38407380.0
3    38407380.0
dtype: float64

For each player we actually pick out his team's salary and clearly they all play for the same team (with a salary of 38407380).

You can see that the .apply method can be used to apply functions to rows or columns.


Practice the .apply method

The task

Calculate the mean of every variable in df_bball using the apply method. The mean can be calculated using the np.mean function.

The solution
df_bball.apply(np.mean,axis=0) # applies the max functions to all columns (axis = 0)
salary      1.345672e+06
teamsal     3.079483e+07
nl          4.759207e-01
years       6.325779e+00
games       6.484249e+02
atbats      2.168592e+03
runs        2.904023e+02
hits        5.847365e+02
doubles     1.038810e+02
triples     1.673371e+01
hruns       5.503116e+01
rbis        2.647450e+02
bavg        2.589858e+02
bb          2.120822e+02
so          3.404079e+02
sbases      6.580170e+01
fldperc     9.765099e+02
frstbase    1.274788e-01
scndbase    1.048159e-01
shrtstop    1.388102e-01
thrdbase    9.631728e-02
outfield    3.852691e-01
catcher     1.473088e-01
yrsallst    8.923513e-01
hispan      1.813031e-01
black       3.059490e-01
whitepop    3.984508e+06
blackpop    7.887030e+05
hisppop     9.006723e+05
pcinc       1.883314e+04
gamesyr     9.007604e+01
hrunsyr     7.119053e+00
atbatsyr    2.930280e+02
allstar     9.242893e+00
slugavg     3.953016e+01
rbisyr      3.505021e+01
sbasesyr    8.501081e+00
runsyr      3.858299e+01
percwhte    7.263109e+01
percblck    1.654893e+01
perchisp    1.081999e+01
blckpb      5.144886e+00
hispph      2.004179e+00
whtepw      3.666764e+01
blckph      2.931898e+00
hisppb      3.046493e+00
lsalary     1.349218e+01
dtype: float64


functions

Functions are at the core of all programming languages. In fact, even in this walk-through we have already used functions. Earlier we calculated the mean value of some variables. When you called the np.mean for instance, Python called up a function that calculated a mean of a variable. You could do that without you seeing the details of the mean calculation. That function was written by someone else and you could use that function as it was imported with the pandas package.

Here we will learn how to write a function ourselves. You would want to do that in case there is a combination of operations you may wish to repeatedly apply.

# here we define a function, sum_integers_to_k is the name of the function 
# it requires one input
def sum_integers_to_k(k):
  sol = 0

  for i in range(1,k+1):
    sol = sol + i
    
  return sol

Once you run this code, noting obvious will happen. However, Python will have stored this function in its memory and when you run the next line (sum_integers_to_k(40)), it will go to that code use 40 as the value of k, calculate the sum of all numbers to 40 and then return that value.

ans1 = sum_integers_to_k(40)
print(ans1)
820

Practice functions

The task

Write a function that takes two numbers as inputs, k1 and k2, and sums up all the integers between these two numbers (including k1 and k2). The function should return that sum. In order to understand how to use two parameters you may have to find on the internet examples of functions that accept multiple inputs. You could use search terms like "Python functions with two parameters".

The solution

Adjust the code of the simple example to our problem at hand. See how this is done in the next step.

# here we define a function, sum_integers_to_k is the name of the function 
# it requires one input
def sum_integers_between(k1,k2):
    sol = 0

    for i in range(k1,k2+1):
        sol = sol + i
    
    return sol

And now you can call that function.

ans2 = sum_integers_between(10,40)
print(ans2)
775


lambda functions

Lambda functions are funny creatures and it may not be immediately obvious what their values are. Above we introduced functions as something you write when you want to repeatedly call the same set of code lines (with potentially different parameters). lambda functions are not that. lambda functions are used to do little things which you will only do once, i.e. only apply to one dataframe but not repeatedly to several dataframes. For instance we could consider writing a short function which checks whether a particular baseball player can be considered a veteran (very experienced player, years > 10). We use that to create a new variable Exper_cat.

temp = df_bball.copy()
temp['Exper_cat'] = temp.apply(lambda row: 'Veteran' if (row['years'] > 10)  else 'Not Veteran', axis = 1)

# Convert the 'Exper_cat' column to a categorical type
temp['Exper_cat'] = pd.Categorical(temp['Exper_cat'])

Here is what happened in the above line in which we define the new variable (temp['Exper_cat'] =). We use the .apply( .... , axis = 1) method meaning that we do something for every row of dataframe temp. Recall that in place of the four dots .... we would name the function we want to apply. However, Python does not have a is_the_player_a_veteran_function ready for use. We therefore need to write that function. And we write that function on teh spot (as we do not think that we will need it again later). So what did lambda row: 'Veteran' if (row['years'] > 10) else 'Not Veteran' do?

This started with lambda. This signals to Python that a special type of function is coming. This function has one input which here is labeled row but it could be labeled however you want. Here we call it row, as the apply method is applied by rows, so the apply method will "hand-over2 one row at a time to the lambda function. The rest of the line now determines what should happen. Here we are asking the function to return Veteran if the years variable is larger 10 and Not Veteran in case that is not correct.

This looks of course very much like an if else function, although its architecture is somewhat different (compare to the earlier section on if else functions).

The reason why this is incredibly useful is that the function can become a little more complicated allowing a new variable definition to depend on multiple other variables. Here we illustrate this by defining a new variable in the df_bball dataframe called position which indicates which position a player plays in. The definition depends on which of the positional dummy variables (frstbase to cathcher) is equal to 1.

df_bball['position'] = df_bball.apply(
    lambda row: 'First Base' if row['frstbase'] == 1 else 
                'Second Base' if row['scndbase'] == 1 else 
                'Shortstop' if row['shrtstop'] == 1 else 
                'Third Base' if row['thrdbase'] == 1 else 
                'Outfield' if row['outfield'] == 1 else 
                'Catcher' if row['catcher'] == 1 else None, axis=1
)

# Convert the 'position' column to a categorical type
df_bball['position'] = pd.Categorical(df_bball['position'])

Summary

In this walkthrough you have learned some basic programming techniques. You will encounter these whenever you try to understanding someone else's code or you have to develop code yourself that does more than just simple work.

This walk-through is part of the ECLR page.