%load_ext pretty_jupyter
The pretty_jupyter extension is already loaded. To reload it, use:
  %reload_ext pretty_jupyter

Introduction - Loading Data

This walkthrough is part of the ECLR page.

In this workthrough you will explore methods which are useful to support your variable selection.

The dataset we will use is the credit dataset which comes with the resources provided by James at al in their textbook An Introduction to Statistical Learning (ISL).

Preparing your workspace

As usual we begin a notebook or scriptfile by loading the libraries we anticipate to use.

# Standard packaes to deal with data
import pandas as pd
import numpy as np

# Plotting
from lets_plot import *
# Lets_plot requires a setup
LetsPlot.setup_html()

# Statistics
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import KFold

Later in this workthrough we will use a few more libraries/packages and we will import them then, when we introduce them. If you already know which packages you will use, it is easiest to add them to your list here.

Data upload

In this workthrough you will explore methods which are useful to support your variable selection.

The dataset we will use is the credit dataset which comes with the resources provided by James at al in their textbook An Introduction to Statistical Learning (ISL).

The data file we practice with here is a Comma-Separated Values (CSV) file. To load such a file in Python, we use the read_csv function from the pandas library. The file is available directly from this url: https://www.statlearning.com/s/Credit.csv.

# Load the CSV file
data_credit = pd.read_csv("https://www.statlearning.com/s/Credit.csv")

If this worked, you should now be able to see an object called data_credit in your list of variables.

A typical error message you could receive here is "FileNotFoundError: [Errno 2] No such file or directory: 'Mroz.csv'". If you get this message the most likely reason is that the file Mroz.csv is not in the place where Python is looking. Either it is not in the working directory, or you did not set the working directory correctly.

These are simulated data for 400 customers of a financial institution. Let's look at the variables we can find in this dataset.


Explore the data

The task

Whenever you work with data you should explore a few things about the data.

  1. What variable types are the variables? Ensure they have the expected type.
  2. Are there missing data?
Variable Types

To find the variable data types you can use the .info() method.

data_credit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Income     400 non-null    float64
 1   Limit      400 non-null    int64  
 2   Rating     400 non-null    int64  
 3   Cards      400 non-null    int64  
 4   Age        400 non-null    int64  
 5   Education  400 non-null    int64  
 6   Own        400 non-null    object 
 7   Student    400 non-null    object 
 8   Married    400 non-null    object 
 9   Region     400 non-null    object 
 10  Balance    400 non-null    int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 34.5+ KB

You can see that the variables Own, Student, Married and Region are object variables. They are basically text variables. We will convert these to categorical variables.

data_credit['Own'] = data_credit['Own'].astype('category')
data_credit['Student'] = data_credit['Student'].astype('category')
data_credit['Married'] = data_credit['Married'].astype('category')
data_credit['Region'] = data_credit['Region'].astype('category')
Missing data

It is important to understand whether our dataset is many observations with missing variables and for which variables we do have missing information.

credit_na = (data_credit.isnull().sum() / len(data_credit)) * 100
missing_data = pd.DataFrame({'Missing Ratio' :credit_na})
missing_data.style.bar()
  Missing Ratio
Income 0.000000
Limit 0.000000
Rating 0.000000
Cards 0.000000
Age 0.000000
Education 0.000000
Own 0.000000
Student 0.000000
Married 0.000000
Region 0.000000
Balance 0.000000

As it turns out there are no missing observations in this dataframe.


Remember that the data are simulated which is one reason why there are no missing data. The variable Balance is the amount of credit card debt (a positive number being a debt). Education is measured in years of completed education, Cards is the number of credit cards owned by an individual, Limit is the credit card limit for the individual and Income is the annual income in thousands of dollars by the individual.

Let's produce a scatter plot for the variables Income and Age.

(
    ggplot(data_credit, aes(x='Age', y = 'Income')) + 
        geom_point()
)

The plot revels that there is a weak positive correlation between these variables.

The variable Rating is a credit rating for an individual. One could think of many different ways a credit rating variable could be defined. Without any further definitions available you will need to use the data to establish whether a higher number in the Ratings variable means a better or a worse credit rating.


Explore the data

The question

Higher numbers of the Rating variable indicative of a better or a worse credit rating?

The answer

You could for instance create a scatter plot between Rating and Income. There you will find a clearly positive correlation indicating that higher ratings relate to better credit worthiness.


Let's calculate some summary statistics for Balance, Rating and Region:

# for numeric variables
data_credit[['Balance','Rating','Region']].describe()
Balance Rating
count 400.000000 400.000000
mean 520.015000 354.940000
std 459.758877 154.724143
min 0.000000 93.000000
25% 68.750000 247.250000
50% 459.500000 344.000000
75% 863.000000 437.250000
max 1999.000000 982.000000
# for categorical variables
data_credit[['Region']].describe()
Region
count 400
unique 3
top South
freq 199

You can see that there are three regions.

Modelling credit balances

Let's consider whether variation in credit balances can be modeled using a linear regression model. For instance we can estimate the following model:

$$ balances_i = \beta_0 + \beta_1~rating_i + \beta_2~cards_i + \epsilon_i $$
data_credit.columns
Index(['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Own',
       'Student', 'Married', 'Region', 'Balance'],
      dtype='object')
y = data_credit['Balance']
x = data_credit[['Rating', 'Cards']]
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Balance   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.746
Method:                 Least Squares   F-statistic:                     587.6
Date:                Sun, 22 Dec 2024   Prob (F-statistic):          2.25e-119
Time:                        21:39:20   Log-Likelihood:                -2744.1
No. Observations:                 400   AIC:                             5494.
Df Residuals:                     397   BIC:                             5506.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -428.8183     37.414    -11.462      0.000    -502.372    -355.265
Rating         2.5598      0.075     34.110      0.000       2.412       2.707
Cards         13.6099      8.468      1.607      0.109      -3.037      30.257
==============================================================================
Omnibus:                       23.329   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               39.146
Skew:                           0.390   Prob(JB):                     3.16e-09
Kurtosis:                       4.319   Cond. No.                     1.26e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.26e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Here we chose two particular explanatory variables to explain the credit card debt (Balance). But could it be that choosing other and/or more variables would explain more of the variation the credit card dept?

In fact, how would you be able to maximise the $R^2$? It is well known that the $R^2$ is maximised by including as many variables as possible (as long as they do not create some perfect colinearity).

Maximising the $R^2$, however, is not the best strategy as it may produce a model that does not have the optimal forecast performance. Information criteria are used to find the best trade-off between fit and number of variables.

Optimal variable choice

Here you will explore four possible ways to chose the optimal set of explanatory variables.

  • Best Subset Selection
  • Forward Selection
  • Backward Selection

The python implementation for these techniques follows the solutions given in this workbook by Jordan Crouser.

Best Subset Selection

For this we want to consider the inclusion of all variables into the model but equally we also wish to consider that any variable could be excluded. The data_credit dataset contains 11 variables, one of which is the dependent variable Balance. So ar first sight ot may look as if there are 10 potential variables to be included as explanatory variables. However, we also know that we have four categorical variables.

data_credit[['Own','Student','Married','Region']].describe()
Own Student Married Region
count 400 400 400 400
unique 2 2 2 3
top Yes No Yes South
freq 207 360 245 199

All but one have two categories, such that addingthese to a regression model will result in the inclusion of one dummy variable. The Region variable has three categories, so including this into a regression model will result in the inclusion of 2 dummy variables.

The sm.OLS function only accepts numerical data and hence we will have to convert the categorical variables to dummy variables. You may have used other functions or languages where you can include categorical variables into a regression specification, but then the conversion to dummy variables happened inside the function. Here, in Python, you need to do that explicitly. But it also is quite straightforward.

# create the dummies for all variables of category type
# but do not create dummy for first category
data_credit_dum = pd.get_dummies(data_credit.select_dtypes('category'), drop_first=True)
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
x_del_list = ['Balance','Own','Student','Married','Region']
x = data_credit.drop(columns = x_del_list)

# add the dummy variables
x = pd.concat([x, data_credit_dum], axis=1)
x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Income        400 non-null    float64
 1   Limit         400 non-null    int64  
 2   Rating        400 non-null    int64  
 3   Cards         400 non-null    int64  
 4   Age           400 non-null    int64  
 5   Education     400 non-null    int64  
 6   Own_Yes       400 non-null    uint8  
 7   Student_Yes   400 non-null    uint8  
 8   Married_Yes   400 non-null    uint8  
 9   Region_South  400 non-null    uint8  
 10  Region_West   400 non-null    uint8  
dtypes: float64(1), int64(5), uint8(5)
memory usage: 20.8 KB

The dataframe x now contains only numeric variables and we can run the model with all possible variables.

y = data_credit['Balance']
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Balance   R-squared:                       0.955
Model:                            OLS   Adj. R-squared:                  0.954
Method:                 Least Squares   F-statistic:                     750.3
Date:                Sun, 22 Dec 2024   Prob (F-statistic):          1.11e-253
Time:                        21:57:44   Log-Likelihood:                -2398.7
No. Observations:                 400   AIC:                             4821.
Df Residuals:                     388   BIC:                             4869.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         -479.2079     35.774    -13.395      0.000    -549.543    -408.873
Income          -7.8031      0.234    -33.314      0.000      -8.264      -7.343
Limit            0.1909      0.033      5.824      0.000       0.126       0.255
Rating           1.1365      0.491      2.315      0.021       0.171       2.102
Cards           17.7245      4.341      4.083      0.000       9.190      26.259
Age             -0.6139      0.294     -2.088      0.037      -1.192      -0.036
Education       -1.0989      1.598     -0.688      0.492      -4.241       2.043
Own_Yes        -10.6532      9.914     -1.075      0.283     -30.145       8.839
Student_Yes    425.7474     16.723     25.459      0.000     392.869     458.626
Married_Yes     -8.5339     10.363     -0.824      0.411     -28.908      11.841
Region_South    10.1070     12.210      0.828      0.408     -13.899      34.113
Region_West     16.8042     14.119      1.190      0.235     -10.955      44.564
==============================================================================
Omnibus:                       34.899   Durbin-Watson:                   1.968
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               41.766
Skew:                           0.782   Prob(JB):                     8.52e-10
Kurtosis:                       3.241   Cond. No.                     3.87e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.87e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

What we want to do, however, is not to fit the largest possible model but to find the model which fits the data best, meaning it is optimal by some sort of information criterion, say AIC or BIC.

How many different models are there? $11^2$ models, as 11 variables could be either switched on or off. With 11 possible regressors this is a possible approach, as it delivers 121 different regressions, but just imagine you have 1,000 possible regressors.

So, while we go through the steps below, this is not a process recommended when you do have very large sets of regressors.

The way to organise this is, is as follows.

  1. Set a variable $k$ that determines how many regressors should be included.
  2. For every $k$ from 1 to 11 estimate a model for all possible combinations of $k$ regressors.
  3. Choose the best model from these (minimum RSS)
  4. Present the best models for all values of $k$

To do this effectively we require a function that operationalises step 2. as effectively as possible. Here we present functions basically as they were written by Jordan Crouser.

The function processSubset assumes that your dependent variable is saved in y and your matrix of explanatory variables is saved in x and then uses a list of variable names (feature_set) to run a regression which only includes the variables included in feature_set. It returns the estimated model and the residual sum of squares.

The function getBest takes as input the value for $k$. It also assumes that your dependent variable is saved in y and your matrix of explanatory variables is saved in x. It then creates all possible $k$ variable combinations, sends them to processSubset, collects the results and identifies which of these models has the smallest RSS. The function returns the output for that chosen model.

def processSubset(feature_set):
    # Fit model on feature_set and calculate RSS
    model = sm.OLS(y,sm.add_constant(x[list(feature_set)]))
    regr = model.fit()
    RSS = ((regr.fittedvalues - y) ** 2).sum()
    return {"model":regr, "RSS":RSS}

def getBest(k):
    
    tic = time.time()
    
    results = []
    
    for combo in itertools.combinations(x.columns, k):
        results.append(processSubset(combo))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    toc = time.time()
    print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

Explaining the Code

Defining functions

When you run the above lines, nothing actually happens in your code. All you have done is that you have given that collection of lines of code a new name (here processSunset and getBest). This means that in the following, when you call either of these, Python knows what combination of codes to run.

Selecting the right variables

In getBest the right variables are being selected. The command itertools.combinations(x.columns, k) creates all possible combinations of the variable names (in x.columns). The loop then loops through all these combinations. The currently selected list (combo) is then sent to the processSubset function in which the actuall model is estimated.

Adding a regression constant

The list of variables sent to the processSubset function (feature_set) is merely a list od variables. It does not contain a constant. However, we always wish to add a constant and sending feature_set through the sm.add_constant function will add a constant.

Use of .argmin

Inside the getBest function we had collected all possible models with combinations of $k$ variables in the object models. But we only wanted to keep information on that model that fit the data best (had the minimum RSS). The way to select that was as follows: We only want to use a particular instance of the models we saved and we will access it using models.loc[]. But how do we find the location of the model we want? We use the .argmin method applied to models['RSS'] which identifies the model which does have the smallest RSS. This leads to the line models.loc[models['RSS'].argmin()].

Adding Stopwatches

The combination of tic = time.time(), toc = time.time() and print("Elapsed time: ", (toc-tic), "seconds.") allows you to check how much time it took to get from tic to toc. In that way you can test how long it takes to process the code which is between tic and toc. This can be very useful when you develop code to understand how long it will take to run your code. Before actually running the code you will have to import the time library. We do this below before we actually call on any of that code.


Given these functions are now available to us, we can write a loop that delivers the best model for any given value of k. We then save the respective model summary in the models_best object. To do that we require two more libraries. The itertools library as it allows us to use itertools.combinations in the getBet function. The time library as it allows us to basically run a few stopwatches to see how long the code takes.

import itertools
import time

models_best = pd.DataFrame(columns=["RSS", "model"])

tic = time.time()
for i in range(1,12):
    models_best.loc[i] = getBest(i)

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed 11 models on 1 predictors in 0.06599211692810059 seconds.
Processed 55 models on 2 predictors in 0.21643424034118652 seconds.
Processed 165 models on 3 predictors in 0.860020637512207 seconds.
Processed 330 models on 4 predictors in 1.2015345096588135 seconds.
Processed 462 models on 5 predictors in 1.734302043914795 seconds.
Processed 462 models on 6 predictors in 1.9868927001953125 seconds.
Processed 330 models on 7 predictors in 1.425506830215454 seconds.
Processed 165 models on 8 predictors in 1.1635918617248535 seconds.
Processed 55 models on 9 predictors in 0.5449655055999756 seconds.
Processed 11 models on 10 predictors in 0.1634523868560791 seconds.
Processed 1 models on 11 predictors in 0.014091253280639648 seconds.
Total elapsed time: 9.450889825820923 seconds.

Now we have 11 best models saved in models_best, one for each number of explanatory variables $k=1, ...,11$. You can look at the details for any given $k$, e.g. $k=3$.

print(models_best.loc[3, "model"].summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Balance   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.949
Method:                 Least Squares   F-statistic:                     2502.
Date:                Tue, 24 Dec 2024   Prob (F-statistic):          6.23e-257
Time:                        09:40:15   Log-Likelihood:                -2420.7
No. Observations:                 400   AIC:                             4849.
Df Residuals:                     396   BIC:                             4865.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        -581.0789     13.835    -42.002      0.000    -608.277    -553.880
Income         -7.8749      0.240    -32.783      0.000      -8.347      -7.403
Rating          3.9875      0.055     72.888      0.000       3.880       4.095
Student_Yes   418.7603     17.230     24.304      0.000     384.886     452.634
==============================================================================
Omnibus:                       13.580   Durbin-Watson:                   1.946
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.236
Skew:                           0.445   Prob(JB):                     0.000810
Kurtosis:                       2.750   Cond. No.                     1.32e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The best model with 3 explanatory variables includes Income, Rating and the student dummy (Student_Yes).

You will now have to decide which of these models you would prefer, in other words you will have to decide on the trade-off between extra explanatory variables and the risk of over-fitting. This is often done by looking at an information criterion such as AIC or BIC.

# Gets the second element from each row ('model') and 
# pulls out its rsquared or bic attribute
rsq = models_best.apply(lambda row: row[1].rsquared, axis=1)
bic = models_best.apply(lambda row: row[1].bic, axis=1)

data = {
  "k": range(1,12),
  "bic": bic,
  "rsq": rsq
}

#load data into a DataFrame object:
model_comp = pd.DataFrame(data)

Now we can plot these criteria against the number of included explanatory variables.

(
    ggplot(model_comp, aes(x = 'k', y = 'rsq')) + 
        geom_line() +
        labs(title="R squared v number of explanatory variables",
             x = "Number of explanatory variables")
)

As expected, the $R^2$ increases with extra explanatory variables. What about the BIC?

(
    ggplot(model_comp, aes(x = 'k', y = 'bic')) + 
        geom_line() +
        labs(title="BIC v number of explanatory variables",
             x = "Number of explanatory variables")
)

Forward Stepwise Selection

This method has a smaller search space, as it, after selecting the best model with one variable only needs to decide which variable to add to that model.

In order to achieve this we need to adjust the getBest function, or better write a new version ForwardSelect. What we do in this procedure is to first select the best model with $k=1$ variables. Whatever variable is selected here will stay in the model, even for $k>1$. That means that for $k=2$ we fix that variable and only search over combinations of two variables that contain that first variable.

To achieve this, we define the new function ForwardSelect. This not only needs to use the objects y and x as before, but we also need to feed in the information on which variables have already been preselected (predictors).


Explaining the Code

The task

Create a new list of variable names from x.columns that excludes variables previously selected, say Cards and Age.

The solution

This will be done in the following way. Consider that we have already selected Cards and Age as variables. We now need to create a new list which consists of the names in x.columns minus those two names. As usual there are several ways to achieve this. Here are two such ways:

# This is only illustrative code
prev_pred = ["Cards", "Age"]
rem_pred  = [p for p in x.columns if p not in prev_pred]
rem_pred2 = list(set(x.columns) - set(prev_pred))
print(x.columns)
print(rem_pred2)
print(rem_pred)
Index(['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Own_Yes',
       'Student_Yes', 'Married_Yes', 'Region_South', 'Region_West'],
      dtype='object')
['Rating', 'Limit', 'Education', 'Student_Yes', 'Region_West', 'Income', 'Region_South', 'Married_Yes', 'Own_Yes']
['Income', 'Limit', 'Rating', 'Education', 'Own_Yes', 'Student_Yes', 'Married_Yes', 'Region_South', 'Region_West']
[]

You can see that the two lists (rem_pred and rem_pred2) contain the same variables. They only differ in the order of variables. The first method ([p for p in x.columns if p not in prev_pred]) maintains the same order of variables as in x.columns and hence we use that. However, the remainder of the code would also work with the other method.


def ForwardSelect(predictors):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in x.columns if p not in predictors]
    
    tic = time.time()
    
    results = []
    
    for p in remaining_predictors:
        # we use the variables in predictors plus one from remaining_predictors
        results.append(processSubset(predictors+[p]))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

Now that we have this new function we can again run through a loop which now adds one variable at the time to the variables selected in the previous round (and saved in predictors).

models_fwd = pd.DataFrame(columns=["RSS", "model"])

tic = time.time()
predictors = [] # start with no preselected predictors

for i in range(1,len(x.columns)+1):    
    models_fwd.loc[i] = ForwardSelect(predictors)
    predictors = models_fwd.loc[i]["model"].model.exog_names
    # remove the firs element [0] which is the constant
    predictors = predictors[1:]  

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed  11 models on 1 predictors in 0.058068037033081055 seconds.
Processed  10 models on 2 predictors in 0.06487774848937988 seconds.
Processed  9 models on 3 predictors in 0.030359268188476562 seconds.
Processed  8 models on 4 predictors in 0.028423547744750977 seconds.
Processed  7 models on 5 predictors in 0.028908491134643555 seconds.
Processed  6 models on 6 predictors in 0.028341054916381836 seconds.
Processed  5 models on 7 predictors in 0.02079939842224121 seconds.
Processed  4 models on 8 predictors in 0.02014613151550293 seconds.
Processed  3 models on 9 predictors in 0.01541900634765625 seconds.
Processed  2 models on 10 predictors in 0.01562643051147461 seconds.
Processed  1 models on 11 predictors in 0.008931159973144531 seconds.
Total elapsed time: 0.3416600227355957 seconds.

As you can see this estimated far fewer models and therefore took much less time. You could compare the selected models from the best subset selection (models_best) and forward selection (models_fwd) with a certain number of selected variables, say $k=4$.

print(models_best.loc[4, "model"].summary())
#print(models_fwd.loc[4, "model"].summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Balance   R-squared:                       0.954
Model:                            OLS   Adj. R-squared:                  0.953
Method:                 Least Squares   F-statistic:                     2029.
Date:                Thu, 26 Dec 2024   Prob (F-statistic):          8.94e-262
Time:                        05:57:20   Log-Likelihood:                -2405.4
No. Observations:                 400   AIC:                             4821.
Df Residuals:                     395   BIC:                             4841.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        -499.7272     15.890    -31.449      0.000    -530.967    -468.488
Income         -7.8392      0.232    -33.780      0.000      -8.295      -7.383
Limit           0.2666      0.004     75.271      0.000       0.260       0.274
Cards          23.1754      3.639      6.368      0.000      16.021      30.330
Student_Yes   429.6064     16.611     25.862      0.000     396.949     462.264
==============================================================================
Omnibus:                       34.751   Durbin-Watson:                   1.950
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               41.363
Skew:                           0.771   Prob(JB):                     1.04e-09
Kurtosis:                       3.319   Cond. No.                     1.84e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.84e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

You will find that, on this occasion, these models are different. If, for this example, you do compare the models with 6 variables you will actually find them to be identical again.

Backward selection

The backward selection procedure is another way to make the best subset selection feasible for large datasets. It works in the reverse way to the forward selection. We start with the model which contains all variables and then, one by one, drop one variable.

We will again have to write a new function to work through this, BackwardSelection.

def BackwardSelection(predictors):
    
    tic = time.time()
    
    results = []
    
    for combo in itertools.combinations(predictors, len(predictors)-1):
        results.append(processSubset(combo))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['RSS'].argmin()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)-1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

With this function in the bag, we can now iterate

models_bwd = pd.DataFrame(columns=["RSS", "model"], index = range(1,len(x.columns)))

tic = time.time()
# start with the full set of predictors
predictors = x.columns

while(len(predictors) > 1):  
    models_bwd.loc[len(predictors)-1] = BackwardSelection(predictors)
    predictors = models_bwd.loc[len(predictors)-1]["model"].model.exog_names
    # remove the firs element [0] which is the constant
    predictors = predictors[1:]  

toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed  11 models on 10 predictors in 0.10194849967956543 seconds.
Processed  10 models on 9 predictors in 0.05987977981567383 seconds.
Processed  9 models on 8 predictors in 0.03385615348815918 seconds.
Processed  8 models on 7 predictors in 0.029391050338745117 seconds.
Processed  7 models on 6 predictors in 0.023579835891723633 seconds.
Processed  6 models on 5 predictors in 0.02250194549560547 seconds.
Processed  5 models on 4 predictors in 0.015988826751708984 seconds.
Processed  4 models on 3 predictors in 0.009361982345581055 seconds.
Processed  3 models on 2 predictors in 0.007973909378051758 seconds.
Processed  2 models on 1 predictors in 0.030282020568847656 seconds.
Total elapsed time: 0.3457319736480713 seconds.

This basically takes the same time as the forward stepwise selection. At any given level of $k < 11$ the results of the three methods may differ.

Select the best model

We have see three potential ways to select the best models for a given number of explanatory variables. But how

models_best['RSS']

# Gets the second element from each row ('model') and 
# pulls out its bic attribute
best_bic = models_best.apply(lambda row: row[1].bic, axis=1)
fwd_bic = models_fwd.apply(lambda row: row[1].bic, axis=1)
bwd_bic = models_bwd.apply(lambda row: row[1].bic, axis=1)

data = {
  "k": range(1,12),
  "best": best_bic,
  "fwd": fwd_bic,
  "bwd": bwd_bic
}

#load data into a DataFrame object:
model_comp = pd.DataFrame(data)

We now whish to plot the BIC measures, depending on $k$ but for all three selection models. When using the ggplot function this is best achieved when the data are on long and not wide format.

model_comp_long = pd.melt(model_comp, 
                          id_vars = 'k', 
                          value_vars = ['best', 'fwd', 'bwd'],
                          var_name = "Method",
                          value_name = "BIC")

Look at the resulting dataframe to understand its structure. Now we can easily plot the results.

(
    ggplot(model_comp_long, aes(x = 'k', y = 'BIC', color = 'Method')) + 
        geom_line() +
        labs(title="BIC v number of explanatory variables",
             x = "Number of explanatory variables")
)

If you wish to find which model sizes deliver the minimum BIC you can do that as follows:

# need to add 1 as python starts counting at 0, 
# e.g. if best_bic.argmin() delivers 3, this means the 4th model which represents 
# the model with 4 explanatory variables
print("Best Subset Selection: best k = ", best_bic.argmin()+1)
print("Forward Stepwise Selection: best k = ", fwd_bic.argmin()+1)
print("Backward Stepwise Selection: best k = ", bwd_bic.argmin()+1)
Best Subset Selection: best k =  4
Forward Stepwise Selection: best k =  5
Backward Stepwise Selection: best k =  4

Summary

In this walkthrough you learned

  • how to write functions to support the repeated application of similar code
  • how to loop through a recursive procedure
  • how to select the best set of explanatory variables using three different methods

This walkthrough is part of the ECLR page.