%load_ext pretty_jupyter
The pretty_jupyter extension is already loaded. To reload it, use: %reload_ext pretty_jupyter
Introduction - Loading Data¶
This walkthrough is part of the ECLR page.
In this workthrough you will explore methods which are useful to support your variable selection.
The dataset we will use is the credit
dataset which comes with the resources provided by James at al in their textbook An Introduction to Statistical Learning (ISL).
Preparing your workspace¶
As usual we begin a notebook or scriptfile by loading the libraries we anticipate to use.
# Standard packaes to deal with data
import pandas as pd
import numpy as np
# Plotting
from lets_plot import *
# Lets_plot requires a setup
LetsPlot.setup_html()
# Statistics
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import KFold
Later in this workthrough we will use a few more libraries/packages and we will import them then, when we introduce them. If you already know which packages you will use, it is easiest to add them to your list here.
Data upload¶
In this workthrough you will explore methods which are useful to support your variable selection.
The dataset we will use is the credit
dataset which comes with the resources provided by James at al in their textbook An Introduction to Statistical Learning (ISL).
The data file we practice with here is a Comma-Separated Values (CSV) file. To load such a file in Python, we use the read_csv
function from the pandas
library. The file is available directly from this url: https://www.statlearning.com/s/Credit.csv.
# Load the CSV file
data_credit = pd.read_csv("https://www.statlearning.com/s/Credit.csv")
If this worked, you should now be able to see an object called data_credit
in your list of variables.
A typical error message you could receive here is "FileNotFoundError: [Errno 2] No such file or directory: 'Mroz.csv'". If you get this message the most likely reason is that the file
Mroz.csv
is not in the place where Python is looking. Either it is not in the working directory, or you did not set the working directory correctly.
These are simulated data for 400 customers of a financial institution. Let's look at the variables we can find in this dataset.
Explore the data¶
The task¶
Whenever you work with data you should explore a few things about the data.
- What variable types are the variables? Ensure they have the expected type.
- Are there missing data?
Variable Types¶
To find the variable data types you can use the .info()
method.
data_credit.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 400 non-null float64 1 Limit 400 non-null int64 2 Rating 400 non-null int64 3 Cards 400 non-null int64 4 Age 400 non-null int64 5 Education 400 non-null int64 6 Own 400 non-null object 7 Student 400 non-null object 8 Married 400 non-null object 9 Region 400 non-null object 10 Balance 400 non-null int64 dtypes: float64(1), int64(6), object(4) memory usage: 34.5+ KB
You can see that the variables Own
, Student
, Married
and Region
are object
variables. They are basically text variables. We will convert these to categorical variables.
data_credit['Own'] = data_credit['Own'].astype('category')
data_credit['Student'] = data_credit['Student'].astype('category')
data_credit['Married'] = data_credit['Married'].astype('category')
data_credit['Region'] = data_credit['Region'].astype('category')
Missing data¶
It is important to understand whether our dataset is many observations with missing variables and for which variables we do have missing information.
credit_na = (data_credit.isnull().sum() / len(data_credit)) * 100
missing_data = pd.DataFrame({'Missing Ratio' :credit_na})
missing_data.style.bar()
Missing Ratio | |
---|---|
Income | 0.000000 |
Limit | 0.000000 |
Rating | 0.000000 |
Cards | 0.000000 |
Age | 0.000000 |
Education | 0.000000 |
Own | 0.000000 |
Student | 0.000000 |
Married | 0.000000 |
Region | 0.000000 |
Balance | 0.000000 |
As it turns out there are no missing observations in this dataframe.
¶
Remember that the data are simulated which is one reason why there are no missing data. The variable Balance
is the amount of credit card debt (a positive number being a debt). Education
is measured in years of completed education, Cards
is the number of credit cards owned by an individual, Limit
is the credit card limit for the individual and Income
is the annual income in thousands of dollars by the individual.
Let's produce a scatter plot for the variables Income
and Age
.
(
ggplot(data_credit, aes(x='Age', y = 'Income')) +
geom_point()
)
The plot revels that there is a weak positive correlation between these variables.
The variable Rating
is a credit rating for an individual. One could think of many different ways a credit rating variable could be defined. Without any further definitions available you will need to use the data to establish whether a higher number in the Ratings
variable means a better or a worse credit rating.
Explore the data¶
The question¶
Higher numbers of the Rating
variable indicative of a better or a worse credit rating?
The answer¶
You could for instance create a scatter plot between Rating
and Income
. There you will find a clearly positive correlation indicating that higher ratings relate to better credit worthiness.
¶
Let's calculate some summary statistics for Balance
, Rating
and Region
:
# for numeric variables
data_credit[['Balance','Rating','Region']].describe()
Balance | Rating | |
---|---|---|
count | 400.000000 | 400.000000 |
mean | 520.015000 | 354.940000 |
std | 459.758877 | 154.724143 |
min | 0.000000 | 93.000000 |
25% | 68.750000 | 247.250000 |
50% | 459.500000 | 344.000000 |
75% | 863.000000 | 437.250000 |
max | 1999.000000 | 982.000000 |
# for categorical variables
data_credit[['Region']].describe()
Region | |
---|---|
count | 400 |
unique | 3 |
top | South |
freq | 199 |
You can see that there are three regions.
Modelling credit balances¶
Let's consider whether variation in credit balances can be modeled using a linear regression model. For instance we can estimate the following model:
$$ balances_i = \beta_0 + \beta_1~rating_i + \beta_2~cards_i + \epsilon_i $$data_credit.columns
Index(['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Own', 'Student', 'Married', 'Region', 'Balance'], dtype='object')
y = data_credit['Balance']
x = data_credit[['Rating', 'Cards']]
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: Balance R-squared: 0.747 Model: OLS Adj. R-squared: 0.746 Method: Least Squares F-statistic: 587.6 Date: Sun, 22 Dec 2024 Prob (F-statistic): 2.25e-119 Time: 21:39:20 Log-Likelihood: -2744.1 No. Observations: 400 AIC: 5494. Df Residuals: 397 BIC: 5506. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -428.8183 37.414 -11.462 0.000 -502.372 -355.265 Rating 2.5598 0.075 34.110 0.000 2.412 2.707 Cards 13.6099 8.468 1.607 0.109 -3.037 30.257 ============================================================================== Omnibus: 23.329 Durbin-Watson: 1.993 Prob(Omnibus): 0.000 Jarque-Bera (JB): 39.146 Skew: 0.390 Prob(JB): 3.16e-09 Kurtosis: 4.319 Cond. No. 1.26e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.26e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Here we chose two particular explanatory variables to explain the credit card debt (Balance
). But could it be that choosing other and/or more variables would explain more of the variation the credit card dept?
In fact, how would you be able to maximise the $R^2$? It is well known that the $R^2$ is maximised by including as many variables as possible (as long as they do not create some perfect colinearity).
Maximising the $R^2$, however, is not the best strategy as it may produce a model that does not have the optimal forecast performance. Information criteria are used to find the best trade-off between fit and number of variables.
Optimal variable choice¶
Here you will explore four possible ways to chose the optimal set of explanatory variables.
- Best Subset Selection
- Forward Selection
- Backward Selection
The python implementation for these techniques follows the solutions given in this workbook by Jordan Crouser.
Best Subset Selection¶
For this we want to consider the inclusion of all variables into the model but equally we also wish to consider that any variable could be excluded. The data_credit
dataset contains 11 variables, one of which is the dependent variable Balance
. So ar first sight ot may look as if there are 10 potential variables to be included as explanatory variables. However, we also know that we have four categorical variables.
data_credit[['Own','Student','Married','Region']].describe()
Own | Student | Married | Region | |
---|---|---|---|---|
count | 400 | 400 | 400 | 400 |
unique | 2 | 2 | 2 | 3 |
top | Yes | No | Yes | South |
freq | 207 | 360 | 245 | 199 |
All but one have two categories, such that addingthese to a regression model will result in the inclusion of one dummy variable. The Region
variable has three categories, so including this into a regression model will result in the inclusion of 2 dummy variables.
The sm.OLS
function only accepts numerical data and hence we will have to convert the categorical variables to dummy variables. You may have used other functions or languages where you can include categorical variables into a regression specification, but then the conversion to dummy variables happened inside the function. Here, in Python, you need to do that explicitly. But it also is quite straightforward.
# create the dummies for all variables of category type
# but do not create dummy for first category
data_credit_dum = pd.get_dummies(data_credit.select_dtypes('category'), drop_first=True)
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
x_del_list = ['Balance','Own','Student','Married','Region']
x = data_credit.drop(columns = x_del_list)
# add the dummy variables
x = pd.concat([x, data_credit_dum], axis=1)
x.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Income 400 non-null float64 1 Limit 400 non-null int64 2 Rating 400 non-null int64 3 Cards 400 non-null int64 4 Age 400 non-null int64 5 Education 400 non-null int64 6 Own_Yes 400 non-null uint8 7 Student_Yes 400 non-null uint8 8 Married_Yes 400 non-null uint8 9 Region_South 400 non-null uint8 10 Region_West 400 non-null uint8 dtypes: float64(1), int64(5), uint8(5) memory usage: 20.8 KB
The dataframe x
now contains only numeric variables and we can run the model with all possible variables.
y = data_credit['Balance']
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: Balance R-squared: 0.955 Model: OLS Adj. R-squared: 0.954 Method: Least Squares F-statistic: 750.3 Date: Sun, 22 Dec 2024 Prob (F-statistic): 1.11e-253 Time: 21:57:44 Log-Likelihood: -2398.7 No. Observations: 400 AIC: 4821. Df Residuals: 388 BIC: 4869. Df Model: 11 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const -479.2079 35.774 -13.395 0.000 -549.543 -408.873 Income -7.8031 0.234 -33.314 0.000 -8.264 -7.343 Limit 0.1909 0.033 5.824 0.000 0.126 0.255 Rating 1.1365 0.491 2.315 0.021 0.171 2.102 Cards 17.7245 4.341 4.083 0.000 9.190 26.259 Age -0.6139 0.294 -2.088 0.037 -1.192 -0.036 Education -1.0989 1.598 -0.688 0.492 -4.241 2.043 Own_Yes -10.6532 9.914 -1.075 0.283 -30.145 8.839 Student_Yes 425.7474 16.723 25.459 0.000 392.869 458.626 Married_Yes -8.5339 10.363 -0.824 0.411 -28.908 11.841 Region_South 10.1070 12.210 0.828 0.408 -13.899 34.113 Region_West 16.8042 14.119 1.190 0.235 -10.955 44.564 ============================================================================== Omnibus: 34.899 Durbin-Watson: 1.968 Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.766 Skew: 0.782 Prob(JB): 8.52e-10 Kurtosis: 3.241 Cond. No. 3.87e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.87e+04. This might indicate that there are strong multicollinearity or other numerical problems.
What we want to do, however, is not to fit the largest possible model but to find the model which fits the data best, meaning it is optimal by some sort of information criterion, say AIC or BIC.
How many different models are there? $11^2$ models, as 11 variables could be either switched on or off. With 11 possible regressors this is a possible approach, as it delivers 121 different regressions, but just imagine you have 1,000 possible regressors.
So, while we go through the steps below, this is not a process recommended when you do have very large sets of regressors.
The way to organise this is, is as follows.
- Set a variable $k$ that determines how many regressors should be included.
- For every $k$ from 1 to 11 estimate a model for all possible combinations of $k$ regressors.
- Choose the best model from these (minimum RSS)
- Present the best models for all values of $k$
To do this effectively we require a function that operationalises step 2. as effectively as possible. Here we present functions basically as they were written by Jordan Crouser.
The function processSubset
assumes that your dependent variable is saved in y
and your matrix of explanatory variables is saved in x
and then uses a list of variable names (feature_set
) to run a regression which only includes the variables included in feature_set
. It returns the estimated model and the residual sum of squares.
The function getBest
takes as input the value for $k$. It also assumes that your dependent variable is saved in y
and your matrix of explanatory variables is saved in x
. It then creates all possible $k$ variable combinations, sends them to processSubset
, collects the results and identifies which of these models has the smallest RSS. The function returns the output for that chosen model.
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y,sm.add_constant(x[list(feature_set)]))
regr = model.fit()
RSS = ((regr.fittedvalues - y) ** 2).sum()
return {"model":regr, "RSS":RSS}
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(x.columns, k):
results.append(processSubset(combo))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
Explaining the Code¶
Defining functions¶
When you run the above lines, nothing actually happens in your code. All you have done is that you have given that collection of lines of code a new name (here processSunset
and getBest
). This means that in the following, when you call either of these, Python knows what combination of codes to run.
Selecting the right variables¶
In getBest
the right variables are being selected. The command itertools.combinations(x.columns, k)
creates all possible combinations of the variable names (in x.columns
). The loop then loops through all these combinations. The currently selected list (combo
) is then sent to the processSubset
function in which the actuall model is estimated.
Adding a regression constant¶
The list of variables sent to the processSubset
function (feature_set
) is merely a list od variables. It does not contain a constant. However, we always wish to add a constant and sending feature_set
through the sm.add_constant
function will add a constant.
Use of .argmin¶
Inside the getBest
function we had collected all possible models with combinations of $k$ variables in the object models
. But we only wanted to keep information on that model that fit the data best (had the minimum RSS). The way to select that was as follows: We only want to use a particular instance of the models we saved and we will access it using models.loc[]
. But how do we find the location of the model we want? We use the .argmin
method applied to models['RSS']
which identifies the model which does have the smallest RSS. This leads to the line models.loc[models['RSS'].argmin()]
.
Adding Stopwatches¶
The combination of tic = time.time()
, toc = time.time()
and print("Elapsed time: ", (toc-tic), "seconds.")
allows you to check how much time it took to get from tic
to toc
. In that way you can test how long it takes to process the code which is between tic
and toc
. This can be very useful when you develop code to understand how long it will take to run your code. Before actually running the code you will have to import the time
library. We do this below before we actually call on any of that code.
¶
Given these functions are now available to us, we can write a loop that delivers the best model for any given value of k
. We then save the respective model summary in the models_best
object. To do that we require two more libraries. The itertools
library as it allows us to use itertools.combinations
in the getBet
function. The time
library as it allows us to basically run a few stopwatches to see how long the code takes.
import itertools
import time
models_best = pd.DataFrame(columns=["RSS", "model"])
tic = time.time()
for i in range(1,12):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed 11 models on 1 predictors in 0.06599211692810059 seconds. Processed 55 models on 2 predictors in 0.21643424034118652 seconds. Processed 165 models on 3 predictors in 0.860020637512207 seconds. Processed 330 models on 4 predictors in 1.2015345096588135 seconds. Processed 462 models on 5 predictors in 1.734302043914795 seconds. Processed 462 models on 6 predictors in 1.9868927001953125 seconds. Processed 330 models on 7 predictors in 1.425506830215454 seconds. Processed 165 models on 8 predictors in 1.1635918617248535 seconds. Processed 55 models on 9 predictors in 0.5449655055999756 seconds. Processed 11 models on 10 predictors in 0.1634523868560791 seconds. Processed 1 models on 11 predictors in 0.014091253280639648 seconds. Total elapsed time: 9.450889825820923 seconds.
Now we have 11 best models saved in models_best
, one for each number of explanatory variables $k=1, ...,11$. You can look at the details for any given $k$, e.g. $k=3$.
print(models_best.loc[3, "model"].summary())
OLS Regression Results ============================================================================== Dep. Variable: Balance R-squared: 0.950 Model: OLS Adj. R-squared: 0.949 Method: Least Squares F-statistic: 2502. Date: Tue, 24 Dec 2024 Prob (F-statistic): 6.23e-257 Time: 09:40:15 Log-Likelihood: -2420.7 No. Observations: 400 AIC: 4849. Df Residuals: 396 BIC: 4865. Df Model: 3 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- const -581.0789 13.835 -42.002 0.000 -608.277 -553.880 Income -7.8749 0.240 -32.783 0.000 -8.347 -7.403 Rating 3.9875 0.055 72.888 0.000 3.880 4.095 Student_Yes 418.7603 17.230 24.304 0.000 384.886 452.634 ============================================================================== Omnibus: 13.580 Durbin-Watson: 1.946 Prob(Omnibus): 0.001 Jarque-Bera (JB): 14.236 Skew: 0.445 Prob(JB): 0.000810 Kurtosis: 2.750 Cond. No. 1.32e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.32e+03. This might indicate that there are strong multicollinearity or other numerical problems.
The best model with 3 explanatory variables includes Income
, Rating
and the student dummy (Student_Yes
).
You will now have to decide which of these models you would prefer, in other words you will have to decide on the trade-off between extra explanatory variables and the risk of over-fitting. This is often done by looking at an information criterion such as AIC or BIC.
# Gets the second element from each row ('model') and
# pulls out its rsquared or bic attribute
rsq = models_best.apply(lambda row: row[1].rsquared, axis=1)
bic = models_best.apply(lambda row: row[1].bic, axis=1)
data = {
"k": range(1,12),
"bic": bic,
"rsq": rsq
}
#load data into a DataFrame object:
model_comp = pd.DataFrame(data)
Now we can plot these criteria against the number of included explanatory variables.
(
ggplot(model_comp, aes(x = 'k', y = 'rsq')) +
geom_line() +
labs(title="R squared v number of explanatory variables",
x = "Number of explanatory variables")
)
As expected, the $R^2$ increases with extra explanatory variables. What about the BIC?
(
ggplot(model_comp, aes(x = 'k', y = 'bic')) +
geom_line() +
labs(title="BIC v number of explanatory variables",
x = "Number of explanatory variables")
)
Forward Stepwise Selection¶
This method has a smaller search space, as it, after selecting the best model with one variable only needs to decide which variable to add to that model.
In order to achieve this we need to adjust the getBest
function, or better write a new version ForwardSelect
. What we do in this procedure is to first select the best model with $k=1$ variables. Whatever variable is selected here will stay in the model, even for $k>1$. That means that for $k=2$ we fix that variable and only search over combinations of two variables that contain that first variable.
To achieve this, we define the new function ForwardSelect
. This not only needs to use the objects y
and x
as before, but we also need to feed in the information on which variables have already been preselected (predictors
).
Explaining the Code¶
The task¶
Create a new list of variable names from x.columns
that excludes variables previously selected, say Cards
and Age
.
The solution¶
This will be done in the following way. Consider that we have already selected Cards
and Age
as variables. We now need to create a new list which consists of the names in x.columns
minus those two names. As usual there are several ways to achieve this. Here are two such ways:
# This is only illustrative code
prev_pred = ["Cards", "Age"]
rem_pred = [p for p in x.columns if p not in prev_pred]
rem_pred2 = list(set(x.columns) - set(prev_pred))
print(x.columns)
print(rem_pred2)
print(rem_pred)
Index(['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Own_Yes', 'Student_Yes', 'Married_Yes', 'Region_South', 'Region_West'], dtype='object') ['Rating', 'Limit', 'Education', 'Student_Yes', 'Region_West', 'Income', 'Region_South', 'Married_Yes', 'Own_Yes'] ['Income', 'Limit', 'Rating', 'Education', 'Own_Yes', 'Student_Yes', 'Married_Yes', 'Region_South', 'Region_West']
[]
You can see that the two lists (rem_pred
and rem_pred2
) contain the same variables. They only differ in the order of variables. The first method ([p for p in x.columns if p not in prev_pred]
) maintains the same order of variables as in x.columns
and hence we use that. However, the remainder of the code would also work with the other method.
¶
def ForwardSelect(predictors):
# Pull out predictors we still need to process
remaining_predictors = [p for p in x.columns if p not in predictors]
tic = time.time()
results = []
for p in remaining_predictors:
# we use the variables in predictors plus one from remaining_predictors
results.append(processSubset(predictors+[p]))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
Now that we have this new function we can again run through a loop which now adds one variable at the time to the variables selected in the previous round (and saved in predictors
).
models_fwd = pd.DataFrame(columns=["RSS", "model"])
tic = time.time()
predictors = [] # start with no preselected predictors
for i in range(1,len(x.columns)+1):
models_fwd.loc[i] = ForwardSelect(predictors)
predictors = models_fwd.loc[i]["model"].model.exog_names
# remove the firs element [0] which is the constant
predictors = predictors[1:]
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed 11 models on 1 predictors in 0.058068037033081055 seconds. Processed 10 models on 2 predictors in 0.06487774848937988 seconds. Processed 9 models on 3 predictors in 0.030359268188476562 seconds. Processed 8 models on 4 predictors in 0.028423547744750977 seconds. Processed 7 models on 5 predictors in 0.028908491134643555 seconds. Processed 6 models on 6 predictors in 0.028341054916381836 seconds. Processed 5 models on 7 predictors in 0.02079939842224121 seconds. Processed 4 models on 8 predictors in 0.02014613151550293 seconds. Processed 3 models on 9 predictors in 0.01541900634765625 seconds. Processed 2 models on 10 predictors in 0.01562643051147461 seconds. Processed 1 models on 11 predictors in 0.008931159973144531 seconds. Total elapsed time: 0.3416600227355957 seconds.
As you can see this estimated far fewer models and therefore took much less time. You could compare the selected models from the best subset selection (models_best
) and forward selection (models_fwd
) with a certain number of selected variables, say $k=4$.
print(models_best.loc[4, "model"].summary())
#print(models_fwd.loc[4, "model"].summary())
OLS Regression Results ============================================================================== Dep. Variable: Balance R-squared: 0.954 Model: OLS Adj. R-squared: 0.953 Method: Least Squares F-statistic: 2029. Date: Thu, 26 Dec 2024 Prob (F-statistic): 8.94e-262 Time: 05:57:20 Log-Likelihood: -2405.4 No. Observations: 400 AIC: 4821. Df Residuals: 395 BIC: 4841. Df Model: 4 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- const -499.7272 15.890 -31.449 0.000 -530.967 -468.488 Income -7.8392 0.232 -33.780 0.000 -8.295 -7.383 Limit 0.2666 0.004 75.271 0.000 0.260 0.274 Cards 23.1754 3.639 6.368 0.000 16.021 30.330 Student_Yes 429.6064 16.611 25.862 0.000 396.949 462.264 ============================================================================== Omnibus: 34.751 Durbin-Watson: 1.950 Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.363 Skew: 0.771 Prob(JB): 1.04e-09 Kurtosis: 3.319 Cond. No. 1.84e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.84e+04. This might indicate that there are strong multicollinearity or other numerical problems.
You will find that, on this occasion, these models are different. If, for this example, you do compare the models with 6 variables you will actually find them to be identical again.
Backward selection¶
The backward selection procedure is another way to make the best subset selection feasible for large datasets. It works in the reverse way to the forward selection. We start with the model which contains all variables and then, one by one, drop one variable.
We will again have to write a new function to work through this, BackwardSelection
.
def BackwardSelection(predictors):
tic = time.time()
results = []
for combo in itertools.combinations(predictors, len(predictors)-1):
results.append(processSubset(combo))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed ", models.shape[0], "models on", len(predictors)-1, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
With this function in the bag, we can now iterate
models_bwd = pd.DataFrame(columns=["RSS", "model"], index = range(1,len(x.columns)))
tic = time.time()
# start with the full set of predictors
predictors = x.columns
while(len(predictors) > 1):
models_bwd.loc[len(predictors)-1] = BackwardSelection(predictors)
predictors = models_bwd.loc[len(predictors)-1]["model"].model.exog_names
# remove the firs element [0] which is the constant
predictors = predictors[1:]
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
Processed 11 models on 10 predictors in 0.10194849967956543 seconds. Processed 10 models on 9 predictors in 0.05987977981567383 seconds. Processed 9 models on 8 predictors in 0.03385615348815918 seconds. Processed 8 models on 7 predictors in 0.029391050338745117 seconds. Processed 7 models on 6 predictors in 0.023579835891723633 seconds. Processed 6 models on 5 predictors in 0.02250194549560547 seconds. Processed 5 models on 4 predictors in 0.015988826751708984 seconds. Processed 4 models on 3 predictors in 0.009361982345581055 seconds. Processed 3 models on 2 predictors in 0.007973909378051758 seconds. Processed 2 models on 1 predictors in 0.030282020568847656 seconds. Total elapsed time: 0.3457319736480713 seconds.
This basically takes the same time as the forward stepwise selection. At any given level of $k < 11$ the results of the three methods may differ.
Select the best model¶
We have see three potential ways to select the best models for a given number of explanatory variables. But how
models_best['RSS']
# Gets the second element from each row ('model') and
# pulls out its bic attribute
best_bic = models_best.apply(lambda row: row[1].bic, axis=1)
fwd_bic = models_fwd.apply(lambda row: row[1].bic, axis=1)
bwd_bic = models_bwd.apply(lambda row: row[1].bic, axis=1)
data = {
"k": range(1,12),
"best": best_bic,
"fwd": fwd_bic,
"bwd": bwd_bic
}
#load data into a DataFrame object:
model_comp = pd.DataFrame(data)
We now whish to plot the BIC measures, depending on $k$ but for all three selection models. When using the ggplot
function this is best achieved when the data are on long and not wide format.
model_comp_long = pd.melt(model_comp,
id_vars = 'k',
value_vars = ['best', 'fwd', 'bwd'],
var_name = "Method",
value_name = "BIC")
Look at the resulting dataframe to understand its structure. Now we can easily plot the results.
(
ggplot(model_comp_long, aes(x = 'k', y = 'BIC', color = 'Method')) +
geom_line() +
labs(title="BIC v number of explanatory variables",
x = "Number of explanatory variables")
)
If you wish to find which model sizes deliver the minimum BIC you can do that as follows:
# need to add 1 as python starts counting at 0,
# e.g. if best_bic.argmin() delivers 3, this means the 4th model which represents
# the model with 4 explanatory variables
print("Best Subset Selection: best k = ", best_bic.argmin()+1)
print("Forward Stepwise Selection: best k = ", fwd_bic.argmin()+1)
print("Backward Stepwise Selection: best k = ", bwd_bic.argmin()+1)
Best Subset Selection: best k = 4 Forward Stepwise Selection: best k = 5 Backward Stepwise Selection: best k = 4
Summary¶
In this walkthrough you learned
- how to write functions to support the repeated application of similar code
- how to loop through a recursive procedure
- how to select the best set of explanatory variables using three different methods
This walkthrough is part of the ECLR page.