Using Regressions for Prediction in Python

%load_ext pretty_jupyter

Introduction¶

This walk-through is part of the ECLR page.

In this walkthrough you will learn to

use regression results to obtain in-sample predictions
use regression results to obtain out-of-sample predictions

Preparing your workfile¶

We add the basic packages needed for this work:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm 

# Lets_plot requires a setup
LetsPlot.setup_html(no_js=True)

Data upload¶

Here we will be using the Women's Wages dataset (mroz) which is made available here.

# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/Mroz.csv"
df_wages = pd.read_csv(url)

Let's look at the list of variable names and their formats (for more detailed variable descriptions):

print(df_wages.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 753 entries, 0 to 752
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   inlf      753 non-null    int64  
 1   hours     753 non-null    int64  
 2   kidslt6   753 non-null    int64  
 3   kidsge6   753 non-null    int64  
 4   age       753 non-null    int64  
 5   educ      753 non-null    int64  
 6   wage      753 non-null    object 
 7   repwage   753 non-null    float64
 8   hushrs    753 non-null    int64  
 9   husage    753 non-null    int64  
 10  huseduc   753 non-null    int64  
 11  huswage   753 non-null    float64
 12  faminc    753 non-null    int64  
 13  mtr       753 non-null    float64
 14  motheduc  753 non-null    int64  
 15  fatheduc  753 non-null    int64  
 16  unem      753 non-null    float64
 17  city      753 non-null    int64  
 18  exper     753 non-null    int64  
 19  nwifeinc  753 non-null    float64
 20  lwage     753 non-null    object 
 21  expersq   753 non-null    int64  
dtypes: float64(5), int64(15), object(2)
memory usage: 129.5+ KB
None

All objects but for wage and lwage are numeric variables. Why are wage and lwage not?

# Display basic summary statistics and structure of the dataset
print(df_wages['wage'].describe())

count     753
unique    374
top         .
freq      325
Name: wage, dtype: object

There are a lot of observations that take the value ".". These are data for which there is no wage. As we didn't give pd.read_csv any other instruction, it interpreted these as text variables and couldn't interpret this variable as a numeric one. It is best to re-upload the data but letting pd.read_csv know that "." should be interpreted as missing values.

# Re-load the dataset
df_wages = pd.read_csv(url, na_values = '.')
df_wages[['wage','lwage']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 753 entries, 0 to 752
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   wage    428 non-null    float64
 1   lwage   428 non-null    float64
dtypes: float64(2)
memory usage: 11.9 KB

This confirms that we are now dealing with numerical wage and lwage variables.

Details on the data are available from here. There are 753 observation of females with information on wages (wage), their education (educ), their age (age) and a number of other relevant variables.

Estimating a regression model¶

What we are about to do is the following. We shall first estimate the following regression model:

$$wage_i = \alpha_0 + \alpha_1 age_i + \alpha_2 educ_i + eps_i$$

on the basis of all data have wage information. Then, subsequently we shall use the estimated model to predict the wages of those who are not currently working. I.e. we'd be answering the question of how much they should expect to earn (on the basis of their age and education).

Before we do this, there are two reasons of why we should look at this analysis with a pinch of salt

Including the education variable in such an analysis is likely to be problematic as it is likely to be correlated with the error terms. This will produce biased coefficient estimates (an issue you would want to address for instance by using an instrumental variables estimation)
The reason why the women is not working is most likely not a random decision and therefore they are not a randomly selected subset.

Despite these reservations we shall use this example to demonstrate how we use R to produce predictions. Let's start by splitting our dataframe into the two subgroups (those for whom we do have wages and those for whom we don't).

We shall use the isna( ) and the notna() methods to figure out which observations should go into the estimation dataset (df_wages_est) and the prediction dataset (df_wages_pred). Try and figure out what df_wages['wage'].isna() and df_wages['wage'].notna() do to understand how the next two lines work. We use the copy() method to ensure that we actually copy the data into a new object.

df_wages_est = df_wages[df_wages['wage'].notna()].copy()
df_wages_pred = df_wages[df_wages['wage'].isna()].copy()

It is good to remind yourself that on most occasions in coding there are multiple ways to achieve the same thing. We could have used the .loc method to select the appropriate data. The following code achieves exactly the same thing.

df_wages_est = df_wages.loc[df_wages['wage'].notna()].copy()
df_wages_pred = df_wages.loc[df_wages['wage'].isna()].copy()

Those observations for which wage is missing go into the df_wages_pred dataframe and all others into df_wages_est.

We now run our regression on the data in the df_wages_est dataframe. To make the code easier to read we predefine a y and the x object to contain the explained variable (y) and the explanatory variables (x) respectively.

y = df_wages_est['wage']
x = df_wages_est[['age', 'educ']]
mod = sm.OLS(y,sm.add_constant(x))  # sm.add_constant adds a constant to the x dataframe
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     28.78
Date:                Thu, 12 Dec 2024   Prob (F-statistic):           1.90e-12
Time:                        13:14:25   Log-Likelihood:                -1092.0
No. Observations:                 428   AIC:                             2190.
Df Residuals:                     425   BIC:                             2202.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.0090      1.211     -2.485      0.013      -5.389      -0.629
age            0.0207      0.020      1.061      0.289      -0.018       0.059
educ           0.4990      0.066      7.557      0.000       0.369       0.629
==============================================================================
Omnibus:                      339.142   Durbin-Watson:                   2.052
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6051.151
Skew:                           3.313   Prob(JB):                         0.00
Kurtosis:                      20.187   Cond. No.                         358.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

All the regression results are in res. You can learn more about accessing regression results from this walkthrough.

In-sample predictions¶

If you run dir(res) you can see which elements you can access from the regression result object res. We will have to use the .get_prediction method. If you call ols_pred = res.get_prediction() you actually get an object that itself will contain a number of elements. You can check out dir(ols_pred) to see what elements this contains.

ols_pred = res.get_prediction()

The bit we are interested in is prediction_mean which contains the expected/predicted values and we will add this to the df_wages dataframe as a variable

df_wages_est["wage_pred1"] = ols_pred.predicted_mean

Now we have both the realisation and the prediction in the same dataframe. Let's start by confirming that indeed the prediction for the first observation results from the estimated coefficients.

df_wages_est.loc[0,['wage','age','educ','wage_pred1']]

wage           3.354000
age           32.000000
educ          12.000000
wage_pred1     3.642144
Name: 0, dtype: float64

We can use the estimated coefficients (available from res.params) to attempt to reconstruct the value in wage_pred1.

res.params[0]+res.params[1] *32+res.params[2]*12  # using age = 32 and educ = 12

3.64214444633467

This is indeed the value in the new wage_pred1 variable.

We can plot a scatter graph to see how wage and wage_pred line up against each other.

(
ggplot(df_wages_est,aes(x = 'wage', y = 'wage_pred1')) + 
    geom_point() + geom_abline(intercept = 0, slope = 1) + 
    xlim(0, 25) + ylim(0, 25) +
    labs(
        title="In-sample wage and predictions",
        x="Observed wage",
        y="Predicted wage"
    )
)

As you can see from the scatter graph, the predictions are not very accurate. Points that lie on the line represent observations for which realisation and prediction are identical. While the realisations vary between 0 and 25, the predictions range between 1 and 6.

Note that the model was a very simplistic model to explain variation in wages (only using education and age) so perhaps we should not expect to be able to explain a lot of the variation in realised wages ($R^2 = 0.119$).

Out-of sample predictions, actual data¶

We now use this estimated model to produce predicted wages for all those observations that are in df_wages_pred, i.e. the observations where we did not observe any wages as these were women who were not working.

We start by setting up a dataframe with ony the predictor variables x_pred very much like we set-up the object x for the model estimation. Then we call the .get_prediction method on the original regression results res (note we are not re-estimting the model!). Then we draw the predictions from (.prediction_mean) and save it into the df_wages_pred dataframe.

x_pred = df_wages_pred[['age', 'educ']]  
ols_pred_oos = res.get_prediction(sm.add_constant(x_pred))
df_wages_pred["wage_pred1"] = ols_pred_oos.predicted_mean

In df_wages_pred you now have a variable wage_pred1 that contains the wages predicted by the model that was estimated on the basis of the data in df_wages_res.

Out-of sample predictions, generated data¶

Sometimes you don't have actual observations for which you want to produce forecast, but rather you have a set of conditioning values at which you want to calculate predictions. You have to create this set of age and education combinations yourself. If there was just an easy way to do that?

This is not a task you are likely to do very often and therefore this is unlikely to be a memory item. So you should go to your favourite search engine and search for something like "Python create combinations of two variables". You are likely to find some advice that refers to the itertools package. This is a possible short example you may see in the advice:

import itertools
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list(itertools.product(list1, list2))

[(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]

We wish to adjust this to get all possible age and educ combinations. Let's first see in what range age and educ vary.

df_wages[['age','educ']].describe()

	age	educ
count	753.000000	753.000000
mean	42.537849	12.286853
std	8.072574	2.280246
min	30.000000	5.000000
25%	36.000000	12.000000
50%	43.000000	12.000000
75%	49.000000	13.000000
max	60.000000	17.000000

So we want age to vary from 30 to 60 and educ from 5 to 17.

list_age = range(30,61)  # the end point, here 61 is the first number NOT created
list_educ = range(5,18)
combos = list(itertools.product(list_age, list_educ))

You can look at the resulting combo to confirm that this is a list with 403 (age,educ) combinations.

The next question is how to turn this into a dataframe with two variables.

Using Chat-GPT to find a solution¶

.tabset|O_O|.tabset-pills

Here you can see how to use Chat-GPT to help you with a coding problem.

Ask the Question¶

First you need to ask Chat-GPT (or another genAI) a precise question. It is best to clearly define a problem that is similar to yours, but stripped down to the smallest possible problem that represents the issue at hand. Here you could ask:

"I am working in Python and have list = [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]. Can you help me turn this into a dataframe with two variables."

Understand the suggested solution¶

In this case we got the following proposed solution:

# Your list of tuples
data = [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]

#Create a DataFrame with two variables from combos
df_wages_pred2 = pd.DataFrame(data, columns=['age', 'educ'])

Adapt the solution to your problem¶

Now you need to adapt the solution to your problem at hand.

x_pred_grid = pd.DataFrame(combos, columns=['age', 'educ'])

¶

.unlisted|O_O|.unnumbered

Now we can proceed to apply our estimated model to the generated age-educ combinations.

ols_pred_grid = res.get_prediction(sm.add_constant(x_pred_grid))
x_pred_grid["wage_pred1"] = ols_pred_grid.predicted_mean

Let's display this as if it was a function. The height of the function will be represented by the color.

(
ggplot(x_pred_grid,aes(x = 'age', y='educ')) + 
    geom_raster(aes(fill = 'wage_pred1'), hjust = 0.5, vjust = 0.5)
)

You can see that all the variation in the predicted wage is along the educ axis. That corresponds to the regression coefficient for that variable being statistically significantly different from 0, while the coefficient to age is not.

Summary¶

In this walk-through you learned how

to use the results from a regression to get predicted values for in- and out-of sample observations
to create a grid of value combinations for the explanatory variables for which to create predictions

This walk-through is part of the ECLR page.