%load_ext pretty_jupyter
Introduction¶
This walk-through is part of the ECLR page.
In this walkthrough you will learn to
- use regression results to obtain in-sample predictions
- use regression results to obtain out-of-sample predictions
Preparing your workfile¶
We add the basic packages needed for this work:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm
# Lets_plot requires a setup
LetsPlot.setup_html(no_js=True)
Data upload¶
Here we will be using the Women's Wages dataset (mroz) which is made available here.
# Load the dataset
url = "https://raw.githubusercontent.com/datasquad/ECLR/refs/heads/gh-pages/data/Mroz.csv"
df_wages = pd.read_csv(url)
Let's look at the list of variable names and their formats (for more detailed variable descriptions):
print(df_wages.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 753 entries, 0 to 752 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 inlf 753 non-null int64 1 hours 753 non-null int64 2 kidslt6 753 non-null int64 3 kidsge6 753 non-null int64 4 age 753 non-null int64 5 educ 753 non-null int64 6 wage 753 non-null object 7 repwage 753 non-null float64 8 hushrs 753 non-null int64 9 husage 753 non-null int64 10 huseduc 753 non-null int64 11 huswage 753 non-null float64 12 faminc 753 non-null int64 13 mtr 753 non-null float64 14 motheduc 753 non-null int64 15 fatheduc 753 non-null int64 16 unem 753 non-null float64 17 city 753 non-null int64 18 exper 753 non-null int64 19 nwifeinc 753 non-null float64 20 lwage 753 non-null object 21 expersq 753 non-null int64 dtypes: float64(5), int64(15), object(2) memory usage: 129.5+ KB None
All objects but for wage
and lwage
are numeric variables. Why are wage
and lwage
not?
# Display basic summary statistics and structure of the dataset
print(df_wages['wage'].describe())
count 753 unique 374 top . freq 325 Name: wage, dtype: object
There are a lot of observations that take the value ".". These are data for which there is no wage. As we didn't give pd.read_csv
any other instruction, it interpreted these as text variables and couldn't interpret this variable as a numeric one. It is best to re-upload the data but letting pd.read_csv
know that "." should be interpreted as missing values.
# Re-load the dataset
df_wages = pd.read_csv(url, na_values = '.')
df_wages[['wage','lwage']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 753 entries, 0 to 752 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 wage 428 non-null float64 1 lwage 428 non-null float64 dtypes: float64(2) memory usage: 11.9 KB
This confirms that we are now dealing with numerical wage
and lwage
variables.
Details on the data are available from here. There are 753 observation of females with information on wages (wage
), their education (educ
), their age (age
) and a number of other relevant variables.
Estimating a regression model¶
What we are about to do is the following. We shall first estimate the following regression model:
$$wage_i = \alpha_0 + \alpha_1 age_i + \alpha_2 educ_i + eps_i$$on the basis of all data have wage information. Then, subsequently we shall use the estimated model to predict the wages of those who are not currently working. I.e. we'd be answering the question of how much they should expect to earn (on the basis of their age and education).
Before we do this, there are two reasons of why we should look at this analysis with a pinch of salt
- Including the education variable in such an analysis is likely to be problematic as it is likely to be correlated with the error terms. This will produce biased coefficient estimates (an issue you would want to address for instance by using an instrumental variables estimation)
- The reason why the women is not working is most likely not a random decision and therefore they are not a randomly selected subset.
Despite these reservations we shall use this example to demonstrate how we use R to produce predictions. Let's start by splitting our dataframe into the two subgroups (those for whom we do have wages and those for whom we don't).
We shall use the isna( )
and the notna()
methods to figure out which observations should go into the estimation dataset (df_wages_est
) and the prediction dataset (df_wages_pred
). Try and figure out what df_wages['wage'].isna()
and df_wages['wage'].notna()
do to understand how the next two lines work. We use the copy()
method to ensure that we actually copy the data into a new object.
df_wages_est = df_wages[df_wages['wage'].notna()].copy()
df_wages_pred = df_wages[df_wages['wage'].isna()].copy()
It is good to remind yourself that on most occasions in coding there are multiple ways to achieve the same thing. We could have used the .loc
method to select the appropriate data. The following code achieves exactly the same thing.
df_wages_est = df_wages.loc[df_wages['wage'].notna()].copy()
df_wages_pred = df_wages.loc[df_wages['wage'].isna()].copy()
Those observations for which wage is missing go into the df_wages_pred
dataframe and all others into df_wages_est
.
We now run our regression on the data in the df_wages_est
dataframe. To make the code easier to read we predefine a y
and the x
object to contain the explained variable (y
) and the explanatory variables (x
) respectively.
y = df_wages_est['wage']
x = df_wages_est[['age', 'educ']]
mod = sm.OLS(y,sm.add_constant(x)) # sm.add_constant adds a constant to the x dataframe
res = mod.fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: wage R-squared: 0.119 Model: OLS Adj. R-squared: 0.115 Method: Least Squares F-statistic: 28.78 Date: Thu, 12 Dec 2024 Prob (F-statistic): 1.90e-12 Time: 13:14:25 Log-Likelihood: -1092.0 No. Observations: 428 AIC: 2190. Df Residuals: 425 BIC: 2202. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -3.0090 1.211 -2.485 0.013 -5.389 -0.629 age 0.0207 0.020 1.061 0.289 -0.018 0.059 educ 0.4990 0.066 7.557 0.000 0.369 0.629 ============================================================================== Omnibus: 339.142 Durbin-Watson: 2.052 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6051.151 Skew: 3.313 Prob(JB): 0.00 Kurtosis: 20.187 Cond. No. 358. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
All the regression results are in res
. You can learn more about accessing regression results from this walkthrough.
In-sample predictions¶
If you run dir(res)
you can see which elements you can access from the regression result object res
. We will have to use the .get_prediction
method. If you call ols_pred = res.get_prediction()
you actually get an object that itself will contain a number of elements. You can check out dir(ols_pred)
to see what elements this contains.
ols_pred = res.get_prediction()
The bit we are interested in is prediction_mean
which contains the expected/predicted values and we will add this to the df_wages
dataframe as a variable
df_wages_est["wage_pred1"] = ols_pred.predicted_mean
Now we have both the realisation and the prediction in the same dataframe. Let's start by confirming that indeed the prediction for the first observation results from the estimated coefficients.
df_wages_est.loc[0,['wage','age','educ','wage_pred1']]
wage 3.354000 age 32.000000 educ 12.000000 wage_pred1 3.642144 Name: 0, dtype: float64
We can use the estimated coefficients (available from res.params
) to attempt to reconstruct the value in wage_pred1
.
res.params[0]+res.params[1] *32+res.params[2]*12 # using age = 32 and educ = 12
3.64214444633467
This is indeed the value in the new wage_pred1
variable.
We can plot a scatter graph to see how wage
and wage_pred
line up against each other.
(
ggplot(df_wages_est,aes(x = 'wage', y = 'wage_pred1')) +
geom_point() + geom_abline(intercept = 0, slope = 1) +
xlim(0, 25) + ylim(0, 25) +
labs(
title="In-sample wage and predictions",
x="Observed wage",
y="Predicted wage"
)
)
As you can see from the scatter graph, the predictions are not very accurate. Points that lie on the line represent observations for which realisation and prediction are identical. While the realisations vary between 0 and 25, the predictions range between 1 and 6.
Note that the model was a very simplistic model to explain variation in wages (only using education and age) so perhaps we should not expect to be able to explain a lot of the variation in realised wages ($R^2 = 0.119$).
Out-of sample predictions, actual data¶
We now use this estimated model to produce predicted wages for all those observations that are in df_wages_pred
, i.e. the observations where we did not observe any wages as these were women who were not working.
We start by setting up a dataframe with ony the predictor variables x_pred
very much like we set-up the object x
for the model estimation. Then we call the .get_prediction
method on the original regression results res
(note we are not re-estimting the model!). Then we draw the predictions from (.prediction_mean
) and save it into the df_wages_pred
dataframe.
x_pred = df_wages_pred[['age', 'educ']]
ols_pred_oos = res.get_prediction(sm.add_constant(x_pred))
df_wages_pred["wage_pred1"] = ols_pred_oos.predicted_mean
In df_wages_pred
you now have a variable wage_pred1
that contains the wages predicted by the model that was estimated on the basis of the data in df_wages_res
.
Out-of sample predictions, generated data¶
Sometimes you don't have actual observations for which you want to produce forecast, but rather you have a set of conditioning values at which you want to calculate predictions. You have to create this set of age and education combinations yourself. If there was just an easy way to do that?
This is not a task you are likely to do very often and therefore this is unlikely to be a memory item. So you should go to your favourite search engine and search for something like "Python create combinations of two variables". You are likely to find some advice that refers to the itertools
package. This is a possible short example you may see in the advice:
import itertools
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list(itertools.product(list1, list2))
[(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]
We wish to adjust this to get all possible age
and educ
combinations. Let's first see in what range age
and educ
vary.
df_wages[['age','educ']].describe()
age | educ | |
---|---|---|
count | 753.000000 | 753.000000 |
mean | 42.537849 | 12.286853 |
std | 8.072574 | 2.280246 |
min | 30.000000 | 5.000000 |
25% | 36.000000 | 12.000000 |
50% | 43.000000 | 12.000000 |
75% | 49.000000 | 13.000000 |
max | 60.000000 | 17.000000 |
So we want age
to vary from 30 to 60 and educ
from 5 to 17.
list_age = range(30,61) # the end point, here 61 is the first number NOT created
list_educ = range(5,18)
combos = list(itertools.product(list_age, list_educ))
You can look at the resulting combo
to confirm that this is a list with 403 (age
,educ
) combinations.
The next question is how to turn this into a dataframe with two variables.
Using Chat-GPT to find a solution¶
Here you can see how to use Chat-GPT to help you with a coding problem.
Ask the Question¶
First you need to ask Chat-GPT (or another genAI) a precise question. It is best to clearly define a problem that is similar to yours, but stripped down to the smallest possible problem that represents the issue at hand. Here you could ask:
"I am working in Python and have list = [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]. Can you help me turn this into a dataframe with two variables."
Understand the suggested solution¶
In this case we got the following proposed solution:
# Your list of tuples
data = [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6)]
#Create a DataFrame with two variables from combos
df_wages_pred2 = pd.DataFrame(data, columns=['age', 'educ'])
Adapt the solution to your problem¶
Now you need to adapt the solution to your problem at hand.
x_pred_grid = pd.DataFrame(combos, columns=['age', 'educ'])
¶
Now we can proceed to apply our estimated model to the generated age
-educ
combinations.
ols_pred_grid = res.get_prediction(sm.add_constant(x_pred_grid))
x_pred_grid["wage_pred1"] = ols_pred_grid.predicted_mean
Let's display this as if it was a function. The height of the function will be represented by the color.
(
ggplot(x_pred_grid,aes(x = 'age', y='educ')) +
geom_raster(aes(fill = 'wage_pred1'), hjust = 0.5, vjust = 0.5)
)
You can see that all the variation in the predicted wage is along the educ
axis. That corresponds to the regression coefficient for that variable being statistically significantly different from 0, while the coefficient to age
is not.
Summary¶
In this walk-through you learned how
- to use the results from a regression to get predicted values for in- and out-of sample observations
- to create a grid of value combinations for the explanatory variables for which to create predictions
This walk-through is part of the ECLR page.