%load_ext pretty_jupyter
Introduction¶
This walk-through is part of the ECLR page.
Here you will learn to
- create summary tables
- create scatter plots using
ggplot
andgeom_point
- run simple regression models
- apply heteroskedastic robust inference on regression coefficients
- display regression results
This follows the example in Kevin Sheppard's Introduction to Python Chapter 21.1 Regression. The statsmodels package has good documentation here.
Preparing your workfile¶
We add the basic packages needed for this work:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm
Loading the Data¶
Here we are using a dataset that is provided through the statsmodels
package (see details on the statecrime
dataset here).
d = sm.datasets.statecrime.load_pandas()
The data are now loaded into d
. That is a dataset and you can see the actual spreadsheet using the .data
attribute. And we shall save this into df_crime
. The actual object d
has additional attributes. For instance it has pre-set what the endogenous and exogenous variables are. Here we shall not use that information.
df_crime = d.data
df_crime.info
<bound method DataFrame.info of violent murder hs_grad poverty single white urban state Alabama 459.9 7.1 82.1 17.5 29.0 70.0 48.65 Alaska 632.6 3.2 91.4 9.0 25.5 68.3 44.46 Arizona 423.2 5.5 84.2 16.5 25.7 80.0 80.07 Arkansas 530.3 6.3 82.4 18.8 26.3 78.4 39.54 California 473.4 5.4 80.6 14.2 27.8 62.7 89.73 Colorado 340.9 3.2 89.3 12.9 21.4 84.6 76.86 Connecticut 300.5 3.0 88.6 9.4 25.0 79.1 84.83 Delaware 645.1 4.6 87.4 10.8 27.6 71.9 68.71 District of Columbia 1348.9 24.2 87.1 18.4 48.0 38.7 100.00 Florida 612.6 5.5 85.3 14.9 26.6 76.9 87.44 Georgia 432.6 6.0 83.9 16.5 29.3 61.9 65.38 Hawaii 274.1 1.8 90.4 10.4 26.3 26.9 71.46 Idaho 238.5 1.5 88.4 14.3 19.0 92.3 50.51 Illinois 618.2 8.4 86.4 13.3 26.0 72.5 79.97 Indiana 366.4 5.3 86.6 14.4 24.5 85.7 59.17 Iowa 294.5 1.3 90.5 11.8 20.3 92.3 41.66 Kansas 412.0 4.7 89.7 13.4 22.8 86.3 50.17 Kentucky 265.5 4.3 81.7 18.6 25.4 88.8 40.99 Louisiana 628.4 12.3 82.2 17.3 31.4 63.7 61.33 Maine 119.9 2.0 90.2 12.3 22.0 94.9 26.21 Maryland 590.0 7.7 88.2 9.1 27.3 60.2 83.53 Massachusetts 465.6 2.7 89.0 10.3 25.0 82.4 90.30 Michigan 504.4 6.3 87.9 16.2 25.6 79.9 66.37 Minnesota 214.2 1.5 91.5 11.0 20.2 87.4 58.00 Mississippi 306.7 6.9 80.4 21.9 32.8 59.6 27.62 Missouri 500.3 6.6 86.8 14.6 25.3 83.9 56.61 Montana 283.9 3.2 90.8 15.1 20.3 89.4 26.49 Nebraska 305.5 2.5 89.8 12.3 20.9 88.1 53.78 Nevada 704.6 5.9 83.9 12.4 28.5 76.2 86.51 New Hampshire 169.5 0.9 91.3 8.5 19.5 94.5 47.34 New Jersey 311.3 3.7 87.4 9.4 25.8 70.7 92.24 New Mexico 652.8 10.0 82.8 18.0 29.1 72.5 53.75 New York 385.5 4.0 84.7 14.2 30.2 67.4 82.66 North Carolina 414.0 5.4 84.3 16.3 26.3 70.5 54.88 North Dakota 223.6 2.0 90.1 11.7 18.2 90.2 40.00 Ohio 358.1 5.0 87.6 15.2 26.3 84.0 65.31 Oklahoma 510.4 6.5 85.6 16.2 25.9 75.4 45.79 Oregon 261.2 2.3 89.1 14.3 22.7 85.6 62.47 Pennsylvania 388.9 5.4 87.9 12.5 24.5 83.5 70.68 Rhode Island 254.3 3.0 84.7 11.5 27.3 82.6 90.46 South Carolina 675.1 6.7 83.6 17.1 28.4 67.6 55.78 South Dakota 201.0 3.6 89.9 14.2 20.8 86.3 29.92 Tennessee 666.0 7.4 83.1 17.1 26.3 79.1 54.38 Texas 491.4 5.4 79.9 17.2 27.6 73.8 75.35 Utah 216.2 1.4 90.4 11.5 17.9 89.3 81.17 Vermont 135.1 1.3 91.0 11.4 21.3 95.8 17.38 Virginia 230.0 4.7 86.6 10.5 24.0 70.4 69.79 Washington 338.3 2.8 89.7 12.3 22.2 80.2 74.97 West Virginia 331.2 4.9 82.8 17.7 23.3 94.3 33.20 Wisconsin 259.7 2.6 89.8 12.4 22.2 88.4 55.80 Wyoming 219.3 2.0 91.8 9.8 18.9 91.3 24.51>
Data definitions are available from the details on the statecrime
dataset here). In short, the data are a cross-section of crime related U.S. state-wide information:
violent
, Rate of violent crimes / 100,000 population.murder
, Rate of murders / 100,000 population.hs_grad
, Percent of population having graduated from high school or higher.poverty
, % of individuals below the poverty linewhite
, Percent of population that is one race - white only.single
, % of single householdsurban
, % of population in Urbanized Areas
Here we may be interested in whether the state characteristics, urban
, poverty
, hs_grad
and single
, explain variation in either the rate of violent crimes (violent
) or murder (murder
).
Simple descriptive data analysis¶
Before embarking on a regression analysis we would typically first look at some simple descriptive analysis and a graphical representation of the data.
df_crime.describe()
violent | murder | hs_grad | poverty | single | white | urban | |
---|---|---|---|---|---|---|---|
count | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 |
mean | 411.482353 | 4.900000 | 86.878431 | 13.854902 | 25.186275 | 77.968627 | 60.670196 |
std | 208.017447 | 3.646094 | 3.377769 | 3.110583 | 4.786565 | 13.553029 | 20.802432 |
min | 119.900000 | 0.900000 | 79.900000 | 8.500000 | 17.900000 | 26.900000 | 17.380000 |
25% | 263.350000 | 2.650000 | 84.050000 | 11.500000 | 22.100000 | 70.600000 | 46.565000 |
50% | 366.400000 | 4.700000 | 87.400000 | 14.200000 | 25.500000 | 80.000000 | 59.170000 |
75% | 507.400000 | 6.150000 | 89.800000 | 16.400000 | 27.300000 | 87.750000 | 78.415000 |
max | 1348.900000 | 24.200000 | 91.800000 | 21.900000 | 48.000000 | 95.800000 | 100.000000 |
We could also plot a scatter plot between, say one of our variables to be explained (say murder
) and one of the variables used to explain variation (say poverty
).
# before using lets_plot you need to call the setup
LetsPlot.setup_html(no_js=True)
p1 = ggplot(df_crime, aes(x = 'poverty', y = 'murder')) + geom_point()
p1
p2 = ggplot(df_crime, aes(x = 'poverty', y = 'urban')) + geom_point()
p2
There appears to be a slight positive correlation between murder
and poverty
but no apparent correlation between murder
and urban
. Of course this is not evidence for any causal relationship.
Estimating a regression¶
The above scatter plots point at the value of estimating a regression model. While we could produce these plots for all combinations of variables, they ignore that, of course poverty
and urban
may be correlated with each other. A regression model will be able to disentangle this. Let's consider estimating this model.
Before we can estimate a regression model we specify the regression model and save that specification in an object mod
. The first argument specifies the explained and the second the explanatory variables. The result is an object of the model class OLS (see the above help on statsmodels for other type of model classes).
mod = sm.OLS(df_crime['murder'],df_crime[['urban', 'poverty', 'hs_grad', 'single']])
In order to confirm what type of object mod
is you could run type(mod)
which will confirm mod
is of type "statsmodels.regression.linear_model.OLS". Every object of that type has some attributes and methods associated with it. You can figure out which by running dir(mod)
.
You can think of attributes as characteristics of the object and of methods as of tools that can be applied to the object. The method that is of immediate importance is to actually estimate (or fit
) the model. We apply that method to the object mod
using the command mod.fit()
. The result we save in res
.
res = mod.fit()
print(res.summary())
OLS Regression Results ======================================================================================= Dep. Variable: murder R-squared (uncentered): 0.915 Model: OLS Adj. R-squared (uncentered): 0.908 Method: Least Squares F-statistic: 126.9 Date: Wed, 11 Dec 2024 Prob (F-statistic): 1.45e-24 Time: 13:43:02 Log-Likelihood: -101.53 No. Observations: 51 AIC: 211.1 Df Residuals: 47 BIC: 218.8 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ urban -0.0118 0.016 -0.745 0.460 -0.044 0.020 poverty 0.0348 0.106 0.327 0.745 -0.179 0.249 hs_grad -0.1187 0.016 -7.614 0.000 -0.150 -0.087 single 0.6137 0.079 7.803 0.000 0.455 0.772 ============================================================================== Omnibus: 3.288 Durbin-Watson: 2.523 Prob(Omnibus): 0.193 Jarque-Bera (JB): 2.663 Skew: 0.200 Prob(JB): 0.264 Kurtosis: 4.046 Cond. No. 53.0 ============================================================================== Notes: [1] R² is computed without centering (uncentered) since the model does not contain a constant. [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This is very much like a standard regression output you would see from most statistical computing packages. One thing you may note is that there are two degrees of freedom (Df) information. The model and residual degrees of freedom. The model Df tells you how many explanatory variables were used (here 4) and the residual Df is the number of observations minus the number of estimated coefficients, here 51 - 4 = 47. The latter is the usual definition of degrees of freedom in the context of regression models.
What you will notice here is that the regression did not include a constant. That is because our matrix with explanatory variables (d.exog
) did not contain any. Often statistical procedures will automatically include a constant (like the lm
function in R), but this one does not. So we need to actively add a constant. This is done by using sm.add_constant
when we set-up the model as seen below. Let's re-specify and re-estimate the model with a constant.
# to make the code a little clearer we define y and x
y = df_crime['murder']
x = df_crime[['urban', 'poverty', 'hs_grad', 'single']]
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: murder R-squared: 0.813 Model: OLS Adj. R-squared: 0.797 Method: Least Squares F-statistic: 50.08 Date: Wed, 11 Dec 2024 Prob (F-statistic): 3.42e-16 Time: 13:46:58 Log-Likelihood: -95.050 No. Observations: 51 AIC: 200.1 Df Residuals: 46 BIC: 209.8 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -44.1024 12.086 -3.649 0.001 -68.430 -19.774 urban 0.0109 0.015 0.707 0.483 -0.020 0.042 poverty 0.4121 0.140 2.939 0.005 0.130 0.694 hs_grad 0.3059 0.117 2.611 0.012 0.070 0.542 single 0.6374 0.070 9.065 0.000 0.496 0.779 ============================================================================== Omnibus: 1.618 Durbin-Watson: 2.507 Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831 Skew: -0.220 Prob(JB): 0.660 Kurtosis: 3.445 Cond. No. 5.80e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.8e+03. This might indicate that there are strong multicollinearity or other numerical problems.
The res
object we used to store the regression results in can be thought of as a shelf full of interesting information. The summary()
method gave as the big hits from that shelf of information. But you can access all possible individual elements from that shelf. To find out what is on that shelf you can again use the dir(res)
command.
dir(res)
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
Here are a few examples of what you could extract from res
.
print("\nConfidence intervals for coefficient estimates")
print(res.conf_int())
print("\nAIC information criterion")
print(res.aic)
print("\nRegression fitted values")
print(res.fittedvalues)
Confidence intervals for coefficient estimates 0 1 const -68.430362 -19.774469 urban -0.020104 0.041880 poverty 0.129901 0.694399 hs_grad 0.070059 0.541795 single 0.495840 0.778910 AIC information criterion 200.10018977656853 Regression fitted values state Alabama 7.240371 Alaska 4.305784 Arizona 5.709436 Arkansas 6.047842 California 5.103820 Colorado 3.010261 Connecticut 3.734915 Delaware 5.426470 District of Columbia 21.810162 Florida 6.040398 Georgia 7.752260 Hawaii 5.380746 Idaho 1.495336 Illinois 5.253719 Indiana 4.585734 Iowa 1.839635 Kansas 3.940428 Kentucky 5.193414 Louisiana 8.856294 Maine 2.869247 Maryland 4.940706 Massachusetts 4.287778 Michigan 6.504815 Minnesota 1.930016 Mississippi 10.726801 Missouri 5.211376 Montana 3.126335 Nebraska 2.345950 Nevada 5.782611 New Hampshire 0.276226 New Jersey 3.958383 New Mexico 7.779862 New York 7.810840 North Carolina 5.765752 North Dakota 0.319488 Ohio 6.435508 Oklahoma 5.768319 Oregon 4.197993 Pennsylvania 4.325676 Rhode Island 4.934577 South Carolina 7.229609 South Dakota 2.836099 Tennessee 5.722915 Texas 5.842075 Utah 0.585888 Vermont 2.200749 Virginia 2.775294 Washington 3.374663 West Virginia 3.735693 Wisconsin 3.237746 Wyoming 0.333984 dtype: float64
Robust standard errors¶
The info in the previous regression output highlights that "Covariance Type: nonrobust". This means that the coefficient standard errors have been calculated with the standard formula which assumes that error terms are iid distributed (also see the warning note [1] in the regression summary output). You will have learned that it is almost standard practice to calculate heteroskedasticity-robust (or heteroskedasticity and autocorrelation-robust) standard errors. Often they are referred to as White (Newey-West standard errors). If you want these you will have to let the .fit
method know.
res_white=mod.fit(cov_type='HC0')
print(res_white.summary())
OLS Regression Results ============================================================================== Dep. Variable: murder R-squared: 0.813 Model: OLS Adj. R-squared: 0.797 Method: Least Squares F-statistic: 31.45 Date: Wed, 11 Dec 2024 Prob (F-statistic): 1.23e-12 Time: 13:52:19 Log-Likelihood: -95.050 No. Observations: 51 AIC: 200.1 Df Residuals: 46 BIC: 209.8 Df Model: 4 Covariance Type: HC0 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -44.1024 11.873 -3.715 0.000 -67.373 -20.832 urban 0.0109 0.013 0.814 0.416 -0.015 0.037 poverty 0.4121 0.115 3.595 0.000 0.187 0.637 hs_grad 0.3059 0.111 2.763 0.006 0.089 0.523 single 0.6374 0.082 7.733 0.000 0.476 0.799 ============================================================================== Omnibus: 1.618 Durbin-Watson: 2.507 Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831 Skew: -0.220 Prob(JB): 0.660 Kurtosis: 3.445 Cond. No. 5.80e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC0) [2] The condition number is large, 5.8e+03. This might indicate that there are strong multicollinearity or other numerical problems.
It turns out that these standard errors were already available from the first regression output (res.HC0_se
), but if you want them to show in your summary output it is useful to indicate as you call the fit
method.
The regression output illustrates that this did not change the coefficient estimates but only the standard errors of the coefficient estimates changed. They are now White standard errors.
If you wanted to implement Newey-West standard errors (here using 2 lags) to make standard errors robust to the presence of autocorrelation you would use the following.
res_NW=mod.fit(cov_type='HAC',cov_kwds={'maxlags':2})
print(res_NW.summary())
OLS Regression Results ============================================================================== Dep. Variable: murder R-squared: 0.813 Model: OLS Adj. R-squared: 0.797 Method: Least Squares F-statistic: 36.41 Date: Wed, 11 Dec 2024 Prob (F-statistic): 1.03e-13 Time: 13:57:12 Log-Likelihood: -95.050 No. Observations: 51 AIC: 200.1 Df Residuals: 46 BIC: 209.8 Df Model: 4 Covariance Type: HAC ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -44.1024 12.488 -3.532 0.000 -68.578 -19.627 urban 0.0109 0.014 0.804 0.422 -0.016 0.037 poverty 0.4121 0.114 3.604 0.000 0.188 0.636 hs_grad 0.3059 0.116 2.634 0.008 0.078 0.534 single 0.6374 0.080 7.922 0.000 0.480 0.795 ============================================================================== Omnibus: 1.618 Durbin-Watson: 2.507 Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831 Skew: -0.220 Prob(JB): 0.660 Kurtosis: 3.445 Cond. No. 5.80e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 2 lags and without small sample correction [2] The condition number is large, 5.8e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Again the parameter estimates remain unchanged and the standard errors do change.
This particular application of Newey-West standard erros, of course, makes no sense as these are no time-series data. This is therefore just another example that not everything you can do in your statistical software actually makes sense. You need to keep your thinking hat on all the time.
Interpretation¶
As this is not an econometrics course but rather some support to implement econometric techniques we will not go into great detail here. But there are a couple of points worth remembering when looking at regression results.
First, how would be interpret a coefficient, say the coefficient to hs_grad
in the above regresison model. In the sample this was estimated to be $\hat{\beta}_3 = 0.31$. A mechanically correct interpretation is: "Ceteris paribus, a one unit increase in hs_grad
correlates to a a 0.31 unit increase in murder
." Recall that ceteris paribus here implies that in our little thought experiment of increasing hs_grad
by one unit we keep all other explanatory variables unchanged.
The statement is a very sterile statement and very mechanical. To breathe more life into that statement it is good to remind yourself of the relevant summary stats:
df_crime[['murder','hs_grad']].describe()
murder | hs_grad | |
---|---|---|
count | 51.000000 | 51.000000 |
mean | 4.900000 | 86.878431 |
std | 3.646094 | 3.377769 |
min | 0.900000 | 79.900000 |
25% | 2.650000 | 84.050000 |
50% | 4.700000 | 87.400000 |
75% | 6.150000 | 89.800000 |
max | 24.200000 | 91.800000 |
dir(res)
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'het_scale', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
A one unit increase in hs_grade
is equivalent to a one percentage point increase in high school graduation. That is a smallish but sensible decrease. But what is a 0.31 unit increase in murder
? The average value for murder is 4.9. An increase of 0.31 is equivalent to a 6% increase (relative to the mean value). A one standard deviation increase in the high school graduation rate (about 3.4% point increase) therefore correlates to an about 22% increase of the murder rate (relative to its mean) ($= 3.4 * 0.31 / 4.9$).
You might think that the direction of correlation here is not intuitive. More education relates to more crime. This would suggest that a policy reducing high school education is a policy that could reduce crime. This is a good reminder of two things:
- The effects estimated are partial correlations, i.e. a measurement of the linear relation accounting for the relation between the other explanatory variables and
murder
. The more highly correlated explanatory variables are related to each other. The more difficult it is to interpret these (checkx.corr()
to see how the variables inx
are correlated amongst each other. - Irrespective of the previous point, in general you cannot attach a causal interpretation to these coefficient estimates. To be precise, you cannot claim that a policy that reduces the percentage of high school graduates in a state by 3.4% points, would cause a drop in the murder rathe by 22%.
In order to make causal statements you will have to think very carefully. This is often called causal inference. While, in the end, a particular OLS regression may deliver the evidence, most OLS regressions will not deliver causal inference.
Summary¶
In this walkthrough you learned
- how to produce simple summary statistics (incl. some scatterplots)
- how to set up a linear regression model
- how to estimate a linear regression model by OLS
- how to access model statistics
- how to implement robust regression inference on single coefficients
- to be cautious about the interpretation of regression results
As OLS regressions are a workhorse of empirical economics you have therefore learned a lot.
This walk-through is part of the ECLR page.