Introduction to Regression in Python

%load_ext pretty_jupyter

Introduction¶

This walk-through is part of the ECLR page.

Here you will learn to

create summary tables
create scatter plots using ggplot and geom_point
run simple regression models
apply heteroskedastic robust inference on regression coefficients
display regression results

This follows the example in Kevin Sheppard's Introduction to Python Chapter 21.1 Regression. The statsmodels package has good documentation here.

Preparing your workfile¶

We add the basic packages needed for this work:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm

Loading the Data¶

Here we are using a dataset that is provided through the statsmodels package (see details on the statecrime dataset here).

d = sm.datasets.statecrime.load_pandas()

The data are now loaded into d. That is a dataset and you can see the actual spreadsheet using the .data attribute. And we shall save this into df_crime. The actual object d has additional attributes. For instance it has pre-set what the endogenous and exogenous variables are. Here we shall not use that information.

df_crime = d.data
df_crime.info

<bound method DataFrame.info of                       violent  murder  hs_grad  poverty  single  white   urban
state                                                                         
Alabama                 459.9     7.1     82.1     17.5    29.0   70.0   48.65
Alaska                  632.6     3.2     91.4      9.0    25.5   68.3   44.46
Arizona                 423.2     5.5     84.2     16.5    25.7   80.0   80.07
Arkansas                530.3     6.3     82.4     18.8    26.3   78.4   39.54
California              473.4     5.4     80.6     14.2    27.8   62.7   89.73
Colorado                340.9     3.2     89.3     12.9    21.4   84.6   76.86
Connecticut             300.5     3.0     88.6      9.4    25.0   79.1   84.83
Delaware                645.1     4.6     87.4     10.8    27.6   71.9   68.71
District of Columbia   1348.9    24.2     87.1     18.4    48.0   38.7  100.00
Florida                 612.6     5.5     85.3     14.9    26.6   76.9   87.44
Georgia                 432.6     6.0     83.9     16.5    29.3   61.9   65.38
Hawaii                  274.1     1.8     90.4     10.4    26.3   26.9   71.46
Idaho                   238.5     1.5     88.4     14.3    19.0   92.3   50.51
Illinois                618.2     8.4     86.4     13.3    26.0   72.5   79.97
Indiana                 366.4     5.3     86.6     14.4    24.5   85.7   59.17
Iowa                    294.5     1.3     90.5     11.8    20.3   92.3   41.66
Kansas                  412.0     4.7     89.7     13.4    22.8   86.3   50.17
Kentucky                265.5     4.3     81.7     18.6    25.4   88.8   40.99
Louisiana               628.4    12.3     82.2     17.3    31.4   63.7   61.33
Maine                   119.9     2.0     90.2     12.3    22.0   94.9   26.21
Maryland                590.0     7.7     88.2      9.1    27.3   60.2   83.53
Massachusetts           465.6     2.7     89.0     10.3    25.0   82.4   90.30
Michigan                504.4     6.3     87.9     16.2    25.6   79.9   66.37
Minnesota               214.2     1.5     91.5     11.0    20.2   87.4   58.00
Mississippi             306.7     6.9     80.4     21.9    32.8   59.6   27.62
Missouri                500.3     6.6     86.8     14.6    25.3   83.9   56.61
Montana                 283.9     3.2     90.8     15.1    20.3   89.4   26.49
Nebraska                305.5     2.5     89.8     12.3    20.9   88.1   53.78
Nevada                  704.6     5.9     83.9     12.4    28.5   76.2   86.51
New Hampshire           169.5     0.9     91.3      8.5    19.5   94.5   47.34
New Jersey              311.3     3.7     87.4      9.4    25.8   70.7   92.24
New Mexico              652.8    10.0     82.8     18.0    29.1   72.5   53.75
New York                385.5     4.0     84.7     14.2    30.2   67.4   82.66
North Carolina          414.0     5.4     84.3     16.3    26.3   70.5   54.88
North Dakota            223.6     2.0     90.1     11.7    18.2   90.2   40.00
Ohio                    358.1     5.0     87.6     15.2    26.3   84.0   65.31
Oklahoma                510.4     6.5     85.6     16.2    25.9   75.4   45.79
Oregon                  261.2     2.3     89.1     14.3    22.7   85.6   62.47
Pennsylvania            388.9     5.4     87.9     12.5    24.5   83.5   70.68
Rhode Island            254.3     3.0     84.7     11.5    27.3   82.6   90.46
South Carolina          675.1     6.7     83.6     17.1    28.4   67.6   55.78
South Dakota            201.0     3.6     89.9     14.2    20.8   86.3   29.92
Tennessee               666.0     7.4     83.1     17.1    26.3   79.1   54.38
Texas                   491.4     5.4     79.9     17.2    27.6   73.8   75.35
Utah                    216.2     1.4     90.4     11.5    17.9   89.3   81.17
Vermont                 135.1     1.3     91.0     11.4    21.3   95.8   17.38
Virginia                230.0     4.7     86.6     10.5    24.0   70.4   69.79
Washington              338.3     2.8     89.7     12.3    22.2   80.2   74.97
West Virginia           331.2     4.9     82.8     17.7    23.3   94.3   33.20
Wisconsin               259.7     2.6     89.8     12.4    22.2   88.4   55.80
Wyoming                 219.3     2.0     91.8      9.8    18.9   91.3   24.51>

Data definitions are available from the details on the statecrime dataset here). In short, the data are a cross-section of crime related U.S. state-wide information:

violent, Rate of violent crimes / 100,000 population.
murder, Rate of murders / 100,000 population.
hs_grad, Percent of population having graduated from high school or higher.
poverty, % of individuals below the poverty line
white, Percent of population that is one race - white only.
single, % of single households
urban, % of population in Urbanized Areas

Here we may be interested in whether the state characteristics, urban, poverty, hs_grad and single, explain variation in either the rate of violent crimes (violent) or murder (murder).

Simple descriptive data analysis¶

Before embarking on a regression analysis we would typically first look at some simple descriptive analysis and a graphical representation of the data.

df_crime.describe()

	violent	murder	hs_grad	poverty	single	white	urban
count	51.000000	51.000000	51.000000	51.000000	51.000000	51.000000	51.000000
mean	411.482353	4.900000	86.878431	13.854902	25.186275	77.968627	60.670196
std	208.017447	3.646094	3.377769	3.110583	4.786565	13.553029	20.802432
min	119.900000	0.900000	79.900000	8.500000	17.900000	26.900000	17.380000
25%	263.350000	2.650000	84.050000	11.500000	22.100000	70.600000	46.565000
50%	366.400000	4.700000	87.400000	14.200000	25.500000	80.000000	59.170000
75%	507.400000	6.150000	89.800000	16.400000	27.300000	87.750000	78.415000
max	1348.900000	24.200000	91.800000	21.900000	48.000000	95.800000	100.000000

We could also plot a scatter plot between, say one of our variables to be explained (say murder) and one of the variables used to explain variation (say poverty).

# before using lets_plot you need to call the setup
LetsPlot.setup_html(no_js=True)
p1 = ggplot(df_crime, aes(x = 'poverty', y = 'murder')) + geom_point()
p1

p2 = ggplot(df_crime, aes(x = 'poverty', y = 'urban')) + geom_point()
p2

There appears to be a slight positive correlation between murder and poverty but no apparent correlation between murder and urban. Of course this is not evidence for any causal relationship.

Estimating a regression¶

The above scatter plots point at the value of estimating a regression model. While we could produce these plots for all combinations of variables, they ignore that, of course poverty and urban may be correlated with each other. A regression model will be able to disentangle this. Let's consider estimating this model.

$$m_i = \beta_0 + \beta_1 urban_i + \beta_2 poverty_i + \beta_3 hsgrad_i + \beta_4 single_i + \epsilon_i $$

Before we can estimate a regression model we specify the regression model and save that specification in an object mod. The first argument specifies the explained and the second the explanatory variables. The result is an object of the model class OLS (see the above help on statsmodels for other type of model classes).

mod = sm.OLS(df_crime['murder'],df_crime[['urban', 'poverty', 'hs_grad', 'single']])

In order to confirm what type of object mod is you could run type(mod) which will confirm mod is of type "statsmodels.regression.linear_model.OLS". Every object of that type has some attributes and methods associated with it. You can figure out which by running dir(mod).

You can think of attributes as characteristics of the object and of methods as of tools that can be applied to the object. The method that is of immediate importance is to actually estimate (or fit) the model. We apply that method to the object mod using the command mod.fit(). The result we save in res.

res = mod.fit()
print(res.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 murder   R-squared (uncentered):                   0.915
Model:                            OLS   Adj. R-squared (uncentered):              0.908
Method:                 Least Squares   F-statistic:                              126.9
Date:                Wed, 11 Dec 2024   Prob (F-statistic):                    1.45e-24
Time:                        13:43:02   Log-Likelihood:                         -101.53
No. Observations:                  51   AIC:                                      211.1
Df Residuals:                      47   BIC:                                      218.8
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
urban         -0.0118      0.016     -0.745      0.460      -0.044       0.020
poverty        0.0348      0.106      0.327      0.745      -0.179       0.249
hs_grad       -0.1187      0.016     -7.614      0.000      -0.150      -0.087
single         0.6137      0.079      7.803      0.000       0.455       0.772
==============================================================================
Omnibus:                        3.288   Durbin-Watson:                   2.523
Prob(Omnibus):                  0.193   Jarque-Bera (JB):                2.663
Skew:                           0.200   Prob(JB):                        0.264
Kurtosis:                       4.046   Cond. No.                         53.0
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This is very much like a standard regression output you would see from most statistical computing packages. One thing you may note is that there are two degrees of freedom (Df) information. The model and residual degrees of freedom. The model Df tells you how many explanatory variables were used (here 4) and the residual Df is the number of observations minus the number of estimated coefficients, here 51 - 4 = 47. The latter is the usual definition of degrees of freedom in the context of regression models.

What you will notice here is that the regression did not include a constant. That is because our matrix with explanatory variables (d.exog) did not contain any. Often statistical procedures will automatically include a constant (like the lm function in R), but this one does not. So we need to actively add a constant. This is done by using sm.add_constant when we set-up the model as seen below. Let's re-specify and re-estimate the model with a constant.

# to make the code a little clearer we define y and x
y = df_crime['murder']
x = df_crime[['urban', 'poverty', 'hs_grad', 'single']]
mod = sm.OLS(y,sm.add_constant(x))
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     50.08
Date:                Wed, 11 Dec 2024   Prob (F-statistic):           3.42e-16
Time:                        13:46:58   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -44.1024     12.086     -3.649      0.001     -68.430     -19.774
urban          0.0109      0.015      0.707      0.483      -0.020       0.042
poverty        0.4121      0.140      2.939      0.005       0.130       0.694
hs_grad        0.3059      0.117      2.611      0.012       0.070       0.542
single         0.6374      0.070      9.065      0.000       0.496       0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The res object we used to store the regression results in can be thought of as a shelf full of interesting information. The summary() method gave as the big hits from that shelf of information. But you can access all possible individual elements from that shelf. To find out what is on that shelf you can again use the dir(res) command.

dir(res)

['HC0_se',
 'HC1_se',
 'HC2_se',
 'HC3_se',
 '_HCCM',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abat_diagonal',
 '_cache',
 '_data_attr',
 '_data_in_cache',
 '_get_robustcov_results',
 '_is_nested',
 '_use_t',
 '_wexog_singular_values',
 'aic',
 'bic',
 'bse',
 'centered_tss',
 'compare_f_test',
 'compare_lm_test',
 'compare_lr_test',
 'condition_number',
 'conf_int',
 'conf_int_el',
 'cov_HC0',
 'cov_HC1',
 'cov_HC2',
 'cov_HC3',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'diagn',
 'eigenvals',
 'el_test',
 'ess',
 'f_pvalue',
 'f_test',
 'fittedvalues',
 'fvalue',
 'get_influence',
 'get_prediction',
 'get_robustcov_results',
 'info_criteria',
 'initialize',
 'k_constant',
 'llf',
 'load',
 'model',
 'mse_model',
 'mse_resid',
 'mse_total',
 'nobs',
 'normalized_cov_params',
 'outlier_test',
 'params',
 'predict',
 'pvalues',
 'remove_data',
 'resid',
 'resid_pearson',
 'rsquared',
 'rsquared_adj',
 'save',
 'scale',
 'ssr',
 'summary',
 'summary2',
 't_test',
 't_test_pairwise',
 'tvalues',
 'uncentered_tss',
 'use_t',
 'wald_test',
 'wald_test_terms',
 'wresid']

Here are a few examples of what you could extract from res.

print("\nConfidence intervals for coefficient estimates")
print(res.conf_int())

print("\nAIC information criterion")
print(res.aic)

print("\nRegression fitted values")
print(res.fittedvalues)

Confidence intervals for coefficient estimates
                 0          1
const   -68.430362 -19.774469
urban    -0.020104   0.041880
poverty   0.129901   0.694399
hs_grad   0.070059   0.541795
single    0.495840   0.778910

AIC information criterion
200.10018977656853

Regression fitted values
state
Alabama                  7.240371
Alaska                   4.305784
Arizona                  5.709436
Arkansas                 6.047842
California               5.103820
Colorado                 3.010261
Connecticut              3.734915
Delaware                 5.426470
District of Columbia    21.810162
Florida                  6.040398
Georgia                  7.752260
Hawaii                   5.380746
Idaho                    1.495336
Illinois                 5.253719
Indiana                  4.585734
Iowa                     1.839635
Kansas                   3.940428
Kentucky                 5.193414
Louisiana                8.856294
Maine                    2.869247
Maryland                 4.940706
Massachusetts            4.287778
Michigan                 6.504815
Minnesota                1.930016
Mississippi             10.726801
Missouri                 5.211376
Montana                  3.126335
Nebraska                 2.345950
Nevada                   5.782611
New Hampshire            0.276226
New Jersey               3.958383
New Mexico               7.779862
New York                 7.810840
North Carolina           5.765752
North Dakota             0.319488
Ohio                     6.435508
Oklahoma                 5.768319
Oregon                   4.197993
Pennsylvania             4.325676
Rhode Island             4.934577
South Carolina           7.229609
South Dakota             2.836099
Tennessee                5.722915
Texas                    5.842075
Utah                     0.585888
Vermont                  2.200749
Virginia                 2.775294
Washington               3.374663
West Virginia            3.735693
Wisconsin                3.237746
Wyoming                  0.333984
dtype: float64

Robust standard errors¶

The info in the previous regression output highlights that "Covariance Type: nonrobust". This means that the coefficient standard errors have been calculated with the standard formula which assumes that error terms are iid distributed (also see the warning note [1] in the regression summary output). You will have learned that it is almost standard practice to calculate heteroskedasticity-robust (or heteroskedasticity and autocorrelation-robust) standard errors. Often they are referred to as White (Newey-West standard errors). If you want these you will have to let the .fit method know.

res_white=mod.fit(cov_type='HC0')
print(res_white.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     31.45
Date:                Wed, 11 Dec 2024   Prob (F-statistic):           1.23e-12
Time:                        13:52:19   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4                                         
Covariance Type:                  HC0                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -44.1024     11.873     -3.715      0.000     -67.373     -20.832
urban          0.0109      0.013      0.814      0.416      -0.015       0.037
poverty        0.4121      0.115      3.595      0.000       0.187       0.637
hs_grad        0.3059      0.111      2.763      0.006       0.089       0.523
single         0.6374      0.082      7.733      0.000       0.476       0.799
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

It turns out that these standard errors were already available from the first regression output (res.HC0_se), but if you want them to show in your summary output it is useful to indicate as you call the fit method.

The regression output illustrates that this did not change the coefficient estimates but only the standard errors of the coefficient estimates changed. They are now White standard errors.

If you wanted to implement Newey-West standard errors (here using 2 lags) to make standard errors robust to the presence of autocorrelation you would use the following.

res_NW=mod.fit(cov_type='HAC',cov_kwds={'maxlags':2})
print(res_NW.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     36.41
Date:                Wed, 11 Dec 2024   Prob (F-statistic):           1.03e-13
Time:                        13:57:12   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4                                         
Covariance Type:                  HAC                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -44.1024     12.488     -3.532      0.000     -68.578     -19.627
urban          0.0109      0.014      0.804      0.422      -0.016       0.037
poverty        0.4121      0.114      3.604      0.000       0.188       0.636
hs_grad        0.3059      0.116      2.634      0.008       0.078       0.534
single         0.6374      0.080      7.922      0.000       0.480       0.795
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 2 lags and without small sample correction
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Again the parameter estimates remain unchanged and the standard errors do change.

This particular application of Newey-West standard erros, of course, makes no sense as these are no time-series data. This is therefore just another example that not everything you can do in your statistical software actually makes sense. You need to keep your thinking hat on all the time.

Interpretation¶

As this is not an econometrics course but rather some support to implement econometric techniques we will not go into great detail here. But there are a couple of points worth remembering when looking at regression results.

First, how would be interpret a coefficient, say the coefficient to hs_grad in the above regresison model. In the sample this was estimated to be $\hat{\beta}_3 = 0.31$. A mechanically correct interpretation is: "Ceteris paribus, a one unit increase in hs_grad correlates to a a 0.31 unit increase in murder." Recall that ceteris paribus here implies that in our little thought experiment of increasing hs_grad by one unit we keep all other explanatory variables unchanged.

The statement is a very sterile statement and very mechanical. To breathe more life into that statement it is good to remind yourself of the relevant summary stats:

df_crime[['murder','hs_grad']].describe()

	murder	hs_grad
count	51.000000	51.000000
mean	4.900000	86.878431
std	3.646094	3.377769
min	0.900000	79.900000
25%	2.650000	84.050000
50%	4.700000	87.400000
75%	6.150000	89.800000
max	24.200000	91.800000

dir(res)

['HC0_se',
 'HC1_se',
 'HC2_se',
 'HC3_se',
 '_HCCM',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abat_diagonal',
 '_cache',
 '_data_attr',
 '_data_in_cache',
 '_get_robustcov_results',
 '_is_nested',
 '_use_t',
 '_wexog_singular_values',
 'aic',
 'bic',
 'bse',
 'centered_tss',
 'compare_f_test',
 'compare_lm_test',
 'compare_lr_test',
 'condition_number',
 'conf_int',
 'conf_int_el',
 'cov_HC0',
 'cov_HC1',
 'cov_HC2',
 'cov_HC3',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'diagn',
 'eigenvals',
 'el_test',
 'ess',
 'f_pvalue',
 'f_test',
 'fittedvalues',
 'fvalue',
 'get_influence',
 'get_prediction',
 'get_robustcov_results',
 'het_scale',
 'info_criteria',
 'initialize',
 'k_constant',
 'llf',
 'load',
 'model',
 'mse_model',
 'mse_resid',
 'mse_total',
 'nobs',
 'normalized_cov_params',
 'outlier_test',
 'params',
 'predict',
 'pvalues',
 'remove_data',
 'resid',
 'resid_pearson',
 'rsquared',
 'rsquared_adj',
 'save',
 'scale',
 'ssr',
 'summary',
 'summary2',
 't_test',
 't_test_pairwise',
 'tvalues',
 'uncentered_tss',
 'use_t',
 'wald_test',
 'wald_test_terms',
 'wresid']

A one unit increase in hs_grade is equivalent to a one percentage point increase in high school graduation. That is a smallish but sensible decrease. But what is a 0.31 unit increase in murder? The average value for murder is 4.9. An increase of 0.31 is equivalent to a 6% increase (relative to the mean value). A one standard deviation increase in the high school graduation rate (about 3.4% point increase) therefore correlates to an about 22% increase of the murder rate (relative to its mean) ($= 3.4 * 0.31 / 4.9$).

You might think that the direction of correlation here is not intuitive. More education relates to more crime. This would suggest that a policy reducing high school education is a policy that could reduce crime. This is a good reminder of two things:

The effects estimated are partial correlations, i.e. a measurement of the linear relation accounting for the relation between the other explanatory variables and murder. The more highly correlated explanatory variables are related to each other. The more difficult it is to interpret these (check x.corr() to see how the variables in x are correlated amongst each other.
Irrespective of the previous point, in general you cannot attach a causal interpretation to these coefficient estimates. To be precise, you cannot claim that a policy that reduces the percentage of high school graduates in a state by 3.4% points, would cause a drop in the murder rathe by 22%.

In order to make causal statements you will have to think very carefully. This is often called causal inference. While, in the end, a particular OLS regression may deliver the evidence, most OLS regressions will not deliver causal inference.

Summary¶

In this walkthrough you learned

how to produce simple summary statistics (incl. some scatterplots)
how to set up a linear regression model
how to estimate a linear regression model by OLS
how to access model statistics
how to implement robust regression inference on single coefficients
to be cautious about the interpretation of regression results

As OLS regressions are a workhorse of empirical economics you have therefore learned a lot.

This walk-through is part of the ECLR page.