Background Notes for ECON20222

Introduction

These are the background notes are for ECON20222 Quantitative methods. You are assumed to have taken a first year course in statistics and therefore we assume that you are familiar with descriptive statistics, statistical inference (hypothesis testing and confidence intervals) and the basics of regression analysis as a descriptive tool as well as basic inference on regression parameters.

These notes establish the methodology/econometrics (algebra/notation/terminology) that we will use throughout the unit.

These background notes are essential reading and are examinable. The notes and slides go side-by-side, except that these notes focus solely on the methodology/econometrics, whereas the slides also refer to the empirical examples estimated using R. The book by Angrist and Pischke (2015) gives a flavour of where the subject is currently at, but is too detailed for what we want to cover. The same topics are covered in Cunningham’s text , which can be accessed online here. The notes are taken mainly from another well-known econometrics text, , by Wooldridge (2018).

I am grateful to my colleague Martyn Andrews on whose material these notes are based on.

The simple regression model

The general case

We start with the

\[ y = \alpha + \beta x + u. \tag{1}\]

The sample version of this model is written:

\[ y_i = \alpha + \beta x_i + u_i, \]

as Equation 1 holds for each observation $i$. This notation emphasises that we have a sample of data at hand:

\[ \{(y_i, x_i): i=1,\dots,n \}. \]

The subscript is useful as it reflects the type of data we are using, eg subscript $i$ is used for cross-section data, $t$ is used for time-series data, and $it$ is used for panel data or repeated cross-sections. The subscript is also known as the unit of observation.

Card and Krueger - Example

In the Card and Krueger example $y_{it}, x_{it}$ are $EMPFT_{it}$ (number of full-time employees) and $WAGE_{it}$ (the starting wage) and the Card-Krueger dataset is an example of microeconometric panel data. The unit of observation is a fast/food/outlet ($i$)–half-year ($t$).

By contrast to $(y_i, x_i)$, $u_i$ is unobserved, but, like $(y_i,x_i)$, $u_i$ is a random variable with mean, variance, etc.

The ultimate objective is to estimate the parameters $\alpha$ and $\beta$ using some econometric estimator. The most well-known is OLS. To do this we need 2 assumptions because there are 2 unknown parameters $\alpha$ and $\beta$. They are:

\[ E(u) = 0~(A1) \]

\[ E(u | x) = 0 ~ (A2). \] These are often called moment conditions or exogeneity assumptions. The latter Assumption A2 is absolutely crucial throughout everything we do in empirical economics. It is best to think of $E(u | x)$ as the same as $E(u x)$. If the assumption is true, $E(u x)=0$, then it follows that $Cov(u x)=0$. In words, $u$ and $x$ are said to be uncorrelated. What this means and whether it is true (or otherwise) is a crucial and recurring theme in this course.

Next, from Equation 1

\[ \begin{align*} E(y|x) &= E(\alpha |x) + E(\beta x |x) + E(u|x) \\ &= \alpha + \beta x + E(u|x) \end{align*} \]

by conditioning on $x$. We now use A2 to define the population regression function or conditional expectation function:

\[ E(y|x) = \alpha + \beta x. \]

When this is estimated, it is written

\[ \hat{y}_i = \hat{\alpha} + \hat{\beta} x_i. \]

and is known as the sample regression function or regression line or line of best fit. It is often written out in full, with numbers in brackets underneath the numerical estimates. These are called standard errors (more below). The estimate and its standard error allow us to test various hypotheses about the parameters (typically on $\beta$).

From this sample regression function,

\[ \hat{\beta} = \Delta \hat{y}_i/ \Delta x_i, \]

and so the correct way to interpret $\hat{\beta}$ is the resulting change in $\hat{y}_i$ for a one unit change in $x_i$. If $y$ is specified as a logged variable in Equation 1, $\log y_i = \alpha + \beta x_i + u_i$, we have

\[ 100\hat{\beta} \approx 100 \frac{\Delta \hat{y}_i/\hat{y}_i}{\Delta x_i} \equiv \frac{\%\Delta \hat{y}_i}{\Delta x_i}, \]

which means, if we muliply $\hat{\beta}$ by 100, the percentage point change in $y$ for a one-unit change in $x$.

Precentage point v. percentage change

This is not a the percentage change in $y$. For instance, if the unemployment rate increases from 4% to 6% that is a 2% point increase but a 50% percent increase in the unemployment rate.

If both $y$ and $x$ are specified as logged variables in Equation 1, $\log y_i = \alpha + \beta \log x_i + u_i$, we have

\[ \hat{\beta} \approx \frac{\Delta \hat{y}_i/\hat{y}_i}{\Delta x_i/x_i} \equiv \frac{\%\Delta \hat{y}_i}{\%\Delta x_i}, \]

which means that $\hat{\beta}$ is the percentage point change in $y$ for a one percentage point change in $x$. Now $\hat{\beta}$ is interpreted as an elasticity. Elasticities are very useful because they are unit-free. (See the appendix to this section for another unit-free interpretation.)

Derivation of OLS Estimators

We now derive the Ordinary Least Squares or OLS estimators $\hat{\alpha}$ and $\hat{\beta}$, using A1 and A2. These are formulae for $\hat{\alpha}$ and $\hat{\beta}$ which only involve quantities that can be computed from a sample of data. First, we form expectations of both sides of Equation 1:

\[ \begin{align*} E(y) &= E(\alpha) + E(\beta x) + E(u), \\ &= \alpha + \beta E(x) + E(u), \end{align*} \]

and then use A1:

\[ E(y) = \alpha + \beta E(x). \tag{2}\]

Subtracting from Equation 1:

\[ y-E(y) = \beta [x-E(x)] + u. \] Now multiply by $x-E(x)$

\[ [y-E(y)][x-E(x)] = \beta [x-E(x)]^2 + u[x-E(x)]. \]

and form expectations of each term

\[ Cov(y,x) = \beta~Var(x) + Cov(u,x). \]

We now use Assumption A2, $Cov(u,x)=0$, and so obtain the following expression for the OLS estimator $\beta$:

\[ \beta = \frac{Cov(y,x)}{Var(x)}. \tag{3}\]

To compute this, we replace population variances and covariances by their sample equivalents (the analogy principle), and so obtain the sample estimator (unique for any given sample):

\[ \hat{\beta} = \frac{\widehat{Cov}(y,x)}{\widehat{Var}(x)} = \frac{\Sigma_i (y_i-\bar{y})(x_i-\bar{x})/n}{\Sigma_i (x_i-\bar{x})^2/n} = \frac{\Sigma_i (y_i-\bar{y})(x_i-\bar{x}) }{\Sigma_i (x_i-\bar{x})^2 }, \tag{4}\]

where $\bar{x} = \frac{1}{n} \Sigma_i x_i$ and $\bar{y} = \frac{1}{n} \Sigma_i y_i$ are the usual sample means. All this assumes that $\widehat{Var(x)}>0$, that is $x$ exhibits some variation in the sample. Finally, from Equation 2, we can write:

\[ \alpha = E(y)- \beta E(x), \]

which we compute using Equation 3} and sample means $\bar{y}$ and $\bar{x}$:

\[ \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}. \]

Regression without explanatory variable x

When the model does not contain $x$, ie $\beta=0$, then $\alpha = E(y)$ and $\hat{\alpha} = \bar{y}$.

The expression for the OLS estimator $\hat{\beta}$ in Equation 4 is key. The corresponding key assumption is A2, and is said to identify the underlying population parameter $\beta$. (A1 is less important, but is needed to derive the expression for $\hat{\alpha}$.)

Next, we define $\hat{u}_i \equiv y_i-\hat{y}_i$ and this residual is computed for every observation $i$. By construction,

\[ \Sigma_i \hat{u}_i=0 \qquad \Sigma_i \hat{u}_i x_i =0, \]

and this is because they match Assumptions A1/A2: $E(u) = E(ux)=0$. Whilst this makes intuitive sense, the obvious problem is that OLS imposes $\Sigma_i \hat{u}_i x_i =0$ even though A2 may not be true.

An alternative and identical way of deriving $\hat{\alpha}, \hat{\beta}$ is to minimize

\[ \Sigma_i \hat{u}_i^2 \equiv \Sigma_i (y_i-\hat{y}_i)^2 = \Sigma_i (y_i - \hat{\alpha} - \hat{\beta} x_i)^2. \]

Whilst this is less appealing, it does convey the idea of fitting a line as closely as possible to a scatter-plot of data and it is where OLS gets its name from. We omit the details of this derivation.

We now give an intuitive explanation as to when $\hat{beta}$ can be said to be unbiased. First, from Equation 4

\[ \begin{multline*} \hat{\beta} = \frac{\widehat{Cov}(y,x)}{\widehat{Var}(x)} = \frac{\widehat{Cov}(\alpha + \beta x +u,x)}{\widehat{Var}(x)} \quad \text{[using @eq-erikson]} = \\ \frac{\beta \widehat{Cov}(x,x) + \widehat{Cov}(u,x)}{\widehat{Var}(x)} \quad \text{[as $\widehat{Cov}(\alpha,x) \equiv 0$ ]} =\beta + \frac{\widehat{Cov}(u,x)}{\widehat{Var}(x)}, \end{multline*} \]

[as $\widehat{Cov}(x,x) \equiv \widehat{Var}(x)$]. In one dataset we cannot know whether $\widehat{Cov}(u,x)$ is zero or otherwise. The key assumption is that $E [\widehat{Cov}(u,x)] =Cov(u,x)=0$ when averaging over all possible datasets that nature could have generated, not just the one dataset we end up with. This rather subtle argument means that $\hat{\beta}$ is unbiased

\[ E(\hat{beta}) = \beta + \frac{E [\widehat{Cov}(u,x)]}{E [\widehat{Var}(x)]} = \beta + \frac{Cov(u,x)}{Var(x)} = \beta. \tag{5}\]

Caution

Normally $E(X/Y) \neq E(X)/E(Y)$, but here it turns out one can write $E[\frac{\widehat{Cov}(u,x)}{\widehat{Var}(x)}] = \frac{E[\widehat{Cov}(u,x)]}{E [\widehat{Var}(x)]}$.

In words, over all possible datasets, the average value of the estimator $\hat{\beta}$ is the true $\beta$, providing we can assume $Cov(u,x)=0$. To slightly misquote Angrist and Pischke (2015, p.35):

Assumption A2 is absolutely crucial in all applied economics. As $u$ is fundamentally unobserved, we cannot look at the data to examine/test/check whether it is true or not. We can only use common sense/reasoned argument/economic theory to establish that Assumption A2 is true. Applied economists spend a lot of time arguing about these things. In a later section on endogeneity, we document three common situations when $Cov(u,x)$ often is not zero. In later sections, we discuss what we can do about it, which is to use other estimators than OLS. If $Cov(u,x)\neq 0$, then $x$ is said to be endogenous.

Three important consequences of believing A2 to be true are:

$x$ is said to be exogenous (this is an econometric definition, more general than that used by economists);
the OLS estimator $\hat{\beta}$ is said to be unbiased; and
the effect of $x$ on $y$ (given by the OLS estimate $\hat{\beta}$) is said to be causal (rather than just a correlation).

Finding causal relationships is often seen as the “holy grail” of applied economists.

Unbiasedness and Consistency

A property related to unbiasedness is consistency, which is unbiasedness, but only in “large” samples.

Estimating regression models in R

The above instructions for estimating regression models are implemented in R via the function lm(y~x).

Regression Inference

As you have seen from the discussion above we can use sample statistics like $\widehat{Cov}(y,x)$, $\widehat{Var}(y)$, $\bar{y}$ and $\bar{x}$ to obtain sample estimates for $\hat{\beta}$ and $\hat{\alpha}$. This immediately implies that for different samples we would obtain different estimates for $\hat{\beta}$ and $\hat{\alpha}$ or in other words, these are actually random variables and the values we obtain are draws from these random variables.

We wish to learn from the estimated values for $\hat{\beta}$ and $\hat{\alpha}$ something about the true (and assumed fixed) population values $\beta$ and $\alpha$. The process of using $\hat{\beta}$ and $\hat{\alpha}$ to learn something about the population values $\beta$ and $\alpha$ is called statistical inference.

What does it mean when $\hat{\beta}$ and $\hat{\alpha}$ are random variables?

First recall what the difference between a random variable and a realisation of a random variable is. Take a dice. The outcome of rolling the dice ones is a random variable. If it is a fair dice then we actually know exactly how to describe this random variable. It is fully characterised by the distribution that puts a probability of $1/6$ on each of the possible outcomes 1 to 6.

Once you rolled the dice you have a particular outcome, say a 4.

Once you calculated, say a $\hat{\beta} = 0.34$ that is equivalent of the 4, i.e. one particular outcome. However, we are not interested in a particular outcome of that sample. This is what inference is about.

In econometrics, for every estimator like $\hat{\beta}$, there is an associated standard error, called $\text{se}(\hat{\beta})$. It captures the so-called sampling variation which comes about because a second, third, fourth sample etc would generate a different $\hat{\beta}$ each time. This pair of numbers, $\hat{\beta}$ and $\text{se}(\hat{beta})$, allows us to conduct hypothesis tests about $\beta$, written:

\[ H_0: \beta = b, \]

where $b$ is often, but not always, zero. If the null hypothesis is true, then the quantity

\[ \frac{\hat{\beta}-b}{\text{se}(\hat{\beta})} \sim t(n-2). \]

This says that the LHS, the so-called the test statistic, has a certain statistical distribution—hence the “$\sim$” symbol—on the RHS. Here we have a $t$ distribution with $n-2$ degrees of freedom. We can then use this distribution to evaluate how likely it would have been to obtain the actual sample we have, assuming the null hypothesis is indeed actually true; this probability is known as a p-value.

Two equivalent decision rules follow:

Decision Rule 1 (DR1): If this p-value is smaller than a pre-specified significance level (typically 5%), then we “reject $H_0$”. 5% is the probability of a Type-I error. If, however, the p-value is larger than our pre-specified significance level, then we “do not reject $H_0$”.
Decision Rule 2 (DR2): If the absolute value of the test statistic is larger than the so-called critical value, then we reject $H_0$. If, however, the absolute value of the test statistic is smaller than the critical value, then we do not reject $H_0$.

The critical value in decision rule 2 is chosen so that DR1 and DR2 give the same .

Further comments on hypothesis testing

We never “accept $H_0$”; instead we “do not reject $H_0$”
This general approach to hypothesis testing recurs throughout econometrics. It is only the details that vary (the hypothesis itself, thereby generating different distributions such as Normal, Chi-squared, F, etc)
In general, students do not need to know the formula for $\text{se}(\hat{beta})$ (the software needs to, of course). But see next point.
Depending on the type of problem there are different versions of $\text{se}(\hat{beta})$. The correct version depends on the type of assumption made on the unobserved error terms. There are “traditional” and “robust” ( or “heteroskedasticity robust” or “White” standard errors. By default we almost always want to use the latter, the “robust” standard errors. There are some additional specialised versions of standard errors, such as the “heteroskedasticity and autocorrelation robust” (or “HAC” or “Newey-West”) standard errors and the “cluster robust” standard errors. We will mention later when these may be important. YOu will not need to know the formulae but you will need to know which ones to use such that you can ask your software to calculate the correct ones.

Special case: When $x$ is a dummy variable

A very important special case is when $x$ is a so-called dummy or binary variable. Everything we discussed about regression model applies, but there are differences as well, on which we now focus.

We start with the same model, given by {Equation 1}

\[ y = \alpha + \beta x + u, \]

but now $x$ is either 0 (eg a woman) or 1 (eg a man). $x$ is said to be a dummy variable. There are many of these variables in cross-section datasets.

The scatter plot changes in a big way. Instead of a “fuzz” of data-points, we get something like this:

We can see two vertical distributions of data, one for $x_i=0$ (women) and the other for $x_i=1$ (men). As an example, suppose $y_i$ are each individual’s earnings. The sample mean for the $x_i=0$ (female) sub-sample is denoted $\bar{y}_0$, and is located in the middle of the female distribution (highlighted in red). $\bar{y}_1$ is defined analogously for men.

The population regression function allows us to understand what is being estimated and what can be tested. Assumptions A1/A2 imply:

\[ E(u|x=0) = E(u|x=1) = 0.~~~A3 \]

We next write out the population regression function for both values of $x$:

\[ \mu_0 \equiv E (y | x=0) = \alpha + E(u|x=0) \qquad \mu_1 \equiv E (y | x=1) = \alpha + \beta + E(u|x=1). \tag{6}\]

We now impose Assumption A3,

\[ E (y | x=0) = \alpha + 0 \qquad E (y | x=1) = \alpha + \beta + 0, \]

from which

\[ \beta = E (y | x=1) - E (y | x=0) \equiv \mu_1-\mu_0. \tag{7}\]

From this, Equation 1 can be rewritten as

\[ y = \mu_0 + (\mu_1-\mu_0)x + u. \]

Equation 7 is the key result, that $\beta$ can be interpreted as the population raw differential. Using the same technique of replacing population means by their sample averages, we have

\[ \hat{\beta} = \bar{y}_1-\bar{y}_0, \tag{8}\]

which is the sample raw differential. Also, $\hat{\alpha}=\bar{y}_0$.

This result is the workhorse result in microeconometrics and will be used repeatedly though these notes. Note that we can write the $\bar{y}$s in two different but interchangeable ways:

\[ \bar{y}_1 = n_1^{-1}\Sigma_{i \in 1} y_i = n^{-1}\Sigma^n_{i=1} x_i y_i. \]

In the latter, multpliplying $y_i$ by $x_i$ before summing means that one is only summimg the $y_i$s for whom $x_i=1$. In the former, “$i \in 1$” means “all $i$s who belong to group 1”. Similarly,

\[ \bar{y}_0 = n_0^{-1}\Sigma_{i \in 0} y_i = n^{-1}\Sigma^n_{i=1} (1-x_i)y_i. \]

In the figure, this result makes perfect sense. $\bar{y}_1-\bar{y}_0$ is the vertical distance between the 2 red squares. The horizontal distance is unity, and so the regression is the line that joins the two blobs, whose slope is $\hat{\beta}$. Comparing with the general case in Section 2.1, the interpretation of the sample regression function is the same, but the scatter plot is different.

We can write the regression line as

\[ \hat{y}_i = \bar{y}_0 + (\bar{y}_1-\bar{y}_0) x_i. \tag{9}\]

The definition of a residual remains the same, $\hat{u}_i \equiv y_i-\hat{y}_i$. To interpret, switch $x_i$ on and off:

\[ \hat{u}_i= \begin{cases} &y_i-\bar{y}_0 \quad \text{if $x_i=0$;} \\ &y_i-\bar{y}_1 \quad \text{if $x_i=1$.} \end{cases} \]

In words, $\hat{u}_i$ is the gap between an individual’s (eg a woman) earnings and the average for their group (women). $\hat{u}_i$ is often said to capture within variation. By contrast, $\bar{y}_1-\bar{y}_0$ is said to capture between variation. It is also the case that $\Sigma_{i \in 1}\hat{u}_i=\Sigma_{i \in 2}\hat{u}_i =0$ matching Assumption A3 above.

In this context, the obvious hypothesis to test is whether the population means for the 2 sub-samples are the same:

\[ H_0: \mu_1=\mu_0 \equiv H_0: \beta=0. \]

This occurs literally all the time in economics (and more generally in social science and public discourse).

Examples for interesting dummy setups

There are numerous examples you can think of here

Do women earn more than men?
Do men run faster ultra marathons than women?
Do high-school graduates live longer than non-high-school graduates?

the key features here are that the explanatory variables are binary and the outcome is a quantitative variable.

If we repeat our general approach to hypothesis testing in Section 2.1, the test statistic still has a $t$ distribution,

\[ \frac{\bar{y}_1-\bar{y}_0-0}{\text{se}(\bar{y}_1-\bar{y}_0)} \sim t(n-2). \]

if the null hypothesis is true. Again, we will not discuss how to calculate $\text{se}(\bar{y}_1-\bar{y}_0)$ but rather assume that in doubt the software we use knows how to calculate this standard error.

When the dependent variable is a logarithmic variable

\[ \log y = \alpha + \beta x + u \]

then Equation 8 becomes

\[ \hat{\beta} = \log \bar{y}_1 - \log \bar{y}_0. \]

As in the general case before, the following algebra gives a unit-free interpretation:

\[ \begin{align*} \hat{\beta} &= \log \bar{y}_1 - \log \bar{y}_0 \\ &= \log \left(1+ \frac{\bar{y}_1 - \bar{y}_0}{\bar{y}_0} \right) \\ &\equiv \log \left(1+ \%\Delta y/100 \right) \\ &\approx \%\Delta y/100 \quad \text{if $\%\Delta y/100$ is small}. \end{align*} \]

In other words, when $\%\Delta y/100$ is small, one can interpret $\hat{\beta}$ as the raw differential in the dependent variable. This is typically described as a log-point change. When $\%\Delta y/100$ is not small, one needs to compute the same concept from the third line:

\[ \%\Delta y = 100[\exp(\hat{\beta})-1], \]

and is described as a percentage-point raw differential.

One advantage of the log-point interpretation is that it is invariant to a redefinition of $x$ by swapping $x=1$ with $x=0$ (eg now $x=0$ is man). All that happens is that $\hat{\beta}$ swaps sign.

Finally, one might wonder why we need to estimate a regression model when all one needs to do is compute 2 sample means and their variances. The answers will become apparent as the course develops. At the moment, we simply assert that regression is an all-embracing methodology.

Practice Question

Assume that you estimated a log-level model, where the dependent variable is log-wages and the explanatory variable is a dummy variable indicating that whether an individual has a high-school degree ($x=1$) or not ($x = 0$). You obtained a $\hat{\beta} = 0.21$.

Which of the following interpretations is correct?

The log income elasticity is 0.21 Income for high-school graduates is, on average 21 log points higher than for those who have no high-school diploma the raw differential between those and without high-school diploma is 21%

What is the (average) percentage point increase for those with a high-school diploma? [to 4 decimal points]

Appendix: standardised coefficients

Another way of producing unit-free estimates—as with elasticities—is the following. Suppose a simple regression has been estimated:

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + \hat{u}_i. \]

Subtract $\bar{y} =\hat{\alpha} +\hat{beta} \bar{x}$ (reflecting the fact that the point of averages is always on the regression line):

\[ \hat{y}_i - \bar{y} = \hat{\beta}_1 (x_i - \bar{x}) + \hat{u}_i, \]

and divide by the sample deviation of $y$, $\hat{\sigma}_y$:

\[ \frac{\hat{y}_i - \bar{y}}{\hat{\sigma}_y} = \frac{\hat{\sigma}_x}{\hat{\sigma}_y}\hat{\beta}_1 \left( \frac{x_i - \bar{x}}{\hat{\sigma}_x} \right ) + \frac{\hat{u}_i}{\hat{\sigma}_y}, \] where $\hat{sigma}_x$ is the sample deviation of $x$. Both $y$ and $x$ have been standardised. The new regression is written:

\[ z_y = \hat{b}_1 z_x + \text{error}_i \]

where

\[ \hat{b}_1 \equiv \frac{\hat{\sigma}_x}{\hat{\sigma}_y}\hat{\beta}_1 \]

is the standardised coefficient. Its interpretation is obvious: if $x$ goes up by one standard deviation, the effect on $y$ is $\hat{b}_1$ standard deviations.

In the simple regression case (as here), remember that $\hat{\beta}_1= \hat{\sigma}_{yx}/\hat{\sigma}^2_x$. This means that

\[ \hat{b}_1 = \frac{\hat{\sigma}_x}{\hat{\sigma}_y}\frac{\hat{\sigma}_{yx}}{\hat{\sigma}^2_x} = \frac{\hat{\sigma}_{yx}}{\hat{\sigma}_y\hat{\sigma}_x}, \]

the sample correlation between $x$ and $y$.

In a multiple regression, standardising all the variables by hand can be tedious. It is better to the estimate the model as usual, and then ask that standarised coefficients be produced post-estimation.

Reading

Most of the material in the section is taken from Wooldridge (2025, chapter 2). The appendix is from Section 6-1a of Wooldridge (2025).

Angrist, J.D. and J.-S. Pischke (2015) Mastering Metrics. Princeton University Press.

Cunningham, S. (2021) Causal Inference: The MixTape

Wooldridge, J. (2025) Introductory Econometrics: A Modern Approach, Cengage.