Chapter 4: Properties of OLS — BLUE and Gauss-Markov

Why OLS works, when it works, and simulation-based intuition

NoteLearning Objectives

By the end of this chapter you will be able to:

  • State the CLM assumptions in both scalar and matrix notation
  • Distinguish an estimator (a rule) from an estimate (a number)
  • Prove OLS unbiasedness algebraically in both scalar and matrix form
  • Derive \(\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}\)
  • State and apply the Gauss-Markov theorem
  • Derive the omitted variable bias formula algebraically
  • Explain how rescaling \(x\) or \(y\) affects OLS estimates
  • Use Monte Carlo simulation to verify unbiasedness and efficiency in R

1 Estimator vs. Estimate

A common source of confusion is the distinction between an estimator and an estimate.

An estimator is a function of the data — a rule or formula that, when applied to any sample, produces a number. Before we collect data, the estimator is a random variable with its own distribution (the sampling distribution). The OLS slope

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

is an estimator: it varies from sample to sample because it depends on the random \(y_i\) (which in turn depend on the random errors \(u_i\)).

An estimate is the specific number you obtain after plugging in a particular realised sample. If your data give \(\hat{\beta}_1 = 2.34\), that is the estimate for this dataset. It is a fixed number, not a random variable.

This distinction matters for inference. Unbiasedness, variance, and confidence intervals are properties of the estimator (averaged over hypothetical repeated samples), not of any single estimate.


2 The Classical Linear Model Assumptions

2.1 Scalar (Simple Regression) Form

Label Assumption Statement
SLR.1 Linearity in parameters \(y = \beta_0 + \beta_1 x + u\)
SLR.2 Random sampling \((x_i, y_i)\) i.i.d. from the population
SLR.3 No perfect collinearity \(\text{Var}(x) > 0\) (not all \(x_i\) identical)
SLR.4 Zero conditional mean \(E[u \mid x] = 0\)
SLR.5 Homoskedasticity \(\text{Var}(u \mid x) = \sigma^2\) (constant)
SLR.6 Normality \(u \mid x \sim \mathcal{N}(0, \sigma^2)\)

SLR.1–4 deliver unbiasedness. Adding SLR.5 delivers BLUE (minimum variance among linear unbiased estimators). Adding SLR.6 delivers exact finite-sample inference via the \(t\) and \(F\) distributions.

2.2 Matrix (Multiple Regression) Form

Stack observations: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}\), where \(\mathbf{y}\) is \(n \times 1\), \(\mathbf{X}\) is \(n \times (k+1)\) with a leading column of ones, \(\boldsymbol{\beta}\) is \((k+1) \times 1\), and \(\mathbf{u}\) is \(n \times 1\).

Label Assumption Matrix Statement
MLR.1 Linearity \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}\)
MLR.2 Random sampling Rows \((x_i', y_i)\) are i.i.d.
MLR.3 Full column rank \(\text{rank}(\mathbf{X}) = k + 1\) (no perfect multicollinearity)
MLR.4 Zero conditional mean \(E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}\)
MLR.5 Scalar covariance \(\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n\)
MLR.6 Normality \(\mathbf{u} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)\)

MLR.5 encodes two conditions simultaneously: (i) each error has variance \(\sigma^2\) (homoskedasticity) and (ii) errors are uncorrelated across observations (no serial correlation). Both conditions appear in the single matrix equation \(\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n\).


3 Unbiasedness of OLS

3.1 Scalar Proof

Write \(\hat{\beta}_1\) in terms of the true data-generating process. From Chapter 3:

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})y_i}{\sum(x_i - \bar{x})^2}\]

Substituting \(y_i = \beta_0 + \beta_1 x_i + u_i\):

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(\beta_0 + \beta_1 x_i + u_i)}{\sum(x_i - \bar{x})^2}\]

The \(\beta_0\) term vanishes because \(\sum(x_i - \bar{x}) = 0\). Therefore:

\[\hat{\beta}_1 = \beta_1 + \frac{\sum(x_i - \bar{x})u_i}{\sum(x_i - \bar{x})^2} = \beta_1 + \sum_{i=1}^n w_i u_i\]

where \(w_i = (x_i - \bar{x}) / \sum(x_i - \bar{x})^2\) are fixed constants (conditional on \(\mathbf{x}\)). Taking conditional expectations:

\[E[\hat{\beta}_1 \mid \mathbf{x}] = \beta_1 + \sum w_i \underbrace{E[u_i \mid \mathbf{x}]}_{=\,0 \text{ by SLR.4}} = \beta_1\]

By the Law of Iterated Expectations, \(E[\hat{\beta}_1] = E_{\mathbf{x}}[E[\hat{\beta}_1 \mid \mathbf{x}]] = \beta_1\). \(\blacksquare\)

3.2 Matrix Proof

Recall from Chapter 3 that \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}\). Substitute the true DGP \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}\):

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X}\boldsymbol{\beta} + \mathbf{u}) = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\]

\[\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\]

Taking conditional expectations:

\[E[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \underbrace{E[\mathbf{u} \mid \mathbf{X}]}_{=\,\mathbf{0} \text{ by MLR.4}} = \boldsymbol{\beta}\]

By the Law of Iterated Expectations, \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\). \(\blacksquare\)

ImportantThe Linearity Trick

The key step in both proofs is that \(\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\) — the estimator equals the truth plus a linear function of the errors. This is what makes the expectation easy to take: \(E[A\mathbf{u} \mid \mathbf{X}] = A\,E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}\) whenever \(A\) is non-stochastic given \(\mathbf{X}\). Unbiasedness fails when \(\mathbf{X}\) and \(\mathbf{u}\) are correlated (endogeneity), because then \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\) does not have conditional mean zero.


4 Variance of the OLS Estimator

4.1 Scalar Formula

Under SLR.1–5, using \(\hat{\beta}_1 = \beta_1 + \sum w_i u_i\) with \(w_i = (x_i - \bar{x}) / SST_x\):

\[\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sum w_i^2\, \text{Var}(u_i \mid \mathbf{x}) = \sigma^2 \sum w_i^2\]

Since \(\sum w_i^2 = \sum (x_i - \bar{x})^2 / SST_x^2 = SST_x / SST_x^2 = 1/SST_x\):

\[\boxed{\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \frac{\sigma^2}{SST_x}}\]

4.2 Matrix Derivation of \(\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\)

From the unbiasedness derivation, the sampling error is \(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\). The variance-covariance matrix of \(\hat{\boldsymbol{\beta}}\) is:

\[\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = E\!\left[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})' \mid \mathbf{X}\right]\]

\[= E\!\left[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\,\mathbf{u}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \mid \mathbf{X}\right]\]

\[= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\, \underbrace{E[\mathbf{u}\mathbf{u}' \mid \mathbf{X}]}_{\sigma^2 \mathbf{I}_n \text{ by MLR.5}}\, \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[= \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\underbrace{\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}}_{\mathbf{I}_{k+1}}\]

\[\boxed{\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}}\]

The \((j,j)\) diagonal element of \(\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\) is \(\text{Var}(\hat{\beta}_j \mid \mathbf{X})\); the off-diagonal \((j,\ell)\) element is \(\text{Cov}(\hat{\beta}_j, \hat{\beta}_\ell \mid \mathbf{X})\).

4.3 Estimating \(\sigma^2\)

The population error variance \(\sigma^2 = \text{Var}(u_i \mid x_i)\) is unknown and must be estimated. The natural candidate is the average squared residual, but OLS residuals have only \(n - (k+1)\) degrees of freedom (we estimated \(k+1\) parameters):

\[\hat{\sigma}^2 = \frac{SSR}{n - k - 1} = \frac{\sum_{i=1}^n \hat{u}_i^2}{n - k - 1} = \frac{\mathbf{\hat{u}}'\mathbf{\hat{u}}}{n - k - 1}\]

This estimator is unbiased: \(E[\hat{\sigma}^2 \mid \mathbf{X}] = \sigma^2\). The proof requires the trace identity \(E[\mathbf{\hat{u}}'\mathbf{\hat{u}} \mid \mathbf{X}] = \sigma^2(n - k - 1)\), which follows because \(\mathbf{\hat{u}} = \mathbf{M}_X \mathbf{u}\) and \(\text{rank}(\mathbf{M}_X) = n - k - 1\).

Plugging \(\hat{\sigma}^2\) into \(\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\) yields the estimated variance-covariance matrix \(\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2(\mathbf{X}'\mathbf{X})^{-1}\). The diagonal square roots are the standard errors reported by lm().

Implications of \(\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x\):

  1. More noise (\(\sigma^2 \uparrow\)) → less precise estimates
  2. More variation in \(x\) (\(SST_x \uparrow\)) → more precise estimates
  3. Larger \(n\)\(SST_x = \sum(x_i - \bar{x})^2\) grows → estimates improve with data

5 The Gauss-Markov Theorem

NoteGauss-Markov Theorem

Under MLR.1–5, the OLS estimator \(\hat{\boldsymbol{\beta}}\) is the Best Linear Unbiased Estimator (BLUE) of \(\boldsymbol{\beta}\).

Linear: \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some matrix \(\mathbf{C}\) that may depend on \(\mathbf{X}\) but not on \(\mathbf{y}\). Unbiased: \(E[\tilde{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}\). Best: \(\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X})\) is positive semi-definite for any alternative linear unbiased \(\tilde{\boldsymbol{\beta}}\).

Proof sketch. Let \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) be any linear unbiased estimator. For unbiasedness, we need \(\mathbf{C}\mathbf{X} = \mathbf{I}\). Write \(\mathbf{C} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' + \mathbf{D}\) where \(\mathbf{D}\mathbf{X} = \mathbf{0}\) (the unbiasedness constraint forces this). Then:

\[\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 \mathbf{C}\mathbf{C}' = \sigma^2(\mathbf{X}'\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}'\]

Since \(\mathbf{D}\mathbf{D}'\) is positive semi-definite, \(\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}})\) in the matrix sense. \(\blacksquare\)

The theorem is powerful but its scope is limited. “Best” is only among linear estimators. Non-linear estimators (e.g., ridge regression) can have lower variance by introducing bias. And “best” requires MLR.5: if errors are heteroskedastic, Weighted Least Squares is BLUE, not OLS (Chapter 9).


6 Omitted Variable Bias

One of the most important consequences of endogeneity in practice is omitted variable bias (OVB): when a relevant variable is excluded from the regression, its effect leaks into the residual and biases the coefficients on included variables.

6.1 The Algebraic Derivation

Suppose the true (long) model is:

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad \text{where } E[u \mid x_1, x_2] = 0 \tag{Long}\]

But the researcher estimates the short model, omitting \(x_2\):

\[y_i = \tilde{\beta}_0 + \tilde{\beta}_1 x_{1i} + \tilde{u}_i \tag{Short}\]

Let \(\delta_1\) be the coefficient from an auxiliary regression of \(x_2\) on \(x_1\):

\[x_{2i} = \delta_0 + \delta_1 x_{1i} + v_i \tag{Auxiliary}\]

By the Frisch-Waugh theorem, the OLS slope from the short regression can be shown to satisfy:

\[E[\tilde{\beta}_1] = \beta_1 + \beta_2 \delta_1\]

The omitted variable bias is therefore:

\[\text{OVB} = \beta_2 \delta_1 = \underbrace{\beta_2}_{\text{effect of }x_2 \text{ on }y} \times \underbrace{\delta_1}_{\text{correlation of }x_1 \text{ with }x_2}\]

The bias is nonzero if and only if both (i) \(x_2\) affects \(y\) (\(\beta_2 \neq 0\)) and (ii) \(x_1\) and \(x_2\) are correlated (\(\delta_1 \neq 0\)). Either condition alone is insufficient.

6.2 Algebraic Derivation Detail

From the short model, \(\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)y_i}{SST_{x_1}}\). Substitute the long model:

\[\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i)}{SST_{x_1}}\]

\[= \beta_1 + \beta_2 \frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}} + \frac{\sum(x_{1i} - \bar{x}_1)u_i}{SST_{x_1}}\]

Taking expectations (the last term has mean zero since \(u\) is uncorrelated with \(x_1\) in the long model):

\[E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \underbrace{\frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}}}_{\delta_1 \text{ from auxiliary regression}} = \beta_1 + \beta_2 \delta_1\]

6.3 Sign of the Bias

\(\beta_2 > 0\) \(\beta_2 < 0\)
\(\delta_1 > 0\) Positive bias: \(\tilde{\beta}_1 > \beta_1\) Negative bias: \(\tilde{\beta}_1 < \beta_1\)
\(\delta_1 < 0\) Negative bias Positive bias

Example. Estimating the return to education, omitting ability. If ability raises wages (\(\beta_2 > 0\)) and is positively correlated with education (\(\delta_1 > 0\)), then OVB \(> 0\): the OLS coefficient on education overstates the causal return, partly picking up the effect of ability.

set.seed(123)
n <- 400

# True DGP: y = 1 + 0.5*x1 + 0.8*x2 + u
x1 <- rnorm(n)
x2 <- 0.7 * x1 + rnorm(n, sd = 0.5)   # x2 correlated with x1
u  <- rnorm(n, sd = 0.5)
y  <- 1 + 0.5 * x1 + 0.8 * x2 + u

df <- tibble(y, x1, x2)

long_fit  <- lm(y ~ x1 + x2, data = df)
short_fit <- lm(y ~ x1, data = df)

cat("Long model (true):  beta_1 =", round(coef(long_fit)["x1"], 3), "\n")
Long model (true):  beta_1 = 0.473 
cat("Short model (OVB):  beta_1 =", round(coef(short_fit)["x1"], 3), "\n")
Short model (OVB):  beta_1 = 1.04 
cat("Auxiliary delta_1  =", round(coef(lm(x2 ~ x1))["x1"], 3), "\n")
Auxiliary delta_1  = 0.714 
cat("Predicted OVB      =",
    round(coef(long_fit)["x2"] * coef(lm(x2 ~ x1))["x1"], 3), "\n")
Predicted OVB      = 0.567 

7 Rescaling and Units of Measurement

OLS estimates change when we rescale variables, but the predictions and fit are invariant. Understanding exactly how they change is essential for interpreting coefficients correctly.

7.1 Case 1: Scaling \(x\) only

Let \(x^* = c \cdot x\) (e.g., converting metres to centimetres, \(c = 100\)). The new slope is:

\[\hat{\beta}_1^* = \frac{\hat{\beta}_1}{c}\]

Intuition: A one-unit increase in \(x^*\) is a \(1/c\)-unit increase in \(x\), so the slope must shrink by \(c\). The fitted values \(\hat{\beta}_1^* x^* = (\hat{\beta}_1/c)(cx) = \hat{\beta}_1 x\) are unchanged.

7.2 Case 2: Affine transformation of \(x\)

Let \(x^* = a + c \cdot x\). The slope is the same as Case 1: \(\hat{\beta}_1^* = \hat{\beta}_1/c\). The intercept adjusts to keep predictions identical: \(\hat{\beta}_0^* = \hat{\beta}_0 - \hat{\beta}_1 a/c\).

Example: Standardising \(x\) to mean 0, SD 1 (setting \(a = -\bar{x}/s_x\) and \(c = 1/s_x\)). The standardised slope is the slope in standard-deviation units.

7.3 Case 3: Scaling \(y\) only

Let \(y^* = c \cdot y\). Then:

\[\hat{\beta}_0^* = c \hat{\beta}_0, \qquad \hat{\beta}_1^* = c \hat{\beta}_1\]

All slope and intercept estimates scale by \(c\). Standard errors also scale by \(c\); \(t\)-statistics and \(R^2\) are unchanged.

7.4 Case 4: Scaling both \(x\) and \(y\)

Let \(y^* = c_y y\) and \(x^* = c_x x\). Then \(\hat{\beta}_1^* = (c_y/c_x)\hat{\beta}_1\) — the ratio of the scaling factors. \(t\)-statistics and \(R^2\) are unchanged.

data(wage1, package = "wooldridge")

fit_original <- lm(wage ~ educ, data = wage1)

# Scale x: education in months instead of years (c = 12)
fit_x_scaled <- lm(wage ~ I(educ * 12), data = wage1)

# Scale y: wage in cents instead of dollars (c = 100)
fit_y_scaled <- lm(I(wage * 100) ~ educ, data = wage1)

tibble(
  Model       = c("Original ($/years)", "x scaled ($/months)", "y scaled (cents/years)"),
  Intercept   = c(coef(fit_original)[1], coef(fit_x_scaled)[1], coef(fit_y_scaled)[1]),
  Slope       = c(coef(fit_original)[2], coef(fit_x_scaled)[2], coef(fit_y_scaled)[2]),
  `R-squared` = c(glance(fit_original)$r.squared,
                  glance(fit_x_scaled)$r.squared,
                  glance(fit_y_scaled)$r.squared)
) |>
  knitr::kable(digits = 4,
               caption = "Rescaling: predictions and R² are invariant; coefficients transform predictably.")
Rescaling: predictions and R² are invariant; coefficients transform predictably.
Model Intercept Slope R-squared
Original (\(/years) | -0.9049| 0.5414| 0.1648| |x scaled (\)/months) -0.9049 0.0451 0.1648
y scaled (cents/years) -90.4852 54.1359 0.1648

8 Monte Carlo Simulation: Seeing BLUE in Action

Monte Carlo simulation lets us repeatedly draw samples from a known data-generating process (DGP) and observe the properties of estimators empirically.

8.1 Setup

True DGP: \(y_i = 2 + 0.5 x_i + u_i\), with \(x_i \sim \mathcal{U}(0, 10)\) and \(u_i \sim \mathcal{N}(0, 4)\).

run_ols <- function(n, beta_0 = 2, beta_1 = 0.5, sigma = 2) {
  x <- runif(n, 0, 10)
  u <- rnorm(n, 0, sigma)
  y <- beta_0 + beta_1 * x + u
  coef(lm(y ~ x))
}

n_sims <- 5000

results <- map_dfr(1:n_sims, ~{
  b <- run_ols(n = 50)
  tibble(intercept = b[1], slope = b[2])
})

results |>
  summarise(
    mean_b0  = mean(intercept),
    mean_b1  = mean(slope),
    sd_b0    = sd(intercept),
    sd_b1    = sd(slope)
  )

Across 5,000 simulations, \(\hat{\beta}_1\) averages approximately 0.5 — confirming unbiasedness. We can also verify against the formula: \(\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x = 4 / (n \cdot \text{Var}(x))\).

# Theoretical SE: sigma^2 / E[SST_x]
# E[SST_x] ≈ n * Var(x) = 50 * (10^2/12) ≈ 416.7
theoretical_var <- 4 / (50 * 100/12)
cat("Theoretical SD of slope:", round(sqrt(theoretical_var), 4), "\n")
Theoretical SD of slope: 0.098 
cat("Simulation SD of slope: ", round(sd(results$slope), 4), "\n")
Simulation SD of slope:  0.1004 

8.2 Distribution of \(\hat{\beta}_1\)

ggplot(results, aes(slope)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60,
                 fill = "#2c7be5", alpha = 0.7) +
  geom_vline(xintercept = 0.5, colour = "#e63946", linewidth = 1.2) +
  stat_function(fun = dnorm,
                args = list(mean = mean(results$slope), sd = sd(results$slope)),
                colour = "#e63946", linewidth = 1) +
  labs(x = expression(hat(beta)[1]), y = "Density",
       title = "Sampling Distribution of OLS Slope",
       subtitle = "Red line = true value (0.5); curve = fitted Normal")
Figure 1: Sampling distribution of OLS slope across 5,000 simulations.

8.3 Effect of Sample Size

map_dfr(c(30, 100, 500, 2000), function(n) {
  slopes <- replicate(n_sims, run_ols(n)[2])
  tibble(n = n, slope = slopes)
}) |>
  mutate(n_label = paste("n =", n)) |>
  ggplot(aes(slope, fill = n_label)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed") +
  labs(x = expression(hat(beta)[1]), y = "Density", fill = "Sample size",
       title = "Sampling Distribution of OLS Slope by Sample Size",
       subtitle = "Dashed line = true slope (0.5)")
Figure 2: OLS becomes more precise as n grows — consistent with Var(β̂₁) = σ²/SSTₓ.

8.4 What Happens When SLR.4 Fails? (Endogeneity)

run_biased_ols <- function(n, beta_1 = 0.5) {
  x <- runif(n, 0, 10)
  u <- 0.8 * x + rnorm(n, 0, 1)   # u correlated with x
  y <- 2 + beta_1 * x + u
  coef(lm(y ~ x))[2]
}

slopes_biased   <- replicate(n_sims, run_biased_ols(50))
slopes_unbiased <- results$slope

tibble(
  estimate = c(slopes_biased, slopes_unbiased),
  type     = rep(c("Endogenous (biased)", "Exogenous (BLUE)"), each = n_sims)
) |>
  ggplot(aes(estimate, fill = type)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed") +
  labs(x = expression(hat(beta)[1]), fill = "",
       title = "Endogeneity Causes Irreducible Bias",
       subtitle = "True beta_1 = 0.5; dashed line marks true value")
Figure 3: OLS is biased when the error is correlated with x. More data does not fix this.

When \(\text{Cov}(x, u) \neq 0\), OLS is systematically biased — larger \(n\) tightens the distribution around the wrong value. This motivates the search for exogenous variation (instruments, natural experiments, randomised assignment) in causal inference.


9 Tutorials

Tutorial 4.1 Show algebraically that \(\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / SST_x\) starting from \(\hat{\beta}_1 = \beta_1 + \sum w_i u_i\) where \(w_i = (x_i - \bar{x}) / SST_x\).

Let \(S_{xx} = SST_x\) and \(w_i = (x_i - \bar{x})/S_{xx}\).

\[\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \text{Var}\!\left(\sum w_i u_i \mid \mathbf{x}\right) = \sum w_i^2 \text{Var}(u_i \mid \mathbf{x}) \stackrel{\text{SLR.5}}{=} \sigma^2 \sum w_i^2\]

\[\sum w_i^2 = \sum \frac{(x_i - \bar{x})^2}{S_{xx}^2} = \frac{S_{xx}}{S_{xx}^2} = \frac{1}{S_{xx}}\]

Therefore \(\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / S_{xx} = \sigma^2 / SST_x\). \(\blacksquare\)

Tutorial 4.2 Using the wage1 dataset from wooldridge, regress lwage (log wage) on educ and exper. Separately regress exper on educ. Use the OVB formula to predict what the slope on educ would be if you omit exper. Verify by running the short regression.

data(wage1, package = "wooldridge")

long_fit <- lm(lwage ~ educ + exper, data = wage1)
aux_fit  <- lm(exper ~ educ, data = wage1)
short_fit <- lm(lwage ~ educ, data = wage1)

beta2  <- coef(long_fit)["exper"]
delta1 <- coef(aux_fit)["educ"]

cat("Long model  beta_educ =", round(coef(long_fit)["educ"], 4), "\n")
Long model  beta_educ = 0.0979 
cat("OVB prediction        =", round(beta2 * delta1, 4), "\n")
OVB prediction        = -0.0152 
cat("Short model beta_educ =", round(coef(short_fit)["educ"], 4), "\n")
Short model beta_educ = 0.0827 
cat("Matches long + OVB?   =",
    round(coef(long_fit)["educ"] + beta2 * delta1, 4), "\n")
Matches long + OVB?   = 0.0827 

The short-model coefficient on educ equals the long-model coefficient plus the predicted OVB — exactly. Note that delta1 < 0 (more educated workers have fewer years of experience, given age), so omitting experience understates the return to education when \(\beta_2 > 0\).

Tutorial 4.3 Adapt the Monte Carlo simulation to demonstrate OVB. Generate data from: \(y_i = 1 + 0.5x_{1i} + 0.8x_{2i} + u_i\) where \(x_{2i} = 0.7x_{1i} + v_i\). Run 2,000 simulations. In each, estimate both the long model and the short model. Plot the sampling distributions of \(\tilde{\beta}_1\) (short) and \(\hat{\beta}_1\) (long) and compare to the true value 0.5.

set.seed(2024)
n_sims2 <- 2000

simulate_ovb <- function(n = 200) {
  x1 <- rnorm(n)
  x2 <- 0.7 * x1 + rnorm(n, sd = sqrt(1 - 0.7^2))
  u  <- rnorm(n, sd = 0.5)
  y  <- 1 + 0.5 * x1 + 0.8 * x2 + u

  long_b1  <- coef(lm(y ~ x1 + x2))[2]
  short_b1 <- coef(lm(y ~ x1))[2]
  tibble(long = long_b1, short = short_b1)
}

sim_results <- map_dfr(1:n_sims2, ~simulate_ovb())

sim_results |>
  pivot_longer(everything(), names_to = "model", values_to = "slope") |>
  ggplot(aes(slope, fill = model)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed", linewidth = 1) +
  labs(x = expression(hat(beta)[1]), y = "Density", fill = "Model",
       title = "Omitted Variable Bias in Monte Carlo",
       subtitle = "Dashed = true value 0.5; short model is biased upward")

OVB: the short-model slope is centred above the true value; the long model recovers it.

The long-model estimator is centred at 0.5. The short-model estimator is centred at \(0.5 + 0.8 \times 0.7 = 1.06\) — the OVB formula in action.