Chapter 4: Properties of OLS — BLUE and Gauss-Markov

Why OLS works, when it works, and simulation-based intuition

Learning Objectives

By the end of this chapter you will be able to:

State the CLM assumptions in both scalar and matrix notation
Distinguish an estimator (a rule) from an estimate (a number)
Prove OLS unbiasedness algebraically in both scalar and matrix form
Derive $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}$
State and apply the Gauss-Markov theorem
Derive the omitted variable bias formula algebraically
Explain how rescaling $x$ or $y$ affects OLS estimates
Use Monte Carlo simulation to verify unbiasedness and efficiency in R

1 Estimator vs. Estimate

A common source of confusion is the distinction between an estimator and an estimate.

An estimator is a function of the data — a rule or formula that, when applied to any sample, produces a number. Before we collect data, the estimator is a random variable with its own distribution (the sampling distribution). The OLS slope

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

is an estimator: it varies from sample to sample because it depends on the random $y_i$ (which in turn depend on the random errors $u_i$).

An estimate is the specific number you obtain after plugging in a particular realised sample. If your data give $\hat{\beta}_1 = 2.34$, that is the estimate for this dataset. It is a fixed number, not a random variable.

This distinction matters for inference. Unbiasedness, variance, and confidence intervals are properties of the estimator (averaged over hypothetical repeated samples), not of any single estimate.

2 The Classical Linear Model Assumptions

2.1 Scalar (Simple Regression) Form

Label	Assumption	Statement
SLR.1	Linearity in parameters	$y = \beta_0 + \beta_1 x + u$
SLR.2	Random sampling	$(x_i, y_i)$ i.i.d. from the population
SLR.3	No perfect collinearity	$\text{Var}(x) > 0$ (not all $x_i$ identical)
SLR.4	Zero conditional mean	$E[u \mid x] = 0$
SLR.5	Homoskedasticity	$\text{Var}(u \mid x) = \sigma^2$ (constant)
SLR.6	Normality	$u \mid x \sim \mathcal{N}(0, \sigma^2)$

SLR.1–4 deliver unbiasedness. Adding SLR.5 delivers BLUE (minimum variance among linear unbiased estimators). Adding SLR.6 delivers exact finite-sample inference via the $t$ and $F$ distributions.

2.2 Matrix (Multiple Regression) Form

Stack observations: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$, where $\mathbf{y}$ is $n \times 1$, $\mathbf{X}$ is $n \times (k+1)$ with a leading column of ones, $\boldsymbol{\beta}$ is $(k+1) \times 1$, and $\mathbf{u}$ is $n \times 1$.

Label	Assumption	Matrix Statement
MLR.1	Linearity	$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$
MLR.2	Random sampling	Rows $(x_i', y_i)$ are i.i.d.
MLR.3	Full column rank	$\text{rank}(\mathbf{X}) = k + 1$ (no perfect multicollinearity)
MLR.4	Zero conditional mean	$E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}$
MLR.5	Scalar covariance	$\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n$
MLR.6	Normality	$\mathbf{u} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$

MLR.5 encodes two conditions simultaneously: (i) each error has variance $\sigma^2$ (homoskedasticity) and (ii) errors are uncorrelated across observations (no serial correlation). Both conditions appear in the single matrix equation $\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n$.

3 Unbiasedness of OLS

3.1 Scalar Proof

Write $\hat{\beta}_1$ in terms of the true data-generating process. From Chapter 3:

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})y_i}{\sum(x_i - \bar{x})^2}\]

Substituting $y_i = \beta_0 + \beta_1 x_i + u_i$:

\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(\beta_0 + \beta_1 x_i + u_i)}{\sum(x_i - \bar{x})^2}\]

The $\beta_0$ term vanishes because $\sum(x_i - \bar{x}) = 0$. Therefore:

\[\hat{\beta}_1 = \beta_1 + \frac{\sum(x_i - \bar{x})u_i}{\sum(x_i - \bar{x})^2} = \beta_1 + \sum_{i=1}^n w_i u_i\]

where $w_i = (x_i - \bar{x}) / \sum(x_i - \bar{x})^2$ are fixed constants (conditional on $\mathbf{x}$). Taking conditional expectations:

\[E[\hat{\beta}_1 \mid \mathbf{x}] = \beta_1 + \sum w_i \underbrace{E[u_i \mid \mathbf{x}]}_{=\,0 \text{ by SLR.4}} = \beta_1\]

By the Law of Iterated Expectations, $E[\hat{\beta}_1] = E_{\mathbf{x}}[E[\hat{\beta}_1 \mid \mathbf{x}]] = \beta_1$. $\blacksquare$

3.2 Matrix Proof

Recall from Chapter 3 that $\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$. Substitute the true DGP $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X}\boldsymbol{\beta} + \mathbf{u}) = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\]

\[\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\]

Taking conditional expectations:

\[E[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \underbrace{E[\mathbf{u} \mid \mathbf{X}]}_{=\,\mathbf{0} \text{ by MLR.4}} = \boldsymbol{\beta}\]

By the Law of Iterated Expectations, $E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$. $\blacksquare$

The Linearity Trick

The key step in both proofs is that $\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$ — the estimator equals the truth plus a linear function of the errors. This is what makes the expectation easy to take: $E[A\mathbf{u} \mid \mathbf{X}] = A\,E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}$ whenever $A$ is non-stochastic given $\mathbf{X}$. Unbiasedness fails when $\mathbf{X}$ and $\mathbf{u}$ are correlated (endogeneity), because then $(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$ does not have conditional mean zero.

4 Variance of the OLS Estimator

4.1 Scalar Formula

Under SLR.1–5, using $\hat{\beta}_1 = \beta_1 + \sum w_i u_i$ with $w_i = (x_i - \bar{x}) / SST_x$:

\[\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sum w_i^2\, \text{Var}(u_i \mid \mathbf{x}) = \sigma^2 \sum w_i^2\]

Since $\sum w_i^2 = \sum (x_i - \bar{x})^2 / SST_x^2 = SST_x / SST_x^2 = 1/SST_x$:

\[\boxed{\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \frac{\sigma^2}{SST_x}}\]

4.2 Matrix Derivation of $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}$

From the unbiasedness derivation, the sampling error is $\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$. The variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is:

\[\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = E\!\left[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})' \mid \mathbf{X}\right]\]

\[= E\!\left[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\,\mathbf{u}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \mid \mathbf{X}\right]\]

\[= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\, \underbrace{E[\mathbf{u}\mathbf{u}' \mid \mathbf{X}]}_{\sigma^2 \mathbf{I}_n \text{ by MLR.5}}\, \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[= \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\underbrace{\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}}_{\mathbf{I}_{k+1}}\]

\[\boxed{\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}}\]

The $(j,j)$ diagonal element of $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ is $\text{Var}(\hat{\beta}_j \mid \mathbf{X})$; the off-diagonal $(j,\ell)$ element is $\text{Cov}(\hat{\beta}_j, \hat{\beta}_\ell \mid \mathbf{X})$.

4.3 Estimating $\sigma^2$

The population error variance $\sigma^2 = \text{Var}(u_i \mid x_i)$ is unknown and must be estimated. The natural candidate is the average squared residual, but OLS residuals have only $n - (k+1)$ degrees of freedom (we estimated $k+1$ parameters):

\[\hat{\sigma}^2 = \frac{SSR}{n - k - 1} = \frac{\sum_{i=1}^n \hat{u}_i^2}{n - k - 1} = \frac{\mathbf{\hat{u}}'\mathbf{\hat{u}}}{n - k - 1}\]

This estimator is unbiased: $E[\hat{\sigma}^2 \mid \mathbf{X}] = \sigma^2$. The proof requires the trace identity $E[\mathbf{\hat{u}}'\mathbf{\hat{u}} \mid \mathbf{X}] = \sigma^2(n - k - 1)$, which follows because $\mathbf{\hat{u}} = \mathbf{M}_X \mathbf{u}$ and $\text{rank}(\mathbf{M}_X) = n - k - 1$.

Plugging $\hat{\sigma}^2$ into $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ yields the estimated variance-covariance matrix $\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2(\mathbf{X}'\mathbf{X})^{-1}$. The diagonal square roots are the standard errors reported by lm().

Implications of $\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x$:

More noise ($\sigma^2 \uparrow$) → less precise estimates
More variation in $x$ ($SST_x \uparrow$) → more precise estimates
Larger $n$ → $SST_x = \sum(x_i - \bar{x})^2$ grows → estimates improve with data

5 The Gauss-Markov Theorem

Gauss-Markov Theorem

Under MLR.1–5, the OLS estimator $\hat{\boldsymbol{\beta}}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.

Linear: $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$ that may depend on $\mathbf{X}$ but not on $\mathbf{y}$. Unbiased: $E[\tilde{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}$. Best: $\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X})$ is positive semi-definite for any alternative linear unbiased $\tilde{\boldsymbol{\beta}}$.

Proof sketch. Let $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ be any linear unbiased estimator. For unbiasedness, we need $\mathbf{C}\mathbf{X} = \mathbf{I}$. Write $\mathbf{C} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' + \mathbf{D}$ where $\mathbf{D}\mathbf{X} = \mathbf{0}$ (the unbiasedness constraint forces this). Then:

\[\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 \mathbf{C}\mathbf{C}' = \sigma^2(\mathbf{X}'\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}'\]

Since $\mathbf{D}\mathbf{D}'$ is positive semi-definite, $\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}})$ in the matrix sense. $\blacksquare$

The theorem is powerful but its scope is limited. “Best” is only among linear estimators. Non-linear estimators (e.g., ridge regression) can have lower variance by introducing bias. And “best” requires MLR.5: if errors are heteroskedastic, Weighted Least Squares is BLUE, not OLS (Chapter 9).

6 Omitted Variable Bias

One of the most important consequences of endogeneity in practice is omitted variable bias (OVB): when a relevant variable is excluded from the regression, its effect leaks into the residual and biases the coefficients on included variables.

6.1 The Algebraic Derivation

Suppose the true (long) model is:

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad \text{where } E[u \mid x_1, x_2] = 0 \tag{Long}\]

But the researcher estimates the short model, omitting $x_2$:

\[y_i = \tilde{\beta}_0 + \tilde{\beta}_1 x_{1i} + \tilde{u}_i \tag{Short}\]

Let $\delta_1$ be the coefficient from an auxiliary regression of $x_2$ on $x_1$:

\[x_{2i} = \delta_0 + \delta_1 x_{1i} + v_i \tag{Auxiliary}\]

By the Frisch-Waugh theorem, the OLS slope from the short regression can be shown to satisfy:

\[E[\tilde{\beta}_1] = \beta_1 + \beta_2 \delta_1\]

The omitted variable bias is therefore:

\[\text{OVB} = \beta_2 \delta_1 = \underbrace{\beta_2}_{\text{effect of }x_2 \text{ on }y} \times \underbrace{\delta_1}_{\text{correlation of }x_1 \text{ with }x_2}\]

The bias is nonzero if and only if both (i) $x_2$ affects $y$ ($\beta_2 \neq 0$) and (ii) $x_1$ and $x_2$ are correlated ($\delta_1 \neq 0$). Either condition alone is insufficient.

6.2 Algebraic Derivation Detail

From the short model, $\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)y_i}{SST_{x_1}}$. Substitute the long model:

\[\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i)}{SST_{x_1}}\]

\[= \beta_1 + \beta_2 \frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}} + \frac{\sum(x_{1i} - \bar{x}_1)u_i}{SST_{x_1}}\]

Taking expectations (the last term has mean zero since $u$ is uncorrelated with $x_1$ in the long model):

\[E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \underbrace{\frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}}}_{\delta_1 \text{ from auxiliary regression}} = \beta_1 + \beta_2 \delta_1\]

6.3 Sign of the Bias

$\beta_2 > 0$	$\beta_2 < 0$
$\delta_1 > 0$	Positive bias: $\tilde{\beta}_1 > \beta_1$	Negative bias: $\tilde{\beta}_1 < \beta_1$
$\delta_1 < 0$	Negative bias	Positive bias

Example. Estimating the return to education, omitting ability. If ability raises wages ($\beta_2 > 0$) and is positively correlated with education ($\delta_1 > 0$), then OVB $> 0$: the OLS coefficient on education overstates the causal return, partly picking up the effect of ability.

set.seed(123)
n <- 400

# True DGP: y = 1 + 0.5*x1 + 0.8*x2 + u
x1 <- rnorm(n)
x2 <- 0.7 * x1 + rnorm(n, sd = 0.5)   # x2 correlated with x1
u  <- rnorm(n, sd = 0.5)
y  <- 1 + 0.5 * x1 + 0.8 * x2 + u

df <- tibble(y, x1, x2)

long_fit  <- lm(y ~ x1 + x2, data = df)
short_fit <- lm(y ~ x1, data = df)

cat("Long model (true):  beta_1 =", round(coef(long_fit)["x1"], 3), "\n")

Long model (true):  beta_1 = 0.473

cat("Short model (OVB):  beta_1 =", round(coef(short_fit)["x1"], 3), "\n")

Short model (OVB):  beta_1 = 1.04

cat("Auxiliary delta_1  =", round(coef(lm(x2 ~ x1))["x1"], 3), "\n")

Auxiliary delta_1  = 0.714

cat("Predicted OVB      =",
    round(coef(long_fit)["x2"] * coef(lm(x2 ~ x1))["x1"], 3), "\n")

Predicted OVB      = 0.567

7 Rescaling and Units of Measurement

OLS estimates change when we rescale variables, but the predictions and fit are invariant. Understanding exactly how they change is essential for interpreting coefficients correctly.

7.1 Case 1: Scaling $x$ only

Let $x^* = c \cdot x$ (e.g., converting metres to centimetres, $c = 100$). The new slope is:

\[\hat{\beta}_1^* = \frac{\hat{\beta}_1}{c}\]

Intuition: A one-unit increase in $x^*$ is a $1/c$-unit increase in $x$, so the slope must shrink by $c$. The fitted values $\hat{\beta}_1^* x^* = (\hat{\beta}_1/c)(cx) = \hat{\beta}_1 x$ are unchanged.

7.2 Case 2: Affine transformation of $x$

Let $x^* = a + c \cdot x$. The slope is the same as Case 1: $\hat{\beta}_1^* = \hat{\beta}_1/c$. The intercept adjusts to keep predictions identical: $\hat{\beta}_0^* = \hat{\beta}_0 - \hat{\beta}_1 a/c$.

Example: Standardising $x$ to mean 0, SD 1 (setting $a = -\bar{x}/s_x$ and $c = 1/s_x$). The standardised slope is the slope in standard-deviation units.

7.3 Case 3: Scaling $y$ only

Let $y^* = c \cdot y$. Then:

\[\hat{\beta}_0^* = c \hat{\beta}_0, \qquad \hat{\beta}_1^* = c \hat{\beta}_1\]

All slope and intercept estimates scale by $c$. Standard errors also scale by $c$; $t$-statistics and $R^2$ are unchanged.

7.4 Case 4: Scaling both $x$ and $y$

Let $y^* = c_y y$ and $x^* = c_x x$. Then $\hat{\beta}_1^* = (c_y/c_x)\hat{\beta}_1$ — the ratio of the scaling factors. $t$-statistics and $R^2$ are unchanged.

data(wage1, package = "wooldridge")

fit_original <- lm(wage ~ educ, data = wage1)

# Scale x: education in months instead of years (c = 12)
fit_x_scaled <- lm(wage ~ I(educ * 12), data = wage1)

# Scale y: wage in cents instead of dollars (c = 100)
fit_y_scaled <- lm(I(wage * 100) ~ educ, data = wage1)

tibble(
  Model       = c("Original ($/years)", "x scaled ($/months)", "y scaled (cents/years)"),
  Intercept   = c(coef(fit_original)[1], coef(fit_x_scaled)[1], coef(fit_y_scaled)[1]),
  Slope       = c(coef(fit_original)[2], coef(fit_x_scaled)[2], coef(fit_y_scaled)[2]),
  `R-squared` = c(glance(fit_original)$r.squared,
                  glance(fit_x_scaled)$r.squared,
                  glance(fit_y_scaled)$r.squared)
) |>
  knitr::kable(digits = 4,
               caption = "Rescaling: predictions and R² are invariant; coefficients transform predictably.")

Rescaling: predictions and R² are invariant; coefficients transform predictably.
Model	Intercept	Slope	R-squared
Original ($/years) \| -0.9049\| 0.5414\| 0.1648\| \|x scaled ($/months)	-0.9049	0.0451	0.1648
y scaled (cents/years)	-90.4852	54.1359	0.1648

8 Monte Carlo Simulation: Seeing BLUE in Action

Monte Carlo simulation lets us repeatedly draw samples from a known data-generating process (DGP) and observe the properties of estimators empirically.

8.1 Setup

True DGP: $y_i = 2 + 0.5 x_i + u_i$, with $x_i \sim \mathcal{U}(0, 10)$ and $u_i \sim \mathcal{N}(0, 4)$.

run_ols <- function(n, beta_0 = 2, beta_1 = 0.5, sigma = 2) {
  x <- runif(n, 0, 10)
  u <- rnorm(n, 0, sigma)
  y <- beta_0 + beta_1 * x + u
  coef(lm(y ~ x))
}

n_sims <- 5000

results <- map_dfr(1:n_sims, ~{
  b <- run_ols(n = 50)
  tibble(intercept = b[1], slope = b[2])
})

results |>
  summarise(
    mean_b0  = mean(intercept),
    mean_b1  = mean(slope),
    sd_b0    = sd(intercept),
    sd_b1    = sd(slope)
  )

Across 5,000 simulations, $\hat{\beta}_1$ averages approximately 0.5 — confirming unbiasedness. We can also verify against the formula: $\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x = 4 / (n \cdot \text{Var}(x))$.

# Theoretical SE: sigma^2 / E[SST_x]
# E[SST_x] ≈ n * Var(x) = 50 * (10^2/12) ≈ 416.7
theoretical_var <- 4 / (50 * 100/12)
cat("Theoretical SD of slope:", round(sqrt(theoretical_var), 4), "\n")

Theoretical SD of slope: 0.098

cat("Simulation SD of slope: ", round(sd(results$slope), 4), "\n")

Simulation SD of slope:  0.1004

8.2 Distribution of $\hat{\beta}_1$

ggplot(results, aes(slope)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60,
                 fill = "#2c7be5", alpha = 0.7) +
  geom_vline(xintercept = 0.5, colour = "#e63946", linewidth = 1.2) +
  stat_function(fun = dnorm,
                args = list(mean = mean(results$slope), sd = sd(results$slope)),
                colour = "#e63946", linewidth = 1) +
  labs(x = expression(hat(beta)[1]), y = "Density",
       title = "Sampling Distribution of OLS Slope",
       subtitle = "Red line = true value (0.5); curve = fitted Normal")

Figure 1: Sampling distribution of OLS slope across 5,000 simulations.

8.3 Effect of Sample Size

map_dfr(c(30, 100, 500, 2000), function(n) {
  slopes <- replicate(n_sims, run_ols(n)[2])
  tibble(n = n, slope = slopes)
}) |>
  mutate(n_label = paste("n =", n)) |>
  ggplot(aes(slope, fill = n_label)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed") +
  labs(x = expression(hat(beta)[1]), y = "Density", fill = "Sample size",
       title = "Sampling Distribution of OLS Slope by Sample Size",
       subtitle = "Dashed line = true slope (0.5)")

Figure 2: OLS becomes more precise as n grows — consistent with Var(β̂₁) = σ²/SSTₓ.

8.4 What Happens When SLR.4 Fails? (Endogeneity)

run_biased_ols <- function(n, beta_1 = 0.5) {
  x <- runif(n, 0, 10)
  u <- 0.8 * x + rnorm(n, 0, 1)   # u correlated with x
  y <- 2 + beta_1 * x + u
  coef(lm(y ~ x))[2]
}

slopes_biased   <- replicate(n_sims, run_biased_ols(50))
slopes_unbiased <- results$slope

tibble(
  estimate = c(slopes_biased, slopes_unbiased),
  type     = rep(c("Endogenous (biased)", "Exogenous (BLUE)"), each = n_sims)
) |>
  ggplot(aes(estimate, fill = type)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed") +
  labs(x = expression(hat(beta)[1]), fill = "",
       title = "Endogeneity Causes Irreducible Bias",
       subtitle = "True beta_1 = 0.5; dashed line marks true value")

Figure 3: OLS is biased when the error is correlated with x. More data does not fix this.

When $\text{Cov}(x, u) \neq 0$, OLS is systematically biased — larger $n$ tightens the distribution around the wrong value. This motivates the search for exogenous variation (instruments, natural experiments, randomised assignment) in causal inference.

9 Tutorials

Tutorial 4.1 Show algebraically that $\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / SST_x$ starting from $\hat{\beta}_1 = \beta_1 + \sum w_i u_i$ where $w_i = (x_i - \bar{x}) / SST_x$.

Solution

Let $S_{xx} = SST_x$ and $w_i = (x_i - \bar{x})/S_{xx}$.

\[\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \text{Var}\!\left(\sum w_i u_i \mid \mathbf{x}\right) = \sum w_i^2 \text{Var}(u_i \mid \mathbf{x}) \stackrel{\text{SLR.5}}{=} \sigma^2 \sum w_i^2\]

\[\sum w_i^2 = \sum \frac{(x_i - \bar{x})^2}{S_{xx}^2} = \frac{S_{xx}}{S_{xx}^2} = \frac{1}{S_{xx}}\]

Therefore $\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / S_{xx} = \sigma^2 / SST_x$. $\blacksquare$

Tutorial 4.2 Using the wage1 dataset from wooldridge, regress lwage (log wage) on educ and exper. Separately regress exper on educ. Use the OVB formula to predict what the slope on educ would be if you omit exper. Verify by running the short regression.

Solution

data(wage1, package = "wooldridge")

long_fit <- lm(lwage ~ educ + exper, data = wage1)
aux_fit  <- lm(exper ~ educ, data = wage1)
short_fit <- lm(lwage ~ educ, data = wage1)

beta2  <- coef(long_fit)["exper"]
delta1 <- coef(aux_fit)["educ"]

cat("Long model  beta_educ =", round(coef(long_fit)["educ"], 4), "\n")

Long model  beta_educ = 0.0979

cat("OVB prediction        =", round(beta2 * delta1, 4), "\n")

OVB prediction        = -0.0152

cat("Short model beta_educ =", round(coef(short_fit)["educ"], 4), "\n")

Short model beta_educ = 0.0827

cat("Matches long + OVB?   =",
    round(coef(long_fit)["educ"] + beta2 * delta1, 4), "\n")

Matches long + OVB?   = 0.0827

The short-model coefficient on educ equals the long-model coefficient plus the predicted OVB — exactly. Note that delta1 < 0 (more educated workers have fewer years of experience, given age), so omitting experience understates the return to education when $\beta_2 > 0$.

Tutorial 4.3 Adapt the Monte Carlo simulation to demonstrate OVB. Generate data from: $y_i = 1 + 0.5x_{1i} + 0.8x_{2i} + u_i$ where $x_{2i} = 0.7x_{1i} + v_i$. Run 2,000 simulations. In each, estimate both the long model and the short model. Plot the sampling distributions of $\tilde{\beta}_1$ (short) and $\hat{\beta}_1$ (long) and compare to the true value 0.5.

Solution

set.seed(2024)
n_sims2 <- 2000

simulate_ovb <- function(n = 200) {
  x1 <- rnorm(n)
  x2 <- 0.7 * x1 + rnorm(n, sd = sqrt(1 - 0.7^2))
  u  <- rnorm(n, sd = 0.5)
  y  <- 1 + 0.5 * x1 + 0.8 * x2 + u

  long_b1  <- coef(lm(y ~ x1 + x2))[2]
  short_b1 <- coef(lm(y ~ x1))[2]
  tibble(long = long_b1, short = short_b1)
}

sim_results <- map_dfr(1:n_sims2, ~simulate_ovb())

sim_results |>
  pivot_longer(everything(), names_to = "model", values_to = "slope") |>
  ggplot(aes(slope, fill = model)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed", linewidth = 1) +
  labs(x = expression(hat(beta)[1]), y = "Density", fill = "Model",
       title = "Omitted Variable Bias in Monte Carlo",
       subtitle = "Dashed = true value 0.5; short model is biased upward")

OVB: the short-model slope is centred above the true value; the long model recovers it.

The long-model estimator is centred at 0.5. The short-model estimator is centred at $0.5 + 0.8 \times 0.7 = 1.06$ — the OVB formula in action.

--- title: "Chapter 4: Properties of OLS — BLUE and Gauss-Markov" subtitle: "Why OLS works, when it works, and simulation-based intuition" --- ```{r} #| label: setup #| include: false library(tidyverse) library(broom) library(patchwork) library(wooldridge) theme_set(theme_minimal(base_size = 13)) options(digits = 4, scipen = 999) set.seed(42) ``` ::: {.callout-note} ## Learning Objectives By the end of this chapter you will be able to: - State the CLM assumptions in both scalar and matrix notation - Distinguish an **estimator** (a rule) from an **estimate** (a number) - Prove OLS unbiasedness algebraically in both scalar and matrix form - Derive $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}$ - State and apply the Gauss-Markov theorem - Derive the omitted variable bias formula algebraically - Explain how rescaling $x$ or $y$ affects OLS estimates - Use Monte Carlo simulation to verify unbiasedness and efficiency in R ::: --- ## Estimator vs. Estimate A common source of confusion is the distinction between an **estimator** and an **estimate**. An **estimator** is a *function* of the data — a rule or formula that, when applied to any sample, produces a number. Before we collect data, the estimator is a random variable with its own distribution (the **sampling distribution**). The OLS slope $$\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$ is an estimator: it varies from sample to sample because it depends on the random $y_i$ (which in turn depend on the random errors $u_i$). An **estimate** is the specific number you obtain after plugging in a particular realised sample. If your data give $\hat{\beta}_1 = 2.34$, that is the estimate for this dataset. It is a fixed number, not a random variable. This distinction matters for inference. Unbiasedness, variance, and confidence intervals are properties of the *estimator* (averaged over hypothetical repeated samples), not of any single *estimate*. --- ## The Classical Linear Model Assumptions ### Scalar (Simple Regression) Form | Label | Assumption | Statement | |---|---|---| | SLR.1 | Linearity in parameters | $y = \beta_0 + \beta_1 x + u$ | | SLR.2 | Random sampling | $(x_i, y_i)$ i.i.d. from the population | | SLR.3 | No perfect collinearity | $\text{Var}(x) > 0$ (not all $x_i$ identical) | | SLR.4 | Zero conditional mean | $E[u \mid x] = 0$ | | SLR.5 | Homoskedasticity | $\text{Var}(u \mid x) = \sigma^2$ (constant) | | SLR.6 | Normality | $u \mid x \sim \mathcal{N}(0, \sigma^2)$ | SLR.1–4 deliver **unbiasedness**. Adding SLR.5 delivers **BLUE** (minimum variance among linear unbiased estimators). Adding SLR.6 delivers exact finite-sample inference via the $t$ and $F$ distributions. ### Matrix (Multiple Regression) Form Stack observations: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$, where $\mathbf{y}$ is $n \times 1$, $\mathbf{X}$ is $n \times (k+1)$ with a leading column of ones, $\boldsymbol{\beta}$ is $(k+1) \times 1$, and $\mathbf{u}$ is $n \times 1$. | Label | Assumption | Matrix Statement | |---|---|---| | MLR.1 | Linearity | $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$ | | MLR.2 | Random sampling | Rows $(x_i', y_i)$ are i.i.d. | | MLR.3 | Full column rank | $\text{rank}(\mathbf{X}) = k + 1$ (no perfect multicollinearity) | | MLR.4 | Zero conditional mean | $E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}$ | | MLR.5 | Scalar covariance | $\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n$ | | MLR.6 | Normality | $\mathbf{u} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$ | MLR.5 encodes two conditions simultaneously: (i) each error has variance $\sigma^2$ (homoskedasticity) and (ii) errors are uncorrelated across observations (no serial correlation). Both conditions appear in the single matrix equation $\text{Var}(\mathbf{u} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n$. --- ## Unbiasedness of OLS ### Scalar Proof Write $\hat{\beta}_1$ in terms of the true data-generating process. From Chapter 3: $$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})y_i}{\sum(x_i - \bar{x})^2}$$ Substituting $y_i = \beta_0 + \beta_1 x_i + u_i$: $$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(\beta_0 + \beta_1 x_i + u_i)}{\sum(x_i - \bar{x})^2}$$ The $\beta_0$ term vanishes because $\sum(x_i - \bar{x}) = 0$. Therefore: $$\hat{\beta}_1 = \beta_1 + \frac{\sum(x_i - \bar{x})u_i}{\sum(x_i - \bar{x})^2} = \beta_1 + \sum_{i=1}^n w_i u_i$$ where $w_i = (x_i - \bar{x}) / \sum(x_i - \bar{x})^2$ are fixed constants (conditional on $\mathbf{x}$). Taking conditional expectations: $$E[\hat{\beta}_1 \mid \mathbf{x}] = \beta_1 + \sum w_i \underbrace{E[u_i \mid \mathbf{x}]}_{=\,0 \text{ by SLR.4}} = \beta_1$$ By the Law of Iterated Expectations, $E[\hat{\beta}_1] = E_{\mathbf{x}}[E[\hat{\beta}_1 \mid \mathbf{x}]] = \beta_1$. $\blacksquare$ ### Matrix Proof Recall from Chapter 3 that $\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$. Substitute the true DGP $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X}\boldsymbol{\beta} + \mathbf{u}) = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$$ $$\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$$ Taking conditional expectations: $$E[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \underbrace{E[\mathbf{u} \mid \mathbf{X}]}_{=\,\mathbf{0} \text{ by MLR.4}} = \boldsymbol{\beta}$$ By the Law of Iterated Expectations, $E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$. $\blacksquare$ ::: {.callout-important} ## The Linearity Trick The key step in both proofs is that $\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$ — the estimator equals the truth plus a **linear function of the errors**. This is what makes the expectation easy to take: $E[A\mathbf{u} \mid \mathbf{X}] = A\,E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}$ whenever $A$ is non-stochastic given $\mathbf{X}$. Unbiasedness fails when $\mathbf{X}$ and $\mathbf{u}$ are correlated (endogeneity), because then $(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$ does not have conditional mean zero. ::: --- ## Variance of the OLS Estimator ### Scalar Formula Under SLR.1–5, using $\hat{\beta}_1 = \beta_1 + \sum w_i u_i$ with $w_i = (x_i - \bar{x}) / SST_x$: $$\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sum w_i^2\, \text{Var}(u_i \mid \mathbf{x}) = \sigma^2 \sum w_i^2$$ Since $\sum w_i^2 = \sum (x_i - \bar{x})^2 / SST_x^2 = SST_x / SST_x^2 = 1/SST_x$: $$\boxed{\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \frac{\sigma^2}{SST_x}}$$ ### Matrix Derivation of $\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ From the unbiasedness derivation, the sampling error is $\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}$. The variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is: $$\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = E\!\left[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})' \mid \mathbf{X}\right]$$ $$= E\!\left[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{u}\,\mathbf{u}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \mid \mathbf{X}\right]$$ $$= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\, \underbrace{E[\mathbf{u}\mathbf{u}' \mid \mathbf{X}]}_{\sigma^2 \mathbf{I}_n \text{ by MLR.5}}\, \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}$$ $$= \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\underbrace{\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}}_{\mathbf{I}_{k+1}}$$ $$\boxed{\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}}$$ The $(j,j)$ diagonal element of $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ is $\text{Var}(\hat{\beta}_j \mid \mathbf{X})$; the off-diagonal $(j,\ell)$ element is $\text{Cov}(\hat{\beta}_j, \hat{\beta}_\ell \mid \mathbf{X})$. ### Estimating $\sigma^2$ The population error variance $\sigma^2 = \text{Var}(u_i \mid x_i)$ is unknown and must be estimated. The natural candidate is the average squared residual, but OLS residuals have only $n - (k+1)$ degrees of freedom (we estimated $k+1$ parameters): $$\hat{\sigma}^2 = \frac{SSR}{n - k - 1} = \frac{\sum_{i=1}^n \hat{u}_i^2}{n - k - 1} = \frac{\mathbf{\hat{u}}'\mathbf{\hat{u}}}{n - k - 1}$$ **This estimator is unbiased**: $E[\hat{\sigma}^2 \mid \mathbf{X}] = \sigma^2$. The proof requires the trace identity $E[\mathbf{\hat{u}}'\mathbf{\hat{u}} \mid \mathbf{X}] = \sigma^2(n - k - 1)$, which follows because $\mathbf{\hat{u}} = \mathbf{M}_X \mathbf{u}$ and $\text{rank}(\mathbf{M}_X) = n - k - 1$. Plugging $\hat{\sigma}^2$ into $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ yields the **estimated variance-covariance matrix** $\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2(\mathbf{X}'\mathbf{X})^{-1}$. The diagonal square roots are the **standard errors** reported by `lm()`. **Implications of $\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x$:** 1. More noise ($\sigma^2 \uparrow$) → less precise estimates 2. More variation in $x$ ($SST_x \uparrow$) → more precise estimates 3. Larger $n$ → $SST_x = \sum(x_i - \bar{x})^2$ grows → estimates improve with data --- ## The Gauss-Markov Theorem ::: {.callout-note} ## Gauss-Markov Theorem Under MLR.1–5, the OLS estimator $\hat{\boldsymbol{\beta}}$ is the **Best Linear Unbiased Estimator (BLUE)** of $\boldsymbol{\beta}$. *Linear*: $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$ that may depend on $\mathbf{X}$ but not on $\mathbf{y}$. *Unbiased*: $E[\tilde{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}$. *Best*: $\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X})$ is positive semi-definite for any alternative linear unbiased $\tilde{\boldsymbol{\beta}}$. ::: **Proof sketch.** Let $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ be any linear unbiased estimator. For unbiasedness, we need $\mathbf{C}\mathbf{X} = \mathbf{I}$. Write $\mathbf{C} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' + \mathbf{D}$ where $\mathbf{D}\mathbf{X} = \mathbf{0}$ (the unbiasedness constraint forces this). Then: $$\text{Var}(\tilde{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2 \mathbf{C}\mathbf{C}' = \sigma^2(\mathbf{X}'\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}'$$ Since $\mathbf{D}\mathbf{D}'$ is positive semi-definite, $\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}})$ in the matrix sense. $\blacksquare$ The theorem is powerful but its scope is limited. "Best" is only among *linear* estimators. Non-linear estimators (e.g., ridge regression) can have lower variance by introducing bias. And "best" requires MLR.5: if errors are heteroskedastic, **Weighted Least Squares** is BLUE, not OLS (Chapter 9). --- ## Omitted Variable Bias One of the most important consequences of endogeneity in practice is **omitted variable bias (OVB)**: when a relevant variable is excluded from the regression, its effect leaks into the residual and biases the coefficients on included variables. ### The Algebraic Derivation Suppose the **true** (long) model is: $$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad \text{where } E[u \mid x_1, x_2] = 0 \tag{Long}$$ But the researcher estimates the **short** model, omitting $x_2$: $$y_i = \tilde{\beta}_0 + \tilde{\beta}_1 x_{1i} + \tilde{u}_i \tag{Short}$$ Let $\delta_1$ be the coefficient from an auxiliary regression of $x_2$ on $x_1$: $$x_{2i} = \delta_0 + \delta_1 x_{1i} + v_i \tag{Auxiliary}$$ By the Frisch-Waugh theorem, the OLS slope from the short regression can be shown to satisfy: $$E[\tilde{\beta}_1] = \beta_1 + \beta_2 \delta_1$$ The **omitted variable bias** is therefore: $$\text{OVB} = \beta_2 \delta_1 = \underbrace{\beta_2}_{\text{effect of }x_2 \text{ on }y} \times \underbrace{\delta_1}_{\text{correlation of }x_1 \text{ with }x_2}$$ The bias is nonzero if and only if both (i) $x_2$ affects $y$ ($\beta_2 \neq 0$) and (ii) $x_1$ and $x_2$ are correlated ($\delta_1 \neq 0$). Either condition alone is insufficient. ### Algebraic Derivation Detail From the short model, $\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)y_i}{SST_{x_1}}$. Substitute the long model: $$\tilde{\beta}_1 = \frac{\sum(x_{1i} - \bar{x}_1)(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i)}{SST_{x_1}}$$ $$= \beta_1 + \beta_2 \frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}} + \frac{\sum(x_{1i} - \bar{x}_1)u_i}{SST_{x_1}}$$ Taking expectations (the last term has mean zero since $u$ is uncorrelated with $x_1$ in the **long** model): $$E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \underbrace{\frac{\sum(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{SST_{x_1}}}_{\delta_1 \text{ from auxiliary regression}} = \beta_1 + \beta_2 \delta_1$$ ### Sign of the Bias | $\beta_2 > 0$ | $\beta_2 < 0$ | | |---|---|---| | $\delta_1 > 0$ | Positive bias: $\tilde{\beta}_1 > \beta_1$ | Negative bias: $\tilde{\beta}_1 < \beta_1$ | | $\delta_1 < 0$ | Negative bias | Positive bias | **Example.** Estimating the return to education, omitting ability. If ability raises wages ($\beta_2 > 0$) and is positively correlated with education ($\delta_1 > 0$), then OVB $> 0$: the OLS coefficient on education overstates the causal return, partly picking up the effect of ability. ```{r} #| label: fig-ovb #| fig-cap: "OVB illustrated by simulation. The short regression (red) overstates the slope when the omitted variable is positively correlated with x1." set.seed(123) n <- 400 # True DGP: y = 1 + 0.5*x1 + 0.8*x2 + u x1 <- rnorm(n) x2 <- 0.7 * x1 + rnorm(n, sd = 0.5) # x2 correlated with x1 u <- rnorm(n, sd = 0.5) y <- 1 + 0.5 * x1 + 0.8 * x2 + u df <- tibble(y, x1, x2) long_fit <- lm(y ~ x1 + x2, data = df) short_fit <- lm(y ~ x1, data = df) cat("Long model (true): beta_1 =", round(coef(long_fit)["x1"], 3), "\n") cat("Short model (OVB): beta_1 =", round(coef(short_fit)["x1"], 3), "\n") cat("Auxiliary delta_1 =", round(coef(lm(x2 ~ x1))["x1"], 3), "\n") cat("Predicted OVB =", round(coef(long_fit)["x2"] * coef(lm(x2 ~ x1))["x1"], 3), "\n") ``` --- ## Rescaling and Units of Measurement OLS estimates change when we rescale variables, but the *predictions* and *fit* are invariant. Understanding exactly how they change is essential for interpreting coefficients correctly. ### Case 1: Scaling $x$ only Let $x^* = c \cdot x$ (e.g., converting metres to centimetres, $c = 100$). The new slope is: $$\hat{\beta}_1^* = \frac{\hat{\beta}_1}{c}$$ **Intuition:** A one-unit increase in $x^*$ is a $1/c$-unit increase in $x$, so the slope must shrink by $c$. The fitted values $\hat{\beta}_1^* x^* = (\hat{\beta}_1/c)(cx) = \hat{\beta}_1 x$ are unchanged. ### Case 2: Affine transformation of $x$ Let $x^* = a + c \cdot x$. The slope is the same as Case 1: $\hat{\beta}_1^* = \hat{\beta}_1/c$. The intercept adjusts to keep predictions identical: $\hat{\beta}_0^* = \hat{\beta}_0 - \hat{\beta}_1 a/c$. **Example:** Standardising $x$ to mean 0, SD 1 (setting $a = -\bar{x}/s_x$ and $c = 1/s_x$). The standardised slope is the slope in standard-deviation units. ### Case 3: Scaling $y$ only Let $y^* = c \cdot y$. Then: $$\hat{\beta}_0^* = c \hat{\beta}_0, \qquad \hat{\beta}_1^* = c \hat{\beta}_1$$ All slope and intercept estimates scale by $c$. Standard errors also scale by $c$; $t$-statistics and $R^2$ are unchanged. ### Case 4: Scaling both $x$ and $y$ Let $y^* = c_y y$ and $x^* = c_x x$. Then $\hat{\beta}_1^* = (c_y/c_x)\hat{\beta}_1$ — the ratio of the scaling factors. $t$-statistics and $R^2$ are unchanged. ```{r} #| label: rescaling-demo data(wage1, package = "wooldridge") fit_original <- lm(wage ~ educ, data = wage1) # Scale x: education in months instead of years (c = 12) fit_x_scaled <- lm(wage ~ I(educ * 12), data = wage1) # Scale y: wage in cents instead of dollars (c = 100) fit_y_scaled <- lm(I(wage * 100) ~ educ, data = wage1) tibble( Model = c("Original ($/years)", "x scaled ($/months)", "y scaled (cents/years)"), Intercept = c(coef(fit_original)[1], coef(fit_x_scaled)[1], coef(fit_y_scaled)[1]), Slope = c(coef(fit_original)[2], coef(fit_x_scaled)[2], coef(fit_y_scaled)[2]), `R-squared` = c(glance(fit_original)$r.squared, glance(fit_x_scaled)$r.squared, glance(fit_y_scaled)$r.squared) ) |> knitr::kable(digits = 4, caption = "Rescaling: predictions and R² are invariant; coefficients transform predictably.") ``` --- ## Monte Carlo Simulation: Seeing BLUE in Action Monte Carlo simulation lets us repeatedly draw samples from a known data-generating process (DGP) and observe the properties of estimators empirically. ### Setup True DGP: $y_i = 2 + 0.5 x_i + u_i$, with $x_i \sim \mathcal{U}(0, 10)$ and $u_i \sim \mathcal{N}(0, 4)$. ```{r} #| label: mc-simulation run_ols <- function(n, beta_0 = 2, beta_1 = 0.5, sigma = 2) { x <- runif(n, 0, 10) u <- rnorm(n, 0, sigma) y <- beta_0 + beta_1 * x + u coef(lm(y ~ x)) } n_sims <- 5000 results <- map_dfr(1:n_sims, ~{ b <- run_ols(n = 50) tibble(intercept = b[1], slope = b[2]) }) results |> summarise( mean_b0 = mean(intercept), mean_b1 = mean(slope), sd_b0 = sd(intercept), sd_b1 = sd(slope) ) ``` Across 5,000 simulations, $\hat{\beta}_1$ averages approximately 0.5 — confirming **unbiasedness**. We can also verify against the formula: $\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x = 4 / (n \cdot \text{Var}(x))$. ```{r} #| label: variance-check # Theoretical SE: sigma^2 / E[SST_x] # E[SST_x] ≈ n * Var(x) = 50 * (10^2/12) ≈ 416.7 theoretical_var <- 4 / (50 * 100/12) cat("Theoretical SD of slope:", round(sqrt(theoretical_var), 4), "\n") cat("Simulation SD of slope: ", round(sd(results$slope), 4), "\n") ``` ### Distribution of $\hat{\beta}_1$ ```{r} #| label: fig-sampling-dist #| fig-cap: "Sampling distribution of OLS slope across 5,000 simulations." ggplot(results, aes(slope)) + geom_histogram(aes(y = after_stat(density)), bins = 60, fill = "#2c7be5", alpha = 0.7) + geom_vline(xintercept = 0.5, colour = "#e63946", linewidth = 1.2) + stat_function(fun = dnorm, args = list(mean = mean(results$slope), sd = sd(results$slope)), colour = "#e63946", linewidth = 1) + labs(x = expression(hat(beta)[1]), y = "Density", title = "Sampling Distribution of OLS Slope", subtitle = "Red line = true value (0.5); curve = fitted Normal") ``` ### Effect of Sample Size ```{r} #| label: fig-sample-size #| fig-cap: "OLS becomes more precise as n grows — consistent with Var(β̂₁) = σ²/SSTₓ." map_dfr(c(30, 100, 500, 2000), function(n) { slopes <- replicate(n_sims, run_ols(n)[2]) tibble(n = n, slope = slopes) }) |> mutate(n_label = paste("n =", n)) |> ggplot(aes(slope, fill = n_label)) + geom_density(alpha = 0.5) + geom_vline(xintercept = 0.5, linetype = "dashed") + labs(x = expression(hat(beta)[1]), y = "Density", fill = "Sample size", title = "Sampling Distribution of OLS Slope by Sample Size", subtitle = "Dashed line = true slope (0.5)") ``` ### What Happens When SLR.4 Fails? (Endogeneity) ```{r} #| label: fig-endogeneity #| fig-cap: "OLS is biased when the error is correlated with x. More data does not fix this." run_biased_ols <- function(n, beta_1 = 0.5) { x <- runif(n, 0, 10) u <- 0.8 * x + rnorm(n, 0, 1) # u correlated with x y <- 2 + beta_1 * x + u coef(lm(y ~ x))[2] } slopes_biased <- replicate(n_sims, run_biased_ols(50)) slopes_unbiased <- results$slope tibble( estimate = c(slopes_biased, slopes_unbiased), type = rep(c("Endogenous (biased)", "Exogenous (BLUE)"), each = n_sims) ) |> ggplot(aes(estimate, fill = type)) + geom_density(alpha = 0.5) + geom_vline(xintercept = 0.5, linetype = "dashed") + labs(x = expression(hat(beta)[1]), fill = "", title = "Endogeneity Causes Irreducible Bias", subtitle = "True beta_1 = 0.5; dashed line marks true value") ``` When $\text{Cov}(x, u) \neq 0$, OLS is systematically biased — larger $n$ tightens the distribution around the *wrong* value. This motivates the search for exogenous variation (instruments, natural experiments, randomised assignment) in causal inference. --- ## Tutorials **Tutorial 4.1** Show algebraically that $\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / SST_x$ starting from $\hat{\beta}_1 = \beta_1 + \sum w_i u_i$ where $w_i = (x_i - \bar{x}) / SST_x$. ::: {.callout-tip collapse="true"} ## Solution Let $S_{xx} = SST_x$ and $w_i = (x_i - \bar{x})/S_{xx}$. $$\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \text{Var}\!\left(\sum w_i u_i \mid \mathbf{x}\right) = \sum w_i^2 \text{Var}(u_i \mid \mathbf{x}) \stackrel{\text{SLR.5}}{=} \sigma^2 \sum w_i^2$$ $$\sum w_i^2 = \sum \frac{(x_i - \bar{x})^2}{S_{xx}^2} = \frac{S_{xx}}{S_{xx}^2} = \frac{1}{S_{xx}}$$ Therefore $\text{Var}(\hat{\beta}_1 \mid \mathbf{x}) = \sigma^2 / S_{xx} = \sigma^2 / SST_x$. $\blacksquare$ ::: **Tutorial 4.2** Using the `wage1` dataset from `wooldridge`, regress `lwage` (log wage) on `educ` and `exper`. Separately regress `exper` on `educ`. Use the OVB formula to predict what the slope on `educ` would be if you omit `exper`. Verify by running the short regression. ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex4-2 data(wage1, package = "wooldridge") long_fit <- lm(lwage ~ educ + exper, data = wage1) aux_fit <- lm(exper ~ educ, data = wage1) short_fit <- lm(lwage ~ educ, data = wage1) beta2 <- coef(long_fit)["exper"] delta1 <- coef(aux_fit)["educ"] cat("Long model beta_educ =", round(coef(long_fit)["educ"], 4), "\n") cat("OVB prediction =", round(beta2 * delta1, 4), "\n") cat("Short model beta_educ =", round(coef(short_fit)["educ"], 4), "\n") cat("Matches long + OVB? =", round(coef(long_fit)["educ"] + beta2 * delta1, 4), "\n") ``` The short-model coefficient on `educ` equals the long-model coefficient plus the predicted OVB — exactly. Note that `delta1 < 0` (more educated workers have fewer years of experience, given age), so omitting experience *understates* the return to education when $\beta_2 > 0$. ::: **Tutorial 4.3** Adapt the Monte Carlo simulation to demonstrate OVB. Generate data from: $y_i = 1 + 0.5x_{1i} + 0.8x_{2i} + u_i$ where $x_{2i} = 0.7x_{1i} + v_i$. Run 2,000 simulations. In each, estimate both the long model and the short model. Plot the sampling distributions of $\tilde{\beta}_1$ (short) and $\hat{\beta}_1$ (long) and compare to the true value 0.5. ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex4-3 #| fig-cap: "OVB: the short-model slope is centred above the true value; the long model recovers it." set.seed(2024) n_sims2 <- 2000 simulate_ovb <- function(n = 200) { x1 <- rnorm(n) x2 <- 0.7 * x1 + rnorm(n, sd = sqrt(1 - 0.7^2)) u <- rnorm(n, sd = 0.5) y <- 1 + 0.5 * x1 + 0.8 * x2 + u long_b1 <- coef(lm(y ~ x1 + x2))[2] short_b1 <- coef(lm(y ~ x1))[2] tibble(long = long_b1, short = short_b1) } sim_results <- map_dfr(1:n_sims2, ~simulate_ovb()) sim_results |> pivot_longer(everything(), names_to = "model", values_to = "slope") |> ggplot(aes(slope, fill = model)) + geom_density(alpha = 0.5) + geom_vline(xintercept = 0.5, linetype = "dashed", linewidth = 1) + labs(x = expression(hat(beta)[1]), y = "Density", fill = "Model", title = "Omitted Variable Bias in Monte Carlo", subtitle = "Dashed = true value 0.5; short model is biased upward") ``` The long-model estimator is centred at 0.5. The short-model estimator is centred at $0.5 + 0.8 \times 0.7 = 1.06$ — the OVB formula in action. :::

\(\beta_2 > 0\)	\(\beta_2 < 0\)
\(\delta_1 > 0\)	Positive bias: \(\tilde{\beta}_1 > \beta_1\)	Negative bias: \(\tilde{\beta}_1 < \beta_1\)
\(\delta_1 < 0\)	Negative bias	Positive bias

Chapter 4: Properties of OLS — BLUE and Gauss-Markov

1 Estimator vs. Estimate

2 The Classical Linear Model Assumptions

2.1 Scalar (Simple Regression) Form

2.2 Matrix (Multiple Regression) Form

3 Unbiasedness of OLS

3.1 Scalar Proof

3.2 Matrix Proof

4 Variance of the OLS Estimator

4.1 Scalar Formula

4.2 Matrix Derivation of \(\text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\)

4.3 Estimating \(\sigma^2\)

5 The Gauss-Markov Theorem

6 Omitted Variable Bias

6.1 The Algebraic Derivation

6.2 Algebraic Derivation Detail

6.3 Sign of the Bias

7 Rescaling and Units of Measurement

7.1 Case 1: Scaling \(x\) only

7.2 Case 2: Affine transformation of \(x\)

7.3 Case 3: Scaling \(y\) only

7.4 Case 4: Scaling both \(x\) and \(y\)

8 Monte Carlo Simulation: Seeing BLUE in Action

8.1 Setup

8.2 Distribution of \(\hat{\beta}_1\)

8.3 Effect of Sample Size

8.4 What Happens When SLR.4 Fails? (Endogeneity)

9 Tutorials