Chapter 8: Large Sample Properties

Consistency, the Law of Large Numbers, and asymptotic normality

NoteLearning Objectives

By the end of this chapter you will be able to:

  • State and apply the Weak Law of Large Numbers (WLLN) and the Central Limit Theorem (CLT)
  • Apply the Continuous Mapping Theorem (CMT) and Slutsky’s theorem to derive limiting distributions
  • Derive the consistency of OLS in both scalar and matrix form
  • Derive the asymptotic normality of OLS from the CLT
  • Explain why large-sample theory validates inference without normality of errors
  • Demonstrate CLT convergence and OLS consistency via simulation

1 Why Large-Sample Theory?

The finite-sample results from Chapters 4–5 relied on MLR.6 — normally distributed errors. In practice, errors are rarely normal. How can we justify using \(t\)- and \(F\)-distributions for inference?

The answer is asymptotic (large-sample) theory: as \(n \to \infty\), OLS estimators behave like normal random variables regardless of the error distribution. This is the Central Limit Theorem at work.

Large-sample theory also tells us:

  • OLS is consistent under conditions weaker than unbiasedness requires
  • When MLR.5 (homoskedasticity) fails, the standard SE formula is wrong — we need robust standard errors
  • For time-series data with serial correlation, we need HAC standard errors

Both of these fixes (Chapters 9–10) are justified by asymptotic theory.


2 Convergence Concepts

2.1 Convergence in Probability and Consistency

A sequence of random variables \(\{W_n\}\) converges in probability to a constant \(c\) if:

\[\forall \varepsilon > 0: \quad \lim_{n \to \infty} P(|W_n - c| > \varepsilon) = 0\]

Written \(W_n \xrightarrow{p} c\) or \(\text{plim}_{n \to \infty} W_n = c\). An estimator \(\hat{\theta}_n\) is consistent for \(\theta\) if \(\hat{\theta}_n \xrightarrow{p} \theta\).

2.2 Convergence in Distribution

\(\{W_n\}\) converges in distribution to a random variable \(W\) if for every continuity point \(w\) of the CDF of \(W\):

\[\lim_{n \to \infty} P(W_n \leq w) = P(W \leq w)\]

Written \(W_n \xrightarrow{d} W\). This is weaker than convergence in probability: if \(W_n \xrightarrow{p} c\), then \(W_n \xrightarrow{d} c\), but not vice versa.

2.3 Two Essential Results

NoteContinuous Mapping Theorem (CMT)

If \(W_n \xrightarrow{p} c\) and \(g(\cdot)\) is continuous at \(c\), then \(g(W_n) \xrightarrow{p} g(c)\).

Implication: \(\text{plim}(A_n^{-1}) = [\text{plim}(A_n)]^{-1}\) if \(\text{plim}(A_n)\) is invertible. This is what allows us to carry the matrix inverse through the probability limit.

NoteSlutsky’s Theorem

If \(A_n \xrightarrow{d} A\) and \(B_n \xrightarrow{p} c\) (a constant), then:

\[A_n + B_n \xrightarrow{d} A + c, \qquad A_n B_n \xrightarrow{d} cA, \qquad A_n / B_n \xrightarrow{d} A/c\]

Implication: we can substitute a consistent estimator for an unknown population parameter in a limiting distribution without changing the limit.


3 The Law of Large Numbers

NoteWeak Law of Large Numbers (WLLN)

If \(X_1, X_2, \ldots\) are i.i.d. with \(E[X_i] = \mu < \infty\) and \(\text{Var}(X_i) < \infty\), then:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu\]

Proof sketch. By Chebyshev’s inequality:

\[P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0 \text{ as } n \to \infty. \qquad \blacksquare\]

The Strong LLN (Kolmogorov) guarantees almost-sure convergence (\(P(\lim \bar{X}_n = \mu) = 1\)) under the same conditions; the WLLN is sufficient for econometrics.

true_mu <- 3.5
n_max   <- 2000

tibble(
  n        = 1:n_max,
  cum_mean = cumsum(sample(1:6, n_max, replace = TRUE)) / 1:n_max
) |>
  ggplot(aes(n, cum_mean)) +
  geom_line(colour = "#2c7be5") +
  geom_hline(yintercept = true_mu, colour = "#e63946", linetype = "dashed") +
  labs(x = "Sample size (n)", y = "Running sample mean",
       title = "Law of Large Numbers",
       subtitle = "Simulated fair die rolls. Red dashed = true mean (3.5)")
Figure 1: LLN: sample means converge to the population mean as n grows.

4 The Central Limit Theorem

NoteCentral Limit Theorem (CLT)

If \(X_1, X_2, \ldots\) are i.i.d. with mean \(\mu\) and finite variance \(\sigma^2 > 0\), then:

\[Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0, 1)\]

Equivalently: \(\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)\).

The CLT is remarkable because it says the sampling distribution of \(\bar{X}_n\) is approximately normal for any parent distribution, as long as \(n\) is large enough and \(\sigma^2\) is finite.

Multivariate extension. If \(\mathbf{X}_1, \ldots, \mathbf{X}_n\) are i.i.d. \(p\)-vectors with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\):

\[\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})\]

This is the version we use for deriving asymptotic normality of the OLS estimator.

n_sims <- 10000

simulate_clt <- function(dist_fn, n_vals = c(5, 30, 100)) {
  map_dfr(n_vals, function(n) {
    tibble(n = n, x_bar = replicate(n_sims, mean(dist_fn(n))))
  })
}

bind_rows(
  simulate_clt(\(n) rexp(n, rate = 1))      |> mutate(dist = "Exponential(1)"),
  simulate_clt(\(n) rbinom(n, 1, 0.2))      |> mutate(dist = "Bernoulli(0.2)"),
  simulate_clt(\(n) runif(n, 0, 1))         |> mutate(dist = "Uniform(0, 1)")
) |>
  group_by(dist, n) |>
  mutate(z = (x_bar - mean(x_bar)) / sd(x_bar)) |>
  ggplot(aes(z)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50,
                 fill = "#2c7be5", alpha = 0.6) +
  stat_function(fun = dnorm, colour = "#e63946", linewidth = 1) +
  facet_grid(dist ~ paste("n =", n)) +
  labs(x = "Standardised sample mean (Z)", y = "Density",
       title = "Central Limit Theorem",
       subtitle = "Red = N(0,1). Columns = sample sizes. All three converge by n = 100.")
Figure 2: CLT: sampling distributions from three non-normal parents approach N(0,1) as n grows.

5 Consistency of OLS

5.1 Scalar Case

Recall from Chapter 4: \(\hat{\beta}_1 = \beta_1 + \frac{\frac{1}{n}\sum(x_i - \bar{x})u_i}{\frac{1}{n}\sum(x_i - \bar{x})^2}\).

By the WLLN (assuming i.i.d. observations):

\[\frac{1}{n}\sum(x_i - \bar{x})^2 \xrightarrow{p} \text{Var}(x) \equiv \sigma_x^2 > 0\]

\[\frac{1}{n}\sum(x_i - \bar{x})u_i \xrightarrow{p} \text{Cov}(x, u)\]

By the CMT (division is continuous away from zero):

\[\text{plim}(\hat{\beta}_1) = \beta_1 + \frac{\text{Cov}(x, u)}{\text{Var}(x)}\]

OLS is consistent if and only if \(\text{Cov}(x, u) = 0\).

5.2 Matrix Case

Write \(\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \left(\frac{\mathbf{X}'\mathbf{X}}{n}\right)^{-1}\left(\frac{\mathbf{X}'\mathbf{u}}{n}\right)\).

Define \(\mathbf{Q}_{XX} = E[\mathbf{x}_i\mathbf{x}_i']\) (the population second-moment matrix of the regressors). By the WLLN applied to each element:

\[\frac{\mathbf{X}'\mathbf{X}}{n} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i' \xrightarrow{p} \mathbf{Q}_{XX}\]

By MLR.4 (\(E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}\), implying \(E[\mathbf{x}_i u_i] = \mathbf{0}\)):

\[\frac{\mathbf{X}'\mathbf{u}}{n} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i u_i \xrightarrow{p} E[\mathbf{x}_i u_i] = \mathbf{0}\]

By the CMT (matrix inversion is continuous at invertible matrices, and \(\mathbf{Q}_{XX}\) is invertible by MLR.3):

\[\text{plim}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} + \mathbf{Q}_{XX}^{-1} \cdot \mathbf{0} = \boldsymbol{\beta} \qquad \blacksquare\]

ImportantConsistency vs. Unbiasedness
  • Unbiasedness: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\) for any fixed \(n\) — requires \(E[\mathbf{u} \mid \mathbf{X}] = \mathbf{0}\) (MLR.4)
  • Consistency: \(\hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta}\) — requires only \(E[\mathbf{x}_i u_i] = \mathbf{0}\) (zero correlation, not zero conditional mean)

Zero conditional mean implies zero correlation (by LIE), but not vice versa. So consistency holds under strictly weaker conditions than unbiasedness. An estimator can be biased in finite samples but consistent (many instrumental variables estimators are in this category).


6 Asymptotic Normality of OLS

6.1 Derivation

The key is to express \(\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})\) as a function of a sample average and apply the CLT.

\[\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) = \left(\frac{\mathbf{X}'\mathbf{X}}{n}\right)^{-1} \cdot \left(\frac{\mathbf{X}'\mathbf{u}}{\sqrt{n}}\right) = \left(\frac{\mathbf{X}'\mathbf{X}}{n}\right)^{-1} \cdot \left(\frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbf{x}_i u_i\right)\]

Step 1: Apply the CLT to the last factor. Under MLR.1–5, \(\{\mathbf{x}_i u_i\}\) are i.i.d. with mean \(\mathbf{0}\) and covariance matrix:

\[E[\mathbf{x}_i u_i \mathbf{x}_i' u_i] = E[\mathbf{x}_i \mathbf{x}_i' u_i^2] = \sigma^2 E[\mathbf{x}_i \mathbf{x}_i'] = \sigma^2 \mathbf{Q}_{XX}\]

(using MLR.5: \(E[u_i^2 \mid \mathbf{x}_i] = \sigma^2\)). By the multivariate CLT:

\[\frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbf{x}_i u_i \xrightarrow{d} \mathcal{N}(\mathbf{0},\, \sigma^2 \mathbf{Q}_{XX})\]

Step 2: Apply Slutsky’s theorem. Since \(\mathbf{X}'\mathbf{X}/n \xrightarrow{p} \mathbf{Q}_{XX}\) (consistent, hence converges in probability to a constant):

\[\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) = \underbrace{\left(\frac{\mathbf{X}'\mathbf{X}}{n}\right)^{-1}}_{\xrightarrow{p}\;\mathbf{Q}_{XX}^{-1}} \cdot \underbrace{\left(\frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbf{x}_i u_i\right)}_{\xrightarrow{d}\;\mathcal{N}(\mathbf{0},\,\sigma^2\mathbf{Q}_{XX})} \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\; \sigma^2 \mathbf{Q}_{XX}^{-1}\right)\]

Step 3: Normalise. Dividing by \(\sqrt{n}\):

\[\hat{\boldsymbol{\beta}} \overset{a}{\sim} \mathcal{N}\!\left(\boldsymbol{\beta},\; \frac{\sigma^2}{n} \mathbf{Q}_{XX}^{-1}\right)\]

Since \(\sigma^2 \mathbf{Q}_{XX}^{-1}/n \approx \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\) for large \(n\), we recover:

\[\boxed{\hat{\boldsymbol{\beta}} \overset{a}{\sim} \mathcal{N}\!\left(\boldsymbol{\beta},\; \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\right)}\]

This is the same variance formula as the exact finite-sample result under normality (Chapter 4) — but now derived without the normality assumption, valid for all large \(n\).

6.2 Asymptotic t-Tests

For a single coefficient:

\[\frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \overset{a}{\sim} \mathcal{N}(0,1) \approx t(n-k-1)\]

For large \(n\), \(t(n-k-1) \approx \mathcal{N}(0,1)\), so the \(t\)-critical values converge to \(z\)-critical values (e.g., \(t_{0.025,\infty} = 1.96\)). In practice we always use the \(t(n-k-1)\) critical values, which are conservative (heavier tails) and approach \(\mathcal{N}(0,1)\) as \(n\) grows.

6.3 What Breaks Without Homoskedasticity?

If MLR.5 fails, let \(\omega_i^2 = E[u_i^2 \mid \mathbf{x}_i]\) (the conditional variance, allowed to vary). Then:

\[\text{Cov}\!\left(\frac{1}{\sqrt{n}}\sum_{i=1}^n \mathbf{x}_i u_i\right) \xrightarrow{p} E[\mathbf{x}_i \mathbf{x}_i' u_i^2] = E[\omega_i^2 \mathbf{x}_i \mathbf{x}_i'] \neq \sigma^2 \mathbf{Q}_{XX}\]

The asymptotic distribution of \(\hat{\boldsymbol{\beta}}\) becomes:

\[\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\; \mathbf{Q}_{XX}^{-1} \boldsymbol{\Omega}\, \mathbf{Q}_{XX}^{-1}\right)\]

where \(\boldsymbol{\Omega} = E[\omega_i^2 \mathbf{x}_i \mathbf{x}_i']\) is the “meat” of the sandwich. The homoskedastic formula \(\sigma^2 \mathbf{Q}_{XX}^{-1}\) is now wrong. We need the heteroskedasticity-robust (sandwich) estimator of the variance — the topic of Chapter 9.


7 Demonstrating Consistency via Simulation

true_beta <- 0.5

results_n <- map_dfr(c(25, 100, 500, 2000, 10000), function(n) {
  slopes <- replicate(1000, {
    x <- runif(n, 0, 10)
    u <- rexp(n, rate = 1) - 1   # non-normal, mean zero
    y <- 1 + true_beta * x + u
    coef(lm(y ~ x))[2]
  })
  tibble(n = n, slope = slopes)
})

results_n |>
  group_by(n) |>
  summarise(mean = mean(slope), sd = sd(slope)) |>
  knitr::kable(digits = 4,
               caption = "OLS slope: mean and SD by sample size. Non-normal (Exponential) errors.")
OLS slope: mean and SD by sample size. Non-normal (Exponential) errors.
n mean sd
25 0.4983 0.0710
100 0.5020 0.0345
500 0.4998 0.0155
2000 0.5000 0.0078
10000 0.4999 0.0034
Figure 3: OLS consistency with non-normal errors: distribution concentrates on true value as n grows.
results_n |>
  ggplot(aes(slope, fill = factor(n))) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = true_beta, linetype = "dashed", colour = "#e63946") +
  scale_fill_viridis_d(name = "n") +
  labs(x = expression(hat(beta)[1]), y = "Density",
       title = "OLS Consistency with Non-Normal Errors",
       subtitle = "True beta = 0.5. Errors are mean-zero Exponential.")
Figure 4: The sampling distribution of β̂₁ narrows around the true value as n grows.

The distribution becomes increasingly concentrated around 0.5 as \(n\) grows, confirming consistency with non-normal errors.


8 Tutorials

Tutorial 8.1 Show via simulation that OLS is inconsistent when \(\text{Cov}(x, u) \neq 0\). Let \(u_i = 0.7 x_i + v_i\) where \(v_i\) is independent noise. Before simulating, compute the theoretical probability limit.

# plim(b1) = beta_1 + Cov(x,u)/Var(x)
# Cov(x, 0.7x + v) = 0.7*Var(x); Var(x) = 1 for N(0,1)
# plim = 0.5 + 0.7 = 1.2
cat("Theoretical plim(beta_hat):", 0.5 + 0.7, "\n\n")
Theoretical plim(beta_hat): 1.2 
for (n in c(100, 1000, 10000)) {
  slopes <- replicate(2000, {
    x <- rnorm(n); v <- rnorm(n)
    u <- 0.7 * x + v
    y <- 1 + 0.5 * x + u
    coef(lm(y ~ x))[2]
  })
  cat(sprintf("n = %6d:  mean slope = %.4f\n", n, mean(slopes)))
}
n =    100:  mean slope = 1.2011
n =   1000:  mean slope = 1.1999
n =  10000:  mean slope = 1.2002

Even at \(n = 10{,}000\), the estimate converges to 1.2 — not 0.5. Inconsistency is not solved by more data.

Tutorial 8.2 The OLS slope can be written as \(\hat{\beta}_1 = \beta_1 + \overline{(x_i - \bar{x})u_i} / \overline{(x_i - \bar{x})^2}\) where the bars denote averages. Use the WLLN and CMT to prove consistency in the scalar case. State precisely which assumptions you invoke at each step.

Step 1 (WLLN on denominator). Under random sampling (MLR.2) and \(\text{Var}(x) < \infty\) (MLR.3): \[\frac{1}{n}\sum(x_i - \bar{x})^2 \xrightarrow{p} E[(x_i - \mu_x)^2] = \text{Var}(x) > 0\] The inequality is MLR.3 (no perfect collinearity).

Step 2 (WLLN on numerator). Under random sampling and MLR.4 (\(E[u_i x_i] = E[u_i]\mu_x = 0\) since MLR.4 implies \(E[u_i] = 0\) and \(\text{Cov}(x_i, u_i) = 0\)): \[\frac{1}{n}\sum(x_i - \bar{x})u_i \xrightarrow{p} \text{Cov}(x_i, u_i) = 0\]

Step 3 (CMT). Since division is a continuous function and \(\text{Var}(x) \neq 0\): \[\hat{\beta}_1 = \beta_1 + \frac{\text{numerator}}{\text{denominator}} \xrightarrow{p} \beta_1 + \frac{0}{\text{Var}(x)} = \beta_1 \qquad \blacksquare\]

Tutorial 8.3 Simulate the CLT for the \(\chi^2(3)\) distribution (highly non-normal, right-skewed). For \(n \in \{5, 30, 100\}\), compute 10,000 sample means, standardise them, and overlay a \(\mathcal{N}(0,1)\) density. At which \(n\) does the approximation become adequate?

map_dfr(c(5, 30, 100), function(n) {
  x_bars <- replicate(10000, mean(rchisq(n, df = 3)))
  tibble(n = n, z = (x_bars - 3) / (sqrt(6/n)))   # mu=3, sigma^2=6
}) |>
  ggplot(aes(z)) +
  geom_histogram(aes(y = after_stat(density)), bins = 60,
                 fill = "#2c7be5", alpha = 0.6) +
  stat_function(fun = dnorm, colour = "#e63946", linewidth = 1) +
  facet_wrap(~paste("n =", n)) +
  labs(x = "Standardised sample mean", y = "Density",
       title = "CLT for Chi-Squared(3) — Skewed Parent Distribution",
       subtitle = "Red = N(0,1). Approximation is adequate by n = 30.")

CLT for chi-squared(3) distribution: convergence to normality.

For \(\chi^2(3)\): \(\mu = 3\), \(\sigma^2 = 2\times 3 = 6\). The approximation is adequate by \(n = 30\); at \(n = 5\) the skewness of the parent distribution is still visible.