---
title: "Chapter 6: Multiple Regression and Specification"
subtitle: "Controlling for confounders, OVB, and model selection"
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
library(broom)
library(modelsummary)
library(car)
library(lmtest)
library(wooldridge)
theme_set(theme_minimal(base_size = 13))
options(digits = 4, scipen = 999)
data("wage1", package = "wooldridge")
```
::: {.callout-note}
## Learning Objectives
By the end of this chapter you will be able to:
- Specify and interpret multiple regression models with ceteris paribus language
- Apply the OVB formula and reason about the direction of bias
- Derive and interpret coefficients in log-level, level-log, and log-log models
- Use the RESET test to detect functional form misspecification
- Compare model specifications using adjusted $R^2$, AIC, BIC, and HQ criteria
- Derive and distinguish confidence intervals for conditional means from prediction intervals
:::
---
## The Multiple Regression Model
The **multiple regression model** with $k$ regressors is:
$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + u_i$$
**Interpretation of $\beta_j$:** The expected change in $y$ for a one-unit increase in $x_j$, **holding all other regressors constant** (*ceteris paribus*). This ceteris paribus interpretation is what makes regression useful for causal inference — including a confounder $x_2$ in the regression means the estimated effect of $x_1$ is net of any variation in $x_2$.
---
## Omitted Variable Bias (OVB)
If we omit a relevant variable from the regression, OLS is **biased**. The full algebraic derivation was given in Chapter 4. Here we state the formula and apply it.
### The OVB Formula
Suppose the true model is $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$ with $E[u \mid x_1, x_2] = 0$. If we estimate the **short regression** (omitting $x_2$):
$$E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \delta_{21}, \qquad \delta_{21} = \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$$
The **omitted variable bias** $\beta_2 \delta_{21}$ is nonzero if and only if both (i) $x_2$ affects $y$ and (ii) $x_1$ and $x_2$ are correlated.
| $\text{sign}(\beta_2)$ | $\text{sign}(\delta_{21})$ | Direction of bias |
|---|---|---|
| $+$ | $+$ | Upward: $E[\tilde{\beta}_1] > \beta_1$ |
| $+$ | $-$ | Downward |
| $-$ | $+$ | Downward |
| $-$ | $-$ | Upward |
### OVB Example: Ability Bias in Wage Regressions
```{r}
#| label: ovb-demonstration
fit_short <- lm(wage ~ educ, data = wage1)
fit_full <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_delta <- lm(exper ~ educ, data = wage1)
beta_exper <- coef(fit_full)["exper"]
delta_21 <- coef(fit_delta)["educ"]
cat("Short regression beta_educ: ", round(coef(fit_short)["educ"], 4), "\n")
cat("Long regression beta_educ: ", round(coef(fit_full)["educ"], 4), "\n")
cat("OVB formula (long + bias): ",
round(coef(fit_full)["educ"] + beta_exper * delta_21, 4), "\n")
```
The OVB formula recovers the short-model coefficient exactly (within rounding). Omitting experience biases the return to education because experience affects wages and is correlated with education (more education → less time for work experience, so $\delta_{21} < 0$; combined with $\beta_{exper} > 0$, the bias is downward here).
---
## Building a Wage Model: Step by Step
Presenting results across multiple model specifications lets readers see how estimates change as controls are added — which reveals both OVB and robustness.
```{r}
#| label: model-comparison
models <- list(
"(1) Education only" = lm(wage ~ educ, data = wage1),
"(2) + Experience" = lm(wage ~ educ + exper, data = wage1),
"(3) + Tenure" = lm(wage ~ educ + exper + tenure, data = wage1),
"(4) + Gender" = lm(wage ~ educ + exper + tenure + female, data = wage1),
"(5) + Race" = lm(wage ~ educ + exper + tenure + female + nonwhite, data = wage1)
)
modelsummary(
models,
stars = TRUE,
gof_map = c("nobs", "r.squared", "adj.r.squared"),
title = "Returns to Education: Progressive Model Specifications",
coef_rename = c(
"(Intercept)" = "Intercept",
"educ" = "Education (years)",
"exper" = "Experience (years)",
"tenure" = "Tenure (years)",
"female" = "Female (= 1)",
"nonwhite" = "Non-white (= 1)"
)
)
```
---
## Log Functional Forms
The **level-level** model $y = \beta_0 + \beta_1 x + u$ is the baseline. But wages, prices, and income are often better modelled in logarithms. Understanding how to interpret log transformations — and why they work — is essential.
### The Log Approximation
The key identity is:
$$\Delta \ln(z) \approx \frac{\Delta z}{z}$$
for small $\Delta z$. This follows from the first-order Taylor expansion: $\ln(z + \Delta z) \approx \ln(z) + \Delta z / z$. Rearranging: $\Delta \ln(z) \approx \Delta z / z$. Multiplying both sides by 100:
$$100 \cdot \Delta \ln(z) \approx \frac{100 \cdot \Delta z}{z} = \% \Delta z$$
So $100 \times$ the change in $\ln(z)$ is approximately the percentage change in $z$. The approximation is exact for continuous time; for discrete changes it becomes less accurate as $\Delta z$ grows.
### The Four Log Models
| Model | Specification | Coefficient interpretation |
|---|---|---|
| Level-level | $y = \beta_0 + \beta_1 x$ | $\Delta y = \beta_1 \Delta x$ (absolute) |
| Log-level | $\ln y = \beta_0 + \beta_1 x$ | $100\beta_1 \Delta x \approx \% \Delta y$ |
| Level-log | $y = \beta_0 + \beta_1 \ln x$ | $\beta_1 (\Delta x / x) \approx \beta_1 / 100 \times \% \Delta x$ |
| Log-log | $\ln y = \beta_0 + \beta_1 \ln x$ | $\beta_1 = \% \Delta y / \% \Delta x$ (elasticity) |
### Deriving the Log-Level Interpretation
If $\ln y = \beta_0 + \beta_1 x + u$, then:
$$\frac{d(\ln y)}{dx} = \beta_1 \implies \frac{1}{y}\frac{dy}{dx} = \beta_1 \implies \frac{dy}{y} = \beta_1 \, dx$$
So a one-unit increase in $x$ is associated with a $\beta_1 \times 100\%$ approximate change in $y$ (a **semi-elasticity**). The approximation breaks down for large $\beta_1$; the exact percentage change is $100(e^{\beta_1} - 1)\%$.
### Deriving the Log-Log Interpretation
If $\ln y = \beta_0 + \beta_1 \ln x + u$, then:
$$\frac{d(\ln y)}{d(\ln x)} = \beta_1 \implies \frac{dy/y}{dx/x} = \beta_1$$
$\beta_1$ is the **elasticity** of $y$ with respect to $x$: a 1% increase in $x$ is associated with a $\beta_1\%$ change in $y$. Elasticities are unit-free, which is why log-log models are standard for demand equations, production functions, and international trade.
```{r}
#| label: log-models
# Level-level and log-level wage models
fit_level <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_log <- lm(lwage ~ educ + exper + tenure, data = wage1)
modelsummary(
list("Level-level" = fit_level, "Log-level" = fit_log),
stars = TRUE,
gof_map = c("nobs", "r.squared"),
title = "Level vs Log-Level Wage Equation"
)
```
```{r}
#| label: log-interpretation
b_educ_log <- coef(fit_log)["educ"]
cat("Log-level model: one extra year of education is associated with\n")
cat(" approx:", round(100 * b_educ_log, 2), "% change in wage\n")
cat(" exact: ", round(100 * (exp(b_educ_log) - 1), 2), "% change in wage\n")
```
The approximate and exact figures differ by less than 1 percentage point here — the approximation is reasonable for small $\beta_1$, but always report the exact figure for precision.
### Log-Differencing in Time Series
For time series data, log-differencing converts levels to approximate growth rates:
$$\Delta \ln y_t = \ln y_t - \ln y_{t-1} \approx \frac{y_t - y_{t-1}}{y_{t-1}} = \text{growth rate of } y_t$$
This is standard for GDP, prices, and exchange rates. A second benefit: log-differencing often eliminates **non-stationarity** (trending series), which we study formally in Chapter 12.
---
## Specification Testing: The RESET Test
The Ramsey **RESET** (Regression Equation Specification Error Test) detects functional form misspecification: if the model is correctly specified, higher powers of $\hat{y}$ should not be significant when added to the regression.
**The test:** Estimate the original model to get $\hat{y}_i$. Then augment with $\hat{y}_i^2$ (and possibly $\hat{y}_i^3$) and test their joint significance using an $F$-test.
$$H_0: \text{model is correctly specified} \quad \text{vs.} \quad H_1: \text{functional form misspecification}$$
```{r}
#| label: reset-test
fit_wage <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_log_wage <- lm(lwage ~ educ + exper + tenure, data = wage1)
# RESET test via lmtest package
resettest(fit_wage, power = 2:3, type = "fitted")
resettest(fit_log_wage, power = 2:3, type = "fitted")
```
The RESET test rejects for the level-level model (p < 0.05) but not for the log-level model — consistent with wages being better modelled in logs. A significant RESET does not tell you *how* to fix the specification, only that the current one is wrong.
---
## Model Selection Criteria
Adding more regressors always increases $R^2$. Better criteria penalise model complexity.
### Adjusted $R^2$
$$\bar{R}^2 = 1 - \frac{SSR/(n-k-1)}{SST/(n-1)}$$
$\bar{R}^2$ rises when a new variable is added only if its $|t|$ exceeds 1. Unlike $R^2$, $\bar{R}^2$ can decrease.
### Information Criteria
Define $\ell = n\ln(SSR/n)$ as the log-likelihood (up to a constant) and $p = k+1$ as the number of parameters:
$$\text{AIC} = \ell + 2p$$
$$\text{BIC} = \ell + p\ln(n)$$
$$\text{HQ} = \ell + 2p\ln(\ln(n))$$
**Penalty ordering:** $\text{AIC} \leq \text{HQ} \leq \text{BIC}$ (for $n \geq 8$). The Hannan-Quinn criterion (HQ) is consistent (like BIC) but penalises less aggressively. In practice:
- AIC tends to overfit (selects too many regressors)
- BIC is consistent — selects the true model as $n \to \infty$
- HQ lies between them
**Lower is better** for all three criteria; compare across models with the same dependent variable.
```{r}
#| label: aic-bic-hq
n <- nrow(wage1)
map_dfr(models, function(m) {
ssr <- sum(resid(m)^2)
p <- length(coef(m))
ell <- n * log(ssr / n)
tibble(
AIC = ell + 2 * p,
HQ = ell + 2 * p * log(log(n)),
BIC = ell + p * log(n),
`Adj R2` = glance(m)$adj.r.squared
)
}, .id = "Model")
```
::: {.callout-important}
## Model Selection vs. Causal Inference
Information criteria are tools for **prediction accuracy**. For **causal inference**, we choose controls based on economic theory about which variables could confound the relationship of interest — not data-driven selection rules. Including an irrelevant control does not bias OLS but wastes a degree of freedom; omitting a relevant confounder causes bias.
:::
---
## Confidence Intervals vs. Prediction Intervals
A common confusion in applied work is between:
1. **Confidence interval (CI) for the conditional mean** $E[y \mid \mathbf{x}_0]$
2. **Prediction interval (PI) for a new observation** $y_0 = \mathbf{x}_0'\boldsymbol{\beta} + u_0$
### Confidence Interval for $E[y \mid \mathbf{x}_0]$
The point estimate of $E[y \mid \mathbf{x}_0] = \mathbf{x}_0'\boldsymbol{\beta}$ is $\hat{y}_0 = \mathbf{x}_0'\hat{\boldsymbol{\beta}}$.
The variance of this estimate is:
$$\text{Var}(\hat{y}_0 \mid \mathbf{X}, \mathbf{x}_0) = \mathbf{x}_0' \text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \mathbf{x}_0 = \sigma^2 \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0$$
**Derivation via reparameterisation:** Define a new parameter $\theta = \mathbf{x}_0'\boldsymbol{\beta}$. Reparameterise the model so that $\theta$ is one of the coefficients (by centring around $\mathbf{x}_0$). The OLS estimate of $\theta$ is $\hat{y}_0$, and its standard error is $\text{se}(\hat{y}_0) = \hat{\sigma}\sqrt{\mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}$. The CI is:
$$\hat{y}_0 \pm t_{\alpha/2, n-k-1} \cdot \text{se}(\hat{y}_0)$$
### Prediction Interval for $y_0$
A new observation is $y_0 = \mathbf{x}_0'\boldsymbol{\beta} + u_0$. The forecast error is:
$$\hat{e}_0 = y_0 - \hat{y}_0 = (\mathbf{x}_0'\boldsymbol{\beta} + u_0) - \mathbf{x}_0'\hat{\boldsymbol{\beta}} = u_0 - \mathbf{x}_0'(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})$$
Since $u_0$ is independent of $\hat{\boldsymbol{\beta}}$:
$$\text{Var}(\hat{e}_0) = \text{Var}(u_0) + \text{Var}(\mathbf{x}_0'\hat{\boldsymbol{\beta}}) = \sigma^2 + \sigma^2 \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0$$
$$\text{se}(\hat{e}_0) = \hat{\sigma}\sqrt{1 + \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}$$
The extra "$+ \sigma^2$" (equivalently, "$+1$" inside the square root) reflects the **irreducible uncertainty** from the new observation's error $u_0$. No matter how large $n$ is, PI width is bounded below by $\pm t_{\alpha/2} \hat{\sigma}$.
A practical rule of thumb: if $\hat{y}_0$ is close to $\bar{y}$ (the in-sample mean), then $\mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0 \approx 1/n \approx 0$ for large $n$, giving the approximation $\hat{y}_0 \pm 2\hat{\sigma}$ (using $t_{\alpha/2} \approx 2$).
```{r}
#| label: prediction-intervals
fit_wage_log <- lm(lwage ~ educ + exper + tenure, data = wage1)
# New observation: 16 years education, 10 years experience, 5 years tenure
new_obs <- tibble(educ = 16, exper = 10, tenure = 5)
ci <- predict(fit_wage_log, newdata = new_obs, interval = "confidence", level = 0.95)
pi <- predict(fit_wage_log, newdata = new_obs, interval = "prediction", level = 0.95)
cat("95% CI for E[log(wage) | x0]: [", round(ci[2], 3), ",", round(ci[3], 3), "]\n")
cat("95% PI for a new log(wage): [", round(pi[2], 3), ",", round(pi[3], 3), "]\n")
cat("Width ratio (PI/CI): ", round((pi[3]-pi[2])/(ci[3]-ci[2]), 2), "\n")
# In level form (anti-log, corrected for log-normal bias)
sigma_hat <- sigma(fit_wage_log)
cat("\nIn dollar terms (approx):\n")
cat("Point prediction: $", round(exp(ci[1] + sigma_hat^2/2), 2), "\n")
cat("95% PI approx: $", round(exp(pi[2]), 2), "to $", round(exp(pi[3]), 2), "\n")
```
The prediction interval is substantially wider than the confidence interval, as it must account for the inherent variability of individual wages around their conditional mean.
---
## Multicollinearity
**Multicollinearity** occurs when regressors are highly correlated. It does **not** bias OLS, but it inflates standard errors, making it hard to separately identify individual effects. The **Variance Inflation Factor**:
$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the $R^2$ from regressing $x_j$ on all other regressors. VIF $> 10$ is often flagged as a concern.
```{r}
#| label: vif
vif(fit_full)
```
No serious multicollinearity here — all VIFs are well below 10.
---
## Tutorials
**Tutorial 6.1**
Using `wooldridge::hprice1`, regress `lprice` (log price) on `llotsize` (log lot size), `lsqrft` (log square footage), `bdrms`, and `colonial`.
a. Interpret the coefficients on `llotsize` and `lsqrft` as elasticities.
b. Interpret the coefficient on `bdrms` as a semi-elasticity (exact percentage change).
c. Construct a 95% prediction interval for a 2,000 sq ft colonial house with a 10,000 sq ft lot and 3 bedrooms. Convert back to dollar terms.
::: {.callout-tip collapse="true"}
## Solution
```{r}
#| label: ex6-1
data("hprice1", package = "wooldridge")
fit_h <- lm(lprice ~ llotsize + lsqrft + bdrms + colonial, data = hprice1)
tidy(fit_h)
# b) Exact percentage change for bdrms
b_bdrms <- coef(fit_h)["bdrms"]
cat("One extra bedroom: approx", round(100*b_bdrms, 2), "% change in price\n")
cat(" exact ", round(100*(exp(b_bdrms)-1), 2), "%\n")
# c) Prediction interval
new_h <- tibble(llotsize = log(10000), lsqrft = log(2000),
bdrms = 3, colonial = 1)
pi_h <- predict(fit_h, newdata = new_h, interval = "prediction")
cat("\n95% PI for log(price): [", round(pi_h[2],3), ",", round(pi_h[3],3), "]\n")
cat("In dollars: [$", round(exp(pi_h[2])*1000, 0),
", $", round(exp(pi_h[3])*1000, 0), "]\n")
```
Note: `price` in `hprice1` is in thousands of dollars, so exponentiate and multiply by 1,000.
:::
**Tutorial 6.2**
Using the OVB formula, predict the direction of bias in $\hat{\beta}_{educ}$ if we omit `female` from the wage regression. Verify empirically. Comment on whether the bias is economically meaningful.
::: {.callout-tip collapse="true"}
## Solution
```{r}
#| label: ex6-2
fit_short_f <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_long_f <- lm(wage ~ educ + exper + tenure + female, data = wage1)
fit_delta_f <- lm(female ~ educ, data = wage1)
beta_female <- coef(fit_long_f)["female"]
delta_21_f <- coef(fit_delta_f)["educ"]
ovb_pred <- beta_female * delta_21_f
cat("beta_female:", round(beta_female, 3), "\n")
cat("delta (female on educ):", round(delta_21_f, 3), "\n")
cat("Predicted OVB:", round(ovb_pred, 4), "\n")
cat("Actual difference:", round(coef(fit_short_f)["educ"] - coef(fit_long_f)["educ"], 4), "\n")
```
$\beta_{female} < 0$ (wage penalty) and $\delta_{21} > 0$ (women have slightly more education in this sample), so the bias is negative — omitting gender *slightly understates* the return to education. The magnitude is small (\$0.01/year), so the bias is statistically interesting but economically minor here.
:::
**Tutorial 6.3**
For the log-level wage regression `lwage ~ educ + exper + tenure`, apply the RESET test. If you find evidence of misspecification, suggest and estimate one modification to the functional form that might resolve it. Re-run the RESET test on the modified model.
::: {.callout-tip collapse="true"}
## Solution
```{r}
#| label: ex6-3
fit_log2 <- lm(lwage ~ educ + exper + tenure, data = wage1)
# RESET test on log-level model
resettest(fit_log2, power = 2:3, type = "fitted")
# Add quadratic experience (experience has a concave profile)
fit_log3 <- lm(lwage ~ educ + exper + I(exper^2) + tenure, data = wage1)
resettest(fit_log3, power = 2:3, type = "fitted")
tidy(fit_log3)
```
Adding `exper^2` eliminates the RESET rejection, consistent with experience having diminishing returns. The coefficient on `exper^2` is negative: wage growth from experience slows down and eventually peaks. We will study this functional form (quadratics and interactions) further in Chapter 7.
:::