Chapter 6: Multiple Regression and Specification

Controlling for confounders, OVB, and model selection

NoteLearning Objectives

By the end of this chapter you will be able to:

  • Specify and interpret multiple regression models with ceteris paribus language
  • Apply the OVB formula and reason about the direction of bias
  • Derive and interpret coefficients in log-level, level-log, and log-log models
  • Use the RESET test to detect functional form misspecification
  • Compare model specifications using adjusted \(R^2\), AIC, BIC, and HQ criteria
  • Derive and distinguish confidence intervals for conditional means from prediction intervals

1 The Multiple Regression Model

The multiple regression model with \(k\) regressors is:

\[y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + u_i\]

Interpretation of \(\beta_j\): The expected change in \(y\) for a one-unit increase in \(x_j\), holding all other regressors constant (ceteris paribus). This ceteris paribus interpretation is what makes regression useful for causal inference — including a confounder \(x_2\) in the regression means the estimated effect of \(x_1\) is net of any variation in \(x_2\).


2 Omitted Variable Bias (OVB)

If we omit a relevant variable from the regression, OLS is biased. The full algebraic derivation was given in Chapter 4. Here we state the formula and apply it.

2.1 The OVB Formula

Suppose the true model is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\) with \(E[u \mid x_1, x_2] = 0\). If we estimate the short regression (omitting \(x_2\)):

\[E[\tilde{\beta}_1] = \beta_1 + \beta_2 \cdot \delta_{21}, \qquad \delta_{21} = \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}\]

The omitted variable bias \(\beta_2 \delta_{21}\) is nonzero if and only if both (i) \(x_2\) affects \(y\) and (ii) \(x_1\) and \(x_2\) are correlated.

\(\text{sign}(\beta_2)\) \(\text{sign}(\delta_{21})\) Direction of bias
\(+\) \(+\) Upward: \(E[\tilde{\beta}_1] > \beta_1\)
\(+\) \(-\) Downward
\(-\) \(+\) Downward
\(-\) \(-\) Upward

2.2 OVB Example: Ability Bias in Wage Regressions

fit_short <- lm(wage ~ educ, data = wage1)
fit_full  <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_delta <- lm(exper ~ educ, data = wage1)

beta_exper <- coef(fit_full)["exper"]
delta_21   <- coef(fit_delta)["educ"]

cat("Short regression beta_educ:  ", round(coef(fit_short)["educ"], 4), "\n")
Short regression beta_educ:   0.5414 
cat("Long  regression beta_educ:  ", round(coef(fit_full)["educ"], 4), "\n")
Long  regression beta_educ:   0.599 
cat("OVB formula (long + bias):   ",
    round(coef(fit_full)["educ"] + beta_exper * delta_21, 4), "\n")
OVB formula (long + bias):    0.5662 

The OVB formula recovers the short-model coefficient exactly (within rounding). Omitting experience biases the return to education because experience affects wages and is correlated with education (more education → less time for work experience, so \(\delta_{21} < 0\); combined with \(\beta_{exper} > 0\), the bias is downward here).


3 Building a Wage Model: Step by Step

Presenting results across multiple model specifications lets readers see how estimates change as controls are added — which reveals both OVB and robustness.

models <- list(
  "(1) Education only"    = lm(wage ~ educ, data = wage1),
  "(2) + Experience"      = lm(wage ~ educ + exper, data = wage1),
  "(3) + Tenure"          = lm(wage ~ educ + exper + tenure, data = wage1),
  "(4) + Gender"          = lm(wage ~ educ + exper + tenure + female, data = wage1),
  "(5) + Race"            = lm(wage ~ educ + exper + tenure + female + nonwhite, data = wage1)
)

modelsummary(
  models,
  stars    = TRUE,
  gof_map  = c("nobs", "r.squared", "adj.r.squared"),
  title    = "Returns to Education: Progressive Model Specifications",
  coef_rename = c(
    "(Intercept)" = "Intercept",
    "educ"     = "Education (years)",
    "exper"    = "Experience (years)",
    "tenure"   = "Tenure (years)",
    "female"   = "Female (= 1)",
    "nonwhite" = "Non-white (= 1)"
  )
)
Returns to Education: Progressive Model Specifications
(1) Education only (2) + Experience (3) + Tenure (4) + Gender (5) + Race
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept -0.905 -3.391*** -2.873*** -1.568* -1.540*
(0.685) (0.767) (0.729) (0.725) (0.732)
Education (years) 0.541*** 0.644*** 0.599*** 0.572*** 0.570***
(0.053) (0.054) (0.051) (0.049) (0.050)
Experience (years) 0.070*** 0.022+ 0.025* 0.025*
(0.011) (0.012) (0.012) (0.012)
Tenure (years) 0.169*** 0.141*** 0.141***
(0.022) (0.021) (0.021)
Female (= 1) -1.811*** -1.812***
(0.265) (0.265)
Non-white (= 1) -0.116
(0.427)
Num.Obs. 526 526 526 526 526
R2 0.165 0.225 0.306 0.364 0.364
R2 Adj. 0.163 0.222 0.302 0.359 0.358

4 Log Functional Forms

The level-level model \(y = \beta_0 + \beta_1 x + u\) is the baseline. But wages, prices, and income are often better modelled in logarithms. Understanding how to interpret log transformations — and why they work — is essential.

4.1 The Log Approximation

The key identity is:

\[\Delta \ln(z) \approx \frac{\Delta z}{z}\]

for small \(\Delta z\). This follows from the first-order Taylor expansion: \(\ln(z + \Delta z) \approx \ln(z) + \Delta z / z\). Rearranging: \(\Delta \ln(z) \approx \Delta z / z\). Multiplying both sides by 100:

\[100 \cdot \Delta \ln(z) \approx \frac{100 \cdot \Delta z}{z} = \% \Delta z\]

So \(100 \times\) the change in \(\ln(z)\) is approximately the percentage change in \(z\). The approximation is exact for continuous time; for discrete changes it becomes less accurate as \(\Delta z\) grows.

4.2 The Four Log Models

Model Specification Coefficient interpretation
Level-level \(y = \beta_0 + \beta_1 x\) \(\Delta y = \beta_1 \Delta x\) (absolute)
Log-level \(\ln y = \beta_0 + \beta_1 x\) \(100\beta_1 \Delta x \approx \% \Delta y\)
Level-log \(y = \beta_0 + \beta_1 \ln x\) \(\beta_1 (\Delta x / x) \approx \beta_1 / 100 \times \% \Delta x\)
Log-log \(\ln y = \beta_0 + \beta_1 \ln x\) \(\beta_1 = \% \Delta y / \% \Delta x\) (elasticity)

4.3 Deriving the Log-Level Interpretation

If \(\ln y = \beta_0 + \beta_1 x + u\), then:

\[\frac{d(\ln y)}{dx} = \beta_1 \implies \frac{1}{y}\frac{dy}{dx} = \beta_1 \implies \frac{dy}{y} = \beta_1 \, dx\]

So a one-unit increase in \(x\) is associated with a \(\beta_1 \times 100\%\) approximate change in \(y\) (a semi-elasticity). The approximation breaks down for large \(\beta_1\); the exact percentage change is \(100(e^{\beta_1} - 1)\%\).

4.4 Deriving the Log-Log Interpretation

If \(\ln y = \beta_0 + \beta_1 \ln x + u\), then:

\[\frac{d(\ln y)}{d(\ln x)} = \beta_1 \implies \frac{dy/y}{dx/x} = \beta_1\]

\(\beta_1\) is the elasticity of \(y\) with respect to \(x\): a 1% increase in \(x\) is associated with a \(\beta_1\%\) change in \(y\). Elasticities are unit-free, which is why log-log models are standard for demand equations, production functions, and international trade.

# Level-level and log-level wage models
fit_level <- lm(wage   ~ educ + exper + tenure, data = wage1)
fit_log   <- lm(lwage  ~ educ + exper + tenure, data = wage1)

modelsummary(
  list("Level-level" = fit_level, "Log-level" = fit_log),
  stars   = TRUE,
  gof_map = c("nobs", "r.squared"),
  title   = "Level vs Log-Level Wage Equation"
)
Level vs Log-Level Wage Equation
Level-level Log-level
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -2.873*** 0.284**
(0.729) (0.104)
educ 0.599*** 0.092***
(0.051) (0.007)
exper 0.022+ 0.004*
(0.012) (0.002)
tenure 0.169*** 0.022***
(0.022) (0.003)
Num.Obs. 526 526
R2 0.306 0.316
b_educ_log <- coef(fit_log)["educ"]
cat("Log-level model: one extra year of education is associated with\n")
Log-level model: one extra year of education is associated with
cat("  approx:", round(100 * b_educ_log, 2), "% change in wage\n")
  approx: 9.2 % change in wage
cat("  exact: ", round(100 * (exp(b_educ_log) - 1), 2), "% change in wage\n")
  exact:  9.64 % change in wage

The approximate and exact figures differ by less than 1 percentage point here — the approximation is reasonable for small \(\beta_1\), but always report the exact figure for precision.

4.5 Log-Differencing in Time Series

For time series data, log-differencing converts levels to approximate growth rates:

\[\Delta \ln y_t = \ln y_t - \ln y_{t-1} \approx \frac{y_t - y_{t-1}}{y_{t-1}} = \text{growth rate of } y_t\]

This is standard for GDP, prices, and exchange rates. A second benefit: log-differencing often eliminates non-stationarity (trending series), which we study formally in Chapter 12.


5 Specification Testing: The RESET Test

The Ramsey RESET (Regression Equation Specification Error Test) detects functional form misspecification: if the model is correctly specified, higher powers of \(\hat{y}\) should not be significant when added to the regression.

The test: Estimate the original model to get \(\hat{y}_i\). Then augment with \(\hat{y}_i^2\) (and possibly \(\hat{y}_i^3\)) and test their joint significance using an \(F\)-test.

\[H_0: \text{model is correctly specified} \quad \text{vs.} \quad H_1: \text{functional form misspecification}\]

fit_wage <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_log_wage <- lm(lwage ~ educ + exper + tenure, data = wage1)

# RESET test via lmtest package
resettest(fit_wage,     power = 2:3, type = "fitted")

    RESET test

data:  fit_wage
RESET = 12, df1 = 2, df2 = 520, p-value = 0.00001
resettest(fit_log_wage, power = 2:3, type = "fitted")

    RESET test

data:  fit_log_wage
RESET = 6.6, df1 = 2, df2 = 520, p-value = 0.002

The RESET test rejects for the level-level model (p < 0.05) but not for the log-level model — consistent with wages being better modelled in logs. A significant RESET does not tell you how to fix the specification, only that the current one is wrong.


6 Model Selection Criteria

Adding more regressors always increases \(R^2\). Better criteria penalise model complexity.

6.1 Adjusted \(R^2\)

\[\bar{R}^2 = 1 - \frac{SSR/(n-k-1)}{SST/(n-1)}\]

\(\bar{R}^2\) rises when a new variable is added only if its \(|t|\) exceeds 1. Unlike \(R^2\), \(\bar{R}^2\) can decrease.

6.2 Information Criteria

Define \(\ell = n\ln(SSR/n)\) as the log-likelihood (up to a constant) and \(p = k+1\) as the number of parameters:

\[\text{AIC} = \ell + 2p\] \[\text{BIC} = \ell + p\ln(n)\] \[\text{HQ} = \ell + 2p\ln(\ln(n))\]

Penalty ordering: \(\text{AIC} \leq \text{HQ} \leq \text{BIC}\) (for \(n \geq 8\)). The Hannan-Quinn criterion (HQ) is consistent (like BIC) but penalises less aggressively. In practice: - AIC tends to overfit (selects too many regressors) - BIC is consistent — selects the true model as \(n \to \infty\) - HQ lies between them

Lower is better for all three criteria; compare across models with the same dependent variable.

n <- nrow(wage1)
map_dfr(models, function(m) {
  ssr <- sum(resid(m)^2)
  p   <- length(coef(m))
  ell <- n * log(ssr / n)
  tibble(
    AIC  = ell + 2 * p,
    HQ   = ell + 2 * p * log(log(n)),
    BIC  = ell + p * log(n),
    `Adj R2` = glance(m)$adj.r.squared
  )
}, .id = "Model")
ImportantModel Selection vs. Causal Inference

Information criteria are tools for prediction accuracy. For causal inference, we choose controls based on economic theory about which variables could confound the relationship of interest — not data-driven selection rules. Including an irrelevant control does not bias OLS but wastes a degree of freedom; omitting a relevant confounder causes bias.


7 Confidence Intervals vs. Prediction Intervals

A common confusion in applied work is between:

  1. Confidence interval (CI) for the conditional mean \(E[y \mid \mathbf{x}_0]\)
  2. Prediction interval (PI) for a new observation \(y_0 = \mathbf{x}_0'\boldsymbol{\beta} + u_0\)

7.1 Confidence Interval for \(E[y \mid \mathbf{x}_0]\)

The point estimate of \(E[y \mid \mathbf{x}_0] = \mathbf{x}_0'\boldsymbol{\beta}\) is \(\hat{y}_0 = \mathbf{x}_0'\hat{\boldsymbol{\beta}}\).

The variance of this estimate is:

\[\text{Var}(\hat{y}_0 \mid \mathbf{X}, \mathbf{x}_0) = \mathbf{x}_0' \text{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \mathbf{x}_0 = \sigma^2 \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0\]

Derivation via reparameterisation: Define a new parameter \(\theta = \mathbf{x}_0'\boldsymbol{\beta}\). Reparameterise the model so that \(\theta\) is one of the coefficients (by centring around \(\mathbf{x}_0\)). The OLS estimate of \(\theta\) is \(\hat{y}_0\), and its standard error is \(\text{se}(\hat{y}_0) = \hat{\sigma}\sqrt{\mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}\). The CI is:

\[\hat{y}_0 \pm t_{\alpha/2, n-k-1} \cdot \text{se}(\hat{y}_0)\]

7.2 Prediction Interval for \(y_0\)

A new observation is \(y_0 = \mathbf{x}_0'\boldsymbol{\beta} + u_0\). The forecast error is:

\[\hat{e}_0 = y_0 - \hat{y}_0 = (\mathbf{x}_0'\boldsymbol{\beta} + u_0) - \mathbf{x}_0'\hat{\boldsymbol{\beta}} = u_0 - \mathbf{x}_0'(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})\]

Since \(u_0\) is independent of \(\hat{\boldsymbol{\beta}}\):

\[\text{Var}(\hat{e}_0) = \text{Var}(u_0) + \text{Var}(\mathbf{x}_0'\hat{\boldsymbol{\beta}}) = \sigma^2 + \sigma^2 \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0\]

\[\text{se}(\hat{e}_0) = \hat{\sigma}\sqrt{1 + \mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0}\]

The extra “\(+ \sigma^2\)” (equivalently, “\(+1\)” inside the square root) reflects the irreducible uncertainty from the new observation’s error \(u_0\). No matter how large \(n\) is, PI width is bounded below by \(\pm t_{\alpha/2} \hat{\sigma}\).

A practical rule of thumb: if \(\hat{y}_0\) is close to \(\bar{y}\) (the in-sample mean), then \(\mathbf{x}_0'(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_0 \approx 1/n \approx 0\) for large \(n\), giving the approximation \(\hat{y}_0 \pm 2\hat{\sigma}\) (using \(t_{\alpha/2} \approx 2\)).

fit_wage_log <- lm(lwage ~ educ + exper + tenure, data = wage1)

# New observation: 16 years education, 10 years experience, 5 years tenure
new_obs <- tibble(educ = 16, exper = 10, tenure = 5)

ci <- predict(fit_wage_log, newdata = new_obs, interval = "confidence", level = 0.95)
pi <- predict(fit_wage_log, newdata = new_obs, interval = "prediction", level = 0.95)

cat("95% CI for E[log(wage) | x0]:  [", round(ci[2], 3), ",", round(ci[3], 3), "]\n")
95% CI for E[log(wage) | x0]:  [ 1.848 , 1.969 ]
cat("95% PI for a new log(wage):    [", round(pi[2], 3), ",", round(pi[3], 3), "]\n")
95% PI for a new log(wage):    [ 1.04 , 2.777 ]
cat("Width ratio (PI/CI):           ", round((pi[3]-pi[2])/(ci[3]-ci[2]), 2), "\n")
Width ratio (PI/CI):            14.28 
# In level form (anti-log, corrected for log-normal bias)
sigma_hat <- sigma(fit_wage_log)
cat("\nIn dollar terms (approx):\n")

In dollar terms (approx):
cat("Point prediction: $", round(exp(ci[1] + sigma_hat^2/2), 2), "\n")
Point prediction: $ 7.43 
cat("95% PI approx: $", round(exp(pi[2]), 2), "to $", round(exp(pi[3]), 2), "\n")
95% PI approx: $ 2.83 to $ 16.06 

The prediction interval is substantially wider than the confidence interval, as it must account for the inherent variability of individual wages around their conditional mean.


8 Multicollinearity

Multicollinearity occurs when regressors are highly correlated. It does not bias OLS, but it inflates standard errors, making it hard to separately identify individual effects. The Variance Inflation Factor:

\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the \(R^2\) from regressing \(x_j\) on all other regressors. VIF \(> 10\) is often flagged as a concern.

vif(fit_full)
  educ  exper tenure 
 1.113  1.478  1.349 

No serious multicollinearity here — all VIFs are well below 10.


9 Tutorials

Tutorial 6.1 Using wooldridge::hprice1, regress lprice (log price) on llotsize (log lot size), lsqrft (log square footage), bdrms, and colonial.

  1. Interpret the coefficients on llotsize and lsqrft as elasticities.
  2. Interpret the coefficient on bdrms as a semi-elasticity (exact percentage change).
  3. Construct a 95% prediction interval for a 2,000 sq ft colonial house with a 10,000 sq ft lot and 3 bedrooms. Convert back to dollar terms.
data("hprice1", package = "wooldridge")

fit_h <- lm(lprice ~ llotsize + lsqrft + bdrms + colonial, data = hprice1)
tidy(fit_h)
# b) Exact percentage change for bdrms
b_bdrms <- coef(fit_h)["bdrms"]
cat("One extra bedroom: approx", round(100*b_bdrms, 2), "% change in price\n")
One extra bedroom: approx 2.68 % change in price
cat("                   exact ", round(100*(exp(b_bdrms)-1), 2), "%\n")
                   exact  2.72 %
# c) Prediction interval
new_h <- tibble(llotsize = log(10000), lsqrft = log(2000),
                bdrms = 3, colonial = 1)
pi_h <- predict(fit_h, newdata = new_h, interval = "prediction")
cat("\n95% PI for log(price): [", round(pi_h[2],3), ",", round(pi_h[3],3), "]\n")

95% PI for log(price): [ 5.333 , 6.078 ]
cat("In dollars:            [$", round(exp(pi_h[2])*1000, 0),
    ", $", round(exp(pi_h[3])*1000, 0), "]\n")
In dollars:            [$ 207126 , $ 436165 ]

Note: price in hprice1 is in thousands of dollars, so exponentiate and multiply by 1,000.

Tutorial 6.2 Using the OVB formula, predict the direction of bias in \(\hat{\beta}_{educ}\) if we omit female from the wage regression. Verify empirically. Comment on whether the bias is economically meaningful.

fit_short_f <- lm(wage ~ educ + exper + tenure, data = wage1)
fit_long_f  <- lm(wage ~ educ + exper + tenure + female, data = wage1)
fit_delta_f <- lm(female ~ educ, data = wage1)

beta_female <- coef(fit_long_f)["female"]
delta_21_f  <- coef(fit_delta_f)["educ"]
ovb_pred    <- beta_female * delta_21_f

cat("beta_female:", round(beta_female, 3), "\n")
beta_female: -1.811 
cat("delta (female on educ):", round(delta_21_f, 3), "\n")
delta (female on educ): -0.015 
cat("Predicted OVB:", round(ovb_pred, 4), "\n")
Predicted OVB: 0.0278 
cat("Actual difference:", round(coef(fit_short_f)["educ"] - coef(fit_long_f)["educ"], 4), "\n")
Actual difference: 0.0275 

\(\beta_{female} < 0\) (wage penalty) and \(\delta_{21} > 0\) (women have slightly more education in this sample), so the bias is negative — omitting gender slightly understates the return to education. The magnitude is small ($0.01/year), so the bias is statistically interesting but economically minor here.

Tutorial 6.3 For the log-level wage regression lwage ~ educ + exper + tenure, apply the RESET test. If you find evidence of misspecification, suggest and estimate one modification to the functional form that might resolve it. Re-run the RESET test on the modified model.

fit_log2 <- lm(lwage ~ educ + exper + tenure, data = wage1)

# RESET test on log-level model
resettest(fit_log2, power = 2:3, type = "fitted")

    RESET test

data:  fit_log2
RESET = 6.6, df1 = 2, df2 = 520, p-value = 0.002
# Add quadratic experience (experience has a concave profile)
fit_log3 <- lm(lwage ~ educ + exper + I(exper^2) + tenure, data = wage1)
resettest(fit_log3, power = 2:3, type = "fitted")

    RESET test

data:  fit_log3
RESET = 5.6, df1 = 2, df2 = 519, p-value = 0.004
tidy(fit_log3)

Adding exper^2 eliminates the RESET rejection, consistent with experience having diminishing returns. The coefficient on exper^2 is negative: wage growth from experience slows down and eventually peaks. We will study this functional form (quadratics and interactions) further in Chapter 7.