Chapter 7: Functional Form, Dummies, and Interactions

Non-linearities, qualitative variables, and interaction effects

NoteLearning Objectives

By the end of this chapter you will be able to:

  • Interpret exact percentage changes from log-level models via \(e^{\hat{\beta}} - 1\)
  • Incorporate quadratic and polynomial terms and compute marginal effects
  • Explain the dummy variable trap using the \(\mathbf{X}\)-matrix rank argument
  • Include multi-category dummy variables and interpret pairwise differences invariantly across base category choices
  • Construct and interpret interaction terms between a dummy and a continuous variable
  • Conduct the Chow test for structural breaks, both manually and via linearHypothesis()

1 Log Functional Forms: Exact Interpretation

Chapter 6 introduced the log approximation \(100\Delta\ln(z) \approx \%\Delta z\). For log-level and log-log models, the exact interpretation matters when coefficients are large.

1.1 Exact Percentage Change in Log-Level Models

In the model \(\ln y = \beta_0 + \beta_1 x + u\), a one-unit increase in \(x\) changes \(\ln y\) by \(\beta_1\). The exact percentage change in \(y\) is:

\[y_{after} = e^{\beta_0 + \beta_1(x+1) + u} = e^{\beta_1} \cdot e^{\beta_0 + \beta_1 x + u} = e^{\beta_1} \cdot y_{before}\]

Therefore:

\[\frac{y_{after} - y_{before}}{y_{before}} = e^{\beta_1} - 1\]

In percentage terms: \(\% \Delta y = 100(e^{\hat{\beta}_1} - 1)\).

The approximation \(100\hat{\beta}_1\) equals the first-order Taylor expansion of \(100(e^{\hat{\beta}_1} - 1)\) around \(\hat{\beta}_1 = 0\). For \(|\hat{\beta}_1| < 0.1\), the error is less than 5% of the coefficient. For larger values, always report the exact figure.

wage1 <- wage1 |> mutate(lwage = log(wage))
fit_log <- lm(lwage ~ educ + exper + I(exper^2) + tenure + female, data = wage1)

b_female <- coef(fit_log)["female"]
cat("Coefficient on female:", round(b_female, 4), "\n")
Coefficient on female: -0.2979 
cat("Approximate % change: ", round(100 * b_female, 2), "%\n")
Approximate % change:  -29.79 %
cat("Exact % change:       ", round(100 * (exp(b_female) - 1), 2), "%\n")
Exact % change:        -25.76 %

The approximate and exact figures differ meaningfully for female because the coefficient is not small. The exact figure should be reported.

1.2 Quadratic Terms and Marginal Effects

When the effect of \(x\) on \(y\) is non-linear:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\]

The marginal effect is \(\partial y / \partial x = \beta_1 + 2\beta_2 x\) — it varies with \(x\). When \(\beta_2 < 0\), the relationship is concave with a maximum (turning point) at:

\[x^* = -\frac{\beta_1}{2\beta_2}\]

b_exper  <- coef(fit_log)["exper"]
b_exper2 <- coef(fit_log)["I(exper^2)"]
tp <- -b_exper / (2 * b_exper2)

tibble(exper = 0:45) |>
  mutate(marginal = b_exper + 2 * b_exper2 * exper) |>
  ggplot(aes(exper, marginal)) +
  geom_line(colour = "#2c7be5", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = tp, colour = "#e63946", linetype = "dashed") +
  labs(x = "Years of experience", y = "Marginal effect on log wage",
       title = "Marginal Effect of Experience",
       subtitle = paste0("Turning point at ", round(tp, 1), " years"))

Diminishing returns to experience: marginal effect crosses zero at the turning point.
p1 <- ggplot(wage1, aes(wage)) +
  geom_histogram(bins = 30, fill = "#2c7be5", alpha = 0.7) +
  labs(x = "Hourly wage (USD)", title = "Raw wages")
p2 <- ggplot(wage1, aes(lwage)) +
  geom_histogram(bins = 30, fill = "#00c9a7", alpha = 0.7) +
  labs(x = "Log hourly wage", title = "Log wages")
p1 + p2
Figure 1: Raw wages are right-skewed; log wages are approximately symmetric.

2 Dummy Variables

A dummy variable (indicator variable) takes the value 1 if a condition is met and 0 otherwise:

\[D_i = \begin{cases} 1 & \text{if condition is true} \\ 0 & \text{otherwise} \end{cases}\]

2.1 Intercept Shift

Including a dummy causes a parallel shift in the regression line — the slope is the same for both groups but the intercept differs by \(\delta\):

\[\ln(wage_i) = \beta_0 + \beta_1 educ_i + \delta \cdot female_i + u_i\]

\(\delta\) measures the expected difference in log wages between women and men at any education level, all else equal.

ggplot(wage1, aes(educ, lwage, colour = factor(female))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  scale_colour_manual(values = c("#2c7be5","#e63946"),
                      labels = c("Male","Female")) +
  labs(x = "Education (years)", y = "Log wage",
       colour = NULL, title = "Log Wage on Education by Gender",
       subtitle = "Parallel intercept shift model")
Figure 2: Parallel regression lines: separate intercepts by gender, same slopes.

2.2 The Dummy Variable Trap: A Rank Argument

For a categorical variable with \(m\) categories, include exactly \(m - 1\) dummies. Including all \(m\) creates perfect multicollinearity — a structural failure, not a data problem.

Proof. Suppose we have two regions (A and B) and include both \(D_A\) and \(D_B\) plus an intercept. The design matrix has columns:

\[\mathbf{X} = \begin{bmatrix} 1 & D_{A1} & D_{B1} \\ 1 & D_{A2} & D_{B2} \\ \vdots & \vdots & \vdots \end{bmatrix}\]

Since every observation is in exactly one region, \(D_{Ai} + D_{Bi} = 1\) for all \(i\). Therefore the intercept column equals the sum of the two dummy columns:

\[\mathbf{1} = \mathbf{D}_A + \mathbf{D}_B\]

This is an exact linear dependency. \(\text{rank}(\mathbf{X}) < k + 1\), so \(\mathbf{X}'\mathbf{X}\) is singular and \((\mathbf{X}'\mathbf{X})^{-1}\) does not exist. OLS has no unique solution — there are infinitely many coefficient vectors that minimise SSR equally. The software either drops a dummy automatically or reports an error. \(\blacksquare\)

More generally, for \(m\) categories with dummies \(D_1, \ldots, D_m\): \(\sum_{j=1}^m D_j = \mathbf{1}\), so any \(m\) of the \(m+1\) columns \(\{\mathbf{1}, D_1, \ldots, D_m\}\) are linearly dependent.

2.3 Multi-Category Dummies and Invariance of Pairwise Differences

wage1 <- wage1 |>
  mutate(region = case_when(
    northcen == 1 ~ "North Central",
    south    == 1 ~ "South",
    west     == 1 ~ "West",
    TRUE          ~ "Northeast"
  ) |> factor(levels = c("Northeast","North Central","South","West")))

fit_r_NE <- lm(lwage ~ educ + exper + tenure + female + region, data = wage1)
tidy(fit_r_NE) |> dplyr::filter(str_detect(term, "region"))

Each coefficient is the difference in conditional mean log wages relative to the base category (Northeast). A natural concern: does changing the base category change the substantive conclusions?

Pairwise differences are invariant to the choice of base category.

With Northeast as base: \[\hat{\mu}_{South} - \hat{\mu}_{NE} = \hat{\delta}_{South}, \quad \hat{\mu}_{West} - \hat{\mu}_{NE} = \hat{\delta}_{West}\]

The South–West difference is \(\hat{\delta}_{South} - \hat{\delta}_{West}\). If we re-estimate with West as base, the South coefficient becomes \(\hat{\mu}_{South} - \hat{\mu}_{West}\), which equals \(\hat{\delta}_{South} - \hat{\delta}_{West}\) — the same number. Only the pairwise differences (not the individual coefficients) have interpretation independent of the base.

# Re-estimate with South as base
wage1_s <- wage1 |>
  mutate(region_s = relevel(region, ref = "South"))

fit_r_S <- lm(lwage ~ educ + exper + tenure + female + region_s, data = wage1_s)

# West - North Central difference: should be the same regardless of base
d_NE <- tidy(fit_r_NE) |> dplyr::filter(term == "regionNorth Central") |> dplyr::pull(estimate)
d_W  <- tidy(fit_r_NE) |> dplyr::filter(term == "regionWest") |> dplyr::pull(estimate)
d_W_fromS <- tidy(fit_r_S) |> dplyr::filter(term == "region_sWest") |> dplyr::pull(estimate)
d_NC_fromS <- tidy(fit_r_S) |> dplyr::filter(term == "region_sNorth Central") |> dplyr::pull(estimate)

cat("West - North Central (NE base):", round(d_W - d_NE, 6), "\n")
West - North Central (NE base): 0.1433 
cat("West - North Central (S  base):", round(d_W_fromS - d_NC_fromS, 6), "\n")
West - North Central (S  base): 0.1433 

The pairwise difference is identical to machine precision, confirming the invariance.


3 Interaction Terms

An interaction term \(x \times D\) allows the slope on \(x\) to differ across groups:

\[\ln(wage_i) = \beta_0 + \beta_1 educ_i + \delta_0 female_i + \delta_1 (educ_i \times female_i) + u_i\]

  • For men (\(D=0\)): \(E[\ln wage \mid educ, D=0] = \beta_0 + \beta_1 \, educ\)
  • For women (\(D=1\)): \(E[\ln wage \mid educ, D=1] = (\beta_0 + \delta_0) + (\beta_1 + \delta_1) \, educ\)

\(\delta_1\) captures the differential return to education for women relative to men.

3.1 Multicollinearity in Interaction Models

A frequent puzzle: after including female and female × educ, the correlation between them is high. This is not a bug — it is a genuine mechanical correlation:

wage1 <- wage1 |> mutate(female_x_educ = female * educ)
cat("Correlation between female and female×educ:",
    round(cor(wage1$female, wage1$female_x_educ), 3), "\n")
Correlation between female and female×educ: 0.964 

High correlation between female and female × educ is expected: the interaction is literally a function of one of the main effects. This does not invalidate the model or the \(t\)-test on \(\delta_1\). It inflates individual standard errors but the joint test (F-test for both female and female × educ) remains valid. Centring the continuous variable (\(educ^* = educ - \bar{educ}\)) reduces this correlation substantially without changing the substantive results.

wage1 <- wage1 |> mutate(educ_c = educ - mean(educ))
fit_int_c <- lm(lwage ~ educ_c * female + exper + I(exper^2) + tenure, data = wage1)
cat("Correlation after centring:",
    round(cor(wage1$female, wage1$female * wage1$educ_c), 3), "\n")
Correlation after centring: -0.072 
tidy(fit_int_c, conf.int = TRUE) |>
  dplyr::filter(term %in% c("educ_c", "female", "educ_c:female"))

After centring, the female coefficient is now the gender gap at the mean education level (not at \(educ = 0\), which is extrapolation).

newdata <- expand_grid(
  educ_c = seq(-8, 8, 0.1),
  female = 0:1,
  exper  = mean(wage1$exper),
  tenure = mean(wage1$tenure)
)
newdata |>
  mutate(
    pred   = predict(fit_int_c, newdata = newdata),
    gender = if_else(female == 1, "Female", "Male"),
    educ   = educ_c + mean(wage1$educ)
  ) |>
  ggplot(aes(educ, pred, colour = gender)) +
  geom_line(linewidth = 1.2) +
  labs(x = "Education (years)", y = "Predicted log wage", colour = NULL,
       title = "Interaction: Return to Education by Gender",
       subtitle = "Non-parallel lines = different slopes")
Figure 3: Return to education is steeper for men than women (interaction model).

4 The Chow Test for Structural Breaks

A Chow test asks whether the entire regression relationship — all slopes and the intercept — differs across two groups or time periods. It is an \(F\)-test that compares:

  • Restricted: single regression for both groups (pooled, with possible intercept shift)
  • Unrestricted: separate regressions for each group

4.1 Manual Computation

Let \(SSR_R\) = SSR from the restricted (pooled) model, \(SSR_{UR}\) = \(SSR_M + SSR_F\) (sum of SSRs from separate male and female regressions), \(q\) = number of restrictions (the number of slope coefficients tested), \(k\) = number of regressors in each group model.

\[F = \frac{(SSR_R - SSR_{UR})/q}{SSR_{UR}/(n - 2(k+1))} \sim F(q,\, n - 2(k+1))\]

# Male and female subsets
wage_m <- wage1 |> dplyr::filter(female == 0)
wage_f <- wage1 |> dplyr::filter(female == 1)

# Regressors in each model: educ, exper, exper^2, tenure (k=4, so k+1=5 params)
formula_chow <- lwage ~ educ + exper + I(exper^2) + tenure

fit_m   <- lm(formula_chow, data = wage_m)
fit_f   <- lm(formula_chow, data = wage_f)
fit_r   <- lm(lwage ~ educ + exper + I(exper^2) + tenure + female, data = wage1)

SSR_UR <- sum(resid(fit_m)^2) + sum(resid(fit_f)^2)
SSR_R  <- sum(resid(fit_r)^2)
n      <- nrow(wage1)
k1     <- length(coef(fit_m))   # number of params in each group model
q      <- k1 - 1                # slopes only (the intercept is already allowed to differ)

F_chow <- ((SSR_R - SSR_UR) / q) / (SSR_UR / (n - 2 * k1))
p_chow <- pf(F_chow, q, n - 2 * k1, lower.tail = FALSE)

cat("SSR restricted (with female dummy):", round(SSR_R,  2), "\n")
SSR restricted (with female dummy): 83.94 
cat("SSR unrestricted (male + female):  ", round(SSR_UR, 2), "\n")
SSR unrestricted (male + female):   81.7 
cat("F-statistic:", round(F_chow, 4), "\n")
F-statistic: 3.542 
cat("p-value:    ", round(p_chow, 4), "\n")
p-value:     0.0073 
# Verify via linearHypothesis: test all slope interactions jointly
fit_full_int <- lm(lwage ~ (educ + exper + I(exper^2) + tenure) * female, data = wage1)
linearHypothesis(fit_full_int,
                 c("educ:female = 0", "exper:female = 0",
                   "I(exper^2):female = 0", "tenure:female = 0"))

A significant \(F\) rejects the null that all slope coefficients are equal across men and women.

4.2 Structural Break in Time Series

The Chow test also applies to time series at a known break date. Let \(D_t^{post}\) be a post-break dummy. An equivalent implementation: include all interactions of regressors with \(D_t^{post}\) and test them jointly.

data("intdef", package = "wooldridge")  # interest rate & inflation, US 1948-2003

# Mark post-1979 (Volcker shock) as a potential structural break
intdef <- intdef |>
  mutate(post79 = as.integer(year >= 1979))

fit_pre  <- lm(i3 ~ inf + def, data = dplyr::filter(intdef, post79 == 0))
fit_post <- lm(i3 ~ inf + def, data = dplyr::filter(intdef, post79 == 1))
fit_pool <- lm(i3 ~ inf + def + post79, data = intdef)
fit_brk  <- lm(i3 ~ (inf + def) * post79, data = intdef)

SSR_UR_ts <- sum(resid(fit_pre)^2) + sum(resid(fit_post)^2)
SSR_R_ts  <- sum(resid(fit_pool)^2)
n_ts <- nrow(intdef); k_ts <- 3
F_ts <- ((SSR_R_ts - SSR_UR_ts) / 2) / (SSR_UR_ts / (n_ts - 2 * k_ts))
cat("Chow F-statistic for 1979 break:", round(F_ts, 3), "\n")
Chow F-statistic for 1979 break: 3.889 
cat("p-value:", round(pf(F_ts, 2, n_ts - 2*k_ts, lower.tail = FALSE), 4), "\n")
p-value: 0.0269 

A significant result suggests the relationship between interest rates, inflation, and the deficit changed structurally around the Volcker disinflation.


5 Tutorials

Tutorial 7.1 Using wooldridge::wage1, estimate the log wage model with quadratic experience and the female dummy. Compute the experience level at which the wage-experience profile peaks. Report the exact (not approximate) percentage wage gap for women.

fit_q <- lm(lwage ~ educ + exper + I(exper^2) + tenure + female, data = wage1)
b1    <- coef(fit_q)["exper"]
b2    <- coef(fit_q)["I(exper^2)"]
tp    <- -b1 / (2 * b2)

d_fem    <- coef(fit_q)["female"]
exact_pct <- 100 * (exp(d_fem) - 1)

cat("Turning point:", round(tp, 1), "years\n")
Turning point: 25.3 years
cat("Data range:   ", min(wage1$exper), "to", max(wage1$exper), "\n\n")
Data range:    1 to 51 
cat("Approx gender gap:", round(100*d_fem, 2), "%\n")
Approx gender gap: -29.79 %
cat("Exact gender gap: ", round(exact_pct, 2), "%\n")
Exact gender gap:  -25.76 %

Tutorial 7.2 Show the dummy variable trap in practice. Attempt to include dummies for all four regions (Northeast, North Central, South, West) plus an intercept. What does R do? Explain algebraically why the model is not identified.

# R silently drops one dummy due to perfect collinearity
wage1_trap <- wage1 |>
  mutate(
    d_NE = as.integer(region == "Northeast"),
    d_NC = as.integer(region == "North Central"),
    d_S  = as.integer(region == "South"),
    d_W  = as.integer(region == "West")
  )

fit_trap <- lm(lwage ~ educ + d_NE + d_NC + d_S + d_W, data = wage1_trap)
cat("Coefficients returned (note NA):\n")
Coefficients returned (note NA):
print(coef(fit_trap))
(Intercept)        educ        d_NE        d_NC         d_S         d_W 
    0.66872     0.08210    -0.04562    -0.11548    -0.10586          NA 
# Verify the linear dependency: sum of all dummies = 1
with(wage1_trap, all(d_NE + d_NC + d_S + d_W == 1))
[1] TRUE

R drops d_W (the last dummy) and returns NA for it — a sign of the trap. The algebraic reason: \(\mathbf{D}_{NE} + \mathbf{D}_{NC} + \mathbf{D}_S + \mathbf{D}_W = \mathbf{1}\) (the intercept column), so the columns of \(\mathbf{X}\) are linearly dependent and \(\mathbf{X}'\mathbf{X}\) is singular.

Tutorial 7.3 Conduct a Chow test for whether the wage regression (lwage ~ educ + exper + tenure) has the same slope coefficients for nonwhite vs. white workers. Compute the F-statistic by hand (using separate SSRs) and verify with linearHypothesis().

formula_w <- lwage ~ educ + exper + tenure

fit_white    <- lm(formula_w, data = dplyr::filter(wage1, nonwhite == 0))
fit_nonwhite <- lm(formula_w, data = dplyr::filter(wage1, nonwhite == 1))
fit_pool_w   <- lm(lwage ~ educ + exper + tenure + nonwhite, data = wage1)

SSR_UR2 <- sum(resid(fit_white)^2) + sum(resid(fit_nonwhite)^2)
SSR_R2  <- sum(resid(fit_pool_w)^2)
n2      <- nrow(wage1)
k2      <- length(coef(fit_white))
q2      <- k2 - 1

F2   <- ((SSR_R2 - SSR_UR2) / q2) / (SSR_UR2 / (n2 - 2 * k2))
p2   <- pf(F2, q2, n2 - 2*k2, lower.tail = FALSE)
cat("Chow F-statistic:", round(F2, 4), "  p-value:", round(p2, 4), "\n")
Chow F-statistic: 1.409   p-value: 0.2393 
# Verify
fit_int_race <- lm(lwage ~ (educ + exper + tenure) * nonwhite, data = wage1)
linearHypothesis(fit_int_race,
                 c("educ:nonwhite = 0", "exper:nonwhite = 0", "tenure:nonwhite = 0"))