Chapter 3: Simple Linear Regression

OLS estimation, matrix form, geometry, and goodness of fit

Learning Objectives

By the end of this chapter you will be able to:

State the simple linear regression model and interpret its components
Derive the OLS estimator algebraically from first-order conditions
Write the regression model and the OLS estimator in matrix form
Explain OLS geometrically as an orthogonal projection onto the column space of $\mathbf{X}$
Derive the normal equations $\mathbf{X'X}\hat{\boldsymbol{\beta}} = \mathbf{X'y}$ from the orthogonality principle
Verify the algebraic properties of OLS residuals
Derive the SST = SSE + SSR decomposition as a consequence of orthogonality
Compute and interpret $R^2$ and the standard error of the regression
Fit and interpret a regression model in R

1 The Population Regression Function

We want to understand how a variable $y$ (the dependent variable) relates to $x$ (the independent variable or regressor). The population model is:

\[ y_i = \beta_0 + \beta_1 x_i + u_i, \quad i = 1, 2, \ldots, n \]

$\beta_0$: the intercept — expected value of $y$ when $x = 0$
$\beta_1$: the slope — the change in $y$ per unit increase in $x$, ceteris paribus
$u_i$: the error term — all factors that determine $y$ other than $x$

The interpretation $\Delta y / \Delta x = \beta_1$ is valid only when $\Delta u / \Delta x = 0$: that is, when $x$ changes do not move the unobservables. This is the ceteris paribus requirement.

1.1 The Population Regression Function

The zero conditional mean assumption $E[u \mid x] = 0$ implies:

\[ E[y \mid x] = \beta_0 + \beta_1 x \]

This is the population regression function (PRF). It describes the true conditional mean in the population. We never observe it; we estimate it from data. Estimating the PRF requires a random sample $\{(x_i, y_i) : i = 1, \ldots, n\}$.

1.2 SLR Assumptions

	Assumption	Statement
SLR.1	Linearity	$y = \beta_0 + \beta_1 x + u$
SLR.2	Random sampling	$(x_i, y_i)$ are i.i.d. draws from the population
SLR.3	Sample variation in $x$	$\sum(x_i - \bar{x})^2 > 0$
SLR.4	Zero conditional mean	$E[u \mid x] = 0$

SLR.4 is the load-bearing assumption. Its violation produces endogeneity and makes $\hat{\beta}_1$ biased.

1.3 What Is a Model?

Studying two random variables means modelling their joint distribution — which is data-intensive. The strategy here is more targeted: model the conditional distribution of $y$ given $x$, and in particular its conditional mean $E[y \mid x]$.

An incomplete model specifies only $E[y \mid x] = \beta_0 + \beta_1 x$. A complete model (the Classical Linear Model) further specifies:

\[ \text{Var}(u \mid x) = \sigma^2 \quad \text{(homoskedasticity)} \] \[ u \mid x \sim \mathcal{N}(0, \sigma^2) \quad \text{(normality)} \]

this chapter we work with the incomplete model. The CLM assumptions are taken up in Chapters 4 and 5.

2 OLS Estimation: Scalar Derivation

Given data $(x_1, y_1), \ldots, (x_n, y_n)$, we need to choose $\hat{\beta}_0$ and $\hat{\beta}_1$. OLS chooses the values that minimise the sum of squared residuals:

\[ \min_{b_0,\, b_1} \; S(b_0, b_1) = \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \]

First-order conditions:

\[ \frac{\partial S}{\partial b_0}\bigg|_{\hat{\beta}} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

\[ \frac{\partial S}{\partial b_1}\bigg|_{\hat{\beta}} = -2\sum_{i=1}^{n} x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

These two equations — the normal equations in scalar form — are:

\[ \sum_{i=1}^{n} \hat{u}_i = 0 \tag{1} \]

\[ \sum_{i=1}^{n} x_i \hat{u}_i = 0 \tag{2} \]

Solving the system. Dividing (1) by $n$ gives $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$. Substituting into (2) and simplifying (see Wooldridge p. 28):

\[ \boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\widehat{\text{Cov}}(x, y)}{\widehat{\text{Var}}(x)}} \]

\[ \boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}} \]

Notice that $\hat{\beta}_1 = \hat{\sigma}_{xy} / \hat{\sigma}_x^2$: the OLS slope is the sample covariance of $x$ and $y$ divided by the sample variance of $x$. In finance, this same formula gives the market beta of a stock.

The estimated regression line is $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$. Since $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$, the line always passes through the sample mean $(\bar{x}, \bar{y})$.

2.1 Estimators We Have Seen

Every population moment has a sample analogue. The table below collects them:

Population parameter	Estimator
$\mu_y = E[y]$	$\bar{y} = \frac{1}{n}\sum y_i$
$\sigma_y^2 = E[(y - \mu_y)^2]$	$\hat{s}_y^2 = \frac{1}{n-1}\sum(y_i - \bar{y})^2$
$\sigma_{xy} = E[(x-\mu_x)(y-\mu_y)]$	$\hat{s}_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y})$
$\rho_{xy} = \sigma_{xy}/(\sigma_x \sigma_y)$	$\hat{r}_{xy} = \hat{s}_{xy}/(\hat{s}_x \hat{s}_y)$
$\beta_1$ in $E[y\mid x] = \beta_0 + \beta_1 x$	$\hat{\beta}_1 = \hat{s}_{xy}/\hat{s}_x^2$
$\beta_0$	$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$

In every case the estimator is a function of sample data only. Because the data are random, $\hat{\beta}_0$ and $\hat{\beta}_1$ are random variables: different samples from the same population will yield different estimates.

3 Matrix Form of the Regression Model

For $n$ observations the model $y_i = \beta_0 + \beta_1 x_i + u_i$ for $i = 1, \ldots, n$ can be stacked:

\[ \underbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}_{\mathbf{y} \; (n \times 1)} = \underbrace{\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}}_{\mathbf{X} \; (n \times 2)} \underbrace{\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}}_{\boldsymbol{\beta} \; (2 \times 1)} + \underbrace{\begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{bmatrix}}_{\mathbf{u} \; (n \times 1)} \]

Compactly: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$.

The first column of $\mathbf{X}$ is a column of ones, which “carries” the intercept $\beta_0$. The matrix $\mathbf{X}$ and vector $\mathbf{y}$ are observable; $\boldsymbol{\beta}$ and $\mathbf{u}$ are unobservable.

3.1 Extension to Multiple Regression

With $k$ regressors plus an intercept the model is identical in form:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}, \quad \mathbf{X} \in \mathbb{R}^{n \times (k+1)}, \quad \boldsymbol{\beta} \in \mathbb{R}^{k+1} \]

The power of abstraction: four apparently distinct problems — returns to education, stock betas, house price prediction, electricity load forecasting — are all instances of the single problem “estimate $\boldsymbol{\beta}$ in $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$ from a sample of observations.”

4 Geometry of OLS

This section gives the geometric interpretation of least squares. It is not in Wooldridge but is essential for understanding what OLS computes, why $\mathbf{X'\hat{u}} = \mathbf{0}$, and where the $SST = SSE + SSR$ decomposition comes from.

4.1 Vectors and Inner Products

Think of $\mathbf{y} = (y_1, \ldots, y_n)' \in \mathbb{R}^n$ as a point (or arrow from the origin) in $n$-dimensional space.

Length of a vector $\mathbf{v} \in \mathbb{R}^n$:

\[ \|\mathbf{v}\| = \sqrt{\mathbf{v}'\mathbf{v}} = \sqrt{\sum_{i=1}^n v_i^2} \]

Orthogonality. Two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ are orthogonal (perpendicular) if and only if:

\[ \mathbf{u}'\mathbf{v} = \sum_{i=1}^n u_i v_i = 0 \]

4.2 The Column Space of $\mathbf{X}$

For any $(k+1)$-vector $\mathbf{b}$, the product $\mathbf{Xb}$ is an $n$-vector that is a linear combination of the columns of $\mathbf{X}$. The set of all such vectors:

\[ \mathcal{C}(\mathbf{X}) = \{\mathbf{Xb} : \mathbf{b} \in \mathbb{R}^{k+1}\} \]

is the column space (or range) of $\mathbf{X}$. It is a subspace of $\mathbb{R}^n$.

If $\mathbf{X}$ has $k+1$ linearly independent columns, then $\mathcal{C}(\mathbf{X})$ is a $(k+1)$-dimensional flat (hyperplane through the origin) inside $\mathbb{R}^n$.
With $k+1 < n$ (more observations than parameters — the usual case), this flat is a proper subspace: $\mathbf{y}$ will generically not lie in $\mathcal{C}(\mathbf{X})$.

4.3 A Concrete Illustration

Two observations, two parameters. Suppose $n = 2$ and $\mathbf{X}$ has two linearly independent columns. Then $\mathcal{C}(\mathbf{X}) = \mathbb{R}^2$: every 2-vector $\mathbf{y}$ lies in $\mathcal{C}(\mathbf{X})$, so we can fit the data perfectly with zero residuals.

Three observations, two parameters. Now $n = 3$, $k+1 = 2$. $\mathcal{C}(\mathbf{X})$ is a 2-dimensional plane through the origin in $\mathbb{R}^3$. A generic $\mathbf{y} \in \mathbb{R}^3$ does not lie in this plane, so perfect fit is impossible.

# Three houses: bedrooms = 4, 1, 1; prices = 10, 4, 6 (hundreds of thousands)
y <- c(10, 4, 6)
X <- cbind(1, c(4, 1, 1))   # [intercept | bedrooms]

# OLS: beta_hat = (X'X)^{-1} X'y
beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% y)
y_hat    <- X %*% beta_hat
u_hat    <- y - y_hat

cat("beta_hat:", round(beta_hat, 4), "\n")

beta_hat: 3.333 1.667

cat("y_hat:   ", round(y_hat, 4),    "\n")

y_hat:    10 5 5

cat("u_hat:   ", round(u_hat, 4),    "\n")

u_hat:    0 -1 1

The residuals are non-zero: the data vector $\mathbf{y}$ does not lie in $\mathcal{C}(\mathbf{X})$.

4.4 OLS as Orthogonal Projection

The OLS objective is:

\[ \min_{\mathbf{b}} \|\mathbf{y} - \mathbf{Xb}\|^2 = \min_{\mathbf{b}} \sum_{i=1}^n (y_i - \mathbf{x}_i'\mathbf{b})^2 \]

This is the problem of finding the point in $\mathcal{C}(\mathbf{X})$ closest to $\mathbf{y}$ in Euclidean distance. The solution is the orthogonal projection of $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$.

The projection $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ is the unique point in $\mathcal{C}(\mathbf{X})$ such that the residual $\hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to the entire column space:

\[ \hat{\mathbf{u}} \perp \mathcal{C}(\mathbf{X}) \]

The Orthogonality Principle

\[\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}\]

The OLS residual vector is orthogonal to every column of $\mathbf{X}$. This single equation contains the entire geometry of OLS and is more important than any formula.

The orthogonality condition says: the residual is uncorrelated with every regressor (including the constant). Writing it out column by column for $\mathbf{X} = [\mathbf{1}, \mathbf{x}_1, \ldots, \mathbf{x}_k]$:

\[ \mathbf{1}'\hat{\mathbf{u}} = \sum_{i=1}^n \hat{u}_i = 0 \qquad \mathbf{x}_j'\hat{\mathbf{u}} = \sum_{i=1}^n x_{ij}\hat{u}_i = 0, \quad j = 1, \ldots, k \]

These are exactly the scalar normal equations (1) and (2) we derived from the first-order conditions.

# Verify X'u_hat = 0 for the house example
round(t(X) %*% u_hat, 10)   # should be (0, 0)

     [,1]
[1,]    0
[2,]    0

4.5 Deriving the OLS Formula

Substituting $\hat{\mathbf{u}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ into $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$:

\[ \mathbf{X}'(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0} \]

\[ \mathbf{X'y} = \mathbf{X'X}\hat{\boldsymbol{\beta}} \tag{Normal Equations} \]

\[ \boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y}} \]

provided $\mathbf{X'X}$ is invertible — which requires the columns of $\mathbf{X}$ to be linearly independent (no perfect multicollinearity).

4.6 The Projection and Annihilator Matrices

Substituting back, the fitted values are:

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'y} \equiv \mathbf{P}_X \mathbf{y} \]

where $\mathbf{P}_X = \mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}$ is the hat matrix (it puts the “hat” on $\mathbf{y}$). The residual vector is:

\[ \hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{P}_X)\mathbf{y} \equiv \mathbf{M}_X \mathbf{y} \]

where $\mathbf{M}_X = \mathbf{I} - \mathbf{P}_X$ is the annihilator matrix. Both $\mathbf{P}_X$ and $\mathbf{M}_X$ are symmetric and idempotent ($\mathbf{P}_X^2 = \mathbf{P}_X$, $\mathbf{M}_X^2 = \mathbf{M}_X$), and they are orthogonal to each other: $\mathbf{P}_X \mathbf{M}_X = \mathbf{0}$.

n <- nrow(X)
P <- X %*% solve(t(X) %*% X) %*% t(X)   # hat matrix
M <- diag(n) - P                          # annihilator

# Idempotency: P^2 = P
cat("Max deviation P^2 - P:", max(abs(P %*% P - P)), "\n")

Max deviation P^2 - P: 0.000000000000000111

# Orthogonality: P M = 0
cat("Max deviation PM:     ", max(abs(P %*% M)),     "\n")

Max deviation PM:      0.000000000000000111

# y_hat and u_hat via projection
round(P %*% y, 4)   # same as y_hat above

     [,1]
[1,]   10
[2,]    5
[3,]    5

round(M %*% y, 4)   # same as u_hat above

     [,1]
[1,]    0
[2,]   -1
[3,]    1

5 Algebraic Properties of OLS

The orthogonality conditions $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$ have direct algebraic consequences. Writing out the two columns of $\mathbf{X} = [\mathbf{1}, \mathbf{x}]$:

Property 1. $\displaystyle\sum_{i=1}^n \hat{u}_i = 0$

The residuals sum to zero. OLS over-predicts as often as it under-predicts. (Follows from the column of ones being in $\mathbf{X}$.)

Property 2. $\displaystyle\sum_{i=1}^n x_i \hat{u}_i = 0$

The residuals are uncorrelated with the regressor in the sample.

Property 3. $\bar{y} = \hat{\bar{y}} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x}$

The sample mean of fitted values equals the sample mean of $y$. The estimated regression line passes through $(\bar{x}, \bar{y})$.

data("wage1", package = "wooldridge")
fit <- lm(wage ~ educ, data = wage1)
u   <- residuals(fit)
x   <- wage1$educ

cat("Sum of residuals:       ", round(sum(u), 10),       "\n")

Sum of residuals:        0

cat("Cov(x, u_hat):          ", round(sum(x * u), 10),   "\n")

Cov(x, u_hat):           0

cat("Mean(y_hat) == mean(y): ",
    isTRUE(all.equal(mean(fitted(fit)), mean(wage1$wage))), "\n")

Mean(y_hat) == mean(y):  TRUE

6 Goodness of Fit: A Geometric Derivation

Because $\hat{\mathbf{u}} \perp \hat{\mathbf{y}}$ — and because the column of ones in $\mathbf{X}$ forces $\hat{\mathbf{u}} \perp \mathbf{1}$, which means $\bar{\hat{u}} = 0$ and hence $\bar{\hat{y}} = \bar{y}$ — we can work with demeaned vectors.

Define $\tilde{\mathbf{y}} = \mathbf{y} - \bar{y}\mathbf{1}$ and $\tilde{\hat{\mathbf{y}}} = \hat{\mathbf{y}} - \bar{y}\mathbf{1}$. Since $\hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}}$, we have:

\[ \tilde{\mathbf{y}} = \tilde{\hat{\mathbf{y}}} + \hat{\mathbf{u}} \]

Squaring both sides (taking inner products) and using $\tilde{\hat{\mathbf{y}}}'\hat{\mathbf{u}} = 0$ (orthogonality — verify: $\hat{\mathbf{y}}'\hat{\mathbf{u}} = \boldsymbol{\beta}'\mathbf{X}'\hat{\mathbf{u}} = 0$):

\[ \|\tilde{\mathbf{y}}\|^2 = \|\tilde{\hat{\mathbf{y}}}\|^2 + \|\hat{\mathbf{u}}\|^2 \]

\[ \underbrace{\sum_{i=1}^n (y_i - \bar{y})^2}_{SST} = \underbrace{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}_{SSE} + \underbrace{\sum_{i=1}^n \hat{u}_i^2}_{SSR} \]

This is the Pythagorean theorem applied to the OLS decomposition — it holds as an algebraic identity, not as an approximation, and follows directly from orthogonality.

Coefficient of Determination

\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]

$R^2$ is the squared cosine of the angle between $\tilde{\mathbf{y}}$ and $\tilde{\hat{\mathbf{y}}}$ in $\mathbb{R}^n$. It lies in $[0, 1]$.

$R^2$ Does Not Measure Causal Validity

A high $R^2$ does not imply a causal interpretation. The election spending example in Wooldridge has $R^2 = 0.856$ but is still a correlation. The wage–education regression has $R^2 = 0.165$ but its coefficient may be closer to a causal effect (especially after controlling for confounders).

6.1 The Standard Error of the Regression

We need to estimate $\sigma^2 = \text{Var}(u)$. The natural candidate is $SSR/n$, but this is biased because we “used up” $k+1$ degrees of freedom estimating $\boldsymbol{\beta}$. The unbiased estimator is:

\[ \hat{\sigma}^2 = \frac{SSR}{n - (k+1)} = \frac{\sum \hat{u}_i^2}{n - k - 1} \]

For simple regression ($k = 1$): $\hat{\sigma}^2 = SSR / (n - 2)$.

The standard error of the regression (SER) is $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$, reported as sigma in glance(). It measures residual spread in the units of $y$.

7 Running Example: Returns to Education

data("wage1", package = "wooldridge")
wage1 |>
  select(wage, educ, exper, female, nonwhite) |>
  datasummary_skim()

	Unique	Mean	SD	Min	Median	Max
wage	241	5.9	3.7	0.5	4.7	25.0
educ	18	12.6	2.8	0.0	12.0	18.0
exper	51	17.0	13.6	1.0	13.5	51.0
female	2	0.5	0.5	0.0	0.0	1.0
nonwhite	2	0.1	0.3	0.0	0.0	1.0

ggplot(wage1, aes(x = educ, y = wage)) +
  geom_jitter(alpha = 0.4, width = 0.2, colour = "#2c7be5") +
  geom_smooth(method = "lm", colour = "#e63946", linewidth = 1.2, se = TRUE) +
  labs(x = "Years of education", y = "Hourly wage (USD)",
       title = "Wages and Education: wage1 Dataset")

Figure 1: Hourly wages and years of education. Line = OLS fit with 95% confidence band.

7.1 Fitting and Reading OLS Output

fit_slr <- lm(wage ~ educ, data = wage1)
tidy(fit_slr, conf.int = TRUE)

glance(fit_slr) |> select(r.squared, adj.r.squared, sigma, nobs)

Intercept $(\hat{\beta}_0 \approx -0.9)$: Predicted wage with zero years of education. Interpret with caution — this is outside the support of the data.
Slope $(\hat{\beta}_1 \approx 0.54)$: Each additional year of schooling is associated with $0.54 more per hour on average.
$R^2 \approx 0.165$: Education explains 16.5% of the variation in wages. A low $R^2$ here is not a problem; many factors other than education affect wages.
SER $\approx 3.38$: Residuals are on average $3.38 away from the fitted line.

7.2 Manual Computation and Verification

wage1 |>
  summarise(
    x_bar  = mean(educ),
    y_bar  = mean(wage),
    cov_xy = sum((educ - mean(educ)) * (wage - mean(wage))),
    var_x  = sum((educ - mean(educ))^2),
    beta_1 = cov_xy / var_x,
    beta_0 = y_bar - beta_1 * x_bar
  )

# Matrix form: beta_hat = (X'X)^{-1} X'y
y_vec <- wage1$wage
X_mat <- cbind(1, wage1$educ)

beta_matrix <- solve(t(X_mat) %*% X_mat) %*% (t(X_mat) %*% y_vec)
round(beta_matrix, 4)   # identical to lm() output

        [,1]
[1,] -0.9049
[2,]  0.5414

All three routes — closed-form formula, lm(), and matrix algebra — agree exactly.

7.3 Verifying the Decomposition

aug <- augment(fit_slr)

SST <- sum((aug$wage        - mean(aug$wage))^2)
SSE <- sum((aug$.fitted     - mean(aug$wage))^2)
SSR <- sum( aug$.resid^2)

cat("SST:", round(SST, 2), "\n")

SST: 7160

cat("SSE:", round(SSE, 2), "\n")

SSE: 1180

cat("SSR:", round(SSR, 2), "\n")

SSR: 5981

cat("SST == SSE + SSR:", isTRUE(all.equal(SST, SSE + SSR)), "\n")

SST == SSE + SSR: TRUE

cat("R-squared:", round(SSE / SST, 4), "\n")

R-squared: 0.1648

7.4 Residual Diagnostics

aug |>
  ggplot(aes(.fitted, .resid)) +
  geom_point(alpha = 0.4, colour = "#2c7be5") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "#e63946") +
  geom_smooth(se = FALSE, colour = "#f4a261", linewidth = 1) +
  labs(x = "Fitted values", y = "Residuals",
       title = "Residuals vs. Fitted Values")

Figure 2: Residuals versus fitted values. A pattern here signals model misspecification.

The spread of residuals increases slightly with fitted values — a hint of heteroskedasticity that we return to in Chapter 9.

8 Multiple Regression: A Preview

With two regressors the PRF becomes $E[y \mid x_1, x_2] = \beta_0 + \beta_1 x_1 + \beta_2 x_2$.

The partial effect of $x_1$ holding $x_2$ fixed is $\Delta \hat{y} = \hat{\beta}_1 \Delta x_1$. That is, $\hat{\beta}_1$ estimates the effect of $x_1$ on $y$ after removing the variation in $y$ attributable to $x_2$.

data("wage2", package = "wooldridge")

fit_iq <- lm(wage ~ educ + IQ, data = wage2)

modelsummary(
  list("SLR (educ only)" = lm(wage ~ educ, data = wage2),
       "MLR (educ + IQ)"  = fit_iq),
  stars   = TRUE,
  gof_map = c("nobs", "r.squared"),
  title   = "Education and IQ as Determinants of Wages"
)

Education and IQ as Determinants of Wages
	SLR (educ only)	MLR (educ + IQ)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	146.952+	-128.890
	(77.715)	(92.182)
educ	60.214***	42.058***
	(5.695)	(6.550)
IQ		5.138***
		(0.956)
Num.Obs.	935	935
R2	0.107	0.134

Including IQ as a control reduces the education coefficient substantially. The SLR coefficient was picking up intelligence: smart people both acquire more education and earn more. Once IQ is held constant, the estimated return to education is smaller and represents a more credible partial effect.

9 Tutorials

Tutorial 3.1 Using wooldridge::wage1:

Regress wage on exper (experience). Report and interpret $\hat{\beta}_0$ and $\hat{\beta}_1$.
What is the predicted wage for someone with 10 years of experience?
What is $R^2$? Is it higher or lower than the education regression? What does this tell you?

Solution

fit_exper <- lm(wage ~ exper, data = wage1)
tidy(fit_exper, conf.int = TRUE)

predict(fit_exper, newdata = tibble(exper = 10))

    1 
5.681

glance(fit_exper) |> select(r.squared)

$\hat{\beta}_0 \approx 5.37$: predicted wage with zero experience.
$\hat{\beta}_1 \approx 0.031$: each additional year of experience adds approximately $0.031 to the hourly wage.
Predicted wage at 10 years: $5.68.
$R^2$ from experience: 0.013, lower than the education regression (0.165). Education explains more cross-sectional variation in wages than experience does.

Tutorial 3.2 Verify the geometry of OLS numerically. Use any three-observation dataset of your choice (or the house example from Section 5).

Construct $\mathbf{y}$, $\mathbf{X}$, and compute $\hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y}$ by hand in R.
Verify that $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$.
Compute $\mathbf{P}_X$ and $\mathbf{M}_X$. Check that they are idempotent and orthogonal to each other.
Verify Pythagoras: $\|\tilde{\mathbf{y}}\|^2 = \|\tilde{\hat{\mathbf{y}}}\|^2 + \|\hat{\mathbf{u}}\|^2$.

Solution

y <- c(10, 4, 6)
X <- cbind(1, c(4, 1, 1))

# (a) OLS via matrix formula
beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% y)
y_hat    <- X %*% beta_hat
u_hat    <- y - y_hat

cat("beta_hat:", round(beta_hat, 4), "\n")

beta_hat: 3.333 1.667

# (b) Orthogonality
cat("X'u_hat:", round(t(X) %*% u_hat, 10), "\n")

X'u_hat: 0 0

# (c) Projection matrices
n  <- length(y)
P  <- X %*% solve(t(X) %*% X) %*% t(X)
M  <- diag(n) - P
cat("Idempotent P:  ", max(abs(P %*% P - P)) < 1e-10, "\n")

Idempotent P:   TRUE

cat("Idempotent M:  ", max(abs(M %*% M - M)) < 1e-10, "\n")

Idempotent M:   TRUE

cat("P orthog. to M:", max(abs(P %*% M))     < 1e-10, "\n")

P orthog. to M: TRUE

# (d) Pythagorean decomposition
ybar  <- mean(y)
SST   <- sum((y     - ybar)^2)
SSE   <- sum((y_hat - ybar)^2)
SSR   <- sum(u_hat^2)
cat("SST:", round(SST, 4), " SSE:", round(SSE, 4), " SSR:", round(SSR, 4), "\n")

SST: 18.67  SSE: 16.67  SSR: 2

cat("SST == SSE + SSR:", isTRUE(all.equal(SST, SSE + SSR)), "\n")

SST == SSE + SSR: TRUE

Tutorial 3.3 Using wooldridge::ceosal1, regress CEO salary (salary) on firm sales (sales).

Interpret the slope coefficient.
Does the linear fit look appropriate? (Plot salary vs. sales.)
Regress log(salary) on log(sales). How does the slope interpretation change?

Solution

data("ceosal1", package = "wooldridge")

fit_levels <- lm(salary ~ sales, data = ceosal1)
fit_logs   <- lm(log(salary) ~ log(sales), data = ceosal1)

p1 <- ggplot(ceosal1, aes(sales, salary)) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm") +
  labs(title = "Levels")

p2 <- ggplot(ceosal1, aes(log(sales), log(salary))) +
  geom_point(alpha = 0.4) + geom_smooth(method = "lm") +
  labs(title = "Log–Log")

p1 + p2

CEO salary on sales: levels vs. log–log.

modelsummary(list("Levels" = fit_levels, "Log-Log" = fit_logs),
             stars = TRUE, gof_map = c("nobs", "r.squared"))

	Levels	Log-Log
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	1174.005***	4.822***
	(112.813)	(0.288)
sales	0.015+
	(0.009)
log(sales)		0.257***
		(0.035)
Num.Obs.	209	209
R2	0.014	0.211

CEO salary on sales: levels vs. log–log.

In the log–log model, $\hat{\beta}_1$ is an elasticity: a 1% increase in sales is associated with approximately 0.257% increase in salary. The log–log specification fits considerably better because both variables are right-skewed; the levels regression is distorted by outliers.

--- title: "Chapter 3: Simple Linear Regression" subtitle: "OLS estimation, matrix form, geometry, and goodness of fit" --- ```{r} #| label: setup #| include: false library(tidyverse) library(broom) library(modelsummary) library(patchwork) library(wooldridge) theme_set(theme_minimal(base_size = 13)) options(digits = 4, scipen = 999) ``` ::: {.callout-note} ## Learning Objectives By the end of this chapter you will be able to: - State the simple linear regression model and interpret its components - Derive the OLS estimator algebraically from first-order conditions - Write the regression model and the OLS estimator in matrix form - Explain OLS geometrically as an orthogonal projection onto the column space of $\mathbf{X}$ - Derive the normal equations $\mathbf{X'X}\hat{\boldsymbol{\beta}} = \mathbf{X'y}$ from the orthogonality principle - Verify the algebraic properties of OLS residuals - Derive the SST = SSE + SSR decomposition as a consequence of orthogonality - Compute and interpret $R^2$ and the standard error of the regression - Fit and interpret a regression model in R ::: --- ## The Population Regression Function We want to understand how a variable $y$ (the **dependent variable**) relates to $x$ (the **independent variable** or **regressor**). The population model is: $$ y_i = \beta_0 + \beta_1 x_i + u_i, \quad i = 1, 2, \ldots, n $$ - $\beta_0$: the **intercept** — expected value of $y$ when $x = 0$ - $\beta_1$: the **slope** — the change in $y$ per unit increase in $x$, ceteris paribus - $u_i$: the **error term** — all factors that determine $y$ other than $x$ The interpretation $\Delta y / \Delta x = \beta_1$ is valid only when $\Delta u / \Delta x = 0$: that is, when $x$ changes do not move the unobservables. This is the **ceteris paribus** requirement. ### The Population Regression Function The **zero conditional mean assumption** $E[u \mid x] = 0$ implies: $$ E[y \mid x] = \beta_0 + \beta_1 x $$ This is the **population regression function (PRF)**. It describes the true conditional mean in the population. We never observe it; we estimate it from data. Estimating the PRF requires a random sample $\{(x_i, y_i) : i = 1, \ldots, n\}$. ### SLR Assumptions | | Assumption | Statement | |--|-----------|-----------| | **SLR.1** | Linearity | $y = \beta_0 + \beta_1 x + u$ | | **SLR.2** | Random sampling | $(x_i, y_i)$ are i.i.d. draws from the population | | **SLR.3** | Sample variation in $x$ | $\sum(x_i - \bar{x})^2 > 0$ | | **SLR.4** | Zero conditional mean | $E[u \mid x] = 0$ | SLR.4 is the load-bearing assumption. Its violation produces **endogeneity** and makes $\hat{\beta}_1$ biased. ### What Is a Model? Studying two random variables means modelling their joint distribution — which is data-intensive. The strategy here is more targeted: model the **conditional distribution** of $y$ given $x$, and in particular its conditional mean $E[y \mid x]$. An **incomplete model** specifies only $E[y \mid x] = \beta_0 + \beta_1 x$. A **complete model** (the **Classical Linear Model**) further specifies: $$ \text{Var}(u \mid x) = \sigma^2 \quad \text{(homoskedasticity)} $$ $$ u \mid x \sim \mathcal{N}(0, \sigma^2) \quad \text{(normality)} $$ this chapter we work with the incomplete model. The CLM assumptions are taken up in Chapters 4 and 5. --- ## OLS Estimation: Scalar Derivation Given data $(x_1, y_1), \ldots, (x_n, y_n)$, we need to choose $\hat{\beta}_0$ and $\hat{\beta}_1$. OLS chooses the values that minimise the **sum of squared residuals**: $$ \min_{b_0,\, b_1} \; S(b_0, b_1) = \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 $$ **First-order conditions:** $$ \frac{\partial S}{\partial b_0}\bigg|_{\hat{\beta}} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 $$ $$ \frac{\partial S}{\partial b_1}\bigg|_{\hat{\beta}} = -2\sum_{i=1}^{n} x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 $$ These two equations — the **normal equations** in scalar form — are: $$ \sum_{i=1}^{n} \hat{u}_i = 0 \tag{1} $$ $$ \sum_{i=1}^{n} x_i \hat{u}_i = 0 \tag{2} $$ **Solving the system.** Dividing (1) by $n$ gives $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$. Substituting into (2) and simplifying (see Wooldridge p. 28): $$ \boxed{\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\widehat{\text{Cov}}(x, y)}{\widehat{\text{Var}}(x)}} $$ $$ \boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}} $$ Notice that $\hat{\beta}_1 = \hat{\sigma}_{xy} / \hat{\sigma}_x^2$: the OLS slope is the sample covariance of $x$ and $y$ divided by the sample variance of $x$. In finance, this same formula gives the market **beta** of a stock. The **estimated regression line** is $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$. Since $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$, the line always passes through the sample mean $(\bar{x}, \bar{y})$. ### Estimators We Have Seen Every population moment has a sample analogue. The table below collects them: | Population parameter | Estimator | |---|---| | $\mu_y = E[y]$ | $\bar{y} = \frac{1}{n}\sum y_i$ | | $\sigma_y^2 = E[(y - \mu_y)^2]$ | $\hat{s}_y^2 = \frac{1}{n-1}\sum(y_i - \bar{y})^2$ | | $\sigma_{xy} = E[(x-\mu_x)(y-\mu_y)]$ | $\hat{s}_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y})$ | | $\rho_{xy} = \sigma_{xy}/(\sigma_x \sigma_y)$ | $\hat{r}_{xy} = \hat{s}_{xy}/(\hat{s}_x \hat{s}_y)$ | | $\beta_1$ in $E[y\mid x] = \beta_0 + \beta_1 x$ | $\hat{\beta}_1 = \hat{s}_{xy}/\hat{s}_x^2$ | | $\beta_0$ | $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$ | In every case the estimator is a function of sample data only. Because the data are random, $\hat{\beta}_0$ and $\hat{\beta}_1$ are **random variables**: different samples from the same population will yield different estimates. --- ## Matrix Form of the Regression Model For $n$ observations the model $y_i = \beta_0 + \beta_1 x_i + u_i$ for $i = 1, \ldots, n$ can be stacked: $$ \underbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}_{\mathbf{y} \; (n \times 1)} = \underbrace{\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}}_{\mathbf{X} \; (n \times 2)} \underbrace{\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}}_{\boldsymbol{\beta} \; (2 \times 1)} + \underbrace{\begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{bmatrix}}_{\mathbf{u} \; (n \times 1)} $$ Compactly: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$. The first column of $\mathbf{X}$ is a column of ones, which "carries" the intercept $\beta_0$. The matrix $\mathbf{X}$ and vector $\mathbf{y}$ are **observable**; $\boldsymbol{\beta}$ and $\mathbf{u}$ are **unobservable**. ### Extension to Multiple Regression With $k$ regressors plus an intercept the model is identical in form: $$ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}, \quad \mathbf{X} \in \mathbb{R}^{n \times (k+1)}, \quad \boldsymbol{\beta} \in \mathbb{R}^{k+1} $$ The power of abstraction: four apparently distinct problems — returns to education, stock betas, house price prediction, electricity load forecasting — are all instances of the single problem "estimate $\boldsymbol{\beta}$ in $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}$ from a sample of observations." --- ## Geometry of OLS This section gives the geometric interpretation of least squares. It is not in Wooldridge but is essential for understanding what OLS computes, why $\mathbf{X'\hat{u}} = \mathbf{0}$, and where the $SST = SSE + SSR$ decomposition comes from. ### Vectors and Inner Products Think of $\mathbf{y} = (y_1, \ldots, y_n)' \in \mathbb{R}^n$ as a point (or arrow from the origin) in $n$-dimensional space. **Length** of a vector $\mathbf{v} \in \mathbb{R}^n$: $$ \|\mathbf{v}\| = \sqrt{\mathbf{v}'\mathbf{v}} = \sqrt{\sum_{i=1}^n v_i^2} $$ **Orthogonality.** Two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ are **orthogonal** (perpendicular) if and only if: $$ \mathbf{u}'\mathbf{v} = \sum_{i=1}^n u_i v_i = 0 $$ ### The Column Space of $\mathbf{X}$ For any $(k+1)$-vector $\mathbf{b}$, the product $\mathbf{Xb}$ is an $n$-vector that is a **linear combination of the columns of $\mathbf{X}$**. The set of all such vectors: $$ \mathcal{C}(\mathbf{X}) = \{\mathbf{Xb} : \mathbf{b} \in \mathbb{R}^{k+1}\} $$ is the **column space** (or **range**) of $\mathbf{X}$. It is a subspace of $\mathbb{R}^n$. - If $\mathbf{X}$ has $k+1$ **linearly independent** columns, then $\mathcal{C}(\mathbf{X})$ is a $(k+1)$-dimensional flat (hyperplane through the origin) inside $\mathbb{R}^n$. - With $k+1 < n$ (more observations than parameters — the usual case), this flat is a proper subspace: **$\mathbf{y}$ will generically not lie in $\mathcal{C}(\mathbf{X})$.** ### A Concrete Illustration **Two observations, two parameters.** Suppose $n = 2$ and $\mathbf{X}$ has two linearly independent columns. Then $\mathcal{C}(\mathbf{X}) = \mathbb{R}^2$: every 2-vector $\mathbf{y}$ lies in $\mathcal{C}(\mathbf{X})$, so we can fit the data perfectly with zero residuals. **Three observations, two parameters.** Now $n = 3$, $k+1 = 2$. $\mathcal{C}(\mathbf{X})$ is a 2-dimensional plane through the origin in $\mathbb{R}^3$. A generic $\mathbf{y} \in \mathbb{R}^3$ does **not** lie in this plane, so perfect fit is impossible. ```{r} #| label: house-example # Three houses: bedrooms = 4, 1, 1; prices = 10, 4, 6 (hundreds of thousands) y <- c(10, 4, 6) X <- cbind(1, c(4, 1, 1)) # [intercept | bedrooms] # OLS: beta_hat = (X'X)^{-1} X'y beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% y) y_hat <- X %*% beta_hat u_hat <- y - y_hat cat("beta_hat:", round(beta_hat, 4), "\n") cat("y_hat: ", round(y_hat, 4), "\n") cat("u_hat: ", round(u_hat, 4), "\n") ``` The residuals are non-zero: the data vector $\mathbf{y}$ does not lie in $\mathcal{C}(\mathbf{X})$. ### OLS as Orthogonal Projection The OLS objective is: $$ \min_{\mathbf{b}} \|\mathbf{y} - \mathbf{Xb}\|^2 = \min_{\mathbf{b}} \sum_{i=1}^n (y_i - \mathbf{x}_i'\mathbf{b})^2 $$ This is the problem of finding the point in $\mathcal{C}(\mathbf{X})$ **closest to $\mathbf{y}$** in Euclidean distance. The solution is the **orthogonal projection** of $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$. The projection $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ is the unique point in $\mathcal{C}(\mathbf{X})$ such that the residual $\hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}}$ is **perpendicular to the entire column space**: $$ \hat{\mathbf{u}} \perp \mathcal{C}(\mathbf{X}) $$ ::: {.callout-note} ## The Orthogonality Principle $$\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$$ The OLS residual vector is orthogonal to every column of $\mathbf{X}$. This single equation contains the entire geometry of OLS and is more important than any formula. ::: The orthogonality condition says: the residual is uncorrelated with every regressor (including the constant). Writing it out column by column for $\mathbf{X} = [\mathbf{1}, \mathbf{x}_1, \ldots, \mathbf{x}_k]$: $$ \mathbf{1}'\hat{\mathbf{u}} = \sum_{i=1}^n \hat{u}_i = 0 \qquad \mathbf{x}_j'\hat{\mathbf{u}} = \sum_{i=1}^n x_{ij}\hat{u}_i = 0, \quad j = 1, \ldots, k $$ These are exactly the scalar normal equations (1) and (2) we derived from the first-order conditions. ```{r} #| label: orthogonality-check # Verify X'u_hat = 0 for the house example round(t(X) %*% u_hat, 10) # should be (0, 0) ``` ### Deriving the OLS Formula Substituting $\hat{\mathbf{u}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ into $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$: $$ \mathbf{X}'(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0} $$ $$ \mathbf{X'y} = \mathbf{X'X}\hat{\boldsymbol{\beta}} \tag{Normal Equations} $$ $$ \boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y}} $$ provided $\mathbf{X'X}$ is invertible — which requires the columns of $\mathbf{X}$ to be **linearly independent** (no perfect multicollinearity). ### The Projection and Annihilator Matrices Substituting back, the fitted values are: $$ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'y} \equiv \mathbf{P}_X \mathbf{y} $$ where $\mathbf{P}_X = \mathbf{X}(\mathbf{X'X})^{-1}\mathbf{X'}$ is the **hat matrix** (it puts the "hat" on $\mathbf{y}$). The residual vector is: $$ \hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{P}_X)\mathbf{y} \equiv \mathbf{M}_X \mathbf{y} $$ where $\mathbf{M}_X = \mathbf{I} - \mathbf{P}_X$ is the **annihilator matrix**. Both $\mathbf{P}_X$ and $\mathbf{M}_X$ are symmetric and idempotent ($\mathbf{P}_X^2 = \mathbf{P}_X$, $\mathbf{M}_X^2 = \mathbf{M}_X$), and they are orthogonal to each other: $\mathbf{P}_X \mathbf{M}_X = \mathbf{0}$. ```{r} #| label: projection-matrices n <- nrow(X) P <- X %*% solve(t(X) %*% X) %*% t(X) # hat matrix M <- diag(n) - P # annihilator # Idempotency: P^2 = P cat("Max deviation P^2 - P:", max(abs(P %*% P - P)), "\n") # Orthogonality: P M = 0 cat("Max deviation PM: ", max(abs(P %*% M)), "\n") # y_hat and u_hat via projection round(P %*% y, 4) # same as y_hat above round(M %*% y, 4) # same as u_hat above ``` --- ## Algebraic Properties of OLS The orthogonality conditions $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$ have direct algebraic consequences. Writing out the two columns of $\mathbf{X} = [\mathbf{1}, \mathbf{x}]$: **Property 1.** $\displaystyle\sum_{i=1}^n \hat{u}_i = 0$ The residuals sum to zero. OLS over-predicts as often as it under-predicts. (Follows from the column of ones being in $\mathbf{X}$.) **Property 2.** $\displaystyle\sum_{i=1}^n x_i \hat{u}_i = 0$ The residuals are uncorrelated with the regressor in the sample. **Property 3.** $\bar{y} = \hat{\bar{y}} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x}$ The sample mean of fitted values equals the sample mean of $y$. The estimated regression line passes through $(\bar{x}, \bar{y})$. ```{r} #| label: ols-properties data("wage1", package = "wooldridge") fit <- lm(wage ~ educ, data = wage1) u <- residuals(fit) x <- wage1$educ cat("Sum of residuals: ", round(sum(u), 10), "\n") cat("Cov(x, u_hat): ", round(sum(x * u), 10), "\n") cat("Mean(y_hat) == mean(y): ", isTRUE(all.equal(mean(fitted(fit)), mean(wage1$wage))), "\n") ``` --- ## Goodness of Fit: A Geometric Derivation Because $\hat{\mathbf{u}} \perp \hat{\mathbf{y}}$ — and because the column of ones in $\mathbf{X}$ forces $\hat{\mathbf{u}} \perp \mathbf{1}$, which means $\bar{\hat{u}} = 0$ and hence $\bar{\hat{y}} = \bar{y}$ — we can work with demeaned vectors. Define $\tilde{\mathbf{y}} = \mathbf{y} - \bar{y}\mathbf{1}$ and $\tilde{\hat{\mathbf{y}}} = \hat{\mathbf{y}} - \bar{y}\mathbf{1}$. Since $\hat{\mathbf{u}} = \mathbf{y} - \hat{\mathbf{y}}$, we have: $$ \tilde{\mathbf{y}} = \tilde{\hat{\mathbf{y}}} + \hat{\mathbf{u}} $$ Squaring both sides (taking inner products) and using $\tilde{\hat{\mathbf{y}}}'\hat{\mathbf{u}} = 0$ (orthogonality — verify: $\hat{\mathbf{y}}'\hat{\mathbf{u}} = \boldsymbol{\beta}'\mathbf{X}'\hat{\mathbf{u}} = 0$): $$ \|\tilde{\mathbf{y}}\|^2 = \|\tilde{\hat{\mathbf{y}}}\|^2 + \|\hat{\mathbf{u}}\|^2 $$ $$ \underbrace{\sum_{i=1}^n (y_i - \bar{y})^2}_{SST} = \underbrace{\sum_{i=1}^n (\hat{y}_i - \bar{y})^2}_{SSE} + \underbrace{\sum_{i=1}^n \hat{u}_i^2}_{SSR} $$ This is the **Pythagorean theorem applied to the OLS decomposition** — it holds as an algebraic identity, not as an approximation, and follows directly from orthogonality. ::: {.callout-note} ## Coefficient of Determination $$R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}$$ $R^2$ is the squared cosine of the angle between $\tilde{\mathbf{y}}$ and $\tilde{\hat{\mathbf{y}}}$ in $\mathbb{R}^n$. It lies in $[0, 1]$. ::: ::: {.callout-important} ## $R^2$ Does Not Measure Causal Validity A high $R^2$ does not imply a causal interpretation. The election spending example in Wooldridge has $R^2 = 0.856$ but is still a correlation. The wage–education regression has $R^2 = 0.165$ but its coefficient may be closer to a causal effect (especially after controlling for confounders). ::: ### The Standard Error of the Regression We need to estimate $\sigma^2 = \text{Var}(u)$. The natural candidate is $SSR/n$, but this is biased because we "used up" $k+1$ degrees of freedom estimating $\boldsymbol{\beta}$. The unbiased estimator is: $$ \hat{\sigma}^2 = \frac{SSR}{n - (k+1)} = \frac{\sum \hat{u}_i^2}{n - k - 1} $$ For simple regression ($k = 1$): $\hat{\sigma}^2 = SSR / (n - 2)$. The **standard error of the regression** (SER) is $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$, reported as `sigma` in `glance()`. It measures residual spread in the units of $y$. --- ## Running Example: Returns to Education ```{r} #| label: wage1-glimpse data("wage1", package = "wooldridge") wage1 |> select(wage, educ, exper, female, nonwhite) |> datasummary_skim() ``` ```{r} #| label: fig-wage-educ #| fig-cap: "Hourly wages and years of education. Line = OLS fit with 95% confidence band." ggplot(wage1, aes(x = educ, y = wage)) + geom_jitter(alpha = 0.4, width = 0.2, colour = "#2c7be5") + geom_smooth(method = "lm", colour = "#e63946", linewidth = 1.2, se = TRUE) + labs(x = "Years of education", y = "Hourly wage (USD)", title = "Wages and Education: wage1 Dataset") ``` ### Fitting and Reading OLS Output ```{r} #| label: slr-wage fit_slr <- lm(wage ~ educ, data = wage1) tidy(fit_slr, conf.int = TRUE) ``` ```{r} #| label: slr-wage-glance glance(fit_slr) |> select(r.squared, adj.r.squared, sigma, nobs) ``` - **Intercept** $(\hat{\beta}_0 \approx `r round(coef(fit_slr)[1], 2)`)$: Predicted wage with zero years of education. Interpret with caution — this is outside the support of the data. - **Slope** $(\hat{\beta}_1 \approx `r round(coef(fit_slr)[2], 2)`)$: Each additional year of schooling is associated with \$`r round(coef(fit_slr)[2], 2)` more per hour on average. - **$R^2 \approx `r round(glance(fit_slr)$r.squared, 3)`$**: Education explains `r round(glance(fit_slr)$r.squared * 100, 1)`% of the variation in wages. A low $R^2$ here is not a problem; many factors other than education affect wages. - **SER $\approx `r round(glance(fit_slr)$sigma, 2)`$**: Residuals are on average \$`r round(glance(fit_slr)$sigma, 2)` away from the fitted line. ### Manual Computation and Verification ```{r} #| label: ols-manual wage1 |> summarise( x_bar = mean(educ), y_bar = mean(wage), cov_xy = sum((educ - mean(educ)) * (wage - mean(wage))), var_x = sum((educ - mean(educ))^2), beta_1 = cov_xy / var_x, beta_0 = y_bar - beta_1 * x_bar ) ``` ```{r} #| label: ols-matrix # Matrix form: beta_hat = (X'X)^{-1} X'y y_vec <- wage1$wage X_mat <- cbind(1, wage1$educ) beta_matrix <- solve(t(X_mat) %*% X_mat) %*% (t(X_mat) %*% y_vec) round(beta_matrix, 4) # identical to lm() output ``` All three routes — closed-form formula, `lm()`, and matrix algebra — agree exactly. ### Verifying the Decomposition ```{r} #| label: r-squared-manual aug <- augment(fit_slr) SST <- sum((aug$wage - mean(aug$wage))^2) SSE <- sum((aug$.fitted - mean(aug$wage))^2) SSR <- sum( aug$.resid^2) cat("SST:", round(SST, 2), "\n") cat("SSE:", round(SSE, 2), "\n") cat("SSR:", round(SSR, 2), "\n") cat("SST == SSE + SSR:", isTRUE(all.equal(SST, SSE + SSR)), "\n") cat("R-squared:", round(SSE / SST, 4), "\n") ``` ### Residual Diagnostics ```{r} #| label: fig-residuals #| fig-cap: "Residuals versus fitted values. A pattern here signals model misspecification." aug |> ggplot(aes(.fitted, .resid)) + geom_point(alpha = 0.4, colour = "#2c7be5") + geom_hline(yintercept = 0, linetype = "dashed", colour = "#e63946") + geom_smooth(se = FALSE, colour = "#f4a261", linewidth = 1) + labs(x = "Fitted values", y = "Residuals", title = "Residuals vs. Fitted Values") ``` The spread of residuals increases slightly with fitted values — a hint of heteroskedasticity that we return to in Chapter 9. --- ## Multiple Regression: A Preview With two regressors the PRF becomes $E[y \mid x_1, x_2] = \beta_0 + \beta_1 x_1 + \beta_2 x_2$. The **partial effect** of $x_1$ holding $x_2$ fixed is $\Delta \hat{y} = \hat{\beta}_1 \Delta x_1$. That is, $\hat{\beta}_1$ estimates the effect of $x_1$ on $y$ **after removing the variation in $y$ attributable to $x_2$**. ```{r} #| label: mlr-preview data("wage2", package = "wooldridge") fit_iq <- lm(wage ~ educ + IQ, data = wage2) modelsummary( list("SLR (educ only)" = lm(wage ~ educ, data = wage2), "MLR (educ + IQ)" = fit_iq), stars = TRUE, gof_map = c("nobs", "r.squared"), title = "Education and IQ as Determinants of Wages" ) ``` Including IQ as a control reduces the education coefficient substantially. The SLR coefficient was picking up intelligence: smart people both acquire more education **and** earn more. Once IQ is held constant, the estimated return to education is smaller and represents a more credible partial effect. --- ## Tutorials **Tutorial 3.1** Using `wooldridge::wage1`: a. Regress `wage` on `exper` (experience). Report and interpret $\hat{\beta}_0$ and $\hat{\beta}_1$. b. What is the predicted wage for someone with 10 years of experience? c. What is $R^2$? Is it higher or lower than the education regression? What does this tell you? ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex3-1 fit_exper <- lm(wage ~ exper, data = wage1) tidy(fit_exper, conf.int = TRUE) predict(fit_exper, newdata = tibble(exper = 10)) glance(fit_exper) |> select(r.squared) ``` - $\hat{\beta}_0 \approx `r round(coef(fit_exper)[1], 2)`$: predicted wage with zero experience. - $\hat{\beta}_1 \approx `r round(coef(fit_exper)[2], 3)`$: each additional year of experience adds approximately \$`r round(coef(fit_exper)[2], 3)` to the hourly wage. - Predicted wage at 10 years: \$`r round(predict(fit_exper, newdata = tibble(exper = 10)), 2)`. - $R^2$ from experience: `r round(glance(fit_exper)$r.squared, 3)`, lower than the education regression (`r round(glance(fit_slr)$r.squared, 3)`). Education explains more cross-sectional variation in wages than experience does. ::: --- **Tutorial 3.2** Verify the geometry of OLS numerically. Use any three-observation dataset of your choice (or the house example from Section 5). a. Construct $\mathbf{y}$, $\mathbf{X}$, and compute $\hat{\boldsymbol{\beta}} = (\mathbf{X'X})^{-1}\mathbf{X'y}$ by hand in R. b. Verify that $\mathbf{X'}\hat{\mathbf{u}} = \mathbf{0}$. c. Compute $\mathbf{P}_X$ and $\mathbf{M}_X$. Check that they are idempotent and orthogonal to each other. d. Verify Pythagoras: $\|\tilde{\mathbf{y}}\|^2 = \|\tilde{\hat{\mathbf{y}}}\|^2 + \|\hat{\mathbf{u}}\|^2$. ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex3-2 y <- c(10, 4, 6) X <- cbind(1, c(4, 1, 1)) # (a) OLS via matrix formula beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% y) y_hat <- X %*% beta_hat u_hat <- y - y_hat cat("beta_hat:", round(beta_hat, 4), "\n") # (b) Orthogonality cat("X'u_hat:", round(t(X) %*% u_hat, 10), "\n") # (c) Projection matrices n <- length(y) P <- X %*% solve(t(X) %*% X) %*% t(X) M <- diag(n) - P cat("Idempotent P: ", max(abs(P %*% P - P)) < 1e-10, "\n") cat("Idempotent M: ", max(abs(M %*% M - M)) < 1e-10, "\n") cat("P orthog. to M:", max(abs(P %*% M)) < 1e-10, "\n") # (d) Pythagorean decomposition ybar <- mean(y) SST <- sum((y - ybar)^2) SSE <- sum((y_hat - ybar)^2) SSR <- sum(u_hat^2) cat("SST:", round(SST, 4), " SSE:", round(SSE, 4), " SSR:", round(SSR, 4), "\n") cat("SST == SSE + SSR:", isTRUE(all.equal(SST, SSE + SSR)), "\n") ``` ::: --- **Tutorial 3.3** Using `wooldridge::ceosal1`, regress CEO salary (`salary`) on firm sales (`sales`). a. Interpret the slope coefficient. b. Does the linear fit look appropriate? (Plot salary vs. sales.) c. Regress `log(salary)` on `log(sales)`. How does the slope interpretation change? ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex3-3 #| fig-cap: "CEO salary on sales: levels vs. log–log." data("ceosal1", package = "wooldridge") fit_levels <- lm(salary ~ sales, data = ceosal1) fit_logs <- lm(log(salary) ~ log(sales), data = ceosal1) p1 <- ggplot(ceosal1, aes(sales, salary)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm") + labs(title = "Levels") p2 <- ggplot(ceosal1, aes(log(sales), log(salary))) + geom_point(alpha = 0.4) + geom_smooth(method = "lm") + labs(title = "Log–Log") p1 + p2 modelsummary(list("Levels" = fit_levels, "Log-Log" = fit_logs), stars = TRUE, gof_map = c("nobs", "r.squared")) ``` In the log–log model, $\hat{\beta}_1$ is an **elasticity**: a 1% increase in sales is associated with approximately `r round(coef(fit_logs)[2], 3)`% increase in salary. The log–log specification fits considerably better because both variables are right-skewed; the levels regression is distorted by outliers. :::

Population parameter	Estimator
\(\mu_y = E[y]\)	\(\bar{y} = \frac{1}{n}\sum y_i\)
\(\sigma_y^2 = E[(y - \mu_y)^2]\)	\(\hat{s}_y^2 = \frac{1}{n-1}\sum(y_i - \bar{y})^2\)
\(\sigma_{xy} = E[(x-\mu_x)(y-\mu_y)]\)	\(\hat{s}_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y})\)
\(\rho_{xy} = \sigma_{xy}/(\sigma_x \sigma_y)\)	\(\hat{r}_{xy} = \hat{s}_{xy}/(\hat{s}_x \hat{s}_y)\)
\(\beta_1\) in \(E[y\mid x] = \beta_0 + \beta_1 x\)	\(\hat{\beta}_1 = \hat{s}_{xy}/\hat{s}_x^2\)
\(\beta_0\)	\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\)

1 The Population Regression Function

1.1 The Population Regression Function

1.2 SLR Assumptions

1.3 What Is a Model?

2 OLS Estimation: Scalar Derivation

2.1 Estimators We Have Seen

3 Matrix Form of the Regression Model

3.1 Extension to Multiple Regression

4 Geometry of OLS

4.1 Vectors and Inner Products

4.2 The Column Space of \(\mathbf{X}\)

4.3 A Concrete Illustration

4.4 OLS as Orthogonal Projection

4.5 Deriving the OLS Formula

4.6 The Projection and Annihilator Matrices

5 Algebraic Properties of OLS

6 Goodness of Fit: A Geometric Derivation

6.1 The Standard Error of the Regression

7 Running Example: Returns to Education

7.1 Fitting and Reading OLS Output

7.2 Manual Computation and Verification

7.3 Verifying the Decomposition

7.4 Residual Diagnostics

8 Multiple Regression: A Preview

9 Tutorials