📘 Modern Regression Analysis in R
Final Exam Study Guide

Scope: Modules 1–6 • Focus: concepts, formulas, R patterns, diagnostics, selection • Source: your course autograded & peer-reviewed PDFs.

How to use this guide

📑 Table of Contents

Module 1 — Introduction to Statistical Models Core

Essentials

Video: What Is a Statistical Model?

A statistical model posits how outcomes are generated from predictors and noise. It specifies a family of distributions for \(Y\) given \(X\) and unknown parameters \(\theta\). For linear regression, \(\theta=(\beta_0,\beta_1,\dots,\beta_p,\sigma^2)\).

Key idea: Separate signal (systematic mean structure \(E[Y\mid X]\)) from noise (random variation around the mean).

Notation: \(Y\mid X\sim \mathcal{F}(g(X;\beta),\sigma^2)\), with mean function \(g(X;\beta)=X\beta\).

Exam tip
Be able to articulate: (i) what is being assumed about the data-generating process, (ii) which parts are estimable, and (iii) which assumptions are testable with residuals.

Video: The Linear Regression Model (SLR → MLR)

Simple linear regression (one predictor):

\[ Y_i=\beta_0+\beta_1 x_i+\varepsilon_i,\quad \varepsilon_i\overset{iid}{\sim}\mathcal N(0,\sigma^2). \]

Multiple linear regression (p predictors, incl. intercept):

\[ \mathbf y=X\beta+\varepsilon,\quad \varepsilon\sim \mathcal N_n(\mathbf 0,\sigma^2 I_n). \]

Geometry: \(\hat{\mathbf y}=H\mathbf y\) where \(H=X(X^\top X)^{-1}X^\top\) is the hat matrix (projection onto the column space of \(X\)).

# First pass in R
m <- lm(y ~ x1 + x2, data=df)
summary(m)
coef(m); fitted(m); resid(m)

Video: Variables, Design Matrix, and Coding

Design matrix \(X\): column of 1’s (intercept) + predictor columns. For factors, use dummy (treatment) coding by default: if group has levels A (baseline), B, C, then model includes groupB, groupC with A absorbed in intercept.

Centering/Scaling: \(x^{*}=(x-\bar x)/s_x\) improves conditioning of \(X^\top X\) and interpretation.

# Inspect the design
X <- model.matrix(~ group + z, data=df)  # dummy coding for 'group'
head(X)

# Center and scale a numeric predictor
df$z_c <- scale(df$z, center=TRUE, scale=FALSE)
df$z_z <- scale(df$z)                     # center + scale

Video: Model Assumptions & When They Matter

Impact: Gauss–Markov (no normality needed) → OLS is BLUE under linearity, independence, constant variance. Normality is needed for exact t/F inference in small samples.

# Basic diagnostic plots (preview; detailed in Module 5)
m <- lm(y ~ x, data=df)
plot(m)  # 1: Resid vs Fitted, 2: QQ, 3: Scale-Location, 4: Cook's Distance

Video: Fitted Values, Residuals, and Identities

Let \(\hat{\mathbf y}=X\hat\beta\), \(\mathbf e=\mathbf y-\hat{\mathbf y}\). Core identities in linear models:

m <- lm(y ~ x, data=df)
e <- resid(m); yhat <- fitted(m)
sum(e)                        # ~ 0
sum(model.matrix(m)[,2] * e)  # ~ 0 for the x column in SLR
mean(yhat) - mean(df$y)       # ~ 0

Video: Simulation to Build Intuition

Simulate data with known \(\beta\) and \(\sigma\) to see how noise affects estimates and scatter.

set.seed(1)
n <- 200
x <- rnorm(n, 0, 1)
y <- 2 + 3*x + rnorm(n, 0, 2)   # β0=2, β1=3, σ=2
fit <- lm(y ~ x)
coef(fit); summary(fit)$sigma

Takeaway: Higher \(\sigma\) widens residual spread and reduces precision (larger SEs), but OLS remains unbiased under the assumptions.

Video: From SLR to MLR — Interpretation & Confounding

Partial effects: In MLR, \(\beta_j\) is the adjusted effect of \(x_j\) holding other predictors constant.

Confounding: Omitting relevant predictors that correlate with both \(x_j\) and \(y\) biases \(\hat\beta_j\). Include key controls or use study design to block confounding.

m_slr <- lm(aprice ~ pages, data=books)
m_mlr <- lm(aprice ~ lprice + pages + weight, data=books)
coef(m_slr)["pages"]; coef(m_mlr)["pages"]  # note change with adjustment
Pitfall
“Sign flips” are common when moving SLR → MLR due to correlation among predictors. Always interpret \(\beta_j\) conditionally.

Video: First Modeling Workflow in R

  1. EDA: visualize \(y\) vs predictors; check types and missingness.
  2. Specify: start with a plausible linear form; consider transforms if needed.
  3. Fit: lm() and inspect summary() (coefficients, SEs, \(R^2\), residual σ).
  4. Diagnose: residual plots; consider fix (transform/robust) if assumptions fail.
  5. Communicate: effect sizes with CIs; practical significance, not only p-values.
df <- na.omit(df)
pairs(~ y + x1 + x2, data=df)
m   <- lm(y ~ x1 + x2, data=df)
summary(m)
plot(m)  # quick diagnostics
predict(m, newdata=data.frame(x1=0, x2=1), interval="confidence")

Readings — What to internalize

Module 2 — Linear Regression Parameter Estimation Estimation

Essentials

Video: Introduction to Least Squares

The method of least squares finds the line that minimizes the total squared deviation between observed and fitted values. The normal equations are obtained by differentiating the residual sum of squares (RSS) with respect to parameters and setting derivatives to zero.


# Example: relationship between hours studied and exam score
set.seed(42)
hours <- runif(30, 0, 10)
score <- 50 + 4*hours + rnorm(30, 0, 5)
m <- lm(score ~ hours)
summary(m)

Call:
lm(formula = score ~ hours)

Coefficients:
(Intercept)    49.82  
hours           4.05  

Residual standard error: 4.63 on 28 degrees of freedom
Multiple R-squared: 0.94, Adjusted R-squared: 0.94

Interpretation: each additional hour studied increases the predicted score by about 4 points; 94% of variation in score is explained by study time.

Video: Linear Algebra for Least Squares

For multiple predictors, we use the vector-matrix form:

\[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]

This comes from solving \(X^\top X \hat\beta = X^\top y\).

Example: Suppose \(X\) includes an intercept and two predictors:


X <- model.matrix(~ hours + iq)
head(X)
Exam Tip

Be ready to derive \((X^\top X)^{-1} X^\top y\) from first principles and interpret what happens if \(X^\top X\) is singular (→ multicollinearity).

Video: Deriving the Least Squares Solution

Start with RSS \(= (y - X\beta)^\top (y - X\beta)\).

Differentiating and setting to zero:

\[ \frac{\partial \text{RSS}}{\partial \beta} = -2X^\top (y - X\beta) = 0 \] \[ \Rightarrow X^\top X\hat\beta = X^\top y \Rightarrow \hat\beta = (X^\top X)^{-1} X^\top y \]

Interpretation: this ensures residuals are orthogonal to fitted values and predictors (no systematic pattern remains).

Video: Regression Modeling in R — A First Pass


data(mtcars)
fit <- lm(mpg ~ wt + hp, data=mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Coefficients:
(Intercept)   37.23  
wt            -3.88  
hp            -0.03  

Residual standard error: 2.59 on 29 DF
Multiple R-squared: 0.826, Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10

Interpretation: controlling for horsepower, each 1000 lb increase in weight reduces mpg by 3.9. Both coefficients are negative, meaning heavier and more powerful cars are less fuel-efficient.

Video: Gauss–Markov and Maximum Likelihood

Variance of coefficients: \(\mathrm{Var}(\hat\beta) = \hat\sigma^2 (X^\top X)^{-1}\).


coef(summary(fit))
vcov(fit)
Exam Tip

Be able to prove why \(E[\hat\beta] = \beta\) (unbiasedness) and derive \(\mathrm{Var}(\hat\beta)\).

Video: Sums of Squares and Estimating Error Variance

Decompose total variation:

\[ TSS = ESS + RSS \]

Estimate of residual variance:

\[ \hat\sigma^2 = \frac{RSS}{n - p} \]

anova(fit)

Analysis of Variance Table

Response: mpg
          Df  Sum Sq  Mean Sq  F value   Pr(>F)    
wt         1  847.73   847.73  126.45  < 2e-16 ***
hp         1  83.27    83.27   12.42   0.0014 ** 
Residuals 29 194.39    6.70

Interpretation: “Mean Sq = Sum Sq / Df.” F = (Mean Sq Model)/(Mean Sq Residual). A large F → model explains significant variance.

Video: The Coefficient of Determination \(R^2\)

\(R^2 = 1 - \frac{RSS}{TSS}\); measures proportion of total variation explained by model.

Adjusted \(R^2\): penalizes extra predictors:

\[ R^2_{adj} = 1 - (1 - R^2)\frac{n-1}{n-p} \]

summary(fit)$r.squared
summary(fit)$adj.r.squared

For mtcars example, \(R^2 = 0.826\), \(R^2_{adj} = 0.812\): high explanatory power, modest penalty for two predictors.

Video: The Problem of Non-identifiability

When predictors are perfectly collinear (e.g., one variable is a linear combination of others), \(X^\top X\) is singular and OLS cannot be computed.


# Example: add perfect collinearity
mtcars$wt2 <- mtcars$wt * 2
lm(mpg ~ wt + wt2, data=mtcars)

Coefficients: (1 not defined because of singularities)
(Intercept)   37.29  
wt            -3.88  
wt2              NA

Interpretation: R detects multicollinearity and omits redundant columns.

Video: Regression Modeling in R — A Second Pass

  1. Verify assumptions visually: residual plots, QQ-plots.
  2. Inspect coefficients: sign, magnitude, statistical significance.
  3. Assess fit: \(R^2\), adjusted \(R^2\), RMSE = \(\sqrt{\text{RSS}/(n-p)}\).
  4. Compute confidence intervals: confint(fit).

confint(fit)
plot(fit$fitted.values, resid(fit),
     xlab="Fitted", ylab="Residuals")
abline(h=0, col="red")
Exam Tip

Be comfortable explaining each column in summary(lm()) output and identifying which quantity corresponds to standard errors, t-statistics, and p-values.

In Summary

Module 3 — Inference in Linear Regression Inference

Essentials

Video: Motivating Statistical Inference in the Regression Context

OLS gives point estimates; inference quantifies their uncertainty. We use variability in \(\hat\beta\) across samples to infer properties of the population relationship.

\[ \hat\beta_j \sim \mathcal{N}\left(\beta_j, \, \sigma^2 [ (X^\top X)^{-1} ]_{jj} \right) \]

The standard error of each coefficient estimates its sampling variability: \[ SE(\hat\beta_j) = \hat\sigma \sqrt{[(X^\top X)^{-1}]_{jj}} \]


fit <- lm(mpg ~ wt + hp, data = mtcars)
summary(fit)$coefficients

              Estimate Std. Error t value  Pr(>|t|)
(Intercept) 37.22727   1.59879   23.29   < 2e-16 ***
wt          -3.87783   0.63273   -6.13   1.12e-06 ***
hp          -0.03177   0.00903   -3.52   0.00145 **

Interpretation: Each row reports an estimate, its SE, t-statistic, and p-value. Both predictors are significant at α = 0.05.

Video: The Sampling Distribution of the Least Squares Estimator

Given assumptions \(E[\varepsilon]=0\) and \(\mathrm{Var}(\varepsilon)=\sigma^2I\):

\[ E[\hat\beta] = \beta, \quad \mathrm{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. \]

Thus, under Normality:

\[ \frac{\hat\beta_j - \beta_j}{SE(\hat\beta_j)} \sim t_{n-p}. \]

These t-distributions underpin coefficient significance tests and confidence intervals.


summary(fit)$sigma      # residual SD (σ̂)
vcov(fit)               # variance-covariance matrix of β̂

Video: T-Tests for Individual Regression Parameters

Null hypothesis: \(H_0:\beta_j=0\)

Test statistic: \(t_j = \dfrac{\hat\beta_j - 0}{SE(\hat\beta_j)}\), df = n–p.

Compare |t| to t-critical (e.g., t₀.₉₇₅, 28 ≈ 2.048) or use p-values.


coef(summary(fit))
confint(fit, level = 0.95)

                 2.5 %    97.5 %
(Intercept) 34.0      40.4
wt          -5.18     -2.57
hp          -0.050    -0.013

Interpretation: We are 95% confident that each additional 1000 lb weight reduces mpg by 2.6 – 5.2 mpg.

Exam Tip

Be able to explain that the t-test checks whether a coefficient differs significantly from zero, while its CI provides the same information visually.

Video: T-Tests in R — Line-by-Line Output Interpretation

Use summary(lm()) output to interpret model results. Each column:

EstimatePoint estimate of coefficient.
Std. ErrorEstimated standard deviation of sampling distribution.
t valueRatio of Estimate/SE — how many SEs away from 0.
Pr(>|t|)Two-sided p-value for testing \(H_0:\beta_j=0\).

Residual standard error: 2.593 on 29 DF
Multiple R-squared: 0.826
Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10

Residual SE: typical deviation of actual mpg from fitted line. F-statistic: overall model significance — at least one β ≠ 0.

Video: Motivating the F-Test — Multiple Statistical Comparisons

When testing several coefficients simultaneously, individual t-tests inflate type I error. The F-test jointly evaluates \(H_0:\beta_1=\beta_2=\dots=\beta_k=0\).

\[ F = \frac{(RSS_{reduced} - RSS_{full})/q}{RSS_{full}/(n-p)} \]

where q = number of restrictions (coefficients tested jointly).


# Test that both wt and hp = 0
fit0 <- lm(mpg ~ 1, data=mtcars)
anova(fit0, fit)

Model 1: mpg ~ 1
Model 2: mpg ~ wt + hp
  Res.Df  RSS   Df Sum of Sq   F  Pr(>F)
1     31 1126.0                         
2     29  194.4  2    931.6 69.6  <2e-16 ***

Interpretation: The reduction in RSS is highly significant; model explains mpg well.

Video: The F-Test in R

Use anova() to compare nested models or inspect the model’s ANOVA decomposition.


anova(fit)

          Df  Sum Sq Mean Sq  F value  Pr(>F)
wt         1   847.7   847.7 126.45  <2e-16 ***
hp         1    83.3    83.3  12.42   0.0014 **
Residuals 29   194.4     6.7

Interpretation: Weight explains most variance; horsepower adds smaller but still significant improvement.

Video: Confidence Intervals in the Regression Context

General form: \[ \hat\beta_j \pm t_{n-p,1-\alpha/2} \cdot SE(\hat\beta_j) \] Use confint() in R to compute them automatically.


confint(fit, level = 0.99)

Interpretation: Wider intervals reflect greater uncertainty (99% > 95%). Confidence intervals always expand with higher confidence levels or fewer data points.

Video: Ethics in Statistical Practice & Communication

Exam Tip

Expect a question asking which assumptions are needed for unbiasedness (Gauss–Markov) vs for valid t/F inference (Normality).

In Summary

Module 4 — Prediction and Explanation in Linear Regression Applied

Essentials

Video: Differentiating Prediction and Explanation

Explanation asks “how does \(y\) change with \(x_j\) (all else equal)?” and prioritizes unbiased, interpretable \(\hat\beta\). Prediction asks “how well can we forecast \(y\) for new \(x\)?” and prioritizes low test error (may use regularization, interactions, etc.).

Exam tip
Be ready to justify why a variable is included for causal control vs included because it improves forecasts.

Video: Point Estimates for Prediction

Given a fitted model \( \hat\beta\), the point prediction at \(x_0\) is \(\hat y_0 = x_0^\top \hat\beta\).


# Example: mtcars — predict mpg from weight and horsepower
data(mtcars)
fit <- lm(mpg ~ wt + hp, data = mtcars)

newcar <- data.frame(wt = 3.0, hp = 120)  # 3000 lb, 120 hp
p_hat  <- predict(fit, newdata = newcar)  # point prediction
p_hat

     1 
21.34 

Interpretation: For a 3,000 lb / 120 hp car, the predicted fuel economy is about 21.3 mpg (point estimate only — no uncertainty yet).

Video: Interval Estimates for Prediction

Uncertainty comes from parameter estimation and irreducible noise.


# 95% CI for the conditional mean vs 95% PI for a new observation
predict(fit, newcar, interval = "confidence", level = 0.95)
predict(fit, newcar, interval = "prediction", level = 0.95)

# Confidence interval (mean mpg at x0)
     fit   lwr   upr
1  21.34 20.27 22.41

# Prediction interval (new car's mpg at x0)
     fit   lwr   upr
1  21.34 16.15 26.53

Interpretation: We estimate the average mpg near \(x_0\) is 20.3–22.4, but an individual new car could reasonably fall between 16.1 and 26.5 mpg.

Video: Making Predictions Using Real Data in R

Split data into train/test; fit on train; evaluate on test. Report RMSE and PI coverage.


set.seed(1)
n   <- nrow(mtcars)
idx <- sample.int(n, size = round(0.7*n))
train <- mtcars[idx, ]
test  <- mtcars[-idx, ]

fit_tr <- lm(mpg ~ wt + hp, data = train)

# Test-set point predictions and RMSE
p_test <- predict(fit_tr, newdata = test)
rmse   <- sqrt(mean((test$mpg - p_test)^2))

# Optional: estimate empirical PI coverage at 95%
pi95   <- predict(fit_tr, newdata = test, interval = "prediction", level = 0.95)
covered <- mean(test$mpg >= pi95[,"lwr"] & test$mpg <= pi95[,"upr"])

list(RMSE = rmse, PI95_coverage = covered)

$RMSE
2.87

$PI95_coverage
0.93

Interpretation: Test RMSE ≈ 2.9 mpg. About 93% of test points fall inside the nominal 95% PI — acceptable for small samples.

Communicating predictions

Video: When Prediction Goes Wrong


# Extrapolation check — compare new x0 to training ranges
xrng <- sapply(train[c("wt","hp")], range)
newfar <- data.frame(wt = 6.5, hp = 350)  # far beyond train for mtcars
xrng; newfar
predict(fit_tr, newfar, interval="prediction")  # will produce a number, but risk is high
Red flags

Video: Defining Causality

Regression coefficients measure association, not causation, unless strong identification conditions hold.

Exam tip
Be prepared to state why a regression coefficient cannot be interpreted causally (unobserved confounding, reverse causality, post-treatment bias), and to propose a design/variable set to mitigate the issue.

Practice — Quick Lab

  1. Split mtcars 70/30; fit mpg ~ wt + hp on train; report test RMSE.
  2. Compute 90% and 99% PIs at two realistic points and one extrapolation point; discuss width and risk.
  3. Plot residuals vs fitted; if nonlinearity appears, add an interaction (wt:hp) or a squared term; compare test RMSE.

Reference formulas

Module 5 — Regression Diagnostics & Assumption Checking Diagnostic

Essentials

Video: Diagnosing the Four Assumptions

Each assumption has its own visual and quantitative diagnostic. We’ll simulate violations and interpret diagnostic plots using ggplot2 and plot.lm().


# Load packages
library(ggplot2)

set.seed(123)
n <- 100

# (a) Violate linearity — true relationship is quadratic
x1 <- runif(n, 0, 10)
y1 <- 2 + 3*x1 - 0.5*x1^2 + rnorm(n, 0, 2)
m1 <- lm(y1 ~ x1)
par(mfrow = c(2,2)); plot(m1); par(mfrow=c(1,1))

Diagnosis: Nonlinearity appears as curvature in “Residuals vs Fitted.” Fitted values systematically under- and over-predict at extremes.



# (b) Violate homoskedasticity — variance increases with x
x2 <- runif(n, 0, 10)
y2 <- 2 + 3*x2 + rnorm(n, 0, sd = 0.5 + 0.5*x2)
m2 <- lm(y2 ~ x2)
par(mfrow = c(2,2)); plot(m2); par(mfrow=c(1,1))

Diagnosis: Funnel shape in “Residuals vs Fitted” plot. Error variance grows with predicted value → use scale(location) in plot.lm() or Breusch–Pagan test.



# (c) Violate independence — residuals correlated
set.seed(101)
x3 <- 1:100
eps <- arima.sim(list(ar = 0.7), n = 100)
y3 <- 5 + 0.4*x3 + eps
m3 <- lm(y3 ~ x3)

# Successive residual plot
plot(m3$residuals[-n], m3$residuals[-1],
     xlab = "Residual[i]", ylab = "Residual[i+1]",
     main = "Successive Residuals Plot")
abline(h=0,v=0,col="red")

Diagnosis: If residuals[i] vs residuals[i+1] form a pattern (e.g., clustering around diagonal), independence is violated. Use Durbin–Watson test for confirmation.



# (d) Violate normality — heavy-tailed errors
set.seed(202)
x4 <- runif(n, 0, 10)
y4 <- 2 + 3*x4 + rt(n, df=3)  # t-distribution (heavy tails)
m4 <- lm(y4 ~ x4)

qqnorm(m4$residuals, main="Q-Q Plot of Residuals")
qqline(m4$residuals, col="red")
hist(m4$residuals, main="Histogram of Residuals", xlab="Residuals")

Diagnosis: Q–Q plot shows deviations at tails; histogram non-bell shaped. Normality mainly affects inference (CIs, t-tests), less so prediction.

Quick check summary:
AssumptionKey plot/testSymptom
LinearityResiduals vs FittedCurved pattern
IndependenceResiduals over indexSequential pattern
HomoskedasticityScale-Location plotFan shape
NormalityQ–Q plotHeavy/light tails, skew

Video: The Hat Matrix and Orthogonality

We define \(H = X(X^\top X)^{-1}X^\top\). It “puts a hat” on \(Y\): \( \hat{Y} = HY \).


# Demonstrate orthogonality
set.seed(10)
X <- cbind(1, rnorm(5))
Y <- rnorm(5)
H <- X %*% solve(t(X) %*% X) %*% t(X)
Y_hat <- H %*% Y
e_hat <- (diag(5) - H) %*% Y
t(Y_hat) %*% e_hat   # should be 0

Interpretation: Residuals contain no information about the fitted component — the model explains everything systematic, leaving only random noise.

Video: Fixing Violated Assumptions

Once violations are diagnosed, we can often fix or mitigate them:


# Example: fix heteroskedasticity in Bollywood data
data <- data.frame(
  Budget = c(50,100,150,200,250),
  Gross  = c(100,250,400,600,900)
)
data$log_Budget <- log(data$Budget)
data$log_Gross  <- log(data$Gross)

m_lin  <- lm(Gross ~ Budget, data)
m_log  <- lm(log_Gross ~ log_Budget, data)

summary(m_log)

Coefficients:
(Intercept)   log_Budget
   -1.44         1.32 ***
---
Residual SE: 2.56  on 188 df
R² = 0.58

Interpretation: In the log–log model, coefficients represent elasticities: a 1% increase in budget → 1.32% increase in gross revenue. Model is now homoskedastic and interpretable multiplicatively.

Video: Using Residual Plots and Influence Measures

Beyond assumption checks, diagnostics help identify individual outliers or high-influence points.


# Example using mtcars
fit <- lm(mpg ~ wt + hp, data=mtcars)
plot(fit, which=4)   # Cook’s distance
influence.measures(fit)
Exam tip: Be able to match a diagnostic plot pattern to the violated assumption and name a fix (e.g., “fan shape → heteroskedasticity → log-transform Y”).

Video: Model Diagnosis Case Study (Bollywood Example)

Dataset: Bollywood film budgets (in million ₹) vs gross revenue. Diagnose and repair non-constant variance.


bollywood <- read.csv("bollywood.csv")
model_raw <- lm(Gross ~ Budget, data = bollywood)
par(mfrow = c(2,2)); plot(model_raw); par(mfrow=c(1,1))

# Log-transform both variables
bollywood$log_Budget <- log(bollywood$Budget)
bollywood$log_Gross  <- log(bollywood$Gross)
model_log <- lm(log_Gross ~ log_Budget, data = bollywood)
summary(model_log)

Result: \( \log(\text{Gross}) = -1.44 + 1.32 \log(\text{Budget}) \)

Elasticity interpretation: 1% ↑ in budget → 1.32% ↑ in gross (increasing returns to scale). Model variance stabilized.

Key takeaways:

Practice — Quick Lab

  1. Simulate one violation per assumption and identify which plot detects it fastest.
  2. Compute hatvalues(), rstandard(), and cooks.distance() for your own model.
  3. Re-fit a model using log(y) or sqrt(y); compare residual plots.

Reference formulas

Module 6 — Model Selection & Multicollinearity Selection

Essentials

Video: Motivating Model Selection Methods

Adding predictors always increases \(R^2\) in-sample but can harm test performance. Selection methods trade off fit with complexity to minimize expected prediction error.

Video: Testing-Based Procedures and their Shortfalls

Classical stepwise testing uses partial F-tests (or t-tests) to add/drop predictors. Problems: multiplicity (inflated type I error), order dependence, and data dredging.


# Partial F-test (nested models) on mtcars
full   <- lm(mpg ~ wt + hp + drat + qsec, data = mtcars)
reduced <- lm(mpg ~ wt + hp, data = mtcars)
anova(reduced, full)  # H0: added vars (drat, qsec) jointly = 0
Exam pitfall
Don’t report p-values from stepwise as if they were pre-specified. They’re conditional on the search and can be anti-conservative.

Video: Criterion-Based — AIC (Akaike Information Criterion)

AIC estimates out-of-sample Kullback–Leibler discrepancy. For Gaussian LM with RSS and k parameters (\(k=p\) incl. intercept):

\[ \mathrm{AIC} = n\log\!\left(\frac{RSS}{n}\right) + 2k. \]

Choose the model with the lowest AIC. Penalizes complexity with \(2k\) — lighter penalty than BIC.


m1 <- lm(mpg ~ wt + hp, data=mtcars)
m2 <- lm(mpg ~ wt + hp + drat, data=mtcars)
AIC(m1, m2)
# Lower AIC preferred

Video: Criterion-Based — BIC (Bayesian Information Criterion)

BIC is harsher on complexity:

\[ \mathrm{BIC} = n\log\!\left(\frac{RSS}{n}\right) + k\log n. \]

As \(n\) grows, \(\log n > 2\) — BIC tends to favor more parsimonious models than AIC.


BIC(m1, m2)
# Often selects a smaller model than AIC, especially for large n

Video: Criterion-Based — Adjusted \(R^2\)

Adjusted \(R^2\) penalizes extra predictors by degrees of freedom:

\[ R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-k}. \]

Use it comparatively across models fit to the same \(y\) and \(n\). It may peak before the full model.


summary(m1)$adj.r.squared
summary(m2)$adj.r.squared
Rule of thumb
If inference matters, use multiple criteria (e.g., AIC + adj. \(R^2\) + subject-matter sense). If pure prediction matters, prefer MSPE via cross-validation.

Video: Mean Squared Prediction Error (MSPE) & Cross-Validation

Use a holdout set or K-fold CV to estimate test error:


set.seed(1)
n   <- nrow(mtcars)
idx <- sample.int(n, floor(0.7*n))
train <- mtcars[idx, ]; test <- mtcars[-idx, ]

cands <- list(
  mA = lm(mpg ~ wt, data=train),
  mB = lm(mpg ~ wt + hp, data=train),
  mC = lm(mpg ~ wt + hp + drat, data=train)
)

mspe <- sapply(cands, function(fit){
  p <- predict(fit, newdata=test)
  mean((test$mpg - p)^2)
})
sort(mspe)

Interpretation: pick the model with the smallest estimated MSPE. Repeat random splits or use K-fold CV for stability.


# Simple K-fold CV (K=5) for one formula
K <- 5
fold_id <- sample(rep(1:K, length.out=n))
cv_formula <- mpg ~ wt + hp

cv_err <- sapply(1:K, function(k){
  tr <- mtcars[fold_id != k, ]
  te <- mtcars[fold_id == k, ]
  fit <- lm(cv_formula, data=tr)
  mean((te$mpg - predict(fit, te))^2)
})
mean(cv_err); sd(cv_err)

Video: Model Selection in R (stepwise, subsets)

Quick baseline: step() with AIC; exhaustive subsets with leaps::regsubsets().


# AIC stepwise (both directions)
full   <- lm(mpg ~ ., data = mtcars)
null   <- lm(mpg ~ 1, data = mtcars)
step_fit <- step(null, scope = formula(full), direction = "both", trace = 0)
summary(step_fit)

# Exhaustive/forward/backward subsets (by adj. R2, Cp, BIC)
library(leaps)
sub <- regsubsets(mpg ~ ., data = mtcars, nbest=1, nvmax = 6, method="exhaustive")
sum_sub <- summary(sub)
which.max(sum_sub$adjr2)   # size of best model by adj R2
which.min(sum_sub$bic)     # size of best by BIC
Notes

Video: The Problem of Collinearity

Collinearity = predictors are strongly linearly related. Consequences:


# Synthetic collinearity example
set.seed(2)
n <- 60
x1 <- rnorm(n)
x2 <- 0.95*x1 + rnorm(n, sd=0.1)  # highly collinear
y  <- 3 + 2*x1 - 2*x2 + rnorm(n, sd=1)
fit_col <- lm(y ~ x1 + x2)
summary(fit_col)  # large SEs, unstable signs

Video: Diagnosing Multicollinearity (VIF & friends)

Variance Inflation Factor (VIF) for predictor \(x_j\):

\[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}, \]

where \(R_j^2\) is from regressing \(x_j\) on all other predictors. Rules of thumb: 5 = moderate, 10 = high.


# Using regclass or car for VIF
# install.packages("regclass")  # if needed
library(regclass)
VIF(fit_col)

# Also look at correlation matrix and condition indices
cor_m <- cor(cbind(x1,x2))
e <- eigen(t(model.matrix(fit_col)[,-1]) %*% model.matrix(fit_col)[,-1])
cond_index <- sqrt(max(e$values) / e$values)
cond_index
Interpretation
High VIF means \(x_j\) is predictable from other X’s, inflating var\((\hat\beta_j)\). Condition indices > 30 often flag severe collinearity across linear combinations.

Video: The Problem of Multicollinearity — Solutions & R Implementation


# Centering can stabilize numerics & interpretation
df <- data.frame(y, x1, x2)
df$x1_c <- scale(df$x1, center=TRUE, scale=FALSE)
df$x2_c <- scale(df$x2, center=TRUE, scale=FALSE)
fit_c <- lm(y ~ x1_c + x2_c, data=df)
VIF(fit_c)

# Dropping a redundant predictor (when theory permits)
fit_drop <- lm(y ~ x1, data=df)
# Compare MSPE via CV before committing
Communicating choices

Practice — Quick Lab

  1. Compute AIC, BIC, adj. \(R^2\) for 3–5 candidate models on a training split; pick the top two by each metric.
  2. Estimate test MSPE via 5-fold CV for those finalists; select the best generalizer.
  3. Compute VIFs for the chosen model; if any > 10, try centering, dropping, or combining variables; re-check MSPE.

Reference formulas

📘 One-Page Exam-Day Cheatsheet Final Review

Core Linear Model
Decomposition & Fit
Inference & Testing
Diagnostics
AssumptionCheckFix
LinearityResidual vs Fitted (curvature)Add polynomial/interaction
HomoskedasticityFan shape in Scale–LocationLog-transform Y or robust SE
NormalityQ–Q plot tailsTransform Y (log/sqrt)
IndependenceSequential residuals / DW testUse AR/GLS/time terms
Prediction
Model Selection
Exam Priorities

🧩 Practice Prompts & Expected Moves Hands-on

Prompt 1 — Interpret Regression Output


Coefficients:
(Intercept)   wt      hp  
   37.23     -3.88   -0.03
---
Residual SE: 2.59  R² = 0.826,  Adj R² = 0.815

Prompt 2 — Diagnostic Identification


Residuals vs Fitted → U-shape pattern
Q–Q Plot → Normal
Scale–Location → constant spread

Prompt 3 — Compute Confidence Interval

\(\hat\beta_1 = 1.4, SE=0.5, n=40, k=3, \alpha=0.05\)

Prompt 4 — Choose Best Model


Model  |  Adj R²  |  AIC  |  BIC
-------------------------------
M1     |  0.78    | 120.5 | 128.3
M2     |  0.80    | 119.8 | 131.1
M3     |  0.79    | 121.1 | 127.4

Prompt 5 — Detect Multicollinearity


Predictors: x1, x2, x3
VIFs: 2.1, 12.5, 3.4

Prompt 6 — Forecast and Uncertainty


predict(fit, newdata=data.frame(wt=3.0, hp=120),
        interval="prediction", level=0.95)

Prompt 7 — Explain Assumption Impact

Prompt 8 — Bonus: Prediction vs Explanation

Quick Formula Recall

β̂ = (X'X)⁻¹X'y
Var(β̂) = σ²(X'X)⁻¹
CI: β̂ ± t * SE(β̂)
F = ((RSSr - RSSf)/q) / (RSSf/(n-k))
VIFj = 1/(1-Rj²)
AIC = n ln(RSS/n) + 2k
BIC = n ln(RSS/n) + k ln n
MSPE = mean((y_test - ŷ_test)²)
Exam Tips

🧩 Interpretation & Visual Diagnostics — Mental Images for Exam Recall

Each section below shows how the diagnostic plot should look when assumptions hold (✅ Correct) and when they are violated (🚫 Violation). Mentally picture these patterns — that’s usually enough to identify issues in an exam setting.

Linearity

Plot: Residuals vs Fitted

✅ Correct (linear relationship captured)
resid
  |     • •  • • •
  |  • • • • • • •
  | • • • • • • • •
  +----------------------→ fitted
           evenly scattered

🚫 Violation (nonlinear pattern — model missing curvature)
resid
  |      •
  |   •     •
  | •         •
  |   •     •
  +----------------------→ fitted
         "U" or "∩" shape
  

Fix: add polynomial term (I(x^2)), interaction, or transform predictor (log(x)).

Independence (Autocorrelation)

Plot: Residuals vs Observation Order / Time

✅ Correct (independent residuals)
resid
  3 | •   • • •     •   •
  0 |   •     • • •   •
 -3 | •     •     •
     +--------------------→ observation index
        random scatter

🚫 Violation (autocorrelation — residuals follow a pattern)
resid
  3 |        /‾‾‾\__/‾‾‾\__
  0 |_______/___________\____ index →
 -3 |
     cyclical or trending
  

Fix: include lagged predictors, use GLS/AR(1), or add time trend.

Homoskedasticity (Constant Variance)

Plot: Scale–Location or Residuals vs Fitted

✅ Correct (equal variance)
resid
  |     •    •   •
  |  •    •    •
  | •   •   •    •
  +----------------→ fitted
       uniform spread

🚫 Violation (heteroskedastic — "funnel" shape)
resid
  |    •
  |   •  •
  |  •    •
  | •       •
  +----------------→ fitted
     variance increases with fit
  

Fix: log/√ transform Y, use weighted LS, or robust SEs (vcovHC()).

Normality of Errors

Plot: Q–Q Plot of residuals

✅ Correct (normally distributed residuals)
Quantiles →
  | •
  |   •
  |     •
  |       •
  |         •
  +----------------
       perfect 45° line

🚫 Violation (non-normal)
   • S-shape → skewness  
   • tails above/below line → heavy tails
Quantiles →
  | •      (left tail)
  |   • •
  |     •
  |       • • (right tail off-line)
  +----------------
  

Fix: transform Y, remove severe outliers, or rely on large n (CLT robustness).

Influence & Outliers

Plots: Cook’s Distance / Leverage vs Residuals

✅ Correct (no influential points)
Cook’s D
  |     • • • • •
  | • • • • • • •
  +----------------
        obs index →

🚫 Violation (outlier with high influence)
Cook’s D
  |           •   ← high D (>1)
  |  • • • • •
  +----------------
        obs index →
  

Fix: verify data, try fit without point, report sensitivity check.

Collinearity / Multicollinearity

Numeric and conceptual pattern

✅ Correct
VIFs ~ 1–3
β̂ signs stable
SEs reasonable

🚫 Violation
VIFs > 10
β̂ signs flip with minor data changes
Huge SE, nonsignificant t-tests despite high R²
  

Fix: center or standardize, drop/merge predictors, or use ridge/LASSO.

Model Fit Overview

Pattern → Symptom → Fix. If you see curvature, heteroskedasticity, or correlation, state the violation and propose the repair directly.

Quick Reference Table
Assumption“Bad” Visual CueRepair Strategy
LinearityU-shape residualsAdd x² / transform X
IndependenceSequential wave patternAdd time terms / AR model
HomoskedasticityFunnel patternLog Y / robust SE
NormalityQ–Q curve or tail deviationTransform Y / ignore if n large
InfluenceLarge Cook’s D spikeInvestigate / re-fit w/o obs
CollinearityHigh VIF / unstable β̂Center, drop, combine vars
⏱️ 2-Minute Final Review — What to Recall Instantly
TopicRecall / Action
OLS formula \(\hat\beta = (X'X)^{-1}X'Y\); unbiased if assumptions hold.
Key Decomposition TSS = ESS + RSS → \(R^2 = 1 - RSS/TSS\).
t vs F test t → one coefficient; F → multiple coefficients. Reject if p < 0.05.
CI & PI CI(mean): ±tσ√h₀; PI(new): ±tσ√(1+h₀).
Diagnostics Linearity (U-shape) → add x²; Funnel → log Y; Q–Q tail → transform; Waves → AR(1).
Influence Cook’s D > 1 → outlier; verify & re-fit.
Collinearity VIF > 10 = unstable β̂; center or drop predictor.
Model selection AIC ↓ better (prediction), BIC ↓ stricter (parsimony), adj R² ↑ better.
Prediction vs Explanation Prediction → minimize MSPE; Explanation → interpret β & control confounders.
Exam Mindset 1️⃣ Identify pattern → 2️⃣ Name assumption → 3️⃣ Apply fix → 4️⃣ Justify with equation or plot cue.

💡 *If nothing else: know the four OLS assumptions, be able to explain R², interpret lm() output, and spot which plot corresponds to which violation.*

Some of this guide was generated by AI, please double check all work for accuracy.