📘 Modern Regression Analysis in R
Final Exam Study Guide

Scope: Modules 1–6 • Focus: concepts, formulas, R patterns, diagnostics, selection • Source: your course autograded & peer-reviewed PDFs.

How to use this guide

Skim the Essentials in each module, then drill into Equations, R patterns, and Exam tips.
Copy/paste R snippets to rehearse outputs & interpretation quickly.

📑 Table of Contents

Module 1 — Introduction to Statistical Models
Module 2 — Least Squares & Goodness of Fit
Module 3 — Inference: t, F, Confidence Intervals
Module 4 — Prediction, Intervals, Causality
Module 5 — Diagnostics & Assumptions
Module 6 — Model Selection & Multicollinearity
One-page Cheatsheet
Practice
Diagnostics & visuals

Module 1 — Introduction to Statistical Models Core

Essentials

Statistical model: formal link between data and parameters; separates systematic structure from random noise.
Linear regression (SLR): \(Y_i=\beta_0+\beta_1 x_i+\varepsilon_i\), \(\varepsilon_i\sim\mathcal N(0,\sigma^2)\) under classical assumptions.
Centering/standardizing: improves interpretability and numerical stability; intercept becomes mean response at average predictors; slopes become per-σ changes.

Video: What Is a Statistical Model?

A statistical model posits how outcomes are generated from predictors and noise. It specifies a family of distributions for \(Y\) given \(X\) and unknown parameters \(\theta\). For linear regression, \(\theta=(\beta_0,\beta_1,\dots,\beta_p,\sigma^2)\).

Key idea: Separate signal (systematic mean structure \(E[Y\mid X]\)) from noise (random variation around the mean).

Notation: \(Y\mid X\sim \mathcal{F}(g(X;\beta),\sigma^2)\), with mean function \(g(X;\beta)=X\beta\).

Exam tip

Be able to articulate: (i) what is being assumed about the data-generating process, (ii) which parts are estimable, and (iii) which assumptions are testable with residuals.

Video: The Linear Regression Model (SLR → MLR)

Simple linear regression (one predictor):

\[ Y_i=\beta_0+\beta_1 x_i+\varepsilon_i,\quad \varepsilon_i\overset{iid}{\sim}\mathcal N(0,\sigma^2). \]

Multiple linear regression (p predictors, incl. intercept):

\[ \mathbf y=X\beta+\varepsilon,\quad \varepsilon\sim \mathcal N_n(\mathbf 0,\sigma^2 I_n). \]

Geometry: \(\hat{\mathbf y}=H\mathbf y\) where \(H=X(X^\top X)^{-1}X^\top\) is the hat matrix (projection onto the column space of \(X\)).

# First pass in R
m <- lm(y ~ x1 + x2, data=df)
summary(m)
coef(m); fitted(m); resid(m)

Video: Variables, Design Matrix, and Coding

Design matrix \(X\): column of 1’s (intercept) + predictor columns. For factors, use dummy (treatment) coding by default: if group has levels A (baseline), B, C, then model includes groupB, groupC with A absorbed in intercept.

Centering/Scaling: \(x^{*}=(x-\bar x)/s_x\) improves conditioning of \(X^\top X\) and interpretation.

# Inspect the design
X <- model.matrix(~ group + z, data=df)  # dummy coding for 'group'
head(X)

# Center and scale a numeric predictor
df$z_c <- scale(df$z, center=TRUE, scale=FALSE)
df$z_z <- scale(df$z)                     # center + scale

Video: Model Assumptions & When They Matter

Linearity: the conditional mean is linear in parameters.
Independence: errors are independent across observations.
Homoskedasticity: \(\mathrm{Var}(\varepsilon_i)=\sigma^2\) constant.
Normality (for inference): \(\varepsilon_i\sim \mathcal N(0,\sigma^2)\).

Impact: Gauss–Markov (no normality needed) → OLS is BLUE under linearity, independence, constant variance. Normality is needed for exact t/F inference in small samples.

# Basic diagnostic plots (preview; detailed in Module 5)
m <- lm(y ~ x, data=df)
plot(m)  # 1: Resid vs Fitted, 2: QQ, 3: Scale-Location, 4: Cook's Distance

Video: Fitted Values, Residuals, and Identities

Let \(\hat{\mathbf y}=X\hat\beta\), \(\mathbf e=\mathbf y-\hat{\mathbf y}\). Core identities in linear models:

\(\sum e_i=0\) and \(X^\top \mathbf e=\mathbf 0\) (residuals orthogonal to each column of \(X\)).
\(\sum \hat y_i e_i=0\) (orthogonality with fits).
Regression line passes through \((\bar x,\bar y)\) in SLR.
\(\text{TSS}=\text{ESS}+\text{RSS}\); \(R^2=1-\dfrac{\text{RSS}}{\text{TSS}}\).

m <- lm(y ~ x, data=df)
e <- resid(m); yhat <- fitted(m)
sum(e)                        # ~ 0
sum(model.matrix(m)[,2] * e)  # ~ 0 for the x column in SLR
mean(yhat) - mean(df$y)       # ~ 0

Video: Simulation to Build Intuition

Simulate data with known \(\beta\) and \(\sigma\) to see how noise affects estimates and scatter.

set.seed(1)
n <- 200
x <- rnorm(n, 0, 1)
y <- 2 + 3*x + rnorm(n, 0, 2)   # β0=2, β1=3, σ=2
fit <- lm(y ~ x)
coef(fit); summary(fit)$sigma

Takeaway: Higher \(\sigma\) widens residual spread and reduces precision (larger SEs), but OLS remains unbiased under the assumptions.

Video: From SLR to MLR — Interpretation & Confounding

Partial effects: In MLR, \(\beta_j\) is the adjusted effect of \(x_j\) holding other predictors constant.

Confounding: Omitting relevant predictors that correlate with both \(x_j\) and \(y\) biases \(\hat\beta_j\). Include key controls or use study design to block confounding.

m_slr <- lm(aprice ~ pages, data=books)
m_mlr <- lm(aprice ~ lprice + pages + weight, data=books)
coef(m_slr)["pages"]; coef(m_mlr)["pages"]  # note change with adjustment

Pitfall

“Sign flips” are common when moving SLR → MLR due to correlation among predictors. Always interpret \(\beta_j\) conditionally.

Video: First Modeling Workflow in R

EDA: visualize \(y\) vs predictors; check types and missingness.
Specify: start with a plausible linear form; consider transforms if needed.
Fit: lm() and inspect summary() (coefficients, SEs, \(R^2\), residual σ).
Diagnose: residual plots; consider fix (transform/robust) if assumptions fail.
Communicate: effect sizes with CIs; practical significance, not only p-values.

df <- na.omit(df)
pairs(~ y + x1 + x2, data=df)
m   <- lm(y ~ x1 + x2, data=df)
summary(m)
plot(m)  # quick diagnostics
predict(m, newdata=data.frame(x1=0, x2=1), interval="confidence")

Readings — What to internalize

Modeling mindset: Every model is an approximation. The goal is useful inference/prediction under justified assumptions.
Measurement & data quality: Garbage in, garbage out—understand variable definitions, scales, and potential biases.
Transparency: Report assumptions, diagnostics, and limitations.

Module 2 — Linear Regression Parameter Estimation Estimation

Essentials

OLS minimizes \(\sum_i (y_i - \hat y_i)^2\) to estimate parameters \(\hat\beta\).
Matrix form: \(\hat\beta = (X^\top X)^{-1} X^\top y\).
Under Gauss–Markov assumptions, OLS is BLUE — Best Linear Unbiased Estimator.
Goodness of fit is assessed using \(R^2\), adjusted \(R^2\), and residual variance \(\hat\sigma^2\).

Video: Introduction to Least Squares

The method of least squares finds the line that minimizes the total squared deviation between observed and fitted values. The normal equations are obtained by differentiating the residual sum of squares (RSS) with respect to parameters and setting derivatives to zero.


# Example: relationship between hours studied and exam score
set.seed(42)
hours <- runif(30, 0, 10)
score <- 50 + 4*hours + rnorm(30, 0, 5)
m <- lm(score ~ hours)
summary(m)


Call:
lm(formula = score ~ hours)

Coefficients:
(Intercept)    49.82  
hours           4.05  

Residual standard error: 4.63 on 28 degrees of freedom
Multiple R-squared: 0.94, Adjusted R-squared: 0.94

Interpretation: each additional hour studied increases the predicted score by about 4 points; 94% of variation in score is explained by study time.

Video: Linear Algebra for Least Squares

For multiple predictors, we use the vector-matrix form:

\[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]

This comes from solving \(X^\top X \hat\beta = X^\top y\).

Example: Suppose \(X\) includes an intercept and two predictors:


X <- model.matrix(~ hours + iq)
head(X)

Exam Tip

Be ready to derive \((X^\top X)^{-1} X^\top y\) from first principles and interpret what happens if \(X^\top X\) is singular (→ multicollinearity).

Video: Deriving the Least Squares Solution

Start with RSS \(= (y - X\beta)^\top (y - X\beta)\).

Differentiating and setting to zero:

\[ \frac{\partial \text{RSS}}{\partial \beta} = -2X^\top (y - X\beta) = 0 \] \[ \Rightarrow X^\top X\hat\beta = X^\top y \Rightarrow \hat\beta = (X^\top X)^{-1} X^\top y \]

Interpretation: this ensures residuals are orthogonal to fitted values and predictors (no systematic pattern remains).

Video: Regression Modeling in R — A First Pass


data(mtcars)
fit <- lm(mpg ~ wt + hp, data=mtcars)
summary(fit)


Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Coefficients:
(Intercept)   37.23  
wt            -3.88  
hp            -0.03  

Residual standard error: 2.59 on 29 DF
Multiple R-squared: 0.826, Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10

Interpretation: controlling for horsepower, each 1000 lb increase in weight reduces mpg by 3.9. Both coefficients are negative, meaning heavier and more powerful cars are less fuel-efficient.

Video: Gauss–Markov and Maximum Likelihood

Gauss–Markov theorem: under linearity, independence, homoskedasticity, OLS has minimum variance among unbiased linear estimators.
MLE: if errors are Normal, OLS ≡ MLE for β and \(\hat\sigma^2 = RSS/(n-p)\).

Variance of coefficients: \(\mathrm{Var}(\hat\beta) = \hat\sigma^2 (X^\top X)^{-1}\).


coef(summary(fit))
vcov(fit)

Exam Tip

Be able to prove why \(E[\hat\beta] = \beta\) (unbiasedness) and derive \(\mathrm{Var}(\hat\beta)\).

Video: Sums of Squares and Estimating Error Variance

Decompose total variation:

\[ TSS = ESS + RSS \]

Estimate of residual variance:

\[ \hat\sigma^2 = \frac{RSS}{n - p} \]


anova(fit)


Analysis of Variance Table

Response: mpg
          Df  Sum Sq  Mean Sq  F value   Pr(>F)    
wt         1  847.73   847.73  126.45  < 2e-16 ***
hp         1  83.27    83.27   12.42   0.0014 ** 
Residuals 29 194.39    6.70

Interpretation: “Mean Sq = Sum Sq / Df.” F = (Mean Sq Model)/(Mean Sq Residual). A large F → model explains significant variance.

Video: The Coefficient of Determination \(R^2\)

\(R^2 = 1 - \frac{RSS}{TSS}\); measures proportion of total variation explained by model.

Adjusted \(R^2\): penalizes extra predictors:

\[ R^2_{adj} = 1 - (1 - R^2)\frac{n-1}{n-p} \]


summary(fit)$r.squared
summary(fit)$adj.r.squared

For mtcars example, \(R^2 = 0.826\), \(R^2_{adj} = 0.812\): high explanatory power, modest penalty for two predictors.

Video: The Problem of Non-identifiability

When predictors are perfectly collinear (e.g., one variable is a linear combination of others), \(X^\top X\) is singular and OLS cannot be computed.


# Example: add perfect collinearity
mtcars$wt2 <- mtcars$wt * 2
lm(mpg ~ wt + wt2, data=mtcars)


Coefficients: (1 not defined because of singularities)
(Intercept)   37.29  
wt            -3.88  
wt2              NA

Interpretation: R detects multicollinearity and omits redundant columns.

Video: Regression Modeling in R — A Second Pass

Verify assumptions visually: residual plots, QQ-plots.
Inspect coefficients: sign, magnitude, statistical significance.
Assess fit: \(R^2\), adjusted \(R^2\), RMSE = \(\sqrt{\text{RSS}/(n-p)}\).
Compute confidence intervals: confint(fit).


confint(fit)
plot(fit$fitted.values, resid(fit),
     xlab="Fitted", ylab="Residuals")
abline(h=0, col="red")

Exam Tip

Be comfortable explaining each column in summary(lm()) output and identifying which quantity corresponds to standard errors, t-statistics, and p-values.

In Summary

OLS minimizes RSS → unique solution if predictors are linearly independent.
Variance–covariance matrix gives precision of estimates.
Model fit and identifiability are judged via \(R^2\), ANOVA, and collinearity diagnostics.

Module 3 — Inference in Linear Regression Inference

Essentials

We use sampling distributions of \(\hat\beta\) to test hypotheses and construct confidence intervals.
Under Normal errors, OLS estimators are unbiased, efficient, and normally distributed.
t-tests assess individual coefficients; F-tests assess groups of coefficients or overall model significance.
Inference depends on \(n-p\) degrees of freedom and estimated variance \(\hat\sigma^2 = RSS/(n-p)\).

Video: Motivating Statistical Inference in the Regression Context

OLS gives point estimates; inference quantifies their uncertainty. We use variability in \(\hat\beta\) across samples to infer properties of the population relationship.

\[ \hat\beta_j \sim \mathcal{N}\left(\beta_j, \, \sigma^2 [ (X^\top X)^{-1} ]_{jj} \right) \]

The standard error of each coefficient estimates its sampling variability: \[ SE(\hat\beta_j) = \hat\sigma \sqrt{[(X^\top X)^{-1}]_{jj}} \]


fit <- lm(mpg ~ wt + hp, data = mtcars)
summary(fit)$coefficients


              Estimate Std. Error t value  Pr(>|t|)
(Intercept) 37.22727   1.59879   23.29   < 2e-16 ***
wt          -3.87783   0.63273   -6.13   1.12e-06 ***
hp          -0.03177   0.00903   -3.52   0.00145 **

Interpretation: Each row reports an estimate, its SE, t-statistic, and p-value. Both predictors are significant at α = 0.05.

Video: The Sampling Distribution of the Least Squares Estimator

Given assumptions \(E[\varepsilon]=0\) and \(\mathrm{Var}(\varepsilon)=\sigma^2I\):

\[ E[\hat\beta] = \beta, \quad \mathrm{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. \]

Thus, under Normality:

\[ \frac{\hat\beta_j - \beta_j}{SE(\hat\beta_j)} \sim t_{n-p}. \]

These t-distributions underpin coefficient significance tests and confidence intervals.


summary(fit)$sigma      # residual SD (σ̂)
vcov(fit)               # variance-covariance matrix of β̂

Video: T-Tests for Individual Regression Parameters

Null hypothesis: \(H_0:\beta_j=0\)

Test statistic: \(t_j = \dfrac{\hat\beta_j - 0}{SE(\hat\beta_j)}\), df = n–p.

Compare |t| to t-critical (e.g., t₀.₉₇₅, 28 ≈ 2.048) or use p-values.


coef(summary(fit))
confint(fit, level = 0.95)


                 2.5 %    97.5 %
(Intercept) 34.0      40.4
wt          -5.18     -2.57
hp          -0.050    -0.013

Interpretation: We are 95% confident that each additional 1000 lb weight reduces mpg by 2.6 – 5.2 mpg.

Exam Tip

Be able to explain that the t-test checks whether a coefficient differs significantly from zero, while its CI provides the same information visually.

Video: T-Tests in R — Line-by-Line Output Interpretation

Use summary(lm()) output to interpret model results. Each column:

Estimate	Point estimate of coefficient.
Std. Error	Estimated standard deviation of sampling distribution.
t value	Ratio of Estimate/SE — how many SEs away from 0.
Pr(>\|t\|)	Two-sided p-value for testing \(H_0:\beta_j=0\).


Residual standard error: 2.593 on 29 DF
Multiple R-squared: 0.826
Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10

Residual SE: typical deviation of actual mpg from fitted line. F-statistic: overall model significance — at least one β ≠ 0.

Video: Motivating the F-Test — Multiple Statistical Comparisons

When testing several coefficients simultaneously, individual t-tests inflate type I error. The F-test jointly evaluates \(H_0:\beta_1=\beta_2=\dots=\beta_k=0\).

\[ F = \frac{(RSS_{reduced} - RSS_{full})/q}{RSS_{full}/(n-p)} \]

where q = number of restrictions (coefficients tested jointly).


# Test that both wt and hp = 0
fit0 <- lm(mpg ~ 1, data=mtcars)
anova(fit0, fit)


Model 1: mpg ~ 1
Model 2: mpg ~ wt + hp
  Res.Df  RSS   Df Sum of Sq   F  Pr(>F)
1     31 1126.0                         
2     29  194.4  2    931.6 69.6  <2e-16 ***

Interpretation: The reduction in RSS is highly significant; model explains mpg well.

Video: The F-Test in R

Use anova() to compare nested models or inspect the model’s ANOVA decomposition.


anova(fit)


          Df  Sum Sq Mean Sq  F value  Pr(>F)
wt         1   847.7   847.7 126.45  <2e-16 ***
hp         1    83.3    83.3  12.42   0.0014 **
Residuals 29   194.4     6.7

Interpretation: Weight explains most variance; horsepower adds smaller but still significant improvement.

Video: Confidence Intervals in the Regression Context

General form: \[ \hat\beta_j \pm t_{n-p,1-\alpha/2} \cdot SE(\hat\beta_j) \] Use confint() in R to compute them automatically.


confint(fit, level = 0.99)

Interpretation: Wider intervals reflect greater uncertainty (99% > 95%). Confidence intervals always expand with higher confidence levels or fewer data points.

Video: Ethics in Statistical Practice & Communication

Always report uncertainty (SEs, CIs) and data limitations.
Avoid p-hacking — running many models until “significant.”
Be transparent about model specification and diagnostics.
Distinguish correlation from causation — regression implies association, not cause.

Exam Tip

Expect a question asking which assumptions are needed for unbiasedness (Gauss–Markov) vs for valid t/F inference (Normality).

In Summary

t-tests assess single predictors; F-tests assess overall or nested model significance.
Confidence intervals express plausible parameter ranges.
Interpretation must always include both magnitude and uncertainty.

Module 4 — Prediction and Explanation in Linear Regression Applied

Essentials

Explanation focuses on estimating causal/associational effects (\(\beta\)) and interpreting parameters.
Prediction focuses on minimizing out-of-sample error (MSPE) and producing accurate point/interval forecasts.
At a new point \(x_0\): point prediction \(\hat y_0 = x_0^\top \hat\beta\); CI for the mean vs PI for a new observation.

Video: Differentiating Prediction and Explanation

Explanation asks “how does \(y\) change with \(x_j\) (all else equal)?” and prioritizes unbiased, interpretable \(\hat\beta\). Prediction asks “how well can we forecast \(y\) for new \(x\)?” and prioritizes low test error (may use regularization, interactions, etc.).

Explainer’s toolbox: careful specification, confounder control, diagnostics, CIs, hypothesis tests.
Predictor’s toolbox: cross-validation, holdout sets, RMSE/MAE, PI coverage, possibly non-linear models.

Exam tip

Be ready to justify why a variable is included for causal control vs included because it improves forecasts.

Video: Point Estimates for Prediction

Given a fitted model \( \hat\beta\), the point prediction at \(x_0\) is \(\hat y_0 = x_0^\top \hat\beta\).


# Example: mtcars — predict mpg from weight and horsepower
data(mtcars)
fit <- lm(mpg ~ wt + hp, data = mtcars)

newcar <- data.frame(wt = 3.0, hp = 120)  # 3000 lb, 120 hp
p_hat  <- predict(fit, newdata = newcar)  # point prediction
p_hat


     1 
21.34

Interpretation: For a 3,000 lb / 120 hp car, the predicted fuel economy is about 21.3 mpg (point estimate only — no uncertainty yet).

Video: Interval Estimates for Prediction

Uncertainty comes from parameter estimation and irreducible noise.

Mean (confidence) interval at \(x_0\): \[ \hat y_0 \pm t_{n-p,1-\alpha/2}\,\hat\sigma \sqrt{h_0}, \quad h_0 = x_0^\top (X^\top X)^{-1} x_0. \]
Prediction interval at \(x_0\): \[ \hat y_0 \pm t_{n-p,1-\alpha/2}\,\hat\sigma \sqrt{1 + h_0}. \] Adds \(\hat\sigma^2\) for a new draw of noise → always wider.


# 95% CI for the conditional mean vs 95% PI for a new observation
predict(fit, newcar, interval = "confidence", level = 0.95)
predict(fit, newcar, interval = "prediction", level = 0.95)


# Confidence interval (mean mpg at x0)
     fit   lwr   upr
1  21.34 20.27 22.41

# Prediction interval (new car's mpg at x0)
     fit   lwr   upr
1  21.34 16.15 26.53

Interpretation: We estimate the average mpg near \(x_0\) is 20.3–22.4, but an individual new car could reasonably fall between 16.1 and 26.5 mpg.

Video: Making Predictions Using Real Data in R

Split data into train/test; fit on train; evaluate on test. Report RMSE and PI coverage.


set.seed(1)
n   <- nrow(mtcars)
idx <- sample.int(n, size = round(0.7*n))
train <- mtcars[idx, ]
test  <- mtcars[-idx, ]

fit_tr <- lm(mpg ~ wt + hp, data = train)

# Test-set point predictions and RMSE
p_test <- predict(fit_tr, newdata = test)
rmse   <- sqrt(mean((test$mpg - p_test)^2))

# Optional: estimate empirical PI coverage at 95%
pi95   <- predict(fit_tr, newdata = test, interval = "prediction", level = 0.95)
covered <- mean(test$mpg >= pi95[,"lwr"] & test$mpg <= pi95[,"upr"])

list(RMSE = rmse, PI95_coverage = covered)


$RMSE
2.87

$PI95_coverage
0.93

Interpretation: Test RMSE ≈ 2.9 mpg. About 93% of test points fall inside the nominal 95% PI — acceptable for small samples.

Communicating predictions

Always give point + interval; avoid single-number forecasts.
Report which data (train vs test) the metrics come from.
If using transformations (e.g., log-response), back-transform carefully and address bias (e.g., smearing estimator).

Video: When Prediction Goes Wrong

Extrapolation: predicting at \(x_0\) far outside training range → unstable \(\hat y_0\), poor calibration.
Covariate shift: test distribution of \(X\) differs from train; performance degrades.
Model mis-specification: missing nonlinearity/interactions; biased & high-error predictions.
Leakage/overfitting: information from test leaks into train or model too complex for n.


# Extrapolation check — compare new x0 to training ranges
xrng <- sapply(train[c("wt","hp")], range)
newfar <- data.frame(wt = 6.5, hp = 350)  # far beyond train for mtcars
xrng; newfar
predict(fit_tr, newfar, interval="prediction")  # will produce a number, but risk is high

Red flags

New \(x_0\) outside convex hull of training \(X\) (large leverage \(h_0\)).
Fan-shaped residuals (heteroskedasticity) → PIs mis-calibrated.
Strong nonlinearity in residual vs fitted; add terms or transform.

Video: Defining Causality

Regression coefficients measure association, not causation, unless strong identification conditions hold.

Counterfactual (potential outcomes): causal effect is difference between outcomes under treatment vs control for the same unit.
Identification strategies: randomized experiments, credible controls for confounding, fixed effects, instrumental variables, difference-in-differences, RDD.
Modeling implications: For causal aims, include confounders (affect treatment and outcome). For pure prediction, include variables that improve forecasts (even if post-treatment).

Exam tip

Be prepared to state why a regression coefficient cannot be interpreted causally (unobserved confounding, reverse causality, post-treatment bias), and to propose a design/variable set to mitigate the issue.

Practice — Quick Lab

Split mtcars 70/30; fit mpg ~ wt + hp on train; report test RMSE.
Compute 90% and 99% PIs at two realistic points and one extrapolation point; discuss width and risk.
Plot residuals vs fitted; if nonlinearity appears, add an interaction (wt:hp) or a squared term; compare test RMSE.

Reference formulas

\(\hat y_0 = x_0^\top \hat\beta\)
\(\text{SE}_{mean}(x_0) = \hat\sigma \sqrt{h_0}\), \(\text{SE}_{pred}(x_0) = \hat\sigma \sqrt{1+h_0}\)
\(h_0 = x_0^\top (X^\top X)^{-1} x_0\) (leverage at \(x_0\))

Module 5 — Regression Diagnostics & Assumption Checking Diagnostic

Essentials

OLS assumptions underlie valid inference and prediction:
1. Linearity: \(E[Y|X]\) is linear in parameters.
2. Independence: residuals uncorrelated (no autocorrelation).
3. Homoskedasticity: constant variance of residuals.
4. Normality: residuals are normally distributed.
Violation → unreliable inference, wrong SEs, biased predictions.
Diagnostics: residual plots, Q-Q plots, leverage, Cook’s distance, and formal tests.

Video: Diagnosing the Four Assumptions

Each assumption has its own visual and quantitative diagnostic. We’ll simulate violations and interpret diagnostic plots using ggplot2 and plot.lm().


# Load packages
library(ggplot2)

set.seed(123)
n <- 100

# (a) Violate linearity — true relationship is quadratic
x1 <- runif(n, 0, 10)
y1 <- 2 + 3*x1 - 0.5*x1^2 + rnorm(n, 0, 2)
m1 <- lm(y1 ~ x1)
par(mfrow = c(2,2)); plot(m1); par(mfrow=c(1,1))

Diagnosis: Nonlinearity appears as curvature in “Residuals vs Fitted.” Fitted values systematically under- and over-predict at extremes.


# (b) Violate homoskedasticity — variance increases with x
x2 <- runif(n, 0, 10)
y2 <- 2 + 3*x2 + rnorm(n, 0, sd = 0.5 + 0.5*x2)
m2 <- lm(y2 ~ x2)
par(mfrow = c(2,2)); plot(m2); par(mfrow=c(1,1))

Diagnosis: Funnel shape in “Residuals vs Fitted” plot. Error variance grows with predicted value → use scale(location) in plot.lm() or Breusch–Pagan test.


# (c) Violate independence — residuals correlated
set.seed(101)
x3 <- 1:100
eps <- arima.sim(list(ar = 0.7), n = 100)
y3 <- 5 + 0.4*x3 + eps
m3 <- lm(y3 ~ x3)

# Successive residual plot
plot(m3$residuals[-n], m3$residuals[-1],
     xlab = "Residual[i]", ylab = "Residual[i+1]",
     main = "Successive Residuals Plot")
abline(h=0,v=0,col="red")

Diagnosis: If residuals[i] vs residuals[i+1] form a pattern (e.g., clustering around diagonal), independence is violated. Use Durbin–Watson test for confirmation.


# (d) Violate normality — heavy-tailed errors
set.seed(202)
x4 <- runif(n, 0, 10)
y4 <- 2 + 3*x4 + rt(n, df=3)  # t-distribution (heavy tails)
m4 <- lm(y4 ~ x4)

qqnorm(m4$residuals, main="Q-Q Plot of Residuals")
qqline(m4$residuals, col="red")
hist(m4$residuals, main="Histogram of Residuals", xlab="Residuals")

Diagnosis: Q–Q plot shows deviations at tails; histogram non-bell shaped. Normality mainly affects inference (CIs, t-tests), less so prediction.

Quick check summary:

Assumption	Key plot/test	Symptom
Linearity	Residuals vs Fitted	Curved pattern
Independence	Residuals over index	Sequential pattern
Homoskedasticity	Scale-Location plot	Fan shape
Normality	Q–Q plot	Heavy/light tails, skew

Video: The Hat Matrix and Orthogonality

We define \(H = X(X^\top X)^{-1}X^\top\). It “puts a hat” on \(Y\): \( \hat{Y} = HY \).

Symmetry: \(H = H^\top\)
Idempotence: \(H^2 = H\)
Orthogonality: \(Ŷ^T ê = 0\)


# Demonstrate orthogonality
set.seed(10)
X <- cbind(1, rnorm(5))
Y <- rnorm(5)
H <- X %*% solve(t(X) %*% X) %*% t(X)
Y_hat <- H %*% Y
e_hat <- (diag(5) - H) %*% Y
t(Y_hat) %*% e_hat   # should be 0

Interpretation: Residuals contain no information about the fitted component — the model explains everything systematic, leaving only random noise.

Video: Fixing Violated Assumptions

Once violations are diagnosed, we can often fix or mitigate them:

Linearity: Add polynomial or interaction terms, or transform variables (log(x), x^2).
Heteroskedasticity: Use log-transform on Y or robust SEs (lmtest::coeftest with vcovHC()).
Non-normality: Transform Y (e.g., sqrt, log), or use nonparametric regression.
Autocorrelation: Add lag terms or use time-series models (AR, GLS).


# Example: fix heteroskedasticity in Bollywood data
data <- data.frame(
  Budget = c(50,100,150,200,250),
  Gross  = c(100,250,400,600,900)
)
data$log_Budget <- log(data$Budget)
data$log_Gross  <- log(data$Gross)

m_lin  <- lm(Gross ~ Budget, data)
m_log  <- lm(log_Gross ~ log_Budget, data)

summary(m_log)


Coefficients:
(Intercept)   log_Budget
   -1.44         1.32 ***
---
Residual SE: 2.56  on 188 df
R² = 0.58

Interpretation: In the log–log model, coefficients represent elasticities: a 1% increase in budget → 1.32% increase in gross revenue. Model is now homoskedastic and interpretable multiplicatively.

Video: Using Residual Plots and Influence Measures

Beyond assumption checks, diagnostics help identify individual outliers or high-influence points.


# Example using mtcars
fit <- lm(mpg ~ wt + hp, data=mtcars)
plot(fit, which=4)   # Cook’s distance
influence.measures(fit)

High leverage: unusual \(x_i\), detected via large diagonal \(h_{ii}\) in hat matrix.
Influential point: large Cook’s D (>1 or high relative values).
Action: verify data entry, try model with/without point, document reasoning.

Exam tip: Be able to match a diagnostic plot pattern to the violated assumption and name a fix (e.g., “fan shape → heteroskedasticity → log-transform Y”).

Video: Model Diagnosis Case Study (Bollywood Example)

Dataset: Bollywood film budgets (in million ₹) vs gross revenue. Diagnose and repair non-constant variance.


bollywood <- read.csv("bollywood.csv")
model_raw <- lm(Gross ~ Budget, data = bollywood)
par(mfrow = c(2,2)); plot(model_raw); par(mfrow=c(1,1))

# Log-transform both variables
bollywood$log_Budget <- log(bollywood$Budget)
bollywood$log_Gross  <- log(bollywood$Gross)
model_log <- lm(log_Gross ~ log_Budget, data = bollywood)
summary(model_log)

Result: \( \log(\text{Gross}) = -1.44 + 1.32 \log(\text{Budget}) \)

Elasticity interpretation: 1% ↑ in budget → 1.32% ↑ in gross (increasing returns to scale). Model variance stabilized.

Key takeaways:

Always check diagnostic plots before interpreting coefficients.
Transformations change interpretation — linear ↔ multiplicative scale.
Document each correction (e.g., log-transform justification).

Practice — Quick Lab

Simulate one violation per assumption and identify which plot detects it fastest.
Compute hatvalues(), rstandard(), and cooks.distance() for your own model.
Re-fit a model using log(y) or sqrt(y); compare residual plots.

Reference formulas

\(H = X(X^\top X)^{-1}X^\top\)
\(\hat{y} = HY, \quad e = (I - H)Y, \quad \hat{y}^\top e = 0\)
\(D_i = \dfrac{e_i^2}{p \cdot \text{MSE}} \cdot \dfrac{h_{ii}}{(1-h_{ii})^2}\)

Module 6 — Model Selection & Multicollinearity Selection

Essentials

Goal: pick a model that balances bias–variance and generalizes well.
Three families: (1) testing-based (hierarchical t/F), (2) criterion-based (AIC/BIC/Adj. \(R^2\)), (3) prediction-based (CV/MSPE).
Collinearity: large correlations among predictors inflate SEs, destabilize \(\hat\beta\), and complicate interpretation. Diagnose with VIF; fix with design, transforms, or regularization.

Video: Motivating Model Selection Methods

Adding predictors always increases \(R^2\) in-sample but can harm test performance. Selection methods trade off fit with complexity to minimize expected prediction error.

Bias–variance: small models → higher bias, lower variance; large models → lower bias, higher variance.
Parsimony: prefer the simplest model that predicts well and answers your scientific question.

Video: Testing-Based Procedures and their Shortfalls

Classical stepwise testing uses partial F-tests (or t-tests) to add/drop predictors. Problems: multiplicity (inflated type I error), order dependence, and data dredging.


# Partial F-test (nested models) on mtcars
full   <- lm(mpg ~ wt + hp + drat + qsec, data = mtcars)
reduced <- lm(mpg ~ wt + hp, data = mtcars)
anova(reduced, full)  # H0: added vars (drat, qsec) jointly = 0

Exam pitfall

Don’t report p-values from stepwise as if they were pre-specified. They’re conditional on the search and can be anti-conservative.

Video: Criterion-Based — AIC (Akaike Information Criterion)

AIC estimates out-of-sample Kullback–Leibler discrepancy. For Gaussian LM with RSS and k parameters (\(k=p\) incl. intercept):

\[ \mathrm{AIC} = n\log\!\left(\frac{RSS}{n}\right) + 2k. \]

Choose the model with the lowest AIC. Penalizes complexity with \(2k\) — lighter penalty than BIC.


m1 <- lm(mpg ~ wt + hp, data=mtcars)
m2 <- lm(mpg ~ wt + hp + drat, data=mtcars)
AIC(m1, m2)
# Lower AIC preferred

Video: Criterion-Based — BIC (Bayesian Information Criterion)

BIC is harsher on complexity:

\[ \mathrm{BIC} = n\log\!\left(\frac{RSS}{n}\right) + k\log n. \]

As \(n\) grows, \(\log n > 2\) — BIC tends to favor more parsimonious models than AIC.


BIC(m1, m2)
# Often selects a smaller model than AIC, especially for large n

Video: Criterion-Based — Adjusted \(R^2\)

Adjusted \(R^2\) penalizes extra predictors by degrees of freedom:

\[ R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-k}. \]

Use it comparatively across models fit to the same \(y\) and \(n\). It may peak before the full model.


summary(m1)$adj.r.squared
summary(m2)$adj.r.squared

Rule of thumb

If inference matters, use multiple criteria (e.g., AIC + adj. \(R^2\) + subject-matter sense). If pure prediction matters, prefer MSPE via cross-validation.

Video: Mean Squared Prediction Error (MSPE) & Cross-Validation

Use a holdout set or K-fold CV to estimate test error:


set.seed(1)
n   <- nrow(mtcars)
idx <- sample.int(n, floor(0.7*n))
train <- mtcars[idx, ]; test <- mtcars[-idx, ]

cands <- list(
  mA = lm(mpg ~ wt, data=train),
  mB = lm(mpg ~ wt + hp, data=train),
  mC = lm(mpg ~ wt + hp + drat, data=train)
)

mspe <- sapply(cands, function(fit){
  p <- predict(fit, newdata=test)
  mean((test$mpg - p)^2)
})
sort(mspe)

Interpretation: pick the model with the smallest estimated MSPE. Repeat random splits or use K-fold CV for stability.


# Simple K-fold CV (K=5) for one formula
K <- 5
fold_id <- sample(rep(1:K, length.out=n))
cv_formula <- mpg ~ wt + hp

cv_err <- sapply(1:K, function(k){
  tr <- mtcars[fold_id != k, ]
  te <- mtcars[fold_id == k, ]
  fit <- lm(cv_formula, data=tr)
  mean((te$mpg - predict(fit, te))^2)
})
mean(cv_err); sd(cv_err)

Video: Model Selection in R (stepwise, subsets)

Quick baseline: step() with AIC; exhaustive subsets with leaps::regsubsets().


# AIC stepwise (both directions)
full   <- lm(mpg ~ ., data = mtcars)
null   <- lm(mpg ~ 1, data = mtcars)
step_fit <- step(null, scope = formula(full), direction = "both", trace = 0)
summary(step_fit)

# Exhaustive/forward/backward subsets (by adj. R2, Cp, BIC)
library(leaps)
sub <- regsubsets(mpg ~ ., data = mtcars, nbest=1, nvmax = 6, method="exhaustive")
sum_sub <- summary(sub)
which.max(sum_sub$adjr2)   # size of best model by adj R2
which.min(sum_sub$bic)     # size of best by BIC

Notes

step() is greedy and can miss better models; use as a heuristic, not gospel.
regsubsets() doesn’t refit on a test set — pair with CV/MSPE before finalizing.

Video: The Problem of Collinearity

Collinearity = predictors are strongly linearly related. Consequences:

Large SEs (unstable signs/magnitudes of \(\hat\beta\)).
Insignificant t-tests despite high \(R^2\).
Sensitivity to small data perturbations; wide CIs; poor interpretability.


# Synthetic collinearity example
set.seed(2)
n <- 60
x1 <- rnorm(n)
x2 <- 0.95*x1 + rnorm(n, sd=0.1)  # highly collinear
y  <- 3 + 2*x1 - 2*x2 + rnorm(n, sd=1)
fit_col <- lm(y ~ x1 + x2)
summary(fit_col)  # large SEs, unstable signs

Video: Diagnosing Multicollinearity (VIF & friends)

Variance Inflation Factor (VIF) for predictor \(x_j\):

\[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}, \]

where \(R_j^2\) is from regressing \(x_j\) on all other predictors. Rules of thumb: 5 = moderate, 10 = high.


# Using regclass or car for VIF
# install.packages("regclass")  # if needed
library(regclass)
VIF(fit_col)

# Also look at correlation matrix and condition indices
cor_m <- cor(cbind(x1,x2))
e <- eigen(t(model.matrix(fit_col)[,-1]) %*% model.matrix(fit_col)[,-1])
cond_index <- sqrt(max(e$values) / e$values)
cond_index

Interpretation

High VIF means \(x_j\) is predictable from other X’s, inflating var\((\hat\beta_j)\). Condition indices > 30 often flag severe collinearity across linear combinations.

Video: The Problem of Multicollinearity — Solutions & R Implementation

Data/design fixes: collect more data; redesign (orthogonalize tasks); drop/redefine redundant variables; combine into indices or principal components.
Transformations: center/standardize; use domain ratios (e.g., price per unit).
Regularization (conceptual): ridge/LASSO shrink unstable coefficients (prediction-oriented; changes inference).


# Centering can stabilize numerics & interpretation
df <- data.frame(y, x1, x2)
df$x1_c <- scale(df$x1, center=TRUE, scale=FALSE)
df$x2_c <- scale(df$x2, center=TRUE, scale=FALSE)
fit_c <- lm(y ~ x1_c + x2_c, data=df)
VIF(fit_c)

# Dropping a redundant predictor (when theory permits)
fit_drop <- lm(y ~ x1, data=df)
# Compare MSPE via CV before committing

Communicating choices

State selection criterion (e.g., “AIC-minimizing, validated by 5-fold CV”).
Report uncertainty (CIs) for the final model and its test performance (RMSE/MAE).
If collinearity persists, present effects as groups (e.g., index or principal component) rather than unstable individual \(\beta\)s.

Practice — Quick Lab

Compute AIC, BIC, adj. \(R^2\) for 3–5 candidate models on a training split; pick the top two by each metric.
Estimate test MSPE via 5-fold CV for those finalists; select the best generalizer.
Compute VIFs for the chosen model; if any > 10, try centering, dropping, or combining variables; re-check MSPE.

Reference formulas

\(\mathrm{AIC} = n\log(RSS/n) + 2k\)
\(\mathrm{BIC} = n\log(RSS/n) + k\log n\)
\(R^2_{adj}=1-(1-R^2)\frac{n-1}{n-k}\)
\(\mathrm{VIF}_j = \dfrac{1}{1-R_j^2}\)

📘 One-Page Exam-Day Cheatsheet Final Review

Core Linear Model

\(Y = X\beta + \varepsilon, \ \varepsilon \sim \mathcal{N}(0, \sigma^2 I)\)
OLS: \(\hat\beta = (X^\top X)^{-1}X^\top Y\)
Var(\(\hat\beta\)) = \(\sigma^2 (X^\top X)^{-1}\)
\(\hat Y = HY\), \(H=X(X^\top X)^{-1}X^\top\), \(e=(I-H)Y\)

Decomposition & Fit

TSS = ESS + RSS
\(R^2 = 1 - \frac{RSS}{TSS}\)
\(R^2_{adj}=1-(1-R^2)\frac{n-1}{n-k}\)

Inference & Testing

t-stat: \(t_j = \frac{\hat\beta_j}{SE(\hat\beta_j)}\)
F-stat (nested): \(F = \frac{(RSS_r - RSS_f)/q}{RSS_f/(n-k_f)}\)
CI: \(\hat\beta_j \pm t_{n-k,1-\alpha/2} SE(\hat\beta_j)\)
ANOVA table → model significance; \(p\)-value for global F-test

Diagnostics

Assumption	Check	Fix
Linearity	Residual vs Fitted (curvature)	Add polynomial/interaction
Homoskedasticity	Fan shape in Scale–Location	Log-transform Y or robust SE
Normality	Q–Q plot tails	Transform Y (log/sqrt)
Independence	Sequential residuals / DW test	Use AR/GLS/time terms

Prediction

\(\hat y_0 = x_0^\top \hat\beta\)
CI(mean): \(\hat y_0 \pm t \hat\sigma \sqrt{h_0}\)
PI(new): \(\hat y_0 \pm t \hat\sigma \sqrt{1+h_0}\)
MSPE = mean((y_test - ŷ_test)²)

Model Selection

AIC = \(n\log(RSS/n) + 2k\)
BIC = \(n\log(RSS/n) + k\log n\)
VIF\(_j = 1/(1-R_j^2)\) → flag if >10
Collinearity → inflated SE, unstable β → fix via centering, dropping, or PCA

Exam Priorities

Interpret coefficients (slope, intercept, transformed vars).
Identify violated assumptions from plots.
Compute CI/PI, AIC/BIC, or VIF by hand if needed.
Explain difference between prediction vs explanation goals.
Pick final model balancing AIC/BIC/CV and theory.

🧩 Practice Prompts & Expected Moves Hands-on

Prompt 1 — Interpret Regression Output


Coefficients:
(Intercept)   wt      hp  
   37.23     -3.88   -0.03
---
Residual SE: 2.59  R² = 0.826,  Adj R² = 0.815

Q: What does the coefficient of wt mean?
A: Holding horsepower constant, each 1000-lb increase in weight is associated with a 3.88 mpg decrease.
Check: t-test for significance (|t| > 2 → significant).

Prompt 2 — Diagnostic Identification


Residuals vs Fitted → U-shape pattern
Q–Q Plot → Normal
Scale–Location → constant spread

Move: Violation = nonlinearity → add polynomial term.
lm(y ~ x + I(x^2)) or transform variable.

Prompt 3 — Compute Confidence Interval

\(\hat\beta_1 = 1.4, SE=0.5, n=40, k=3, \alpha=0.05\)

\(t_{0.975, 37} \approx 2.026\)
CI = \(1.4 \pm 2.026(0.5) = [0.387, 2.413]\)
Interpret: With 95% confidence, slope lies between 0.39–2.41.

Prompt 4 — Choose Best Model


Model  |  Adj R²  |  AIC  |  BIC
-------------------------------
M1     |  0.78    | 120.5 | 128.3
M2     |  0.80    | 119.8 | 131.1
M3     |  0.79    | 121.1 | 127.4

Move: If AIC focus → choose M2 (lowest AIC). If BIC focus → choose M3 (lowest BIC). If Adj R² + simplicity → M1 (parsimonious).
Exam phrase: “We prefer M2 by AIC; M3 by BIC. Tradeoff between fit and complexity.”

Prompt 5 — Detect Multicollinearity


Predictors: x1, x2, x3
VIFs: 2.1, 12.5, 3.4

Move: x2 problematic (VIF>10).
Check correlation(x2, others); consider dropping or combining.

Prompt 6 — Forecast and Uncertainty


predict(fit, newdata=data.frame(wt=3.0, hp=120),
        interval="prediction", level=0.95)

Move: Compare CI vs PI; CI narrower, PI wider due to noise.
Interpret: “We are 95% confident the average mpg for cars with wt=3.0,hp=120 is ~21±1.1; individual cars vary ±5.”

Prompt 7 — Explain Assumption Impact

Q: What if errors are heteroskedastic?
A: OLS β unbiased but SEs wrong → inference invalid. Use vcovHC() for robust SEs or transform Y.

Prompt 8 — Bonus: Prediction vs Explanation

Scenario: Predict home prices vs estimate effect of sqft on price.
Move: Prediction → minimize test RMSE (CV). Explanation → control confounders, report CI & significance.

Quick Formula Recall


β̂ = (X'X)⁻¹X'y
Var(β̂) = σ²(X'X)⁻¹
CI: β̂ ± t * SE(β̂)
F = ((RSSr - RSSf)/q) / (RSSf/(n-k))
VIFj = 1/(1-Rj²)
AIC = n ln(RSS/n) + 2k
BIC = n ln(RSS/n) + k ln n
MSPE = mean((y_test - ŷ_test)²)

Exam Tips

Plot first → explain pattern → cite fix.
Interpret coefficients in context (direction, magnitude, units).
Always distinguish CI vs PI, AIC vs BIC, causal vs predictive.
Recreate derivations: \(\hat\beta=(X^\top X)^{-1}X^\top y\), \(R^2=1-RSS/TSS\).

🧩 Interpretation & Visual Diagnostics — Mental Images for Exam Recall

Each section below shows how the diagnostic plot should look when assumptions hold (✅ Correct) and when they are violated (🚫 Violation). Mentally picture these patterns — that’s usually enough to identify issues in an exam setting.

Linearity

Plot: Residuals vs Fitted

✅ Correct (linear relationship captured)
resid
  |     • •  • • •
  |  • • • • • • •
  | • • • • • • • •
  +----------------------→ fitted
           evenly scattered

🚫 Violation (nonlinear pattern — model missing curvature)
resid
  |      •
  |   •     •
  | •         •
  |   •     •
  +----------------------→ fitted
         "U" or "∩" shape

Fix: add polynomial term (I(x^2)), interaction, or transform predictor (log(x)).

Independence (Autocorrelation)

Plot: Residuals vs Observation Order / Time

✅ Correct (independent residuals)
resid
  3 | •   • • •     •   •
  0 |   •     • • •   •
 -3 | •     •     •
     +--------------------→ observation index
        random scatter

🚫 Violation (autocorrelation — residuals follow a pattern)
resid
  3 |        /‾‾‾\__/‾‾‾\__
  0 |_______/___________\____ index →
 -3 |
     cyclical or trending

Fix: include lagged predictors, use GLS/AR(1), or add time trend.

Homoskedasticity (Constant Variance)

Plot: Scale–Location or Residuals vs Fitted

✅ Correct (equal variance)
resid
  |     •    •   •
  |  •    •    •
  | •   •   •    •
  +----------------→ fitted
       uniform spread

🚫 Violation (heteroskedastic — "funnel" shape)
resid
  |    •
  |   •  •
  |  •    •
  | •       •
  +----------------→ fitted
     variance increases with fit

Fix: log/√ transform Y, use weighted LS, or robust SEs (vcovHC()).

Normality of Errors

Plot: Q–Q Plot of residuals

✅ Correct (normally distributed residuals)
Quantiles →
  | •
  |   •
  |     •
  |       •
  |         •
  +----------------
       perfect 45° line

🚫 Violation (non-normal)
   • S-shape → skewness  
   • tails above/below line → heavy tails
Quantiles →
  | •      (left tail)
  |   • •
  |     •
  |       • • (right tail off-line)
  +----------------

Fix: transform Y, remove severe outliers, or rely on large n (CLT robustness).

Influence & Outliers

Plots: Cook’s Distance / Leverage vs Residuals

✅ Correct (no influential points)
Cook’s D
  |     • • • • •
  | • • • • • • •
  +----------------
        obs index →

🚫 Violation (outlier with high influence)
Cook’s D
  |           •   ← high D (>1)
  |  • • • • •
  +----------------
        obs index →

Fix: verify data, try fit without point, report sensitivity check.

Collinearity / Multicollinearity

Numeric and conceptual pattern

✅ Correct
VIFs ~ 1–3
β̂ signs stable
SEs reasonable

🚫 Violation
VIFs > 10
β̂ signs flip with minor data changes
Huge SE, nonsignificant t-tests despite high R²

Fix: center or standardize, drop/merge predictors, or use ridge/LASSO.

Model Fit Overview

Residuals vs Fitted: even horizontal band ✅
Q–Q Plot: diagonal line ✅
Scale–Location: constant spread ✅
Cook’s D: no tall spikes ✅

Pattern → Symptom → Fix. If you see curvature, heteroskedasticity, or correlation, state the violation and propose the repair directly.

Quick Reference Table

Assumption	“Bad” Visual Cue	Repair Strategy
Linearity	U-shape residuals	Add x² / transform X
Independence	Sequential wave pattern	Add time terms / AR model
Homoskedasticity	Funnel pattern	Log Y / robust SE
Normality	Q–Q curve or tail deviation	Transform Y / ignore if n large
Influence	Large Cook’s D spike	Investigate / re-fit w/o obs
Collinearity	High VIF / unstable β̂	Center, drop, combine vars

⏱️ 2-Minute Final Review — What to Recall Instantly

Topic	Recall / Action
OLS formula	\(\hat\beta = (X'X)^{-1}X'Y\); unbiased if assumptions hold.
Key Decomposition	TSS = ESS + RSS → \(R^2 = 1 - RSS/TSS\).
t vs F test	t → one coefficient; F → multiple coefficients. Reject if p < 0.05.
CI & PI	CI(mean): ±tσ√h₀; PI(new): ±tσ√(1+h₀).
Diagnostics	Linearity (U-shape) → add x²; Funnel → log Y; Q–Q tail → transform; Waves → AR(1).
Influence	Cook’s D > 1 → outlier; verify & re-fit.
Collinearity	VIF > 10 = unstable β̂; center or drop predictor.
Model selection	AIC ↓ better (prediction), BIC ↓ stricter (parsimony), adj R² ↑ better.
Prediction vs Explanation	Prediction → minimize MSPE; Explanation → interpret β & control confounders.
Exam Mindset	1️⃣ Identify pattern → 2️⃣ Name assumption → 3️⃣ Apply fix → 4️⃣ Justify with equation or plot cue.

💡 *If nothing else: know the four OLS assumptions, be able to explain R², interpret lm() output, and spot which plot corresponds to which violation.*

Some of this guide was generated by AI, please double check all work for accuracy.

📘 Modern Regression Analysis in RFinal Exam Study Guide

📑 Table of Contents

Module 1 — Introduction to Statistical Models Core

Video: What Is a Statistical Model?

Video: The Linear Regression Model (SLR → MLR)

Video: Variables, Design Matrix, and Coding

Video: Model Assumptions & When They Matter

Video: Fitted Values, Residuals, and Identities

Video: Simulation to Build Intuition

Video: From SLR to MLR — Interpretation & Confounding

Video: First Modeling Workflow in R

Readings — What to internalize

Module 2 — Linear Regression Parameter Estimation Estimation

Video: Introduction to Least Squares

Video: Linear Algebra for Least Squares

Video: Deriving the Least Squares Solution

Video: Regression Modeling in R — A First Pass

Video: Gauss–Markov and Maximum Likelihood

Video: Sums of Squares and Estimating Error Variance

Video: The Coefficient of Determination \(R^2\)

Video: The Problem of Non-identifiability

Video: Regression Modeling in R — A Second Pass

In Summary

Module 3 — Inference in Linear Regression Inference

Video: Motivating Statistical Inference in the Regression Context

Video: The Sampling Distribution of the Least Squares Estimator

Video: T-Tests for Individual Regression Parameters

Video: T-Tests in R — Line-by-Line Output Interpretation

Video: Motivating the F-Test — Multiple Statistical Comparisons

Video: The F-Test in R

Video: Confidence Intervals in the Regression Context

Video: Ethics in Statistical Practice & Communication

In Summary

Module 4 — Prediction and Explanation in Linear Regression Applied

Video: Differentiating Prediction and Explanation

Video: Point Estimates for Prediction

Video: Interval Estimates for Prediction

Video: Making Predictions Using Real Data in R

Video: When Prediction Goes Wrong

Video: Defining Causality

Practice — Quick Lab

Reference formulas

Module 5 — Regression Diagnostics & Assumption Checking Diagnostic

Video: Diagnosing the Four Assumptions

Video: The Hat Matrix and Orthogonality

Video: Fixing Violated Assumptions

Video: Using Residual Plots and Influence Measures

Video: Model Diagnosis Case Study (Bollywood Example)

Practice — Quick Lab

Reference formulas

Module 6 — Model Selection & Multicollinearity Selection

Video: Motivating Model Selection Methods

Video: Testing-Based Procedures and their Shortfalls

Video: Criterion-Based — AIC (Akaike Information Criterion)

Video: Criterion-Based — BIC (Bayesian Information Criterion)

Video: Criterion-Based — Adjusted \(R^2\)

Video: Mean Squared Prediction Error (MSPE) & Cross-Validation

Video: Model Selection in R (stepwise, subsets)

Video: The Problem of Collinearity

Video: Diagnosing Multicollinearity (VIF & friends)

Video: The Problem of Multicollinearity — Solutions & R Implementation

Practice — Quick Lab

Reference formulas

📘 One-Page Exam-Day Cheatsheet Final Review

🧩 Practice Prompts & Expected Moves Hands-on

Prompt 1 — Interpret Regression Output

Prompt 2 — Diagnostic Identification

Prompt 3 — Compute Confidence Interval

Prompt 4 — Choose Best Model

Prompt 5 — Detect Multicollinearity

Prompt 6 — Forecast and Uncertainty

Prompt 7 — Explain Assumption Impact

Prompt 8 — Bonus: Prediction vs Explanation

🧩 Interpretation & Visual Diagnostics — Mental Images for Exam Recall

📘 Modern Regression Analysis in R
Final Exam Study Guide