Scope: Modules 1–6 • Focus: concepts, formulas, R patterns, diagnostics, selection • Source: your course autograded & peer-reviewed PDFs.
A statistical model posits how outcomes are generated from predictors and noise. It specifies a family of distributions for \(Y\) given \(X\) and unknown parameters \(\theta\). For linear regression, \(\theta=(\beta_0,\beta_1,\dots,\beta_p,\sigma^2)\).
Key idea: Separate signal (systematic mean structure \(E[Y\mid X]\)) from noise (random variation around the mean).
Notation: \(Y\mid X\sim \mathcal{F}(g(X;\beta),\sigma^2)\), with mean function \(g(X;\beta)=X\beta\).
Simple linear regression (one predictor):
\[ Y_i=\beta_0+\beta_1 x_i+\varepsilon_i,\quad \varepsilon_i\overset{iid}{\sim}\mathcal N(0,\sigma^2). \]Multiple linear regression (p predictors, incl. intercept):
\[ \mathbf y=X\beta+\varepsilon,\quad \varepsilon\sim \mathcal N_n(\mathbf 0,\sigma^2 I_n). \]Geometry: \(\hat{\mathbf y}=H\mathbf y\) where \(H=X(X^\top X)^{-1}X^\top\) is the hat matrix (projection onto the column space of \(X\)).
# First pass in R
m <- lm(y ~ x1 + x2, data=df)
summary(m)
coef(m); fitted(m); resid(m)
Design matrix \(X\): column of 1’s (intercept) + predictor columns. For factors, use dummy (treatment) coding by default: if group has levels A (baseline), B, C, then model includes groupB, groupC with A absorbed in intercept.
Centering/Scaling: \(x^{*}=(x-\bar x)/s_x\) improves conditioning of \(X^\top X\) and interpretation.
# Inspect the design
X <- model.matrix(~ group + z, data=df) # dummy coding for 'group'
head(X)
# Center and scale a numeric predictor
df$z_c <- scale(df$z, center=TRUE, scale=FALSE)
df$z_z <- scale(df$z) # center + scale
Impact: Gauss–Markov (no normality needed) → OLS is BLUE under linearity, independence, constant variance. Normality is needed for exact t/F inference in small samples.
# Basic diagnostic plots (preview; detailed in Module 5)
m <- lm(y ~ x, data=df)
plot(m) # 1: Resid vs Fitted, 2: QQ, 3: Scale-Location, 4: Cook's Distance
Let \(\hat{\mathbf y}=X\hat\beta\), \(\mathbf e=\mathbf y-\hat{\mathbf y}\). Core identities in linear models:
m <- lm(y ~ x, data=df)
e <- resid(m); yhat <- fitted(m)
sum(e) # ~ 0
sum(model.matrix(m)[,2] * e) # ~ 0 for the x column in SLR
mean(yhat) - mean(df$y) # ~ 0
Simulate data with known \(\beta\) and \(\sigma\) to see how noise affects estimates and scatter.
set.seed(1)
n <- 200
x <- rnorm(n, 0, 1)
y <- 2 + 3*x + rnorm(n, 0, 2) # β0=2, β1=3, σ=2
fit <- lm(y ~ x)
coef(fit); summary(fit)$sigma
Takeaway: Higher \(\sigma\) widens residual spread and reduces precision (larger SEs), but OLS remains unbiased under the assumptions.
Partial effects: In MLR, \(\beta_j\) is the adjusted effect of \(x_j\) holding other predictors constant.
Confounding: Omitting relevant predictors that correlate with both \(x_j\) and \(y\) biases \(\hat\beta_j\). Include key controls or use study design to block confounding.
m_slr <- lm(aprice ~ pages, data=books)
m_mlr <- lm(aprice ~ lprice + pages + weight, data=books)
coef(m_slr)["pages"]; coef(m_mlr)["pages"] # note change with adjustment
lm() and inspect summary() (coefficients, SEs, \(R^2\), residual σ).df <- na.omit(df)
pairs(~ y + x1 + x2, data=df)
m <- lm(y ~ x1 + x2, data=df)
summary(m)
plot(m) # quick diagnostics
predict(m, newdata=data.frame(x1=0, x2=1), interval="confidence")
The method of least squares finds the line that minimizes the total squared deviation between observed and fitted values. The normal equations are obtained by differentiating the residual sum of squares (RSS) with respect to parameters and setting derivatives to zero.
# Example: relationship between hours studied and exam score
set.seed(42)
hours <- runif(30, 0, 10)
score <- 50 + 4*hours + rnorm(30, 0, 5)
m <- lm(score ~ hours)
summary(m)
Call:
lm(formula = score ~ hours)
Coefficients:
(Intercept) 49.82
hours 4.05
Residual standard error: 4.63 on 28 degrees of freedom
Multiple R-squared: 0.94, Adjusted R-squared: 0.94
Interpretation: each additional hour studied increases the predicted score by about 4 points; 94% of variation in score is explained by study time.
For multiple predictors, we use the vector-matrix form:
\[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]This comes from solving \(X^\top X \hat\beta = X^\top y\).
Example: Suppose \(X\) includes an intercept and two predictors:
X <- model.matrix(~ hours + iq)
head(X)
Be ready to derive \((X^\top X)^{-1} X^\top y\) from first principles and interpret what happens if \(X^\top X\) is singular (→ multicollinearity).
Start with RSS \(= (y - X\beta)^\top (y - X\beta)\).
Differentiating and setting to zero:
\[ \frac{\partial \text{RSS}}{\partial \beta} = -2X^\top (y - X\beta) = 0 \] \[ \Rightarrow X^\top X\hat\beta = X^\top y \Rightarrow \hat\beta = (X^\top X)^{-1} X^\top y \]Interpretation: this ensures residuals are orthogonal to fitted values and predictors (no systematic pattern remains).
data(mtcars)
fit <- lm(mpg ~ wt + hp, data=mtcars)
summary(fit)
Call:
lm(formula = mpg ~ wt + hp, data = mtcars)
Coefficients:
(Intercept) 37.23
wt -3.88
hp -0.03
Residual standard error: 2.59 on 29 DF
Multiple R-squared: 0.826, Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10
Interpretation: controlling for horsepower, each 1000 lb increase in weight reduces mpg by 3.9. Both coefficients are negative, meaning heavier and more powerful cars are less fuel-efficient.
Variance of coefficients: \(\mathrm{Var}(\hat\beta) = \hat\sigma^2 (X^\top X)^{-1}\).
coef(summary(fit))
vcov(fit)
Be able to prove why \(E[\hat\beta] = \beta\) (unbiasedness) and derive \(\mathrm{Var}(\hat\beta)\).
Decompose total variation:
\[ TSS = ESS + RSS \]Estimate of residual variance:
\[ \hat\sigma^2 = \frac{RSS}{n - p} \]
anova(fit)
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.73 847.73 126.45 < 2e-16 ***
hp 1 83.27 83.27 12.42 0.0014 **
Residuals 29 194.39 6.70
Interpretation: “Mean Sq = Sum Sq / Df.” F = (Mean Sq Model)/(Mean Sq Residual). A large F → model explains significant variance.
\(R^2 = 1 - \frac{RSS}{TSS}\); measures proportion of total variation explained by model.
Adjusted \(R^2\): penalizes extra predictors:
\[ R^2_{adj} = 1 - (1 - R^2)\frac{n-1}{n-p} \]
summary(fit)$r.squared
summary(fit)$adj.r.squared
For mtcars example, \(R^2 = 0.826\), \(R^2_{adj} = 0.812\): high explanatory power, modest penalty for two predictors.
When predictors are perfectly collinear (e.g., one variable is a linear combination of others), \(X^\top X\) is singular and OLS cannot be computed.
# Example: add perfect collinearity
mtcars$wt2 <- mtcars$wt * 2
lm(mpg ~ wt + wt2, data=mtcars)
Coefficients: (1 not defined because of singularities)
(Intercept) 37.29
wt -3.88
wt2 NA
Interpretation: R detects multicollinearity and omits redundant columns.
confint(fit).
confint(fit)
plot(fit$fitted.values, resid(fit),
xlab="Fitted", ylab="Residuals")
abline(h=0, col="red")
Be comfortable explaining each column in summary(lm()) output and identifying which quantity corresponds to standard errors, t-statistics, and p-values.
OLS gives point estimates; inference quantifies their uncertainty. We use variability in \(\hat\beta\) across samples to infer properties of the population relationship.
\[ \hat\beta_j \sim \mathcal{N}\left(\beta_j, \, \sigma^2 [ (X^\top X)^{-1} ]_{jj} \right) \]The standard error of each coefficient estimates its sampling variability: \[ SE(\hat\beta_j) = \hat\sigma \sqrt{[(X^\top X)^{-1}]_{jj}} \]
fit <- lm(mpg ~ wt + hp, data = mtcars)
summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.29 < 2e-16 ***
wt -3.87783 0.63273 -6.13 1.12e-06 ***
hp -0.03177 0.00903 -3.52 0.00145 **
Interpretation: Each row reports an estimate, its SE, t-statistic, and p-value. Both predictors are significant at α = 0.05.
Given assumptions \(E[\varepsilon]=0\) and \(\mathrm{Var}(\varepsilon)=\sigma^2I\):
\[ E[\hat\beta] = \beta, \quad \mathrm{Var}(\hat\beta) = \sigma^2 (X^\top X)^{-1}. \]Thus, under Normality:
\[ \frac{\hat\beta_j - \beta_j}{SE(\hat\beta_j)} \sim t_{n-p}. \]These t-distributions underpin coefficient significance tests and confidence intervals.
summary(fit)$sigma # residual SD (σ̂)
vcov(fit) # variance-covariance matrix of β̂
Null hypothesis: \(H_0:\beta_j=0\)
Test statistic: \(t_j = \dfrac{\hat\beta_j - 0}{SE(\hat\beta_j)}\), df = n–p.
Compare |t| to t-critical (e.g., t₀.₉₇₅, 28 ≈ 2.048) or use p-values.
coef(summary(fit))
confint(fit, level = 0.95)
2.5 % 97.5 %
(Intercept) 34.0 40.4
wt -5.18 -2.57
hp -0.050 -0.013
Interpretation: We are 95% confident that each additional 1000 lb weight reduces mpg by 2.6 – 5.2 mpg.
Be able to explain that the t-test checks whether a coefficient differs significantly from zero, while its CI provides the same information visually.
Use summary(lm()) output to interpret model results. Each column:
| Estimate | Point estimate of coefficient. |
|---|---|
| Std. Error | Estimated standard deviation of sampling distribution. |
| t value | Ratio of Estimate/SE — how many SEs away from 0. |
| Pr(>|t|) | Two-sided p-value for testing \(H_0:\beta_j=0\). |
Residual standard error: 2.593 on 29 DF
Multiple R-squared: 0.826
Adjusted R-squared: 0.812
F-statistic: 58.4 on 2 and 29 DF, p-value: 1.15e-10
Residual SE: typical deviation of actual mpg from fitted line. F-statistic: overall model significance — at least one β ≠ 0.
When testing several coefficients simultaneously, individual t-tests inflate type I error. The F-test jointly evaluates \(H_0:\beta_1=\beta_2=\dots=\beta_k=0\).
\[ F = \frac{(RSS_{reduced} - RSS_{full})/q}{RSS_{full}/(n-p)} \]where q = number of restrictions (coefficients tested jointly).
# Test that both wt and hp = 0
fit0 <- lm(mpg ~ 1, data=mtcars)
anova(fit0, fit)
Model 1: mpg ~ 1
Model 2: mpg ~ wt + hp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 31 1126.0
2 29 194.4 2 931.6 69.6 <2e-16 ***
Interpretation: The reduction in RSS is highly significant; model explains mpg well.
Use anova() to compare nested models or inspect the model’s ANOVA decomposition.
anova(fit)
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.7 847.7 126.45 <2e-16 ***
hp 1 83.3 83.3 12.42 0.0014 **
Residuals 29 194.4 6.7
Interpretation: Weight explains most variance; horsepower adds smaller but still significant improvement.
General form:
\[
\hat\beta_j \pm t_{n-p,1-\alpha/2} \cdot SE(\hat\beta_j)
\]
Use confint() in R to compute them automatically.
confint(fit, level = 0.99)
Interpretation: Wider intervals reflect greater uncertainty (99% > 95%). Confidence intervals always expand with higher confidence levels or fewer data points.
Expect a question asking which assumptions are needed for unbiasedness (Gauss–Markov) vs for valid t/F inference (Normality).
Explanation asks “how does \(y\) change with \(x_j\) (all else equal)?” and prioritizes unbiased, interpretable \(\hat\beta\). Prediction asks “how well can we forecast \(y\) for new \(x\)?” and prioritizes low test error (may use regularization, interactions, etc.).
Given a fitted model \( \hat\beta\), the point prediction at \(x_0\) is \(\hat y_0 = x_0^\top \hat\beta\).
# Example: mtcars — predict mpg from weight and horsepower
data(mtcars)
fit <- lm(mpg ~ wt + hp, data = mtcars)
newcar <- data.frame(wt = 3.0, hp = 120) # 3000 lb, 120 hp
p_hat <- predict(fit, newdata = newcar) # point prediction
p_hat
1
21.34
Interpretation: For a 3,000 lb / 120 hp car, the predicted fuel economy is about 21.3 mpg (point estimate only — no uncertainty yet).
Uncertainty comes from parameter estimation and irreducible noise.
# 95% CI for the conditional mean vs 95% PI for a new observation
predict(fit, newcar, interval = "confidence", level = 0.95)
predict(fit, newcar, interval = "prediction", level = 0.95)
# Confidence interval (mean mpg at x0)
fit lwr upr
1 21.34 20.27 22.41
# Prediction interval (new car's mpg at x0)
fit lwr upr
1 21.34 16.15 26.53
Interpretation: We estimate the average mpg near \(x_0\) is 20.3–22.4, but an individual new car could reasonably fall between 16.1 and 26.5 mpg.
Split data into train/test; fit on train; evaluate on test. Report RMSE and PI coverage.
set.seed(1)
n <- nrow(mtcars)
idx <- sample.int(n, size = round(0.7*n))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
fit_tr <- lm(mpg ~ wt + hp, data = train)
# Test-set point predictions and RMSE
p_test <- predict(fit_tr, newdata = test)
rmse <- sqrt(mean((test$mpg - p_test)^2))
# Optional: estimate empirical PI coverage at 95%
pi95 <- predict(fit_tr, newdata = test, interval = "prediction", level = 0.95)
covered <- mean(test$mpg >= pi95[,"lwr"] & test$mpg <= pi95[,"upr"])
list(RMSE = rmse, PI95_coverage = covered)
$RMSE
2.87
$PI95_coverage
0.93
Interpretation: Test RMSE ≈ 2.9 mpg. About 93% of test points fall inside the nominal 95% PI — acceptable for small samples.
# Extrapolation check — compare new x0 to training ranges
xrng <- sapply(train[c("wt","hp")], range)
newfar <- data.frame(wt = 6.5, hp = 350) # far beyond train for mtcars
xrng; newfar
predict(fit_tr, newfar, interval="prediction") # will produce a number, but risk is high
Regression coefficients measure association, not causation, unless strong identification conditions hold.
mtcars 70/30; fit mpg ~ wt + hp on train; report test RMSE.wt:hp) or a squared term; compare test RMSE.Each assumption has its own visual and quantitative diagnostic. We’ll simulate violations and interpret diagnostic plots using ggplot2 and plot.lm().
# Load packages
library(ggplot2)
set.seed(123)
n <- 100
# (a) Violate linearity — true relationship is quadratic
x1 <- runif(n, 0, 10)
y1 <- 2 + 3*x1 - 0.5*x1^2 + rnorm(n, 0, 2)
m1 <- lm(y1 ~ x1)
par(mfrow = c(2,2)); plot(m1); par(mfrow=c(1,1))
Diagnosis: Nonlinearity appears as curvature in “Residuals vs Fitted.” Fitted values systematically under- and over-predict at extremes.
# (b) Violate homoskedasticity — variance increases with x
x2 <- runif(n, 0, 10)
y2 <- 2 + 3*x2 + rnorm(n, 0, sd = 0.5 + 0.5*x2)
m2 <- lm(y2 ~ x2)
par(mfrow = c(2,2)); plot(m2); par(mfrow=c(1,1))
Diagnosis: Funnel shape in “Residuals vs Fitted” plot. Error variance grows with predicted value → use scale(location) in plot.lm() or Breusch–Pagan test.
# (c) Violate independence — residuals correlated
set.seed(101)
x3 <- 1:100
eps <- arima.sim(list(ar = 0.7), n = 100)
y3 <- 5 + 0.4*x3 + eps
m3 <- lm(y3 ~ x3)
# Successive residual plot
plot(m3$residuals[-n], m3$residuals[-1],
xlab = "Residual[i]", ylab = "Residual[i+1]",
main = "Successive Residuals Plot")
abline(h=0,v=0,col="red")
Diagnosis: If residuals[i] vs residuals[i+1] form a pattern (e.g., clustering around diagonal), independence is violated. Use Durbin–Watson test for confirmation.
# (d) Violate normality — heavy-tailed errors
set.seed(202)
x4 <- runif(n, 0, 10)
y4 <- 2 + 3*x4 + rt(n, df=3) # t-distribution (heavy tails)
m4 <- lm(y4 ~ x4)
qqnorm(m4$residuals, main="Q-Q Plot of Residuals")
qqline(m4$residuals, col="red")
hist(m4$residuals, main="Histogram of Residuals", xlab="Residuals")
Diagnosis: Q–Q plot shows deviations at tails; histogram non-bell shaped. Normality mainly affects inference (CIs, t-tests), less so prediction.
| Assumption | Key plot/test | Symptom |
|---|---|---|
| Linearity | Residuals vs Fitted | Curved pattern |
| Independence | Residuals over index | Sequential pattern |
| Homoskedasticity | Scale-Location plot | Fan shape |
| Normality | Q–Q plot | Heavy/light tails, skew |
We define \(H = X(X^\top X)^{-1}X^\top\). It “puts a hat” on \(Y\): \( \hat{Y} = HY \).
# Demonstrate orthogonality
set.seed(10)
X <- cbind(1, rnorm(5))
Y <- rnorm(5)
H <- X %*% solve(t(X) %*% X) %*% t(X)
Y_hat <- H %*% Y
e_hat <- (diag(5) - H) %*% Y
t(Y_hat) %*% e_hat # should be 0
Interpretation: Residuals contain no information about the fitted component — the model explains everything systematic, leaving only random noise.
Once violations are diagnosed, we can often fix or mitigate them:
log(x), x^2).Y or robust SEs (lmtest::coeftest with vcovHC()).Y (e.g., sqrt, log), or use nonparametric regression.
# Example: fix heteroskedasticity in Bollywood data
data <- data.frame(
Budget = c(50,100,150,200,250),
Gross = c(100,250,400,600,900)
)
data$log_Budget <- log(data$Budget)
data$log_Gross <- log(data$Gross)
m_lin <- lm(Gross ~ Budget, data)
m_log <- lm(log_Gross ~ log_Budget, data)
summary(m_log)
Coefficients:
(Intercept) log_Budget
-1.44 1.32 ***
---
Residual SE: 2.56 on 188 df
R² = 0.58
Interpretation: In the log–log model, coefficients represent elasticities: a 1% increase in budget → 1.32% increase in gross revenue. Model is now homoskedastic and interpretable multiplicatively.
Beyond assumption checks, diagnostics help identify individual outliers or high-influence points.
# Example using mtcars
fit <- lm(mpg ~ wt + hp, data=mtcars)
plot(fit, which=4) # Cook’s distance
influence.measures(fit)
Dataset: Bollywood film budgets (in million ₹) vs gross revenue. Diagnose and repair non-constant variance.
bollywood <- read.csv("bollywood.csv")
model_raw <- lm(Gross ~ Budget, data = bollywood)
par(mfrow = c(2,2)); plot(model_raw); par(mfrow=c(1,1))
# Log-transform both variables
bollywood$log_Budget <- log(bollywood$Budget)
bollywood$log_Gross <- log(bollywood$Gross)
model_log <- lm(log_Gross ~ log_Budget, data = bollywood)
summary(model_log)
Result: \( \log(\text{Gross}) = -1.44 + 1.32 \log(\text{Budget}) \)
Elasticity interpretation: 1% ↑ in budget → 1.32% ↑ in gross (increasing returns to scale). Model variance stabilized.
hatvalues(), rstandard(), and cooks.distance() for your own model.log(y) or sqrt(y); compare residual plots.Adding predictors always increases \(R^2\) in-sample but can harm test performance. Selection methods trade off fit with complexity to minimize expected prediction error.
Classical stepwise testing uses partial F-tests (or t-tests) to add/drop predictors. Problems: multiplicity (inflated type I error), order dependence, and data dredging.
# Partial F-test (nested models) on mtcars
full <- lm(mpg ~ wt + hp + drat + qsec, data = mtcars)
reduced <- lm(mpg ~ wt + hp, data = mtcars)
anova(reduced, full) # H0: added vars (drat, qsec) jointly = 0
AIC estimates out-of-sample Kullback–Leibler discrepancy. For Gaussian LM with RSS and k parameters (\(k=p\) incl. intercept):
\[ \mathrm{AIC} = n\log\!\left(\frac{RSS}{n}\right) + 2k. \]Choose the model with the lowest AIC. Penalizes complexity with \(2k\) — lighter penalty than BIC.
m1 <- lm(mpg ~ wt + hp, data=mtcars)
m2 <- lm(mpg ~ wt + hp + drat, data=mtcars)
AIC(m1, m2)
# Lower AIC preferred
BIC is harsher on complexity:
\[ \mathrm{BIC} = n\log\!\left(\frac{RSS}{n}\right) + k\log n. \]As \(n\) grows, \(\log n > 2\) — BIC tends to favor more parsimonious models than AIC.
BIC(m1, m2)
# Often selects a smaller model than AIC, especially for large n
Adjusted \(R^2\) penalizes extra predictors by degrees of freedom:
\[ R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-k}. \]Use it comparatively across models fit to the same \(y\) and \(n\). It may peak before the full model.
summary(m1)$adj.r.squared
summary(m2)$adj.r.squared
Use a holdout set or K-fold CV to estimate test error:
set.seed(1)
n <- nrow(mtcars)
idx <- sample.int(n, floor(0.7*n))
train <- mtcars[idx, ]; test <- mtcars[-idx, ]
cands <- list(
mA = lm(mpg ~ wt, data=train),
mB = lm(mpg ~ wt + hp, data=train),
mC = lm(mpg ~ wt + hp + drat, data=train)
)
mspe <- sapply(cands, function(fit){
p <- predict(fit, newdata=test)
mean((test$mpg - p)^2)
})
sort(mspe)
Interpretation: pick the model with the smallest estimated MSPE. Repeat random splits or use K-fold CV for stability.
# Simple K-fold CV (K=5) for one formula
K <- 5
fold_id <- sample(rep(1:K, length.out=n))
cv_formula <- mpg ~ wt + hp
cv_err <- sapply(1:K, function(k){
tr <- mtcars[fold_id != k, ]
te <- mtcars[fold_id == k, ]
fit <- lm(cv_formula, data=tr)
mean((te$mpg - predict(fit, te))^2)
})
mean(cv_err); sd(cv_err)
Quick baseline: step() with AIC; exhaustive subsets with leaps::regsubsets().
# AIC stepwise (both directions)
full <- lm(mpg ~ ., data = mtcars)
null <- lm(mpg ~ 1, data = mtcars)
step_fit <- step(null, scope = formula(full), direction = "both", trace = 0)
summary(step_fit)
# Exhaustive/forward/backward subsets (by adj. R2, Cp, BIC)
library(leaps)
sub <- regsubsets(mpg ~ ., data = mtcars, nbest=1, nvmax = 6, method="exhaustive")
sum_sub <- summary(sub)
which.max(sum_sub$adjr2) # size of best model by adj R2
which.min(sum_sub$bic) # size of best by BIC
step() is greedy and can miss better models; use as a heuristic, not gospel.regsubsets() doesn’t refit on a test set — pair with CV/MSPE before finalizing.Collinearity = predictors are strongly linearly related. Consequences:
# Synthetic collinearity example
set.seed(2)
n <- 60
x1 <- rnorm(n)
x2 <- 0.95*x1 + rnorm(n, sd=0.1) # highly collinear
y <- 3 + 2*x1 - 2*x2 + rnorm(n, sd=1)
fit_col <- lm(y ~ x1 + x2)
summary(fit_col) # large SEs, unstable signs
Variance Inflation Factor (VIF) for predictor \(x_j\):
\[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}, \]where \(R_j^2\) is from regressing \(x_j\) on all other predictors. Rules of thumb: 5 = moderate, 10 = high.
# Using regclass or car for VIF
# install.packages("regclass") # if needed
library(regclass)
VIF(fit_col)
# Also look at correlation matrix and condition indices
cor_m <- cor(cbind(x1,x2))
e <- eigen(t(model.matrix(fit_col)[,-1]) %*% model.matrix(fit_col)[,-1])
cond_index <- sqrt(max(e$values) / e$values)
cond_index
# Centering can stabilize numerics & interpretation
df <- data.frame(y, x1, x2)
df$x1_c <- scale(df$x1, center=TRUE, scale=FALSE)
df$x2_c <- scale(df$x2, center=TRUE, scale=FALSE)
fit_c <- lm(y ~ x1_c + x2_c, data=df)
VIF(fit_c)
# Dropping a redundant predictor (when theory permits)
fit_drop <- lm(y ~ x1, data=df)
# Compare MSPE via CV before committing
| Assumption | Check | Fix |
|---|---|---|
| Linearity | Residual vs Fitted (curvature) | Add polynomial/interaction |
| Homoskedasticity | Fan shape in Scale–Location | Log-transform Y or robust SE |
| Normality | Q–Q plot tails | Transform Y (log/sqrt) |
| Independence | Sequential residuals / DW test | Use AR/GLS/time terms |
Coefficients:
(Intercept) wt hp
37.23 -3.88 -0.03
---
Residual SE: 2.59 R² = 0.826, Adj R² = 0.815
wt mean?
Residuals vs Fitted → U-shape pattern
Q–Q Plot → Normal
Scale–Location → constant spread
lm(y ~ x + I(x^2)) or transform variable.\(\hat\beta_1 = 1.4, SE=0.5, n=40, k=3, \alpha=0.05\)
Model | Adj R² | AIC | BIC
-------------------------------
M1 | 0.78 | 120.5 | 128.3
M2 | 0.80 | 119.8 | 131.1
M3 | 0.79 | 121.1 | 127.4
Predictors: x1, x2, x3
VIFs: 2.1, 12.5, 3.4
predict(fit, newdata=data.frame(wt=3.0, hp=120),
interval="prediction", level=0.95)
vcovHC() for robust SEs or transform Y.
β̂ = (X'X)⁻¹X'y
Var(β̂) = σ²(X'X)⁻¹
CI: β̂ ± t * SE(β̂)
F = ((RSSr - RSSf)/q) / (RSSf/(n-k))
VIFj = 1/(1-Rj²)
AIC = n ln(RSS/n) + 2k
BIC = n ln(RSS/n) + k ln n
MSPE = mean((y_test - ŷ_test)²)
Each section below shows how the diagnostic plot should look when assumptions hold (✅ Correct) and when they are violated (🚫 Violation). Mentally picture these patterns — that’s usually enough to identify issues in an exam setting.
Plot: Residuals vs Fitted
✅ Correct (linear relationship captured)
resid
| • • • • •
| • • • • • • •
| • • • • • • • •
+----------------------→ fitted
evenly scattered
🚫 Violation (nonlinear pattern — model missing curvature)
resid
| •
| • •
| • •
| • •
+----------------------→ fitted
"U" or "∩" shape
Fix: add polynomial term (I(x^2)), interaction, or transform predictor (log(x)).
Plot: Residuals vs Observation Order / Time
✅ Correct (independent residuals)
resid
3 | • • • • • •
0 | • • • • •
-3 | • • •
+--------------------→ observation index
random scatter
🚫 Violation (autocorrelation — residuals follow a pattern)
resid
3 | /‾‾‾\__/‾‾‾\__
0 |_______/___________\____ index →
-3 |
cyclical or trending
Fix: include lagged predictors, use GLS/AR(1), or add time trend.
Plot: Scale–Location or Residuals vs Fitted
✅ Correct (equal variance)
resid
| • • •
| • • •
| • • • •
+----------------→ fitted
uniform spread
🚫 Violation (heteroskedastic — "funnel" shape)
resid
| •
| • •
| • •
| • •
+----------------→ fitted
variance increases with fit
Fix: log/√ transform Y, use weighted LS, or robust SEs (vcovHC()).
Plot: Q–Q Plot of residuals
✅ Correct (normally distributed residuals)
Quantiles →
| •
| •
| •
| •
| •
+----------------
perfect 45° line
🚫 Violation (non-normal)
• S-shape → skewness
• tails above/below line → heavy tails
Quantiles →
| • (left tail)
| • •
| •
| • • (right tail off-line)
+----------------
Fix: transform Y, remove severe outliers, or rely on large n (CLT robustness).
Plots: Cook’s Distance / Leverage vs Residuals
✅ Correct (no influential points)
Cook’s D
| • • • • •
| • • • • • • •
+----------------
obs index →
🚫 Violation (outlier with high influence)
Cook’s D
| • ← high D (>1)
| • • • • •
+----------------
obs index →
Fix: verify data, try fit without point, report sensitivity check.
Numeric and conceptual pattern
✅ Correct VIFs ~ 1–3 β̂ signs stable SEs reasonable 🚫 Violation VIFs > 10 β̂ signs flip with minor data changes Huge SE, nonsignificant t-tests despite high R²
Fix: center or standardize, drop/merge predictors, or use ridge/LASSO.
Pattern → Symptom → Fix. If you see curvature, heteroskedasticity, or correlation, state the violation and propose the repair directly.
| Assumption | “Bad” Visual Cue | Repair Strategy |
|---|---|---|
| Linearity | U-shape residuals | Add x² / transform X |
| Independence | Sequential wave pattern | Add time terms / AR model |
| Homoskedasticity | Funnel pattern | Log Y / robust SE |
| Normality | Q–Q curve or tail deviation | Transform Y / ignore if n large |
| Influence | Large Cook’s D spike | Investigate / re-fit w/o obs |
| Collinearity | High VIF / unstable β̂ | Center, drop, combine vars |
| Topic | Recall / Action |
|---|---|
| OLS formula | \(\hat\beta = (X'X)^{-1}X'Y\); unbiased if assumptions hold. |
| Key Decomposition | TSS = ESS + RSS → \(R^2 = 1 - RSS/TSS\). |
| t vs F test | t → one coefficient; F → multiple coefficients. Reject if p < 0.05. |
| CI & PI | CI(mean): ±tσ√h₀; PI(new): ±tσ√(1+h₀). |
| Diagnostics | Linearity (U-shape) → add x²; Funnel → log Y; Q–Q tail → transform; Waves → AR(1). |
| Influence | Cook’s D > 1 → outlier; verify & re-fit. |
| Collinearity | VIF > 10 = unstable β̂; center or drop predictor. |
| Model selection | AIC ↓ better (prediction), BIC ↓ stricter (parsimony), adj R² ↑ better. |
| Prediction vs Explanation | Prediction → minimize MSPE; Explanation → interpret β & control confounders. |
| Exam Mindset | 1️⃣ Identify pattern → 2️⃣ Name assumption → 3️⃣ Apply fix → 4️⃣ Justify with equation or plot cue. |
💡 *If nothing else: know the four OLS assumptions, be able to explain R², interpret lm() output, and spot which plot corresponds to which violation.*