The simplest method in statistics used to model and predict the relationship between continuous variables.
Simple Linear Regression
Linear regression involving 2 variables, dependent (response) and independent (explanatory).
Indications of a Linear Relationship
- Scatter Diagram
Plots observed pairs (xi,yi) to visualize the relationship.
- Correlation Coefficient
Measures the strength and direction of the linear relationship.
Model
General population model:
Y=α+βX
For an individual observation:
yi=α+βxi+εi
where
- α = intercept (value of Y when X=0)
- β = slope (rate of change of Y with respect to X)
- εi = random error, assumed εi∼N(0,σ2)
Coefficient of Determination
Shows how much of the variance in (Y) is explained by (X). Denoted by R2.
Error Sum of Squares
Denoted by ESS.
ESS=∑(yi−α−βxi)2
Estimation of Parameters
Suppose the fitted regression line is y^=α^+β^x. Finding α^,β^ that minimize ESS is the goal. Least Squares Method is used here.
By setting partial derivatives to zero gives the normal equations:
∑yi=nα+β∑xi
∑xiyi=α∑xi+β∑xi2
Solving gives:
β^=n∑xi2−(∑xi)2n∑xiyi−(∑xi)(∑yi)
α^=yˉ−β^xˉ
Alternate form using deviations:
β^=∑(xi−xˉ)2∑(xi−xˉ)(yi−yˉ)
Sampling Distribution of Beta
Under the normal error assumption εi∼N(0,σ2):
β^∼N(β,Sxxσ2)
where S_xx=∑(xi−xˉ)2=nσ.
When σ2 is unknown, estimate it using:
s2=n−2ESS
A confidence interval for the true slope β is:
β^±tα/2,n−2×Sxxs
Hypothesis Testing on the Regression Coefficient
To test whether X significantly predicts Y:
H0:β=0andH1:β=0
Test statistic:
t=s/Sxxβ^−0∼tn−2
If ∣t∣>critical value, reject H0, which means the relationship between X and Y is statistically significant.
Analysis of Variance (ANOVA) for Regression
Tests whether the regression line fits the data well.
| Source of Variation | Sum of Squares (SS) | df | Mean Square (MS) |
|---|
| Regression (RSS) | ∑i=1n(yi^−yˉ)2 | 1 | RSS/1 |
| Error (ESS) | ∑i=1n(yi−yi^)2 | n–2 | ESS/(n–2) |
| Total (TSS) | ∑i=1n(yi−yˉ)2 | n–1 | |
Here:
- yi: Actual observed value of the dependent variable for observation i
- yi^: Predicted value of yi from the regression line
- yˉ: Mean of all observed yi values (overall average)
If RSS is large relative to ESS, the model fits well. Computed by F-ratio.
F-ratio
Fcalc=ESS/(n−2)RSS/1∼F1,n−2
Decision Rule
H0:regression line does not fit the data
H1:regression line fits the data
Reject H0 if Fcalc>F1,n−2,α.