Linear Regression

3 min read Updated Fri Apr 24 2026 07:36:29 GMT+0000 (Coordinated Universal Time)

The simplest method in statistics used to model and predict the relationship between continuous variables.

Simple Linear Regression

Linear regression involving 2 variables, dependent (response) and independent (explanatory).

Indications of a Linear Relationship

  • Scatter Diagram
    Plots observed pairs (xi,yi)(x_i, y_i) to visualize the relationship.
  • Correlation Coefficient
    Measures the strength and direction of the linear relationship.

Model

General population model:

Y=α+βXY = \alpha + \beta X

For an individual observation:

yi=α+βxi+εiy_i = \alpha + \beta x_i + \varepsilon_i

where

  • α\alpha = intercept (value of YY when X=0X = 0)
  • β\beta = slope (rate of change of YY with respect to XX)
  • εi \varepsilon_i = random error, assumed εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2)

Coefficient of Determination

Shows how much of the variance in (YY) is explained by (XX). Denoted by R2R^2.

Error Sum of Squares

Denoted by ESS\text{ESS}.

ESS=(yiαβxi)2ESS = \sum (y_i - \alpha - \beta x_i)^2

Estimation of Parameters

Suppose the fitted regression line is y^=α^+β^x\hat{y} = \hat{\alpha} + \hat{\beta}x. Finding α^,β^\hat{\alpha}, \hat{\beta} that minimize ESS\text{ESS} is the goal. Least Squares Method is used here.

By setting partial derivatives to zero gives the normal equations:

yi=nα+βxi\sum y_i = n\alpha + \beta \sum x_i xiyi=αxi+βxi2\sum x_i y_i = \alpha \sum x_i + \beta \sum x_i^2

Solving gives:

β^=nxiyi(xi)(yi)nxi2(xi)2\hat{\beta} = \frac{n\sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2} α^=yˉβ^xˉ\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}

Alternate form using deviations:

β^=(xixˉ)(yiyˉ)(xixˉ)2\hat{\beta} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Sampling Distribution of Beta

Under the normal error assumption εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2):

β^N(β,σ2Sxx)\hat{\beta} \sim N\left(\beta, \frac{\sigma^2}{S_{xx}}\right)

where S_xx=(xixˉ)2=nσS\_{xx} = \sum (x_i - \bar{x})^2 = n\sigma .

When σ2\sigma^2 is unknown, estimate it using:

s2=ESSn2s^2 = \frac{ESS}{n - 2}

A confidence interval for the true slope β\beta is:

β^±tα/2,n2×sSxx\hat{\beta} \pm t_{\alpha/2, n-2} \times \frac{s}{\sqrt{S_{xx}}}

Hypothesis Testing on the Regression Coefficient

To test whether XX significantly predicts YY:

H0:β=0andH1:β0H_0: \beta = 0 \quad \text{and} \quad H_1: \beta \neq 0

Test statistic:

t=β^0s/Sxxtn2t = \frac{\hat{\beta} - 0}{s / \sqrt{S_{xx}}} \sim t_{n-2}

If t>critical value|t| > \text{critical value}, reject H0H_0, which means the relationship between X and Y is statistically significant.

Analysis of Variance (ANOVA) for Regression

Tests whether the regression line fits the data well.

Source of VariationSum of Squares (SS)dfMean Square (MS)
Regression (RSS)i=1n(yi^yˉ)2\sum_{i=1}^n {(\hat{y_i}-\bar{y})^2}11RSS/1\text{RSS} / 1
Error (ESS)i=1n(yiyi^)2\sum_{i=1}^n {(y_i-\hat{y_i})^2}n2n–2ESS/(n2)\text{ESS} / (n–2)
Total (TSS)i=1n(yiyˉ)2\sum_{i=1}^n {(y_i-\bar{y})^2}n1n–1

Here:

  • yiy_i: Actual observed value of the dependent variable for observation ii
  • yi^\hat{y_i}: Predicted value of yiy_i from the regression line
  • yˉ\bar{y}: Mean of all observed yiy_i values (overall average)

If RSS is large relative to ESS, the model fits well. Computed by F-ratio.

F-ratio

Fcalc=RSS/1ESS/(n2)F1,n2F_\text{calc} = \frac{\text{RSS}/1}{\text{ESS}/(n-2)} \sim F_{1, n-2}

Decision Rule

H0:regression line does not fit the dataH_0: \text{regression line does not fit the data}

H1:regression line fits the dataH_1: \text{regression line fits the data}

Reject H0H_0 if Fcalc>F1,n2,αF_{\text{calc}} > F_{1, n-2, \alpha}.