Linear Regression — Formula, Gradient Descent & Worked Example

Linear Regression

Formula, Gradient Descent & Worked Example — For Engineering Students

Last Updated: March 2026

Key Takeaways 📌

  • Definition: Linear Regression is a supervised ML algorithm that models the relationship between a dependent variable and one or more independent variables using a straight line (or hyperplane).
  • Formula: ŷ = β₀ + β₁x (simple)  |  ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ (multiple)
  • Cost Function: Mean Squared Error (MSE) — J(β) = (1/2m) × Σ(ŷᵢ − yᵢ)²
  • Optimisation: Gradient Descent or Ordinary Least Squares (OLS) closed-form solution.
  • Used for: Predicting continuous numerical values (price, temperature, marks, salary).

1. Definition & Analogy

Linear Regression is a supervised Machine Learning algorithm that models the linear relationship between a dependent variable (output) and one or more independent variables (inputs). The model finds the best-fit straight line through the data that minimises the total prediction error.

It is one of the oldest and most widely used statistical and ML methods — introduced by Francis Galton in the 19th century and still central to data science today.

Analogy — Marking Students on Study Hours

Imagine you have data on 50 students: how many hours each studied, and their exam scores. If you plot this (hours on the x-axis, score on the y-axis), you will see a trend — more hours generally means a higher score. Linear regression draws the best single straight line through this scatter of points. Once you have this line, you can predict a new student’s expected score just by knowing their study hours.

That line is the linear regression model. Its equation tells you: for every additional hour of study, how many marks does the score increase?

2. The Linear Regression Formula

Simple Linear Regression (one input feature)

ŷ = β₀ + β₁x

SymbolNameMeaning
ŷPredicted valueThe output the model predicts
xInput featureThe independent variable (e.g., study hours)
β₀Intercept (bias)Value of ŷ when x = 0; shifts the line up or down
β₁Slope (weight)Change in ŷ for every one-unit increase in x

Multiple Linear Regression (two or more features)

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Here, x₁, x₂, …, xₙ are different input features (e.g., study hours, sleep hours, attendance percentage), and each βᵢ represents the effect of that feature on the prediction, holding all other features constant.

In matrix form: ŷ = Xβ, where X is the feature matrix (m × n+1) and β is the parameter vector (n+1 × 1).

3. Cost Function — Mean Squared Error (MSE)

The standard cost function for linear regression is the Mean Squared Error (MSE):

J(β) = (1 / 2m) × Σᵢ₌₁ᵐ (ŷᵢ − yᵢ)²

SymbolMeaning
J(β)Total cost (error) of the model with current parameters β
mNumber of training examples
ŷᵢModel’s predicted output for example i
yᵢActual (true) output for example i
ΣSum over all m training examples

The factor of 1/2 is included as a mathematical convenience — it cancels with the exponent when you take the derivative during gradient descent. The goal is to find the values of β that minimise J(β).

Why square the errors? Squaring penalises large errors more heavily, ensures all error terms are positive (so they do not cancel each other out), and makes the cost function differentiable — essential for gradient descent.

4. Gradient Descent — How the Model Learns

Gradient Descent is the optimisation algorithm used to find the β values that minimise the cost function J(β). It works by iteratively nudging the parameters in the direction that reduces the cost the most.

Intuitively, imagine the cost function as a hilly landscape. Your current position is a set of β values, and your height is the cost. Gradient descent finds the downhill direction and takes a step. Repeat until you reach the valley (minimum cost).

Update Rules

β₀ := β₀ − α × (∂J / ∂β₀)

β₁ := β₁ − α × (∂J / ∂β₁)

After computing the partial derivatives:

β₀ := β₀ − (α/m) × Σ(ŷᵢ − yᵢ)

β₁ := β₁ − (α/m) × Σ(ŷᵢ − yᵢ) × xᵢ

SymbolMeaning
α (alpha)Learning rate — controls step size. Too large: overshoots. Too small: slow convergence.
:=Update (assignment) — both parameters must be updated simultaneously
∂J/∂βPartial derivative of cost w.r.t. parameter — gives the slope of the cost function

These updates are repeated for a set number of iterations (epochs) or until the cost stops decreasing significantly.

5. Ordinary Least Squares (OLS) — Closed Form Solution

For simple linear regression, you can solve for β directly using the Ordinary Least Squares (OLS) formula:

β₁ = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²

β₀ = ȳ − β₁ × x̄

Where x̄ is the mean of x values and ȳ is the mean of y values.

In matrix form for multiple regression: β = (XᵀX)⁻¹ Xᵀy

OLS gives the exact solution in one step — no iterations needed. However, for very large datasets with many features, computing (XᵀX)⁻¹ is computationally expensive (O(n³)), which is why gradient descent is preferred for large-scale problems.

6. Worked Numerical Example

Problem: A student collected data on hours studied (x) and exam marks (y) for 5 students:

StudentHours Studied (x)Marks Scored (y)
1140
2250
3360
4470
5580

Step 1 — Compute means:

x̄ = (1+2+3+4+5)/5 = 15/5 = 3

ȳ = (40+50+60+70+80)/5 = 300/5 = 60

Step 2 — Compute β₁ (slope):

xᵢyᵢ(xᵢ − x̄)(yᵢ − ȳ)(xᵢ − x̄)(yᵢ − ȳ)(xᵢ − x̄)²
140−2−20404
250−1−10101
3600000
470110101
580220404
Sum10010

β₁ = 100 / 10 = 10

Step 3 — Compute β₀ (intercept):

β₀ = ȳ − β₁ × x̄ = 60 − 10 × 3 = 60 − 30 = 30

Step 4 — Final Model:

ŷ = 30 + 10x

Interpretation: For every additional hour of study, the predicted score increases by 10 marks. A student who studies 0 hours is predicted to score 30 (the base score).

Prediction: If a student studies 6 hours: ŷ = 30 + 10 × 6 = 90 marks

7. Key Assumptions of Linear Regression

  • Linearity: The relationship between x and y must be approximately linear.
  • Independence: Each training example must be independent of the others (no autocorrelation).
  • Homoscedasticity: The variance of errors must be approximately constant across all values of x.
  • Normality of errors: The residuals should be approximately normally distributed.
  • No multicollinearity: In multiple linear regression, the input features should not be highly correlated with each other.

Violations do not always mean the model is useless, but they affect the reliability of predictions and statistical inference. Always check residual plots after fitting a model.

8. Python Code — Linear Regression with Scikit-learn

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Training data
X = np.array([[1], [2], [3], [4], [5]])  # Hours studied
y = np.array([40, 50, 60, 70, 80])       # Exam marks

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Model parameters
print(f"Intercept (β₀): {model.intercept_}")   # Output: 30.0
print(f"Slope (β₁):     {model.coef_[0]}")     # Output: 10.0

# Predict for a student who studied 6 hours
prediction = model.predict([[6]])
print(f"Predicted score for 6 hours: {prediction[0]}")  # Output: 90.0

# Model evaluation
y_pred = model.predict(X)
print(f"MSE:  {mean_squared_error(y, y_pred)}")
print(f"R²:   {r2_score(y, y_pred)}")           # Output: 1.0 (perfect fit)
  

9. Common Mistakes Students Make

  • Using linear regression for non-linear data: If your scatter plot shows a curve, linear regression will give poor results. Use polynomial regression or a non-linear model instead.
  • Not scaling features: When features have very different scales, gradient descent converges slowly. Always normalise or standardise features before training.
  • Ignoring outliers: Linear regression is sensitive to outliers because it squares the errors. A single extreme data point can pull the regression line significantly off.
  • Confusing correlation with causation: A high R² score means the model fits the data well, but it does not prove that x causes y.
  • Extrapolating too far: The model is only reliable within the range of training data.

10. Frequently Asked Questions

What is the formula for linear regression?

Simple: ŷ = β₀ + β₁x. Multiple: ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ. The goal of training is to find the β values that best fit the data.

What is R² in linear regression?

R² (R-squared) is the coefficient of determination. It measures what proportion of the variance in y is explained by the model. R² = 1 means a perfect fit; R² = 0 means the model explains nothing.

What is the difference between linear regression and logistic regression?

Linear regression predicts a continuous numerical value. Logistic regression predicts the probability of a class — it is a classification algorithm that uses the sigmoid function to bound predictions between 0 and 1.

What is the learning rate in gradient descent?

The learning rate (α) controls the size of each step gradient descent takes toward the minimum. A large α can overshoot; a small α is slow. Typical starting values are 0.01 or 0.001.

Next Steps

Leave a Comment