Back to all posts
Data Science

Linear Regression — A Complete Deep Dive | ML Series Part 2

Linear Regression explained from scratch — math, types, implementation, evaluation metrics, regularization, and real-world projects in Python. Beginner to ad...

1. Introduction — What is Linear Regression? {#introduction}

Let's start with a real-world scenario.

Imagine you're a real estate agent. You have the following data:

  • 1000 sqft → $200,000

  • 1500 sqft → $300,000

  • 2000 sqft → $400,000

Now someone asks: "What would a 2500 sqft house cost?"

Your brain immediately finds the pattern: "$200 per sqft" → 2500 × 200 = $500,000.

That's exactly what Linear Regression does — it predicts continuous numerical values based on a straight-line (linear) relationship.

Definition

Linear Regression = A mathematical model that finds the linear relationship between variables and uses it to predict future values.

Why Is It Important?

  • Simplest ML algorithm — easy to understand and explain

  • Highly interpretable — "What does increasing area by 1 sqft do to the price?" — you get a direct answer

  • Foundation — the base concept behind many other ML algorithms

  • Fast — trains quickly even on large datasets

  • Universal — used in Economics, Finance, Science, Engineering, and more

Real-World Applications

Industry

Use Case

Predicts

🏠 Real Estate

House pricing

Property value

📈 Finance

Stock analysis

Future price

🌡️ Science

Climate

Temperature trends

🏭 Manufacturing

Quality control

Defect rate

🏥 Healthcare

Dosage

Drug effectiveness

🚗 Automotive

Fuel efficiency

MPG from weight

📣 Marketing

Ad spend

Sales revenue

🌾 Agriculture

Yield prediction

Crop output


2. 🧮 The Math — Explained Simply {#math-behind}

Don't worry! We'll treat the math like a story.

Simple Linear Regression (1 Feature)

Remember the straight-line equation from school?

SQL
y = mx + c

Where:
y = Output (what we want to predict)
x = Input (the feature)
m = Slope (steepness of the line)
c = Intercept (where the line starts)

In ML, the same equation is written differently:

SQL
ŷ = β₀ + β₁x₁

Where:
ŷ  = Predicted value
β₀ = Intercept (bias)
β₁ = Coefficient (slope) for feature x₁
x₁ = Feature value

Real Example:

SQL
Price = β₀ + β₁ × Area

If the model learns:
β₀ = 50,000   (base price)
β₁ = 150      (price per sqft)

Then:
1000 sqft → Price = 50,000 + 150×1000 = $200,0001500 sqft → Price = 50,000 + 150×1500 = $275,0002500 sqft → Price = 50,000 + 150×2500 = $425,000

Multiple Linear Regression (Multiple Features)

SQL
ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ

Example:
Price = β₀ + β₁×Area + β₂×Bedrooms + β₃×Age - β₄×Distance

Price = 20,000 + 150×Area + 8,000×Bedrooms - 2,000×Age - 5,000×Distance

Loss Function — How Does the Model Learn?

Learning = finding the best β₀ and β₁ that make predictions closest to actual values.

For this, we use MSE (Mean Squared Error):

SQL
MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

Meaning: Average squared difference between actual and predicted values.

Minimize MSE → Find the best-fit line
Python
# Manually computing MSE
import numpy as np

y_actual    = np.array([200, 300, 400, 500, 600])   # Actual prices ($k)
y_predicted = np.array([210, 290, 390, 510, 590])   # Model predictions

errors         = y_actual - y_predicted
squared_errors = errors ** 2
mse            = squared_errors.mean()

print(f"Errors: {errors}")
print(f"Squared Errors: {squared_errors}")
print(f"MSE: {mse:.2f}")
# Lower MSE = better model!

Gradient Descent — The Optimization Magic

How does the model find β₀ and β₁? With Gradient Descent!

SQL
Analogy: You're on a dark hillside and need to reach the valley.
- At each step, feel which direction the ground slopes downward
- Take a small step in that direction
- Repeat until you reach the bottom (minimum loss)
Python
# Gradient Descent from scratch
import numpy as np

X = np.array([1000, 1500, 2000, 2500, 3000])  # Area
y = np.array([200, 300, 400, 500, 600])         # Price ($k)

# Initialise parameters
beta_0 = 0   # Intercept
beta_1 = 0   # Slope
lr     = 0.000001  # Learning rate (step size)
n      = len(X)
epochs = 10000

for epoch in range(epochs):
    y_pred = beta_0 + beta_1 * X

    errors = y - y_pred

    # Compute gradients
    d_beta_0 = (-2/n) * errors.sum()
    d_beta_1 = (-2/n) * (errors * X).sum()

    # Update parameters
    beta_0 -= lr * d_beta_0
    beta_1 -= lr * d_beta_1

print(f"Learned Intercept (β₀): {beta_0:.4f}")
print(f"Learned Slope (β₁): {beta_1:.6f}")
print(f"\nPrediction for 2500 sqft: ${beta_0 + beta_1 * 2500:.2f}k")

3. 🌿 Types of Linear Regression {#types}

SQL
Linear Regression
├── Simple Linear Regression      (1 feature)
├── Multiple Linear Regression    (multiple features)
├── Polynomial Regression         (curved relationship)
├── Ridge Regression              (L2 regularization)
├── Lasso Regression              (L1 regularization)
└── ElasticNet Regression         (L1 + L2 combined)

Type 1: Simple Linear Regression

Python
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1000], [1500], [2000], [2500], [3000]])  # Area
y = np.array([200, 300, 400, 500, 600])                  # Price ($k)

model = LinearRegression()
model.fit(X, y)

print(f"Intercept (β₀): {model.intercept_:.4f}")
print(f"Slope (β₁): {model.coef_[0]:.6f}")
print(f"Prediction for 2200 sqft: ${model.predict([[2200]])[0]:.2f}k")

Type 2: Multiple Linear Regression

Python
import pandas as pd
from sklearn.linear_model import LinearRegression

data = pd.DataFrame({
    'area':       [1000, 1500, 2000, 2500, 3000],
    'bedrooms':   [2, 3, 3, 4, 5],
    'age_years':  [10, 5, 15, 2, 8],
    'price_k':    [200, 320, 360, 560, 620]
})

X = data[['area', 'bedrooms', 'age_years']]
y = data['price_k']

model = LinearRegression()
model.fit(X, y)

print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

# Interpretation:
# area:      0.18  → Each extra sqft adds $180 to the price
# bedrooms:  25000 → Each extra bedroom adds $25,000
# age_years: -5000 → Each year of age reduces price by $5,000

Type 3: Polynomial Regression (Curved Relationship)

Python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Data that doesn't fit a straight line (curved relationship)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2.5, 5.5, 11, 19, 30, 44, 60, 79, 101, 126])  # Quadratic!

# ❌ Simple linear — poor fit
linear = LinearRegression()
linear.fit(X, y)
print(f"Linear R²: {linear.score(X, y):.4f}")

# ✅ Polynomial — good fit
poly = PolynomialFeatures(degree=2)  # Creates x, x², etc.
X_poly = poly.fit_transform(X)

poly_model = LinearRegression()
poly_model.fit(X_poly, y)
print(f"Polynomial R²: {poly_model.score(X_poly, y):.4f}")

X_new = np.array([[11]])
X_new_poly = poly.transform(X_new)
print(f"Prediction for x=11: {poly_model.predict(X_new_poly)[0]:.2f}")

Type 4: Ridge Regression (L2 Regularization)

Python
from sklearn.linear_model import Ridge

# Ridge: Prevents coefficients from growing too large
# alpha = regularization strength (higher = more regularization)
ridge = Ridge(alpha=1.0)  # alpha=0 → same as LinearRegression
ridge.fit(X_train, y_train)

# When to use:
# - Features are highly correlated (multicollinearity)
# - Model is overfitting
# - All features seem important

Type 5: Lasso Regression (L1 Regularization)

Python
from sklearn.linear_model import Lasso

# Lasso: Sets some coefficients exactly to 0 (built-in feature selection!)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# When to use:
# - You need automatic feature selection
# - Many features exist, some are irrelevant
# - You want a sparse, interpretable model

# Check which features survived
selected = [(f, c) for f, c in zip(feature_names, lasso.coef_) if c != 0]
print("Selected features:", selected)

Type 6: ElasticNet (Best of Both Worlds)

Python
from sklearn.linear_model import ElasticNet

# L1 + L2 combined
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio: 0=Ridge, 1=Lasso
elastic.fit(X_train, y_train)

# When to use:
# - When neither Ridge nor Lasso alone gives great results
# - When you want grouping behaviour for correlated features

4. ✅ Assumptions — When Does It Work? {#assumptions}

Linear Regression performs well only when these conditions hold.

Assumption 1: Linearity

There should be a linear relationship between X and y.

Python
import matplotlib.pyplot as plt

# Check: plot a scatter chart
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.5)
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linearity Check')
plt.show()

# If a curved pattern appears → use Polynomial Regression

Assumption 2: No Multicollinearity

Features should not be highly correlated with each other.

Python
import seaborn as sns

# Correlation heatmap
corr_matrix = df[feature_columns].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix — Multicollinearity Check')
plt.show()

# VIF (Variance Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data['Feature'] = feature_columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(feature_columns))]
print(vif_data)
# VIF > 10 → Multicollinearity problem!

Assumption 3: Homoscedasticity

Residual variance should be constant (not funnel-shaped).

Python
y_pred    = model.predict(X)
residuals = y - y_pred

plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show()

# Funnel shape → Heteroscedasticity → Try log or sqrt transform on y

Assumption 4: Normality of Residuals

Python
import scipy.stats as stats

stats.probplot(residuals, plot=plt)
plt.title('Q-Q Plot — Normality Check')
plt.show()

stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# p > 0.05 → Normal distribution ✅

5. 🛠️ Implementation — Step by Step {#implementation}

Python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# ── 1. Generate Data ───────────────────────────────────────
np.random.seed(42)
n = 500

df = pd.DataFrame({
    'area_sqft':     np.random.randint(500, 5000, n),
    'bedrooms':      np.random.randint(1, 6, n),
    'bathrooms':     np.random.randint(1, 4, n),
    'age_years':     np.random.randint(0, 40, n),
    'floor':         np.random.randint(1, 20, n),
    'parking':       np.random.randint(0, 3, n),
    'metro_dist_km': np.random.uniform(0.5, 15, n),
})

# Realistic pricing formula
df['price_k'] = (
    df['area_sqft'] * 150 +
    df['bedrooms'] * 15000 +
    df['bathrooms'] * 10000 -
    df['age_years'] * 2000 -
    df['metro_dist_km'] * 5000 +
    df['parking'] * 8000 +
    np.random.normal(0, 20000, n)
) / 1000

print("📦 Dataset Shape:", df.shape)
print("\n📊 Price Statistics (in $k):")
print(df['price_k'].describe())

# ── 2. Features & Target ───────────────────────────────────
feature_cols = ['area_sqft', 'bedrooms', 'bathrooms', 'age_years',
                'floor', 'parking', 'metro_dist_km']
X = df[feature_cols]
y = df['price_k']

# ── 3. Train-Test Split ────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"\n🔀 Train: {X_train.shape[0]} | Test: {X_test.shape[0]}")

# ── 4. Feature Scaling ─────────────────────────────────────
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ── 5. Train ───────────────────────────────────────────────
model = LinearRegression()
model.fit(X_train_sc, y_train)

# ── 6. Predict ─────────────────────────────────────────────
y_pred = model.predict(X_test_sc)

# ── 7. Evaluate ────────────────────────────────────────────
mae  = mean_absolute_error(y_test, y_pred)
mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print("\n📊 Model Performance:")
print(f"  MAE  (Mean Absolute Error):  ${mae:.2f}k")
print(f"  MSE  (Mean Squared Error):   {mse:.4f}")
print(f"  RMSE (Root MSE):             ${rmse:.2f}k")
print(f"  R²   (R-squared):            {r2:.4f} ({r2*100:.2f}%)")

# ── 8. Feature Coefficients ────────────────────────────────
coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("\n🔑 Feature Coefficients:")
print(coef_df.to_string(index=False))
print(f"\n  Intercept: {model.intercept_:.4f}")

6. 🔧 All Important Functions & Parameters {#functions}

LinearRegression() Parameters

Python
from sklearn.linear_model import LinearRegression

model = LinearRegression(
    fit_intercept=True,    # Calculate intercept? Default: True
    copy_X=True,           # Copy X during training? Default: True
    n_jobs=None,           # CPU cores to use (-1 = all)
    positive=False         # Force positive coefficients? Default: False
)

Key Attributes After Fitting

Python
model.fit(X_train, y_train)

print(model.coef_)             # [β₁, β₂, ...] — one per feature
print(model.intercept_)        # β₀ — the intercept
print(model.n_features_in_)    # Number of features used in training
print(model.feature_names_in_) # Feature names (if DataFrame was used)

Key Methods

Python
model.fit(X_train, y_train)          # Train the model
y_pred = model.predict(X_test)       # Make predictions
r2 = model.score(X_test, y_test)     # R² score
print(model.get_params())            # View current parameters
model.set_params(fit_intercept=False) # Change parameters

Ridge, Lasso, ElasticNet Key Parameters

Python
from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(
    alpha=1.0,           # Regularization strength
    fit_intercept=True,
    solver='auto',       # 'auto', 'svd', 'cholesky', 'lsqr', 'saga'
    max_iter=None,
    tol=1e-4,
    random_state=None
)

lasso = Lasso(
    alpha=0.1,
    fit_intercept=True,
    max_iter=1000,
    warm_start=False,
    selection='cyclic'   # 'cyclic' or 'random'
)

elastic = ElasticNet(
    alpha=0.1,
    l1_ratio=0.5,        # 0 = Ridge, 1 = Lasso
    fit_intercept=True,
    max_iter=1000
)

7. 📊 Evaluation Metrics {#evaluation}

Metric 1: MAE (Mean Absolute Error)

Python
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.4f}")

# Average absolute error
# Interpretable, robust to outliers
# Same units as y

Metric 2: MSE (Mean Squared Error)

Python
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.4f}")

# Large errors penalised heavily (squared)
# Units are y² — less interpretable
# Commonly used during optimisation

Metric 3: RMSE

Python
import numpy as np
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse:.4f}")

# Square root of MSE → same units as y
# Sensitive to outliers
# Most commonly reported metric

Metric 4: R² Score

Python
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")

# R² = 0.85 → Model explains 85% of variance
# R² = 1.0  → Perfect fit
# R² = 0.0  → No better than predicting the mean
# R² < 0    → Worse than predicting the mean!

# ⚠️ Limitation: Keeps increasing as you add features
# Solution: Use Adjusted R²

Metric 5: Adjusted R²

Python
def adjusted_r2(r2, n, p):
    """
    r2 = R² score
    n  = number of samples
    p  = number of features
    """
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

n = len(y_test)
p = X_test.shape[1]
adj_r2 = adjusted_r2(r2, n, p)
print(f"Adjusted R²: {adj_r2:.4f}")

# Penalises adding useless features
# Better metric for model comparison

Metrics Summary Table

Metric

Formula

Range

Better When

Units

MAE

mean(|y - ŷ|)

[0, ∞)

Lower

Same as y

MSE

mean((y - ŷ)²)

[0, ∞)

Lower

RMSE

√MSE

[0, ∞)

Lower

Same as y

1 - SS_res/SS_tot

(-∞, 1]

Closer to 1

Unitless

Adj R²

Modified R²

(-∞, 1]

Closer to 1

Unitless


8. 🌊 Intermediate Concepts {#intermediate}

Hyperparameter Tuning with Cross-Validation

Python
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge())
])

param_grid = {'model__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best Alpha: {grid_search.best_params_}")
print(f"Best CV R²: {grid_search.best_score_:.4f}")
print(f"Test R²: {grid_search.score(X_test, y_test):.4f}")

Learning Curves — Diagnosing Overfitting/Underfitting

Python
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np

def plot_learning_curve(model, X, y, title="Learning Curve"):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='r2',
        n_jobs=-1
    )

    train_mean = train_scores.mean(axis=1)
    val_mean   = val_scores.mean(axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.plot(train_sizes, val_mean,   'o-', color='red',  label='Validation Score')
    plt.xlabel('Training Size')
    plt.ylabel('R² Score')
    plt.title(title)
    plt.legend()
    plt.grid(True)
    plt.show()

    final_gap = train_mean[-1] - val_mean[-1]
    print(f"\nDiagnosis:")
    print(f"Train R²: {train_mean[-1]:.4f}")
    print(f"Val   R²: {val_mean[-1]:.4f}")
    print(f"Gap:      {final_gap:.4f}")
    if final_gap > 0.1:
        print("⚠️ Possible OVERFITTING — Try regularization or more data")
    elif val_mean[-1] < 0.6:
        print("⚠️ Possible UNDERFITTING — Try more features or a complex model")
    else:
        print("✅ Good fit!")

from sklearn.linear_model import LinearRegression
plot_learning_curve(LinearRegression(), X, y)

Feature Engineering for Regression

Python
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({
    'area':      [1000, 1500, 2000, 2500, 3000],
    'bedrooms':  [2, 3, 3, 4, 4],
    'age':       [5, 10, 15, 3, 20],
    'metro_dist':[2, 5, 3, 1, 8],
})

# 1. Interaction features
df['area_per_bedroom'] = df['area'] / df['bedrooms']
df['area_x_metro']     = df['area'] * df['metro_dist']

# 2. Log transform (for skewed data)
df['log_area']  = np.log1p(df['area'])
df['log_metro'] = np.log1p(df['metro_dist'])

# 3. Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 5, 15, 100],
                          labels=['New', 'Mid', 'Old'])

# 4. Automated polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['area', 'bedrooms']])
print("Polynomial Features:", poly.get_feature_names_out(['area', 'bedrooms']))
# ['area', 'bedrooms', 'area^2', 'area bedrooms', 'bedrooms^2']

9. 🚀 Advanced Concepts {#advanced}

Regularization — Deeper Look

Python
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 200

# Introduce multicollinearity — X2 ≈ X1
X1 = np.random.randn(n)
X2 = X1 + np.random.randn(n) * 0.1  # Almost identical to X1!
X3 = np.random.randn(n)
X4 = np.random.randn(n)  # Noise
X5 = np.random.randn(n)  # Noise

X = np.column_stack([X1, X2, X3, X4, X5])
y = 3*X1 + 2*X3 + np.random.randn(n) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

models = {
    'Linear':         LinearRegression(),
    'Ridge(α=0.1)':   Ridge(alpha=0.1),
    'Ridge(α=10)':    Ridge(alpha=10),
    'Lasso(α=0.1)':   Lasso(alpha=0.1),
    'Lasso(α=1.0)':   Lasso(alpha=1.0),
    'ElasticNet':      ElasticNet(alpha=0.1, l1_ratio=0.5)
}

print(f"{'Model':<20} {'Train R²':>10} {'Test R²':>10} {'Coefficients':>40}")
print("-" * 85)

for name, m in models.items():
    m.fit(X_train_sc, y_train)
    train_r2 = m.score(X_train_sc, y_train)
    test_r2  = m.score(X_test_sc, y_test)
    coefs    = [f"{c:.3f}" for c in m.coef_]
    print(f"{name:<20} {train_r2:>10.4f} {test_r2:>10.4f} {str(coefs):>40}")

RidgeCV and LassoCV — Automatic Alpha Selection

Python
from sklearn.linear_model import RidgeCV, LassoCV
import numpy as np

alphas = np.logspace(-4, 4, 50)  # 0.0001 to 10000

ridge_cv = RidgeCV(alphas=alphas, cv=5, scoring='r2')
ridge_cv.fit(X_train_sc, y_train)
print(f"Ridge Best Alpha: {ridge_cv.alpha_:.4f}")
print(f"Ridge Test R²:   {ridge_cv.score(X_test_sc, y_test):.4f}")

lasso_cv = LassoCV(alphas=alphas, cv=5, n_jobs=-1, random_state=42)
lasso_cv.fit(X_train_sc, y_train)
print(f"\nLasso Best Alpha:     {lasso_cv.alpha_:.4f}")
print(f"Lasso Test R²:        {lasso_cv.score(X_test_sc, y_test):.4f}")
print(f"Non-zero features:    {(lasso_cv.coef_ != 0).sum()} / {X_train.shape[1]}")

Statsmodels — Statistical Analysis

Python
import statsmodels.api as sm

X_with_const = sm.add_constant(X_train_sc)

ols_model = sm.OLS(y_train, X_with_const)
results   = ols_model.fit()

print(results.summary())

# Key outputs:
print("\nCoefficients:")
print(results.params)

print("\nP-values (< 0.05 = statistically significant):")
print(results.pvalues)

print("\n95% Confidence Intervals:")
print(results.conf_int())

10. 🌐 Real-World Use Cases {#real-world}

Use Case 1: Sales Forecasting

Python
"""
Problem: Predict monthly sales based on ad spend and season
Business: FMCG company
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 120  # 10 years of monthly data

months = pd.date_range('2014-01', periods=n, freq='ME')
data = pd.DataFrame({
    'date':              months,
    'ad_spend_k':        np.random.uniform(50, 500, n),
    'season':            np.where(months.month.isin([10, 11, 12, 1, 2]), 'peak', 'normal'),
    'competitors_count': np.random.randint(2, 8, n),
    'discount_pct':      np.random.uniform(0, 30, n),
})

data['sales_k'] = (
    data['ad_spend_k'] * 0.8 +
    (data['season'] == 'peak').astype(int) * 200 -
    data['competitors_count'] * 30 +
    data['discount_pct'] * 5 +
    np.random.normal(0, 50, n)
)

numeric_cols     = ['ad_spend_k', 'competitors_count', 'discount_pct']
categorical_cols = ['season']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

pipeline = Pipeline([('prep', preprocessor), ('model', LinearRegression())])

X = data[numeric_cols + categorical_cols]
y = data['sales_k']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)

print(f"✅ Sales Forecast Model R²: {pipeline.score(X_test, y_test):.4f}")

# Predict next month
next_month = pd.DataFrame({
    'ad_spend_k': [350], 'competitors_count': [4],
    'discount_pct': [15], 'season': ['peak']
})
print(f"🎯 Predicted Next Month Sales: ${pipeline.predict(next_month)[0]:.2f}k")

Use Case 2: Student Performance Prediction

Python
"""
Problem: Predict a student's final exam score
Business: EdTech platform
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'daily_study_hours':  np.random.uniform(1, 10, n),
    'attendance_pct':     np.random.uniform(50, 100, n),
    'assignments_done':   np.random.uniform(0, 100, n),
    'mock_score_avg':     np.random.uniform(30, 95, n),
    'sleep_hours':        np.random.uniform(4, 9, n),
    'online_resources':   np.random.uniform(0, 5, n),
})

df['final_score'] = np.clip(
    df['daily_study_hours'] * 4 +
    df['attendance_pct'] * 0.3 +
    df['assignments_done'] * 0.2 +
    df['mock_score_avg'] * 0.5 +
    df['sleep_hours'] * 1.5 +
    df['online_resources'] * 2 +
    np.random.normal(0, 5, n),
    0, 100
)

X = df.drop('final_score', axis=1)
y = df['final_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1.0))])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(f"📚 Student Score Predictor:")
print(f"  R²:  {r2_score(y_test, y_pred):.4f}")
print(f"  MAE: {mean_absolute_error(y_test, y_pred):.2f} marks")

11. 💻 Practical Examples — Complete Code {#practical}

Full Project: California Housing Price Prediction

Python
# ============================================================
# HOUSE PRICE PREDICTION — COMPLETE ML PROJECT
# ============================================================

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# ── Load Data ──────────────────────────────────────────────
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target  # Median house price (in $100k)

print("🏠 California Housing Dataset")
print(f"Shape: {df.shape}")
print(f"Features: {list(housing.feature_names)}")
print(f"\nPrice Range: ${df['Price'].min():.2f}0k — ${df['Price'].max():.2f}0k")

# ── Feature Engineering ────────────────────────────────────
df['RoomsPerHouse']     = df['AveRooms'] / df['AveOccup']
df['BedroomRatio']      = df['AveBedrms'] / df['AveRooms']
df['PopulationDensity'] = df['Population'] / df['AveOccup']

# ── Train-Test Split ───────────────────────────────────────
feature_cols = [c for c in df.columns if c != 'Price']
X = df[feature_cols]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Compare Models ─────────────────────────────────────────
models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=1)':       Ridge(alpha=1),
    'Ridge (α=10)':      Ridge(alpha=10),
    'Lasso (α=0.01)':    Lasso(alpha=0.01),
}

results = []
for name, m in models.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', m)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    results.append({
        'Model':    name,
        'Train R²': pipe.score(X_train, y_train),
        'Test R²':  pipe.score(X_test, y_test),
        'RMSE':     np.sqrt(mean_squared_error(y_test, y_pred)),
        'MAE':      mean_absolute_error(y_test, y_pred)
    })

results_df = pd.DataFrame(results).round(4)
print("\n🏆 Model Comparison:")
print(results_df.to_string(index=False))

# ── Best Model Analysis ────────────────────────────────────
best_pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1))])
best_pipe.fit(X_train, y_train)

y_pred = best_pipe.predict(X_test)
r2  = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"\n✅ Best Model: Ridge (α=1)")
print(f"  R² Score: {r2:.4f} — model explains {r2*100:.1f}% of variance")
print(f"  MAE: ${mae*100:.0f} average prediction error")

# ── Feature Importance ─────────────────────────────────────
ridge_model = best_pipe.named_steps['model']
coef_df = pd.DataFrame({
    'Feature':     feature_cols,
    'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("\n🔑 Feature Importance:")
print(coef_df.to_string(index=False))

12. ⚠️ Edge Cases & Errors {#edge-cases}

Error 1: ValueError — Feature Count Mismatch

Python
# ❌ Error: different number of features at train vs predict time
model.predict(X_different_shape)
# ValueError: X has 5 features, but LinearRegression expects 7

# ✅ Fix: Always use the same features; use Pipelines to enforce this

Error 2: Perfect Multicollinearity

Python
# ❌ Problem: Two features are identical
# The matrix becomes singular → unreliable or failed solutions

# ✅ Fix 1: Remove the duplicate feature
# ✅ Fix 2: Use Ridge which handles multicollinearity gracefully
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)

Error 3: Skewed Target Variable

Python
import numpy as np

y_skewed = np.array([10000, 15000, 20000, 1000000, 1500000])

# ✅ Fix: Apply log transform
y_log = np.log1p(y_skewed)

model.fit(X_train, np.log1p(y_train))

y_pred_log = model.predict(X_test)
y_pred     = np.expm1(y_pred_log)  # Reverse transform

Error 4: Convergence Warning (Lasso/ElasticNet)

Python
# ❌ ConvergenceWarning: Objective did not converge

# ✅ Fix 1: Increase max_iter
lasso = Lasso(alpha=0.1, max_iter=10000, tol=1e-4)

# ✅ Fix 2: Scale the data first
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_train)

13. 💡 Pro Tips {#pro-tips}

Python
# ✅ Tip 1: Always use a Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())])
# Prevents data leakage and keeps code clean

# ✅ Tip 2: Rely on cross-validation, not a single test score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=10, scoring='r2')
print(f"CV R²: {scores.mean():.4f} ± {scores.std():.4f}")

# ✅ Tip 3: Permutation importance for model-agnostic feature ranking
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)

# ✅ Tip 4: SHAP values for production explainability
import shap
explainer  = shap.LinearExplainer(model, X_train_sc)
shap_values = explainer.shap_values(X_test_sc)
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)

# ✅ Tip 5: Save the complete pipeline
import joblib
joblib.dump(pipe, 'house_price_model_v1.pkl')
loaded = joblib.load('house_price_model_v1.pkl')

# ✅ Tip 6: Always inspect residuals
residuals = y_test - y_pred
# Mean should be ~0
# No pattern should exist in residuals vs fitted values

14. ⚖️ Comparison {#comparison}

Linear Regression vs Other Regression Models

Model

Complexity

Interpretability

Handles Non-linearity

Overfitting Risk

Speed

Linear Regression

⭐ Low

⭐⭐⭐⭐⭐ Very High

❌ No

Low

⚡ Very Fast

Ridge/Lasso

⭐ Low

⭐⭐⭐⭐ High

❌ No

Very Low

⚡ Very Fast

Polynomial Reg

⭐⭐ Medium

⭐⭐⭐ Medium

✅ Limited

Medium-High

⚡ Fast

Decision Tree

⭐⭐ Medium

⭐⭐⭐ Medium

✅ Yes

High

⚡ Fast

Random Forest

⭐⭐⭐ High

⭐⭐ Low

✅ Yes

Low

🐢 Medium

XGBoost

⭐⭐⭐⭐ High

⭐ Very Low

✅ Yes

Very Low

🐢 Medium

Neural Network

⭐⭐⭐⭐⭐ Very High

⭐ Very Low

✅ Yes

Variable

🐌 Slow

When to Choose Linear Regression?

SQL
✅ Use Linear Regression when:
  - You need a simple, interpretable model
  - The relationship is linear
  - You want to understand individual feature impact
  - Dataset is small to medium
  - Building a quick baseline
  - Regulatory / explainability requirements exist

❌ Avoid when:
  - Complex non-linear patterns exist
  - Many outliers are present
  - Data is very high-dimensional (without regularization)
  - Working with image or text data

15. 📊 Data Science Perspective {#data-science}

Python
# Answering a business question with Linear Regression
import statsmodels.api as sm
import pandas as pd
import numpy as np

# Business Question: "How much does ad spend increase our sales?"
df = pd.DataFrame({
    'ad_spend': np.random.uniform(10, 100, 100),
    'sales':    np.random.uniform(50, 500, 100)
})
df['sales'] = df['ad_spend'] * 4.2 + np.random.normal(0, 20, 100)

X_sm    = sm.add_constant(df['ad_spend'])
sm_model = sm.OLS(df['sales'], X_sm).fit()

coef  = sm_model.params['ad_spend']
pval  = sm_model.pvalues['ad_spend']

print(f"💡 Business Insight:")
print(f"Every $1 increase in ad spend drives ${coef:.2f} in additional sales")
print(f"Statistical significance: p={pval:.4f} ({'Significant ✅' if pval < 0.05 else 'Not Significant ❌'})")
print(f"95% CI: {sm_model.conf_int().loc['ad_spend'].values}")

16. 🎤 Interview Questions {#interview-questions}

🟢 Basic

Q1: What does Linear Regression do?

Predicts continuous output values. Finds the best-fit line through the data by minimising MSE.

Q2: What does the slope (coefficient) mean?

The expected change in the target for a 1-unit increase in a feature, holding all other features constant.

Q3: What is the R² score?

The proportion of variance in the target that the model explains. R²=0.85 → 85% explained. 1.0 = perfect; 0 = useless.

Q4: MAE vs RMSE — when to use which?

MAE: Interpretable, robust to outliers. RMSE: Penalises large errors more. When outliers exist, prefer MAE.

Q5: Why use regularization?

To prevent overfitting. Ridge shrinks coefficients. Lasso zeroes out some coefficients (feature selection).

🟡 Intermediate

Q6: What is multicollinearity and how do you fix it?

SQL
Problem:  Highly correlated features → unstable coefficients
Detect:   VIF > 10 or high correlation matrix values
Fix:
  - Remove one of the correlated features
  - Apply PCA
  - Use Ridge regression

Q7: What is Gradient Descent? Name the variants.

SQL
Gradient Descent: Algorithm that minimises loss iteratively
Types:
  - Batch GD:     Uses all data per step
  - Stochastic GD: Uses one sample per step
  - Mini-batch GD: Uses small batches (most common in practice)
Learning Rate: Too high → diverge; Too low → very slow

Q8: What is heteroscedasticity?

Residual variance is not constant across fitted values. Appears as a funnel shape in residual plots. Fix: log-transform y, or use Weighted Least Squares.

🔴 Advanced

Q9: How are Ridge and Lasso different mathematically?

SQL
Ridge: Minimise MSE + α × Σβᵢ²   (L2 norm — squares the coefficients)
Lasso: Minimise MSE + α × Σ|βᵢ|  (L1 norm — absolute value of coefficients)

L1 pushes some coefficients exactly to 0  → Sparse / feature selection
L2 shrinks all coefficients but keeps them non-zero → Dense solution

Q10: Is Polynomial Regression still "Linear Regression"?

Yes! You create polynomial features (x, x², x³) and then fit a Linear Regression on them. The model is still linear in its coefficients — hence the name holds.

Q11: What is the difference between statistical significance and practical significance?

SQL
Statistical Significance: p < 0.05 → Feature is statistically significant
Practical Significance:   Is the coefficient large enough to matter in the real world?

With a large n, tiny and meaningless effects can become statistically significant.
Always check both.

🎭 Scenario-Based

Q12: You deployed a model six months ago and performance has dropped. What happened?

SQL
Possible causes:
1. Data Drift:    Input distribution has changed
2. Concept Drift: The relationship between X and y has changed
3. Seasonal effects

Solution:
- Monitor input distributions regularly
- Retrain periodically with fresh data
- Use drift detection tools (Evidently, alibi-detect)

17. 🎯 Conclusion {#conclusion}

Key Takeaways

Topic

Key Learning

What

Predict continuous values using a linear relationship

Types

Simple, Multiple, Polynomial, Ridge, Lasso, ElasticNet

Math

ŷ = β₀ + β₁x₁ + ... + βₙxₙ

Loss

Minimise MSE via Gradient Descent or Normal Equation

Metrics

MAE, RMSE, R², Adjusted R²

Regularization

Ridge=L2, Lasso=L1, ElasticNet=Both

Best Practice

Pipeline + Cross-Validation + Feature Scaling

Series Roadmap

SQL
✅ Part 1: ML Introduction
✅ Part 2: Linear Regression (This Blog!)
⏳ Part 3: Logistic Regression — Classification
⏳ Part 4: Decision Trees
⏳ Part 5: Random Forests & Ensemble Methods
⏳ Part 6: Support Vector Machines
⏳ Part 7: K-Means Clustering
⏳ Part 8: Neural Networks

Final Advice

"Master Linear Regression properly — it is the foundation for everything else. Understand regularization here and it will make Ridge/Lasso/dropout in deep learning trivial. Understand MSE here and you'll understand loss functions everywhere."

Do This Now:

  1. ✅ Run the California Housing project code above

  2. ✅ Compare Ridge, Lasso, and ElasticNet yourself

  3. ✅ Use Statsmodels for a full statistical summary

  4. ✅ Try the "House Prices" Kaggle competition


Part 2 of ML A to Z | Next: Logistic Regression!

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.