1. Introduction — What is Linear Regression? {#introduction}
Let's start with a real-world scenario.
Imagine you're a real estate agent. You have the following data:
1000 sqft → $200,000
1500 sqft → $300,000
2000 sqft → $400,000
Now someone asks: "What would a 2500 sqft house cost?"
Your brain immediately finds the pattern: "$200 per sqft" → 2500 × 200 = $500,000.
That's exactly what Linear Regression does — it predicts continuous numerical values based on a straight-line (linear) relationship.
Definition
Linear Regression = A mathematical model that finds the linear relationship between variables and uses it to predict future values.
Why Is It Important?
Simplest ML algorithm — easy to understand and explain
Highly interpretable — "What does increasing area by 1 sqft do to the price?" — you get a direct answer
Foundation — the base concept behind many other ML algorithms
Fast — trains quickly even on large datasets
Universal — used in Economics, Finance, Science, Engineering, and more
Real-World Applications
Industry | Use Case | Predicts |
|---|---|---|
🏠 Real Estate | House pricing | Property value |
📈 Finance | Stock analysis | Future price |
🌡️ Science | Climate | Temperature trends |
🏭 Manufacturing | Quality control | Defect rate |
🏥 Healthcare | Dosage | Drug effectiveness |
🚗 Automotive | Fuel efficiency | MPG from weight |
📣 Marketing | Ad spend | Sales revenue |
🌾 Agriculture | Yield prediction | Crop output |
2. 🧮 The Math — Explained Simply {#math-behind}
Don't worry! We'll treat the math like a story.
Simple Linear Regression (1 Feature)
Remember the straight-line equation from school?
y = mx + c
Where:
y = Output (what we want to predict)
x = Input (the feature)
m = Slope (steepness of the line)
c = Intercept (where the line starts)
In ML, the same equation is written differently:
ŷ = β₀ + β₁x₁
Where:
ŷ = Predicted value
β₀ = Intercept (bias)
β₁ = Coefficient (slope) for feature x₁
x₁ = Feature value
Real Example:
Price = β₀ + β₁ × Area
If the model learns:
β₀ = 50,000 (base price)
β₁ = 150 (price per sqft)
Then:
1000 sqft → Price = 50,000 + 150×1000 = $200,000 ✅
1500 sqft → Price = 50,000 + 150×1500 = $275,000 ✅
2500 sqft → Price = 50,000 + 150×2500 = $425,000 ✅
Multiple Linear Regression (Multiple Features)
ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ
Example:
Price = β₀ + β₁×Area + β₂×Bedrooms + β₃×Age - β₄×Distance
Price = 20,000 + 150×Area + 8,000×Bedrooms - 2,000×Age - 5,000×Distance
Loss Function — How Does the Model Learn?
Learning = finding the best β₀ and β₁ that make predictions closest to actual values.
For this, we use MSE (Mean Squared Error):
MSE = (1/n) × Σ(yᵢ - ŷᵢ)²
Meaning: Average squared difference between actual and predicted values.
Minimize MSE → Find the best-fit line
# Manually computing MSE
import numpy as np
y_actual = np.array([200, 300, 400, 500, 600]) # Actual prices ($k)
y_predicted = np.array([210, 290, 390, 510, 590]) # Model predictions
errors = y_actual - y_predicted
squared_errors = errors ** 2
mse = squared_errors.mean()
print(f"Errors: {errors}")
print(f"Squared Errors: {squared_errors}")
print(f"MSE: {mse:.2f}")
# Lower MSE = better model!
Gradient Descent — The Optimization Magic
How does the model find β₀ and β₁? With Gradient Descent!
Analogy: You're on a dark hillside and need to reach the valley.
- At each step, feel which direction the ground slopes downward
- Take a small step in that direction
- Repeat until you reach the bottom (minimum loss)
# Gradient Descent from scratch
import numpy as np
X = np.array([1000, 1500, 2000, 2500, 3000]) # Area
y = np.array([200, 300, 400, 500, 600]) # Price ($k)
# Initialise parameters
beta_0 = 0 # Intercept
beta_1 = 0 # Slope
lr = 0.000001 # Learning rate (step size)
n = len(X)
epochs = 10000
for epoch in range(epochs):
y_pred = beta_0 + beta_1 * X
errors = y - y_pred
# Compute gradients
d_beta_0 = (-2/n) * errors.sum()
d_beta_1 = (-2/n) * (errors * X).sum()
# Update parameters
beta_0 -= lr * d_beta_0
beta_1 -= lr * d_beta_1
print(f"Learned Intercept (β₀): {beta_0:.4f}")
print(f"Learned Slope (β₁): {beta_1:.6f}")
print(f"\nPrediction for 2500 sqft: ${beta_0 + beta_1 * 2500:.2f}k")
3. 🌿 Types of Linear Regression {#types}
Linear Regression
├── Simple Linear Regression (1 feature)
├── Multiple Linear Regression (multiple features)
├── Polynomial Regression (curved relationship)
├── Ridge Regression (L2 regularization)
├── Lasso Regression (L1 regularization)
└── ElasticNet Regression (L1 + L2 combined)
Type 1: Simple Linear Regression
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1000], [1500], [2000], [2500], [3000]]) # Area
y = np.array([200, 300, 400, 500, 600]) # Price ($k)
model = LinearRegression()
model.fit(X, y)
print(f"Intercept (β₀): {model.intercept_:.4f}")
print(f"Slope (β₁): {model.coef_[0]:.6f}")
print(f"Prediction for 2200 sqft: ${model.predict([[2200]])[0]:.2f}k")
Type 2: Multiple Linear Regression
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.DataFrame({
'area': [1000, 1500, 2000, 2500, 3000],
'bedrooms': [2, 3, 3, 4, 5],
'age_years': [10, 5, 15, 2, 8],
'price_k': [200, 320, 360, 560, 620]
})
X = data[['area', 'bedrooms', 'age_years']]
y = data['price_k']
model = LinearRegression()
model.fit(X, y)
print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f" Intercept: {model.intercept_:.4f}")
# Interpretation:
# area: 0.18 → Each extra sqft adds $180 to the price
# bedrooms: 25000 → Each extra bedroom adds $25,000
# age_years: -5000 → Each year of age reduces price by $5,000
Type 3: Polynomial Regression (Curved Relationship)
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Data that doesn't fit a straight line (curved relationship)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2.5, 5.5, 11, 19, 30, 44, 60, 79, 101, 126]) # Quadratic!
# ❌ Simple linear — poor fit
linear = LinearRegression()
linear.fit(X, y)
print(f"Linear R²: {linear.score(X, y):.4f}")
# ✅ Polynomial — good fit
poly = PolynomialFeatures(degree=2) # Creates x, x², etc.
X_poly = poly.fit_transform(X)
poly_model = LinearRegression()
poly_model.fit(X_poly, y)
print(f"Polynomial R²: {poly_model.score(X_poly, y):.4f}")
X_new = np.array([[11]])
X_new_poly = poly.transform(X_new)
print(f"Prediction for x=11: {poly_model.predict(X_new_poly)[0]:.2f}")
Type 4: Ridge Regression (L2 Regularization)
from sklearn.linear_model import Ridge
# Ridge: Prevents coefficients from growing too large
# alpha = regularization strength (higher = more regularization)
ridge = Ridge(alpha=1.0) # alpha=0 → same as LinearRegression
ridge.fit(X_train, y_train)
# When to use:
# - Features are highly correlated (multicollinearity)
# - Model is overfitting
# - All features seem important
Type 5: Lasso Regression (L1 Regularization)
from sklearn.linear_model import Lasso
# Lasso: Sets some coefficients exactly to 0 (built-in feature selection!)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# When to use:
# - You need automatic feature selection
# - Many features exist, some are irrelevant
# - You want a sparse, interpretable model
# Check which features survived
selected = [(f, c) for f, c in zip(feature_names, lasso.coef_) if c != 0]
print("Selected features:", selected)
Type 6: ElasticNet (Best of Both Worlds)
from sklearn.linear_model import ElasticNet
# L1 + L2 combined
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio: 0=Ridge, 1=Lasso
elastic.fit(X_train, y_train)
# When to use:
# - When neither Ridge nor Lasso alone gives great results
# - When you want grouping behaviour for correlated features
4. ✅ Assumptions — When Does It Work? {#assumptions}
Linear Regression performs well only when these conditions hold.
Assumption 1: Linearity
There should be a linear relationship between X and y.
import matplotlib.pyplot as plt
# Check: plot a scatter chart
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.5)
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linearity Check')
plt.show()
# If a curved pattern appears → use Polynomial Regression
Assumption 2: No Multicollinearity
Features should not be highly correlated with each other.
import seaborn as sns
# Correlation heatmap
corr_matrix = df[feature_columns].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix — Multicollinearity Check')
plt.show()
# VIF (Variance Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data['Feature'] = feature_columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(feature_columns))]
print(vif_data)
# VIF > 10 → Multicollinearity problem!
Assumption 3: Homoscedasticity
Residual variance should be constant (not funnel-shaped).
y_pred = model.predict(X)
residuals = y - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show()
# Funnel shape → Heteroscedasticity → Try log or sqrt transform on y
Assumption 4: Normality of Residuals
import scipy.stats as stats
stats.probplot(residuals, plot=plt)
plt.title('Q-Q Plot — Normality Check')
plt.show()
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
# p > 0.05 → Normal distribution ✅
5. 🛠️ Implementation — Step by Step {#implementation}
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')
# ── 1. Generate Data ───────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
'area_sqft': np.random.randint(500, 5000, n),
'bedrooms': np.random.randint(1, 6, n),
'bathrooms': np.random.randint(1, 4, n),
'age_years': np.random.randint(0, 40, n),
'floor': np.random.randint(1, 20, n),
'parking': np.random.randint(0, 3, n),
'metro_dist_km': np.random.uniform(0.5, 15, n),
})
# Realistic pricing formula
df['price_k'] = (
df['area_sqft'] * 150 +
df['bedrooms'] * 15000 +
df['bathrooms'] * 10000 -
df['age_years'] * 2000 -
df['metro_dist_km'] * 5000 +
df['parking'] * 8000 +
np.random.normal(0, 20000, n)
) / 1000
print("📦 Dataset Shape:", df.shape)
print("\n📊 Price Statistics (in $k):")
print(df['price_k'].describe())
# ── 2. Features & Target ───────────────────────────────────
feature_cols = ['area_sqft', 'bedrooms', 'bathrooms', 'age_years',
'floor', 'parking', 'metro_dist_km']
X = df[feature_cols]
y = df['price_k']
# ── 3. Train-Test Split ────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"\n🔀 Train: {X_train.shape[0]} | Test: {X_test.shape[0]}")
# ── 4. Feature Scaling ─────────────────────────────────────
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# ── 5. Train ───────────────────────────────────────────────
model = LinearRegression()
model.fit(X_train_sc, y_train)
# ── 6. Predict ─────────────────────────────────────────────
y_pred = model.predict(X_test_sc)
# ── 7. Evaluate ────────────────────────────────────────────
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("\n📊 Model Performance:")
print(f" MAE (Mean Absolute Error): ${mae:.2f}k")
print(f" MSE (Mean Squared Error): {mse:.4f}")
print(f" RMSE (Root MSE): ${rmse:.2f}k")
print(f" R² (R-squared): {r2:.4f} ({r2*100:.2f}%)")
# ── 8. Feature Coefficients ────────────────────────────────
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print("\n🔑 Feature Coefficients:")
print(coef_df.to_string(index=False))
print(f"\n Intercept: {model.intercept_:.4f}")
6. 🔧 All Important Functions & Parameters {#functions}
LinearRegression() Parameters
from sklearn.linear_model import LinearRegression
model = LinearRegression(
fit_intercept=True, # Calculate intercept? Default: True
copy_X=True, # Copy X during training? Default: True
n_jobs=None, # CPU cores to use (-1 = all)
positive=False # Force positive coefficients? Default: False
)
Key Attributes After Fitting
model.fit(X_train, y_train)
print(model.coef_) # [β₁, β₂, ...] — one per feature
print(model.intercept_) # β₀ — the intercept
print(model.n_features_in_) # Number of features used in training
print(model.feature_names_in_) # Feature names (if DataFrame was used)
Key Methods
model.fit(X_train, y_train) # Train the model
y_pred = model.predict(X_test) # Make predictions
r2 = model.score(X_test, y_test) # R² score
print(model.get_params()) # View current parameters
model.set_params(fit_intercept=False) # Change parameters
Ridge, Lasso, ElasticNet Key Parameters
from sklearn.linear_model import Ridge, Lasso, ElasticNet
ridge = Ridge(
alpha=1.0, # Regularization strength
fit_intercept=True,
solver='auto', # 'auto', 'svd', 'cholesky', 'lsqr', 'saga'
max_iter=None,
tol=1e-4,
random_state=None
)
lasso = Lasso(
alpha=0.1,
fit_intercept=True,
max_iter=1000,
warm_start=False,
selection='cyclic' # 'cyclic' or 'random'
)
elastic = ElasticNet(
alpha=0.1,
l1_ratio=0.5, # 0 = Ridge, 1 = Lasso
fit_intercept=True,
max_iter=1000
)
7. 📊 Evaluation Metrics {#evaluation}
Metric 1: MAE (Mean Absolute Error)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.4f}")
# Average absolute error
# Interpretable, robust to outliers
# Same units as y
Metric 2: MSE (Mean Squared Error)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.4f}")
# Large errors penalised heavily (squared)
# Units are y² — less interpretable
# Commonly used during optimisation
Metric 3: RMSE
import numpy as np
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse:.4f}")
# Square root of MSE → same units as y
# Sensitive to outliers
# Most commonly reported metric
Metric 4: R² Score
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
# R² = 0.85 → Model explains 85% of variance
# R² = 1.0 → Perfect fit
# R² = 0.0 → No better than predicting the mean
# R² < 0 → Worse than predicting the mean!
# ⚠️ Limitation: Keeps increasing as you add features
# Solution: Use Adjusted R²
Metric 5: Adjusted R²
def adjusted_r2(r2, n, p):
"""
r2 = R² score
n = number of samples
p = number of features
"""
return 1 - (1 - r2) * (n - 1) / (n - p - 1)
n = len(y_test)
p = X_test.shape[1]
adj_r2 = adjusted_r2(r2, n, p)
print(f"Adjusted R²: {adj_r2:.4f}")
# Penalises adding useless features
# Better metric for model comparison
Metrics Summary Table
Metric | Formula | Range | Better When | Units |
|---|---|---|---|---|
MAE | mean(|y - ŷ|) | [0, ∞) | Lower | Same as y |
MSE | mean((y - ŷ)²) | [0, ∞) | Lower | y² |
RMSE | √MSE | [0, ∞) | Lower | Same as y |
R² | 1 - SS_res/SS_tot | (-∞, 1] | Closer to 1 | Unitless |
Adj R² | Modified R² | (-∞, 1] | Closer to 1 | Unitless |
8. 🌊 Intermediate Concepts {#intermediate}
Hyperparameter Tuning with Cross-Validation
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', Ridge())
])
param_grid = {'model__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='r2',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best Alpha: {grid_search.best_params_}")
print(f"Best CV R²: {grid_search.best_score_:.4f}")
print(f"Test R²: {grid_search.score(X_test, y_test):.4f}")
Learning Curves — Diagnosing Overfitting/Underfitting
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
def plot_learning_curve(model, X, y, title="Learning Curve"):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='r2',
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
plt.xlabel('Training Size')
plt.ylabel('R² Score')
plt.title(title)
plt.legend()
plt.grid(True)
plt.show()
final_gap = train_mean[-1] - val_mean[-1]
print(f"\nDiagnosis:")
print(f"Train R²: {train_mean[-1]:.4f}")
print(f"Val R²: {val_mean[-1]:.4f}")
print(f"Gap: {final_gap:.4f}")
if final_gap > 0.1:
print("⚠️ Possible OVERFITTING — Try regularization or more data")
elif val_mean[-1] < 0.6:
print("⚠️ Possible UNDERFITTING — Try more features or a complex model")
else:
print("✅ Good fit!")
from sklearn.linear_model import LinearRegression
plot_learning_curve(LinearRegression(), X, y)
Feature Engineering for Regression
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({
'area': [1000, 1500, 2000, 2500, 3000],
'bedrooms': [2, 3, 3, 4, 4],
'age': [5, 10, 15, 3, 20],
'metro_dist':[2, 5, 3, 1, 8],
})
# 1. Interaction features
df['area_per_bedroom'] = df['area'] / df['bedrooms']
df['area_x_metro'] = df['area'] * df['metro_dist']
# 2. Log transform (for skewed data)
df['log_area'] = np.log1p(df['area'])
df['log_metro'] = np.log1p(df['metro_dist'])
# 3. Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 5, 15, 100],
labels=['New', 'Mid', 'Old'])
# 4. Automated polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['area', 'bedrooms']])
print("Polynomial Features:", poly.get_feature_names_out(['area', 'bedrooms']))
# ['area', 'bedrooms', 'area^2', 'area bedrooms', 'bedrooms^2']
9. 🚀 Advanced Concepts {#advanced}
Regularization — Deeper Look
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 200
# Introduce multicollinearity — X2 ≈ X1
X1 = np.random.randn(n)
X2 = X1 + np.random.randn(n) * 0.1 # Almost identical to X1!
X3 = np.random.randn(n)
X4 = np.random.randn(n) # Noise
X5 = np.random.randn(n) # Noise
X = np.column_stack([X1, X2, X3, X4, X5])
y = 3*X1 + 2*X3 + np.random.randn(n) * 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
models = {
'Linear': LinearRegression(),
'Ridge(α=0.1)': Ridge(alpha=0.1),
'Ridge(α=10)': Ridge(alpha=10),
'Lasso(α=0.1)': Lasso(alpha=0.1),
'Lasso(α=1.0)': Lasso(alpha=1.0),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
print(f"{'Model':<20} {'Train R²':>10} {'Test R²':>10} {'Coefficients':>40}")
print("-" * 85)
for name, m in models.items():
m.fit(X_train_sc, y_train)
train_r2 = m.score(X_train_sc, y_train)
test_r2 = m.score(X_test_sc, y_test)
coefs = [f"{c:.3f}" for c in m.coef_]
print(f"{name:<20} {train_r2:>10.4f} {test_r2:>10.4f} {str(coefs):>40}")
RidgeCV and LassoCV — Automatic Alpha Selection
from sklearn.linear_model import RidgeCV, LassoCV
import numpy as np
alphas = np.logspace(-4, 4, 50) # 0.0001 to 10000
ridge_cv = RidgeCV(alphas=alphas, cv=5, scoring='r2')
ridge_cv.fit(X_train_sc, y_train)
print(f"Ridge Best Alpha: {ridge_cv.alpha_:.4f}")
print(f"Ridge Test R²: {ridge_cv.score(X_test_sc, y_test):.4f}")
lasso_cv = LassoCV(alphas=alphas, cv=5, n_jobs=-1, random_state=42)
lasso_cv.fit(X_train_sc, y_train)
print(f"\nLasso Best Alpha: {lasso_cv.alpha_:.4f}")
print(f"Lasso Test R²: {lasso_cv.score(X_test_sc, y_test):.4f}")
print(f"Non-zero features: {(lasso_cv.coef_ != 0).sum()} / {X_train.shape[1]}")
Statsmodels — Statistical Analysis
import statsmodels.api as sm
X_with_const = sm.add_constant(X_train_sc)
ols_model = sm.OLS(y_train, X_with_const)
results = ols_model.fit()
print(results.summary())
# Key outputs:
print("\nCoefficients:")
print(results.params)
print("\nP-values (< 0.05 = statistically significant):")
print(results.pvalues)
print("\n95% Confidence Intervals:")
print(results.conf_int())
10. 🌐 Real-World Use Cases {#real-world}
Use Case 1: Sales Forecasting
"""
Problem: Predict monthly sales based on ad spend and season
Business: FMCG company
"""
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 120 # 10 years of monthly data
months = pd.date_range('2014-01', periods=n, freq='ME')
data = pd.DataFrame({
'date': months,
'ad_spend_k': np.random.uniform(50, 500, n),
'season': np.where(months.month.isin([10, 11, 12, 1, 2]), 'peak', 'normal'),
'competitors_count': np.random.randint(2, 8, n),
'discount_pct': np.random.uniform(0, 30, n),
})
data['sales_k'] = (
data['ad_spend_k'] * 0.8 +
(data['season'] == 'peak').astype(int) * 200 -
data['competitors_count'] * 30 +
data['discount_pct'] * 5 +
np.random.normal(0, 50, n)
)
numeric_cols = ['ad_spend_k', 'competitors_count', 'discount_pct']
categorical_cols = ['season']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_cols),
('cat', OneHotEncoder(drop='first'), categorical_cols)
])
pipeline = Pipeline([('prep', preprocessor), ('model', LinearRegression())])
X = data[numeric_cols + categorical_cols]
y = data['sales_k']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print(f"✅ Sales Forecast Model R²: {pipeline.score(X_test, y_test):.4f}")
# Predict next month
next_month = pd.DataFrame({
'ad_spend_k': [350], 'competitors_count': [4],
'discount_pct': [15], 'season': ['peak']
})
print(f"🎯 Predicted Next Month Sales: ${pipeline.predict(next_month)[0]:.2f}k")
Use Case 2: Student Performance Prediction
"""
Problem: Predict a student's final exam score
Business: EdTech platform
"""
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'daily_study_hours': np.random.uniform(1, 10, n),
'attendance_pct': np.random.uniform(50, 100, n),
'assignments_done': np.random.uniform(0, 100, n),
'mock_score_avg': np.random.uniform(30, 95, n),
'sleep_hours': np.random.uniform(4, 9, n),
'online_resources': np.random.uniform(0, 5, n),
})
df['final_score'] = np.clip(
df['daily_study_hours'] * 4 +
df['attendance_pct'] * 0.3 +
df['assignments_done'] * 0.2 +
df['mock_score_avg'] * 0.5 +
df['sleep_hours'] * 1.5 +
df['online_resources'] * 2 +
np.random.normal(0, 5, n),
0, 100
)
X = df.drop('final_score', axis=1)
y = df['final_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1.0))])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"📚 Student Score Predictor:")
print(f" R²: {r2_score(y_test, y_pred):.4f}")
print(f" MAE: {mean_absolute_error(y_test, y_pred):.2f} marks")
11. 💻 Practical Examples — Complete Code {#practical}
Full Project: California Housing Price Prediction
# ============================================================
# HOUSE PRICE PREDICTION — COMPLETE ML PROJECT
# ============================================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# ── Load Data ──────────────────────────────────────────────
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target # Median house price (in $100k)
print("🏠 California Housing Dataset")
print(f"Shape: {df.shape}")
print(f"Features: {list(housing.feature_names)}")
print(f"\nPrice Range: ${df['Price'].min():.2f}0k — ${df['Price'].max():.2f}0k")
# ── Feature Engineering ────────────────────────────────────
df['RoomsPerHouse'] = df['AveRooms'] / df['AveOccup']
df['BedroomRatio'] = df['AveBedrms'] / df['AveRooms']
df['PopulationDensity'] = df['Population'] / df['AveOccup']
# ── Train-Test Split ───────────────────────────────────────
feature_cols = [c for c in df.columns if c != 'Price']
X = df[feature_cols]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ── Compare Models ─────────────────────────────────────────
models = {
'Linear Regression': LinearRegression(),
'Ridge (α=1)': Ridge(alpha=1),
'Ridge (α=10)': Ridge(alpha=10),
'Lasso (α=0.01)': Lasso(alpha=0.01),
}
results = []
for name, m in models.items():
pipe = Pipeline([('scaler', StandardScaler()), ('model', m)])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
results.append({
'Model': name,
'Train R²': pipe.score(X_train, y_train),
'Test R²': pipe.score(X_test, y_test),
'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
'MAE': mean_absolute_error(y_test, y_pred)
})
results_df = pd.DataFrame(results).round(4)
print("\n🏆 Model Comparison:")
print(results_df.to_string(index=False))
# ── Best Model Analysis ────────────────────────────────────
best_pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1))])
best_pipe.fit(X_train, y_train)
y_pred = best_pipe.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"\n✅ Best Model: Ridge (α=1)")
print(f" R² Score: {r2:.4f} — model explains {r2*100:.1f}% of variance")
print(f" MAE: ${mae*100:.0f} average prediction error")
# ── Feature Importance ─────────────────────────────────────
ridge_model = best_pipe.named_steps['model']
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': ridge_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print("\n🔑 Feature Importance:")
print(coef_df.to_string(index=False))
12. ⚠️ Edge Cases & Errors {#edge-cases}
Error 1: ValueError — Feature Count Mismatch
# ❌ Error: different number of features at train vs predict time
model.predict(X_different_shape)
# ValueError: X has 5 features, but LinearRegression expects 7
# ✅ Fix: Always use the same features; use Pipelines to enforce this
Error 2: Perfect Multicollinearity
# ❌ Problem: Two features are identical
# The matrix becomes singular → unreliable or failed solutions
# ✅ Fix 1: Remove the duplicate feature
# ✅ Fix 2: Use Ridge which handles multicollinearity gracefully
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
Error 3: Skewed Target Variable
import numpy as np
y_skewed = np.array([10000, 15000, 20000, 1000000, 1500000])
# ✅ Fix: Apply log transform
y_log = np.log1p(y_skewed)
model.fit(X_train, np.log1p(y_train))
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log) # Reverse transform
Error 4: Convergence Warning (Lasso/ElasticNet)
# ❌ ConvergenceWarning: Objective did not converge
# ✅ Fix 1: Increase max_iter
lasso = Lasso(alpha=0.1, max_iter=10000, tol=1e-4)
# ✅ Fix 2: Scale the data first
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_train)
13. 💡 Pro Tips {#pro-tips}
# ✅ Tip 1: Always use a Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())])
# Prevents data leakage and keeps code clean
# ✅ Tip 2: Rely on cross-validation, not a single test score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=10, scoring='r2')
print(f"CV R²: {scores.mean():.4f} ± {scores.std():.4f}")
# ✅ Tip 3: Permutation importance for model-agnostic feature ranking
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
# ✅ Tip 4: SHAP values for production explainability
import shap
explainer = shap.LinearExplainer(model, X_train_sc)
shap_values = explainer.shap_values(X_test_sc)
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)
# ✅ Tip 5: Save the complete pipeline
import joblib
joblib.dump(pipe, 'house_price_model_v1.pkl')
loaded = joblib.load('house_price_model_v1.pkl')
# ✅ Tip 6: Always inspect residuals
residuals = y_test - y_pred
# Mean should be ~0
# No pattern should exist in residuals vs fitted values
14. ⚖️ Comparison {#comparison}
Linear Regression vs Other Regression Models
Model | Complexity | Interpretability | Handles Non-linearity | Overfitting Risk | Speed |
|---|---|---|---|---|---|
Linear Regression | ⭐ Low | ⭐⭐⭐⭐⭐ Very High | ❌ No | Low | ⚡ Very Fast |
Ridge/Lasso | ⭐ Low | ⭐⭐⭐⭐ High | ❌ No | Very Low | ⚡ Very Fast |
Polynomial Reg | ⭐⭐ Medium | ⭐⭐⭐ Medium | ✅ Limited | Medium-High | ⚡ Fast |
Decision Tree | ⭐⭐ Medium | ⭐⭐⭐ Medium | ✅ Yes | High | ⚡ Fast |
Random Forest | ⭐⭐⭐ High | ⭐⭐ Low | ✅ Yes | Low | 🐢 Medium |
XGBoost | ⭐⭐⭐⭐ High | ⭐ Very Low | ✅ Yes | Very Low | 🐢 Medium |
Neural Network | ⭐⭐⭐⭐⭐ Very High | ⭐ Very Low | ✅ Yes | Variable | 🐌 Slow |
When to Choose Linear Regression?
✅ Use Linear Regression when:
- You need a simple, interpretable model
- The relationship is linear
- You want to understand individual feature impact
- Dataset is small to medium
- Building a quick baseline
- Regulatory / explainability requirements exist
❌ Avoid when:
- Complex non-linear patterns exist
- Many outliers are present
- Data is very high-dimensional (without regularization)
- Working with image or text data
15. 📊 Data Science Perspective {#data-science}
# Answering a business question with Linear Regression
import statsmodels.api as sm
import pandas as pd
import numpy as np
# Business Question: "How much does ad spend increase our sales?"
df = pd.DataFrame({
'ad_spend': np.random.uniform(10, 100, 100),
'sales': np.random.uniform(50, 500, 100)
})
df['sales'] = df['ad_spend'] * 4.2 + np.random.normal(0, 20, 100)
X_sm = sm.add_constant(df['ad_spend'])
sm_model = sm.OLS(df['sales'], X_sm).fit()
coef = sm_model.params['ad_spend']
pval = sm_model.pvalues['ad_spend']
print(f"💡 Business Insight:")
print(f"Every $1 increase in ad spend drives ${coef:.2f} in additional sales")
print(f"Statistical significance: p={pval:.4f} ({'Significant ✅' if pval < 0.05 else 'Not Significant ❌'})")
print(f"95% CI: {sm_model.conf_int().loc['ad_spend'].values}")
16. 🎤 Interview Questions {#interview-questions}
🟢 Basic
Q1: What does Linear Regression do?
Predicts continuous output values. Finds the best-fit line through the data by minimising MSE.
Q2: What does the slope (coefficient) mean?
The expected change in the target for a 1-unit increase in a feature, holding all other features constant.
Q3: What is the R² score?
The proportion of variance in the target that the model explains. R²=0.85 → 85% explained. 1.0 = perfect; 0 = useless.
Q4: MAE vs RMSE — when to use which?
MAE: Interpretable, robust to outliers. RMSE: Penalises large errors more. When outliers exist, prefer MAE.
Q5: Why use regularization?
To prevent overfitting. Ridge shrinks coefficients. Lasso zeroes out some coefficients (feature selection).
🟡 Intermediate
Q6: What is multicollinearity and how do you fix it?
Problem: Highly correlated features → unstable coefficients
Detect: VIF > 10 or high correlation matrix values
Fix:
- Remove one of the correlated features
- Apply PCA
- Use Ridge regression
Q7: What is Gradient Descent? Name the variants.
Gradient Descent: Algorithm that minimises loss iteratively
Types:
- Batch GD: Uses all data per step
- Stochastic GD: Uses one sample per step
- Mini-batch GD: Uses small batches (most common in practice)
Learning Rate: Too high → diverge; Too low → very slow
Q8: What is heteroscedasticity?
Residual variance is not constant across fitted values. Appears as a funnel shape in residual plots. Fix: log-transform y, or use Weighted Least Squares.
🔴 Advanced
Q9: How are Ridge and Lasso different mathematically?
Ridge: Minimise MSE + α × Σβᵢ² (L2 norm — squares the coefficients)
Lasso: Minimise MSE + α × Σ|βᵢ| (L1 norm — absolute value of coefficients)
L1 pushes some coefficients exactly to 0 → Sparse / feature selection
L2 shrinks all coefficients but keeps them non-zero → Dense solution
Q10: Is Polynomial Regression still "Linear Regression"?
Yes! You create polynomial features (x, x², x³) and then fit a Linear Regression on them. The model is still linear in its coefficients — hence the name holds.
Q11: What is the difference between statistical significance and practical significance?
Statistical Significance: p < 0.05 → Feature is statistically significant
Practical Significance: Is the coefficient large enough to matter in the real world?
With a large n, tiny and meaningless effects can become statistically significant.
Always check both.
🎭 Scenario-Based
Q12: You deployed a model six months ago and performance has dropped. What happened?
Possible causes:
1. Data Drift: Input distribution has changed
2. Concept Drift: The relationship between X and y has changed
3. Seasonal effects
Solution:
- Monitor input distributions regularly
- Retrain periodically with fresh data
- Use drift detection tools (Evidently, alibi-detect)
17. 🎯 Conclusion {#conclusion}
Key Takeaways
Topic | Key Learning |
|---|---|
What | Predict continuous values using a linear relationship |
Types | Simple, Multiple, Polynomial, Ridge, Lasso, ElasticNet |
Math | ŷ = β₀ + β₁x₁ + ... + βₙxₙ |
Loss | Minimise MSE via Gradient Descent or Normal Equation |
Metrics | MAE, RMSE, R², Adjusted R² |
Regularization | Ridge=L2, Lasso=L1, ElasticNet=Both |
Best Practice | Pipeline + Cross-Validation + Feature Scaling |
Series Roadmap
✅ Part 1: ML Introduction
✅ Part 2: Linear Regression (This Blog!)
⏳ Part 3: Logistic Regression — Classification
⏳ Part 4: Decision Trees
⏳ Part 5: Random Forests & Ensemble Methods
⏳ Part 6: Support Vector Machines
⏳ Part 7: K-Means Clustering
⏳ Part 8: Neural Networks
Final Advice
"Master Linear Regression properly — it is the foundation for everything else. Understand regularization here and it will make Ridge/Lasso/dropout in deep learning trivial. Understand MSE here and you'll understand loss functions everywhere."
Do This Now:
✅ Run the California Housing project code above
✅ Compare Ridge, Lasso, and ElasticNet yourself
✅ Use Statsmodels for a full statistical summary
✅ Try the "House Prices" Kaggle competition
Part 2 of ML A to Z | Next: Logistic Regression!