Back to all posts
Data Science

Logistic Regression — The Superhero of Classification | ML Series Part 3

Logistic Regression explained completely — sigmoid function, evaluation metrics, class imbalance, threshold tuning, and real-world projects in Python. Beginn...

1. Introduction — What is Logistic Regression?

The name says "Regression" but this is actually a Classification algorithm! Confusing? Let's clear that up with a simple example.

Real-Life Problem

A doctor needs to decide: Does this patient have diabetes or not?

  • Blood sugar: 180 mg/dL

  • BMI: 32

  • Age: 45 years

The doctor calculates a probability: "There's a 78% chance this patient has diabetes." If probability > 50% → Diabetes is diagnosed.

That is exactly what Logistic Regression does!

Logistic Regression = A classification algorithm that predicts the probability that a data point belongs to a particular class.

How Is It Different from Linear Regression?

SQL
Linear Regression:    Price = $200k, $300k, $1.2M  → Continuous numbers
Logistic Regression:  Email = Spam(1) / Not Spam(0) → Probability between 0 and 1

Why Is It Called "Regression"?

Logistic Regression runs a regression on the log-odds (logit) of the target — that's where the "regression" name comes from. But since the output is a probability used for classification, it is a classification algorithm in practice.

Why Does It Matter?

  • Simple and fast — works on large datasets

  • Interpretable — coefficients have clear, real-world meaning

  • Probability output — "78% chance" is far more actionable than a hard 0/1 label

  • The go-to baseline — starting point for every classification project

  • Everywhere in production — Medical, Finance, HR, Marketing

Real-World Applications

Domain

Problem

Output

🏥 Medical

Has disease or not?

0/1

💳 Finance

Will customer default?

Probability

📧 Email

Is it spam?

0/1

🛒 E-commerce

Will customer buy?

Probability

📱 App

Will user churn?

Risk score

🏦 Banking

Is this fraud?

Flag

🎓 Education

Will student pass?

Probability

🚗 Insurance

Will they file a claim?

Risk


2. 🧮 The Math — Sigmoid Function {#math}

The Problem with Linear Regression for Classification

SQL
Linear Regression: ŷ = β₀ + β₁x

Problem: Output can range from -to +!
But a probability must be between 0 and 1.

Example:
  Linear Regression might predict: 1.5 or -0.3
  → These are meaningless as probabilities!

Solution: The Sigmoid Function 🧠

SQL
Sigmoid Function:

σ(z) = 1 / (1 + e^(-z))

Where:
z    = β₀ + β₁x₁ + β₂x₂ + ... (linear combination)
e    = Euler's number (2.718...)
σ(z) = output in range (0, 1) — a valid probability!
Python
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Squashes any number into the range (0, 1)"""
    return 1 / (1 + np.exp(-z))

z_values       = np.linspace(-10, 10, 100)
sigmoid_values = sigmoid(z_values)

print(f"sigmoid(-10) = {sigmoid(-10):.6f}")  # ~0 (almost impossible)
print(f"sigmoid(-2)  = {sigmoid(-2):.4f}")   # 0.12
print(f"sigmoid(0)   = {sigmoid(0):.4f}")    # 0.5 (50/50)
print(f"sigmoid(2)   = {sigmoid(2):.4f}")    # 0.88
print(f"sigmoid(10)  = {sigmoid(10):.6f}")   # ~1 (almost certain)

plt.figure(figsize=(10, 5))
plt.plot(z_values, sigmoid_values, 'b-', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.axvline(x=0,   color='gray', linestyle='--')
plt.xlabel('z (Linear combination of features)')
plt.ylabel('σ(z) (Probability)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()

The Complete Model

SQL
Step 1: Compute the linear combination
z = β₀ + β₁x₁ + β₂x₂ + ...

Step 2: Apply the sigmoid
P(y=1 | X) = σ(z) = 1 / (1 + e^(-z))

Step 3: Apply the decision rule
If P ≥ 0.5 → Predict Class 1
If P <  0.5 → Predict Class 0

Log-Odds (Logit) — The Full Picture

SQL
Odds = P(event) / P(no event) = p / (1 - p)
Log-Odds = log(p / (1 - p))

Logistic Regression fits a linear model to the log-odds:
log(p / (1-p)) = β₀ + β₁x₁ + β₂x₂

Coefficient β₁ means:
  A 1-unit increase in x₁ increases the log-odds by β₁
  = The odds are multiplied by exp(β₁)
Python
import numpy as np

# Interpretation example
# Coefficient for blood_sugar = 0.05

coef       = 0.05
odds_ratio = np.exp(coef)
print(f"For each 1-unit increase in blood sugar:")
print(f"Odds Ratio: {odds_ratio:.4f}")
print(f"Diabetes odds increase by {(odds_ratio - 1) * 100:.1f}%")

Loss Function — Log Loss (Binary Cross-Entropy)

SQL
L = -(1/n) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]

Where:
yᵢ = Actual class (0 or 1)
ŷᵢ = Predicted probability

Minimising Log Loss → best parameters → best model
Python
from sklearn.metrics import log_loss
import numpy as np

y_true    = np.array([1, 0, 1, 1, 0])
y_pred_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])

logloss = log_loss(y_true, y_pred_prob)
print(f"Log Loss: {logloss:.4f}")
# Closer to 0 = better model

3. 🌿 Types of Logistic Regression {#types}

SQL
Logistic Regression
├── Binary Classification     (2 classes: 0/1, Yes/No)
├── Multiclass Classification (3+ classes)
│   ├── One-vs-Rest (OvR)
│   └── Multinomial (Softmax)
└── Multilabel Classification (multiple labels per sample)

Binary Classification

Python
from sklearn.linear_model import LogisticRegression
import numpy as np

# Spam detection example
X = np.array([[0.1, 2], [0.5, 3], [0.9, 5], [0.2, 1], [0.8, 4]])
y = np.array([0, 0, 1, 0, 1])  # 0=Ham, 1=Spam

model = LogisticRegression()
model.fit(X, y)

probs = model.predict_proba(X)
print("Probabilities:")
for i, p in enumerate(probs):
    print(f"  Sample {i+1}: Ham={p[0]:.3f}, Spam={p[1]:.3f}")

preds = model.predict(X)
print(f"\nPredictions: {preds}")

Multiclass Classification (3+ Classes)

Python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(
    multi_class='auto',  # Automatically picks the best strategy
    solver='lbfgs',
    max_iter=200,
    random_state=42
)
model.fit(X, y)

probs = model.predict_proba(X[:3])
print("Class Probabilities for first 3 flowers:")
print(f"{'Sample':<10} {'Setosa':>10} {'Versicolor':>12} {'Virginica':>10}")
for i, p in enumerate(probs):
    print(f"{i+1:<10} {p[0]:>10.3f} {p[1]:>12.3f} {p[2]:>10.3f}")

4. 🎯 The Decision Boundary {#decision-boundary}

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=200, n_features=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)

model = LogisticRegression()
model.fit(X, y)

h    = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', linewidth=0.5)
plt.title('Logistic Regression — Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.show()

print(f"Coefficients: {model.coef_[0]}")
print(f"Intercept:    {model.intercept_[0]:.4f}")

5. 🔧 All Functions & Parameters {#functions}

Python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    # ─── Regularization ───────────────────────────────────
    penalty='l2',        # 'l1', 'l2', 'elasticnet', None
                         # l2: Default (Ridge-like behaviour)
                         # l1: Lasso-like (can zero out features)
                         # elasticnet: Combined (set l1_ratio too)
                         # None: No regularization

    C=1.0,               # Inverse of regularization strength
                         # C < 1 → Stronger regularization
                         # C > 1 → Weaker regularization
                         # Think: C = 1/alpha (opposite convention to Ridge!)

    # ─── Solver ────────────────────────────────────────────
    solver='lbfgs',      # 'lbfgs'     : Default, small-medium, L2/None
                         # 'liblinear' : Small data, L1/L2
                         # 'saga'      : L1/L2/ElasticNet, large sparse data
                         # 'sag'       : Large data, L2 only
                         # 'newton-cg' : L2/None, medium data

    # ─── Multiclass ────────────────────────────────────────
    multi_class='auto',  # 'auto'         : Automatically selects
                         # 'ovr'          : One-vs-Rest
                         # 'multinomial'  : Softmax (true multiclass)

    # ─── Convergence ───────────────────────────────────────
    max_iter=100,        # Maximum number of solver iterations
    tol=1e-4,            # Convergence tolerance

    # ─── Class Imbalance ───────────────────────────────────
    class_weight=None,   # None      : All classes equal weight
                         # 'balanced': Auto-weight by class frequency
                         # {0:1, 1:10}: Custom per-class weights

    # ─── ElasticNet only ───────────────────────────────────
    l1_ratio=None,       # 0 = L2, 1 = L1, 0.5 = equal mix

    # ─── Other ─────────────────────────────────────────────
    fit_intercept=True,
    random_state=None,
    n_jobs=None,         # -1 = use all CPU cores
    warm_start=False
)

Solver Selection Guide

Solver

Penalties Supported

Dataset Size

Notes

lbfgs

l2, None

Small–Medium

Default, fast

liblinear

l1, l2

Small

Good for small data

saga

l1, l2, elasticnet

Large

Great for sparse data

sag

l2

Large

Faster than lbfgs at scale

newton-cg

l2, None

Medium

Memory intensive

Important Methods

Python
model.fit(X_train, y_train)            # Train

predictions = model.predict(X_test)   # Hard class predictions (0 or 1)

probabilities = model.predict_proba(X_test)
# Returns: [[P(class0), P(class1)], ...]  — most useful output!

log_probs = model.predict_log_proba(X_test)  # Log-scale probabilities

accuracy = model.score(X_test, y_test)

# Learned attributes
print(model.classes_)        # [0, 1] or ['cat', 'dog']
print(model.coef_)           # Feature coefficients
print(model.intercept_)      # Intercept term
print(model.n_iter_)         # Actual iterations used

6. 📊 Evaluation Metrics for Classification {#metrics}

Confusion Matrix — The Foundation

SQL
Actual \ Predicted | Positive (1) | Negative (0)
───────────────────────────────────────────────────
Positive (1)       | TP           | FN
Negative (0)       | FP           | TN

TP = True Positive  (correct: predicted 1, actual 1)
TN = True Negative  (correct: predicted 0, actual 0)
FP = False Positive (wrong: predicted 1, actual 0) — Type I Error
FN = False Negative (wrong: predicted 0, actual 1) — Type II Error
Python
from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_prob = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.15]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

All Key Metrics

Python
# 1. Accuracy — Overall correct predictions
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy:  {accuracy:.4f}")
# = (TP + TN) / Total
# ⚠️ Misleading on imbalanced datasets!

# 2. Precision — Of all predicted positives, how many are actually positive?
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")
# = TP / (TP + FP)
# High Precision: "When I say Positive, I'm nearly always right"
# Use: Spam detection (avoid flagging legitimate email)

# 3. Recall (Sensitivity) — Of all actual positives, how many did we catch?
recall = recall_score(y_true, y_pred)
print(f"Recall:    {recall:.4f}")
# = TP / (TP + FN)
# High Recall: "I don't miss any real positive cases"
# Use: Cancer detection (never miss a sick patient)

# 4. F1 Score — Harmonic mean of Precision and Recall
f1 = f1_score(y_true, y_pred)
print(f"F1 Score:  {f1:.4f}")
# = 2 × (Precision × Recall) / (Precision + Recall)
# Best single metric for imbalanced data

# 5. AUC-ROC — Overall discrimination ability of the model
auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC:   {auc:.4f}")
# 0.5 = random guessing
# 1.0 = perfect classifier
# > 0.8 = good model

# 6. Full Report
print("\n📋 Full Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

ROC Curve

Python
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR / Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()

Precision-Recall Curve (Better for Imbalanced Data!)

Python
from sklearn.metrics import precision_recall_curve, average_precision_score

precision_arr, recall_arr, _ = precision_recall_curve(y_true, y_prob)
avg_precision = average_precision_score(y_true, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(recall_arr, precision_arr, 'b-', linewidth=2,
         label=f'PR Curve (AP = {avg_precision:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

# On highly imbalanced datasets:
# AUC-ROC can be overly optimistic
# PR Curve gives a more honest picture

Metric Selection Guide

Scenario

Recommended Metric

Reason

Balanced classes

Accuracy

Simple and reliable

Imbalanced, both errors equal

F1 Score

Balances Precision & Recall

FP costly (spam filter)

Precision

Don't block legitimate email

FN costly (cancer detection)

Recall

Never miss a sick patient

Overall model discrimination

AUC-ROC

Threshold-independent

Highly imbalanced data

AUC-PR

More informative than ROC


7. 🌊 Intermediate Concepts {#intermediate}

Threshold Tuning — 0.5 Is Not Always Optimal!

Python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import pandas as pd

# Imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
                            weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

probs = model.predict_proba(X_test)[:, 1]

# Try different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []
for thresh in thresholds:
    preds = (probs >= thresh).astype(int)
    results.append({
        'Threshold': thresh,
        'Precision': precision_score(y_test, preds, zero_division=0),
        'Recall':    recall_score(y_test, preds, zero_division=0),
        'F1':        f1_score(y_test, preds, zero_division=0)
    })

results_df = pd.DataFrame(results)
print("Threshold Analysis:")
print(results_df.round(4).to_string(index=False))

# Find the best threshold for F1
best_thresh = results_df.loc[results_df['F1'].idxmax(), 'Threshold']
print(f"\n✅ Best Threshold: {best_thresh:.2f}")
print(f"Best F1 Score:    {results_df['F1'].max():.4f}")

Handling Class Imbalance

Python
from sklearn.linear_model import LogisticRegression

# Method 1: class_weight='balanced' — simplest fix
model = LogisticRegression(class_weight='balanced', random_state=42)

# Method 2: SMOTE — generate synthetic minority samples
# pip install imbalanced-learn
import numpy as np
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"Original distribution:    {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After SMOTE:              {dict(zip(*np.unique(y_resampled, return_counts=True)))}")

# Method 3: Full Pipeline with SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler

pipeline = ImbPipeline([
    ('smote',  SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(random_state=42))
])

pipeline.fit(X_train, y_train)

8. 🚀 Advanced Concepts {#advanced}

Regularization — The C Parameter in Depth

Python
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

C_values = np.logspace(-4, 4, 20)  # 0.0001 to 10000

results = []
for C in C_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(C=C, max_iter=1000, random_state=42))
    ])
    cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1')
    results.append({'C': C, 'Mean F1': cv_scores.mean(), 'Std F1': cv_scores.std()})

results_df = pd.DataFrame(results)
best_C = results_df.loc[results_df['Mean F1'].idxmax(), 'C']
print(f"Best C:       {best_C:.6f}")
print(f"Best Mean F1: {results_df['Mean F1'].max():.4f}")

LogisticRegressionCV — Automatic C Selection

Python
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

model_cv = LogisticRegressionCV(
    Cs=20,           # Number of C values to try
    cv=5,
    scoring='f1',
    penalty='l2',
    solver='lbfgs',
    max_iter=1000,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline([('scaler', StandardScaler()), ('model', model_cv)])
pipe.fit(X_train, y_train)

print(f"Best C: {model_cv.C_[0]:.6f}")
print(f"Test F1: {f1_score(y_test, pipe.predict(X_test)):.4f}")

Multi-class Strategies

Python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Strategy 1: One-vs-Rest (OvR)
# Trains one binary classifier per class
ovr_model = LogisticRegression(multi_class='ovr', solver='liblinear', random_state=42)
ovr_model.fit(X_train, y_train)
print(f"OvR Accuracy: {ovr_model.score(X_test, y_test):.4f}")

# Strategy 2: Multinomial (Softmax)
# One model that handles all classes simultaneously
mnm_model = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                                max_iter=200, random_state=42)
mnm_model.fit(X_train, y_train)
print(f"Multinomial Accuracy: {mnm_model.score(X_test, y_test):.4f}")

# Probabilities sum to 1 for each sample
probs = mnm_model.predict_proba(X_test[:3])
print("\nProbabilities (should sum to 1):")
for p in probs:
    print(f"  Setosa: {p[0]:.3f} | Versicolor: {p[1]:.3f} | Virginica: {p[2]:.3f} | Sum: {sum(p):.3f}")

9. 🌐 Real-World Use Cases {#real-world}

Use Case 1: Loan Default Prediction

Python
"""
Problem: Will this customer default on their loan?
Business: Bank / Lending Institution
Impact: Risk management, credit scoring
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

np.random.seed(42)
n = 5000

df = pd.DataFrame({
    'age':               np.random.randint(22, 65, n),
    'annual_income':     np.random.randint(20000, 200000, n),
    'loan_amount':       np.random.randint(5000, 200000, n),
    'loan_term_months':  np.random.choice([12, 24, 36, 48, 60], n),
    'credit_score':      np.random.randint(300, 850, n),
    'employment_type':   np.random.choice(['Salaried', 'Self-Employed', 'Business'], n),
    'existing_loans':    np.random.randint(0, 5, n),
    'monthly_expenses':  np.random.randint(1000, 10000, n),
})

default_prob = (
    - (df['credit_score'] - 600) * 0.005
    + (df['loan_amount'] / df['annual_income']) * 0.3
    + df['existing_loans'] * 0.1
    + (df['monthly_expenses'] / (df['annual_income'] / 12)) * 0.2
    + np.random.normal(0, 0.1, n)
)

df['defaulted'] = (default_prob > 0.3).astype(int)
print(f"Default Rate: {df['defaulted'].mean():.2%}")

num_features = ['age', 'annual_income', 'loan_amount', 'loan_term_months',
                 'credit_score', 'existing_loans', 'monthly_expenses']
cat_features = ['employment_type']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(drop='first'), cat_features)
])

X = df.drop('defaulted', axis=1)
y = df['defaulted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline = Pipeline([
    ('prep',  preprocessor),
    ('model', LogisticRegression(
        C=0.1, class_weight='balanced',
        solver='lbfgs', max_iter=1000, random_state=42
    ))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print(f"\n📊 Loan Default Model Performance:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

# Score a new applicant
new_applicant = pd.DataFrame({
    'age': [35], 'annual_income': [60000], 'loan_amount': [50000],
    'loan_term_months': [36], 'credit_score': [650],
    'employment_type': ['Salaried'], 'existing_loans': [1],
    'monthly_expenses': [3000]
})

risk_prob    = pipeline.predict_proba(new_applicant)[0, 1]
risk_category = 'High' if risk_prob > 0.6 else 'Medium' if risk_prob > 0.3 else 'Low'

print(f"\n🏦 Applicant Risk Assessment:")
print(f"  Default Probability: {risk_prob:.2%}")
print(f"  Risk Category:       {risk_category}")
recommendation = 'Reject' if risk_category == 'High' else 'Manual Review' if risk_category == 'Medium' else 'Approve'
print(f"  Recommendation:      {recommendation}")

10. 💻 Practical Examples — Complete Code {#practical}

Customer Churn Prediction — Full ML Pipeline

Python
# ============================================================
# CUSTOMER CHURN PREDICTION — COMPLETE ML PROJECT
# ============================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report
)
import warnings
warnings.filterwarnings('ignore')

# ── 1. Generate Data ──────────────────────────────────────
np.random.seed(42)
n = 3000

df = pd.DataFrame({
    'tenure_months':    np.random.randint(1, 72, n),
    'monthly_charges':  np.random.uniform(20, 120, n),
    'total_charges':    np.random.uniform(100, 8000, n),
    'num_products':     np.random.randint(1, 5, n),
    'support_calls':    np.random.randint(0, 10, n),
    'payment_delays':   np.random.randint(0, 5, n),
    'has_premium':      np.random.randint(0, 2, n),
    'avg_session_days': np.random.uniform(1, 30, n),
})

churn_prob = (
    - df['tenure_months'] * 0.02
    + df['support_calls'] * 0.15
    + df['payment_delays'] * 0.25
    - df['num_products'] * 0.1
    - df['has_premium'] * 0.3
    + df['monthly_charges'] * 0.005
    + np.random.normal(0, 0.5, n)
)

df['churned'] = (churn_prob > 0).astype(int)
print(f"📊 Dataset: {n} customers")
print(f"📊 Churn Rate: {df['churned'].mean():.2%} ({df['churned'].sum()} churned)")

# ── 2. Features & Target ──────────────────────────────────
feature_cols = [c for c in df.columns if c != 'churned']
X = df[feature_cols]
y = df['churned']

# ── 3. Stratified Train-Test Split ────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n🔀 Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Train Churn Rate: {y_train.mean():.2%}")
print(f"Test  Churn Rate: {y_test.mean():.2%}")

# ── 4. Compare Models ─────────────────────────────────────
configs = {
    'LR (Default)':   LogisticRegression(random_state=42, max_iter=1000),
    'LR (C=0.1)':     LogisticRegression(C=0.1, random_state=42, max_iter=1000),
    'LR (C=10)':      LogisticRegression(C=10, random_state=42, max_iter=1000),
    'LR (Balanced)':  LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    'LR (L1)':        LogisticRegression(penalty='l1', solver='liblinear', random_state=42),
}

print("\n📊 Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'AUC':>10}")
print("-" * 60)

best_model = None
best_score = 0

for name, m in configs.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', m)])
    acc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()
    f1  = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1').mean()
    auc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc').mean()
    print(f"{name:<25} {acc:>10.4f} {f1:>10.4f} {auc:>10.4f}")

    if f1 > best_score:
        best_score = f1
        best_model = (name, m)

print(f"\n🏆 Best Model: {best_model[0]}")

# ── 5. Train Final Model ──────────────────────────────────
final_pipe = Pipeline([('scaler', StandardScaler()), ('model', best_model[1])])
final_pipe.fit(X_train, y_train)

# ── 6. Optimise Threshold ─────────────────────────────────
probs_test = final_pipe.predict_proba(X_test)[:, 1]

best_thresh = 0.5
best_f1     = 0
for thresh in np.arange(0.2, 0.8, 0.05):
    preds = (probs_test >= thresh).astype(int)
    f1 = f1_score(y_test, preds, zero_division=0)
    if f1 > best_f1:
        best_f1     = f1
        best_thresh = thresh

print(f"🎯 Optimal Threshold: {best_thresh:.2f}")

y_pred_final = (probs_test >= best_thresh).astype(int)

# ── 7. Final Evaluation ───────────────────────────────────
print("\n📊 Final Model Performance:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred_final):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred_final):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred_final):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred_final):.4f}")
print(f"  AUC-ROC:   {roc_auc_score(y_test, probs_test):.4f}")

print("\n📋 Classification Report:")
print(classification_report(y_test, y_pred_final, target_names=['Retained', 'Churned']))

# ── 8. Feature Importance ────────────────────────────────
lr_model = final_pipe.named_steps['model']
coef_df  = pd.DataFrame({
    'Feature':     feature_cols,
    'Coefficient': lr_model.coef_[0],
    'Odds_Ratio':  np.exp(lr_model.coef_[0])
}).sort_values('Coefficient', key=abs, ascending=False)

print("\n🔑 Feature Importance:")
print(coef_df.to_string(index=False))

# ── 9. Score a New Customer ──────────────────────────────
new_customer = pd.DataFrame({
    'tenure_months': [6], 'monthly_charges': [85], 'total_charges': [510],
    'num_products': [1], 'support_calls': [5], 'payment_delays': [2],
    'has_premium': [0], 'avg_session_days': [5],
})

churn_prob_val = final_pipe.predict_proba(new_customer)[0, 1]
risk_level     = 'High 🔴' if churn_prob_val > 0.6 else 'Medium 🟡' if churn_prob_val > 0.3 else 'Low 🟢'

print(f"\n⚠️ Customer Churn Risk:")
print(f"  Probability: {churn_prob_val:.2%}")
print(f"  Risk Level:  {risk_level}")

action = (
    "Immediate retention offer required!" if churn_prob_val > 0.6 else
    "Monitor and send a personalised offer" if churn_prob_val > 0.3 else
    "Maintain regular engagement"
)
print(f"  Action:      {action}")

11. ⚠️ Edge Cases & Errors {#edge-cases}

Error 1: ConvergenceWarning

Python
# ❌ Warning: Solver did not converge

# ✅ Fix 1: Increase max_iter
model = LogisticRegression(max_iter=10000)

# ✅ Fix 2: Scale the data
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_train)

# ✅ Fix 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=1000)

Error 2: UndefinedMetricWarning — Zero Division

Python
# ❌ Warning: Precision is ill-defined (no positive predictions made)

# ✅ Fix: Use zero_division parameter
from sklearn.metrics import precision_score, f1_score

precision = precision_score(y_test, y_pred, zero_division=0)
f1        = f1_score(y_test, y_pred, zero_division=0)

Error 3: Penalty-Solver Incompatibility

Python
# ❌ Error: Solver lbfgs does not support penalty='l1'
model = LogisticRegression(penalty='l1', solver='lbfgs')  # ERROR!

# ✅ Valid combinations:
# penalty='l1'        → solver='liblinear' or 'saga'
# penalty='l2'        → any solver
# penalty='elasticnet'→ solver='saga' only
# penalty=None        → 'lbfgs', 'saga', 'sag', 'newton-cg'

model = LogisticRegression(penalty='l1', solver='liblinear')  # ✅

Debugging Checklist

Python
# 1. Check class distribution
print("Class Distribution:", dict(zip(*np.unique(y, return_counts=True))))

# 2. Verify model classes
print("Model Classes:", model.classes_)

# 3. Probability sanity check
probs = model.predict_proba(X_test)
print("Sum to 1:", np.allclose(probs.sum(axis=1), 1.0))
print("Min prob:", probs.min(), "| Max prob:", probs.max())

# 4. Check for NaN in coefficients
print("Any NaN in coef:", np.isnan(model.coef_).any())

# 5. Train vs test score (overfitting check)
print("Train Score:", model.score(X_train_sc, y_train))
print("Test Score:",  model.score(X_test_sc, y_test))

12. 💡 Pro Tips {#pro-tips}

Python
# ✅ Tip 1: ALWAYS use predict_proba, not just predict
# Probabilities are far more useful for real business decisions
probs = model.predict_proba(X_test)[:, 1]
# Then tune the threshold for your specific use case

# ✅ Tip 2: Use stratified splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2  # Maintains class proportions!
)

# ✅ Tip 3: Calibrate probabilities for trustworthy confidence scores
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(LogisticRegression(), cv=5, method='sigmoid')
calibrated.fit(X_train, y_train)

# ✅ Tip 4: SHAP values for production-grade explainability
import shap
explainer   = shap.LinearExplainer(model, X_train_sc)
shap_values = explainer.shap_values(X_test_sc)
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)

# ✅ Tip 5: Use StratifiedKFold for cross-validation
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')

# ✅ Tip 6: Save the pipeline with metadata
import joblib, json
metadata = {
    'model_version':      'v1.0',
    'features':           feature_cols,
    'optimal_threshold':  best_thresh,
    'metrics': {'f1': best_f1, 'auc': float(roc_auc_score(y_test, probs_test))}
}
joblib.dump(final_pipe, 'churn_model.pkl')
with open('model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

13. ⚖️ Logistic Regression vs Other Classifiers {#comparison}

Aspect

Logistic Reg

Decision Tree

Random Forest

SVM

Neural Net

Interpretability

⭐⭐⭐⭐⭐

⭐⭐⭐⭐

⭐⭐

⭐⭐

Training Speed

⚡ Very Fast

⚡ Fast

🐢 Medium

🐌 Slow

🐌 Very Slow

Non-linear patterns

❌ (base)

Calibrated Probs

✅ (approx)

✅ (approx)

❌ (needs calibration)

Feature Scaling

Required

Not needed

Not needed

Required

Required

Overfitting Risk

Low

High

Low

Medium

High

Data Required

Small OK

Small OK

Medium

Medium

Large

When to Choose Logistic Regression?

SQL
✅ Choose Logistic Regression when:
  - A linear decision boundary is sufficient
  - Interpretability is critical (medical, finance, legal)
  - Calibrated probability output is needed
  - Working with a small to medium dataset
  - Building a quick, defensible baseline
  - Regulatory or audit requirements exist

❌ Avoid when:
  - Complex non-linear patterns exist
  - Working with images, text (without engineered features)
  - Features are highly multicollinear (use Ridge instead)
  - Very large sparse data (use SGD-based solvers)

14. 📊 Data Science Perspective {#data-science}

Python
# Logistic Regression in A/B Testing Analysis
import scipy.stats as stats
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
n_per_group = 1000

control   = pd.DataFrame({
    'group': 0, 'converted': np.random.binomial(1, 0.05, n_per_group),
    'session_duration': np.random.exponential(120, n_per_group),
    'device': np.random.choice([0, 1], n_per_group)
})
treatment = pd.DataFrame({
    'group': 1, 'converted': np.random.binomial(1, 0.08, n_per_group),
    'session_duration': np.random.exponential(150, n_per_group),
    'device': np.random.choice([0, 1], n_per_group)
})

data = pd.concat([control, treatment], ignore_index=True)

import statsmodels.api as sm
X_sm     = sm.add_constant(data[['group', 'session_duration', 'device']])
sm_model = sm.Logit(data['converted'], X_sm).fit(disp=0)

print("A/B Test Analysis:")
print(sm_model.summary2().tables[1])

odds_ratio = np.exp(sm_model.params['group'])
print(f"\nTreatment Odds Ratio: {odds_ratio:.4f}")
print(f"Treatment lifts conversion by {(odds_ratio - 1) * 100:.1f}%")

15. 🎤 Interview Questions {#interview}

🟢 Basic

Q1: What is Logistic Regression?

A classification algorithm that uses the sigmoid function to convert a linear combination of features into a probability between 0 and 1.

Q2: Why do we use the sigmoid function?

A linear model's output can be any real number. The sigmoid squashes it into the range (0, 1), making it a valid probability.

Q3: Precision vs Recall — when to choose which?

Precision: When FP is costly (e.g. spam — don't block legitimate email). Recall: When FN is costly (e.g. cancer detection — never miss a sick patient).

Q4: Explain the Confusion Matrix.

A 2×2 table showing TP, TN, FP, FN. It compares actual vs predicted classes. All classification metrics (Accuracy, Precision, Recall, F1) are derived from it.

Q5: What does the C parameter control?

It is the inverse of regularization strength. Small C → Strong regularization (simpler model). Large C → Weak regularization (more complex model). Default is 1.

🟡 Intermediate

Q6: What is AUC-ROC?

SQL
ROC Curve: Plots TPR (Recall) vs FPR at various decision thresholds
AUC:       Area Under the ROC Curve
0.5 = Random classifier, 1.0 = Perfect classifier
Preferred over accuracy for comparing models on imbalanced data

Q7: How do you handle class imbalance?

SQL
1. class_weight='balanced'  → Simple and effective
2. SMOTE                    → Generate synthetic minority samples
3. Threshold tuning         → Lower the decision threshold below 0.5
4. Metrics                  → Use F1 / AUC instead of accuracy

Q8: What is the difference between L1 and L2 regularization?

SQL
L1 (penalty='l1'): Sets some coefficients exactly to 0 → Feature selection
L2 (penalty='l2'): Shrinks all coefficients, none to zero → Default
ElasticNet:        Combines both behaviours

🔴 Advanced

Q9: When is the Logistic Regression decision boundary non-linear?

The base model always produces a linear boundary. To get a non-linear boundary, add polynomial or interaction features — but the model itself remains linear in its parameters.

Q10: Why minimise Log Loss and not MSE for Logistic Regression?

SQL
MSE applied to probabilities: Non-convex → Multiple local minima → Gradient Descent can fail
Log Loss:                     Convex → Single global minimum → Gradient Descent works reliably

Log Loss also severely penalises confident wrong predictions, which is desirable behaviour.

Q11: What is probability calibration?

A model's predicted probabilities should match actual observed frequencies. If it predicts 70% probability, then 70% of those cases should actually be positive. Use CalibratedClassifierCV when calibration matters.

🎭 Scenario-Based

Q12: Your model has 99% accuracy but is completely useless. Why?

The dataset is heavily imbalanced! If 99% of samples belong to class 0 and the model predicts class 0 for everything, it achieves 99% accuracy while identifying zero positives. Fix: Use F1-Score and AUC-ROC as your metrics.


16. 🎯 Conclusion {#conclusion}

Key Takeaways

Topic

Key Learning

What

Classification via sigmoid function

Output

Probability (0–1), then apply a threshold

Types

Binary, Multiclass (OvR / Multinomial)

Key Params

C (inverse regularization), penalty, solver

Metrics

Precision, Recall, F1, AUC-ROC

Imbalance

class_weight='balanced', SMOTE, threshold tuning

Best Practice

Pipeline + Stratified CV + Threshold optimisation

Series Roadmap

SQL
✅ Part 1: ML Introduction
✅ Part 2: Linear Regression
✅ Part 3: Logistic Regression (This Blog!)
⏳ Part 4: Decision Trees
⏳ Part 5: Random Forests & Ensemble Methods
⏳ Part 6: Support Vector Machines
⏳ Part 7: K-Means Clustering
⏳ Part 8: Neural Networks

Final Advice

"Master Logistic Regression thoroughly. Every classification metric, class imbalance technique, and threshold optimisation concept you learn here carries forward to every advanced model you will ever use."

Do This Now:

  1. ✅ Implement on the Titanic dataset (Kaggle)

  2. ✅ Experiment with different decision thresholds

  3. ✅ Try SMOTE on an imbalanced dataset

  4. ✅ Plot and interpret the ROC and PR curves


Part 3 of ML A to Z | Next: Decision Trees! 🚀

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.