Logistic Regression — The Superhero of Classification | ML Series Part 3

Logistic Regression explained completely — sigmoid function, evaluation metrics, class imbalance, threshold tuning, and real-world projects in Python. Beginn...

Apr 14, 2026 22 min read

1. Introduction — What is Logistic Regression?

The name says "Regression" but this is actually a Classification algorithm! Confusing? Let's clear that up with a simple example.

Real-Life Problem

A doctor needs to decide: Does this patient have diabetes or not?

Blood sugar: 180 mg/dL
BMI: 32
Age: 45 years

The doctor calculates a probability: "There's a 78% chance this patient has diabetes." If probability > 50% → Diabetes is diagnosed.

That is exactly what Logistic Regression does!

Logistic Regression = A classification algorithm that predicts the probability that a data point belongs to a particular class.

How Is It Different from Linear Regression?

Linear Regression:    Price = $200k, $300k, $1.2M  → Continuous numbers
Logistic Regression:  Email = Spam(1) / Not Spam(0) → Probability between 0 and 1

Why Is It Called "Regression"?

Logistic Regression runs a regression on the log-odds (logit) of the target — that's where the "regression" name comes from. But since the output is a probability used for classification, it is a classification algorithm in practice.

Why Does It Matter?

Simple and fast — works on large datasets
Interpretable — coefficients have clear, real-world meaning
Probability output — "78% chance" is far more actionable than a hard 0/1 label
The go-to baseline — starting point for every classification project
Everywhere in production — Medical, Finance, HR, Marketing

Real-World Applications

Domain	Problem	Output
🏥 Medical	Has disease or not?	0/1
💳 Finance	Will customer default?	Probability
📧 Email	Is it spam?	0/1
🛒 E-commerce	Will customer buy?	Probability
📱 App	Will user churn?	Risk score
🏦 Banking	Is this fraud?	Flag
🎓 Education	Will student pass?	Probability
🚗 Insurance	Will they file a claim?	Risk

2. The Math — Sigmoid Function

The Problem with Linear Regression for Classification

Linear Regression: ŷ = β₀ + β₁x

Problem: Output can range from -∞ to +∞!
But a probability must be between 0 and 1.

Example:
  Linear Regression might predict: 1.5 or -0.3
  → These are meaningless as probabilities!

Solution: The Sigmoid Function 🧠

Sigmoid Function:

σ(z) = 1 / (1 + e^(-z))

Where:
z    = β₀ + β₁x₁ + β₂x₂ + ... (linear combination)
e    = Euler's number (2.718...)
σ(z) = output in range (0, 1) — a valid probability!

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Squashes any number into the range (0, 1)"""
    return 1 / (1 + np.exp(-z))

z_values       = np.linspace(-10, 10, 100)
sigmoid_values = sigmoid(z_values)

print(f"sigmoid(-10) = {sigmoid(-10):.6f}")  # ~0 (almost impossible)
print(f"sigmoid(-2)  = {sigmoid(-2):.4f}")   # 0.12
print(f"sigmoid(0)   = {sigmoid(0):.4f}")    # 0.5 (50/50)
print(f"sigmoid(2)   = {sigmoid(2):.4f}")    # 0.88
print(f"sigmoid(10)  = {sigmoid(10):.6f}")   # ~1 (almost certain)

plt.figure(figsize=(10, 5))
plt.plot(z_values, sigmoid_values, 'b-', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.axvline(x=0,   color='gray', linestyle='--')
plt.xlabel('z (Linear combination of features)')
plt.ylabel('σ(z) (Probability)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()

The Complete Model

Step 1: Compute the linear combination
z = β₀ + β₁x₁ + β₂x₂ + ...

Step 2: Apply the sigmoid
P(y=1 | X) = σ(z) = 1 / (1 + e^(-z))

Step 3: Apply the decision rule
If P ≥ 0.5 → Predict Class 1
If P <  0.5 → Predict Class 0

Log-Odds (Logit) — The Full Picture

Odds = P(event) / P(no event) = p / (1 - p)
Log-Odds = log(p / (1 - p))

Logistic Regression fits a linear model to the log-odds:
log(p / (1-p)) = β₀ + β₁x₁ + β₂x₂

Coefficient β₁ means:
  A 1-unit increase in x₁ increases the log-odds by β₁
  = The odds are multiplied by exp(β₁)

import numpy as np

# Interpretation example
# Coefficient for blood_sugar = 0.05

coef       = 0.05
odds_ratio = np.exp(coef)
print(f"For each 1-unit increase in blood sugar:")
print(f"Odds Ratio: {odds_ratio:.4f}")
print(f"Diabetes odds increase by {(odds_ratio - 1) * 100:.1f}%")

Loss Function — Log Loss (Binary Cross-Entropy)

L = -(1/n) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]

Where:
yᵢ = Actual class (0 or 1)
ŷᵢ = Predicted probability

Minimising Log Loss → best parameters → best model

from sklearn.metrics import log_loss
import numpy as np

y_true    = np.array([1, 0, 1, 1, 0])
y_pred_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])

logloss = log_loss(y_true, y_pred_prob)
print(f"Log Loss: {logloss:.4f}")
# Closer to 0 = better model

3. Types of Logistic Regression

Logistic Regression
├── Binary Classification     (2 classes: 0/1, Yes/No)
├── Multiclass Classification (3+ classes)
│   ├── One-vs-Rest (OvR)
│   └── Multinomial (Softmax)
└── Multilabel Classification (multiple labels per sample)

Binary Classification

from sklearn.linear_model import LogisticRegression
import numpy as np

# Spam detection example
X = np.array([[0.1, 2], [0.5, 3], [0.9, 5], [0.2, 1], [0.8, 4]])
y = np.array([0, 0, 1, 0, 1])  # 0=Ham, 1=Spam

model = LogisticRegression()
model.fit(X, y)

probs = model.predict_proba(X)
print("Probabilities:")
for i, p in enumerate(probs):
    print(f"  Sample {i+1}: Ham={p[0]:.3f}, Spam={p[1]:.3f}")

preds = model.predict(X)
print(f"\nPredictions: {preds}")

Multiclass Classification (3+ Classes)

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(
    multi_class='auto',  # Automatically picks the best strategy
    solver='lbfgs',
    max_iter=200,
    random_state=42
)
model.fit(X, y)

probs = model.predict_proba(X[:3])
print("Class Probabilities for first 3 flowers:")
print(f"{'Sample':<10} {'Setosa':>10} {'Versicolor':>12} {'Virginica':>10}")
for i, p in enumerate(probs):
    print(f"{i+1:<10} {p[0]:>10.3f} {p[1]:>12.3f} {p[2]:>10.3f}")

4. The Decision Boundary {#decision-boundary}

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=200, n_features=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)

model = LogisticRegression()
model.fit(X, y)

h    = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', linewidth=0.5)
plt.title('Logistic Regression — Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.show()

print(f"Coefficients: {model.coef_[0]}")
print(f"Intercept:    {model.intercept_[0]:.4f}")

5. All Functions & Parameters {#functions}

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    # ─── Regularization ───────────────────────────────────
    penalty='l2',        # 'l1', 'l2', 'elasticnet', None
                         # l2: Default (Ridge-like behaviour)
                         # l1: Lasso-like (can zero out features)
                         # elasticnet: Combined (set l1_ratio too)
                         # None: No regularization

    C=1.0,               # Inverse of regularization strength
                         # C < 1 → Stronger regularization
                         # C > 1 → Weaker regularization
                         # Think: C = 1/alpha (opposite convention to Ridge!)

    # ─── Solver ────────────────────────────────────────────
    solver='lbfgs',      # 'lbfgs'     : Default, small-medium, L2/None
                         # 'liblinear' : Small data, L1/L2
                         # 'saga'      : L1/L2/ElasticNet, large sparse data
                         # 'sag'       : Large data, L2 only
                         # 'newton-cg' : L2/None, medium data

    # ─── Multiclass ────────────────────────────────────────
    multi_class='auto',  # 'auto'         : Automatically selects
                         # 'ovr'          : One-vs-Rest
                         # 'multinomial'  : Softmax (true multiclass)

    # ─── Convergence ───────────────────────────────────────
    max_iter=100,        # Maximum number of solver iterations
    tol=1e-4,            # Convergence tolerance

    # ─── Class Imbalance ───────────────────────────────────
    class_weight=None,   # None      : All classes equal weight
                         # 'balanced': Auto-weight by class frequency
                         # {0:1, 1:10}: Custom per-class weights

    # ─── ElasticNet only ───────────────────────────────────
    l1_ratio=None,       # 0 = L2, 1 = L1, 0.5 = equal mix

    # ─── Other ─────────────────────────────────────────────
    fit_intercept=True,
    random_state=None,
    n_jobs=None,         # -1 = use all CPU cores
    warm_start=False
)

Solver Selection Guide

Solver	Penalties Supported	Dataset Size	Notes
`lbfgs`	l2, None	Small–Medium	Default, fast
`liblinear`	l1, l2	Small	Good for small data
`saga`	l1, l2, elasticnet	Large	Great for sparse data
`sag`	l2	Large	Faster than lbfgs at scale
`newton-cg`	l2, None	Medium	Memory intensive

Important Methods

model.fit(X_train, y_train)            # Train

predictions = model.predict(X_test)   # Hard class predictions (0 or 1)

probabilities = model.predict_proba(X_test)
# Returns: [[P(class0), P(class1)], ...]  — most useful output!

log_probs = model.predict_log_proba(X_test)  # Log-scale probabilities

accuracy = model.score(X_test, y_test)

# Learned attributes
print(model.classes_)        # [0, 1] or ['cat', 'dog']
print(model.coef_)           # Feature coefficients
print(model.intercept_)      # Intercept term
print(model.n_iter_)         # Actual iterations used

6. Evaluation Metrics for Classification

Confusion Matrix — The Foundation

Actual \ Predicted | Positive (1) | Negative (0)
───────────────────────────────────────────────────
Positive (1)       | TP           | FN
Negative (0)       | FP           | TN

TP = True Positive  (correct: predicted 1, actual 1)
TN = True Negative  (correct: predicted 0, actual 0)
FP = False Positive (wrong: predicted 1, actual 0) — Type I Error
FN = False Negative (wrong: predicted 0, actual 1) — Type II Error

from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_prob = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.15]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

All Key Metrics

# 1. Accuracy — Overall correct predictions
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy:  {accuracy:.4f}")
# = (TP + TN) / Total
# ⚠️ Misleading on imbalanced datasets!

# 2. Precision — Of all predicted positives, how many are actually positive?
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")
# = TP / (TP + FP)
# High Precision: "When I say Positive, I'm nearly always right"
# Use: Spam detection (avoid flagging legitimate email)

# 3. Recall (Sensitivity) — Of all actual positives, how many did we catch?
recall = recall_score(y_true, y_pred)
print(f"Recall:    {recall:.4f}")
# = TP / (TP + FN)
# High Recall: "I don't miss any real positive cases"
# Use: Cancer detection (never miss a sick patient)

# 4. F1 Score — Harmonic mean of Precision and Recall
f1 = f1_score(y_true, y_pred)
print(f"F1 Score:  {f1:.4f}")
# = 2 × (Precision × Recall) / (Precision + Recall)
# Best single metric for imbalanced data

# 5. AUC-ROC — Overall discrimination ability of the model
auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC:   {auc:.4f}")
# 0.5 = random guessing
# 1.0 = perfect classifier
# > 0.8 = good model

# 6. Full Report
print("\n📋 Full Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

ROC Curve

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR / Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()

Precision-Recall Curve (Better for Imbalanced Data!)

from sklearn.metrics import precision_recall_curve, average_precision_score

precision_arr, recall_arr, _ = precision_recall_curve(y_true, y_prob)
avg_precision = average_precision_score(y_true, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(recall_arr, precision_arr, 'b-', linewidth=2,
         label=f'PR Curve (AP = {avg_precision:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()

# On highly imbalanced datasets:
# AUC-ROC can be overly optimistic
# PR Curve gives a more honest picture

Metric Selection Guide

Scenario	Recommended Metric	Reason
Balanced classes	Accuracy	Simple and reliable
Imbalanced, both errors equal	F1 Score	Balances Precision & Recall
FP costly (spam filter)	Precision	Don't block legitimate email
FN costly (cancer detection)	Recall	Never miss a sick patient
Overall model discrimination	AUC-ROC	Threshold-independent
Highly imbalanced data	AUC-PR	More informative than ROC

7. Intermediate Concepts {#intermediate}

Threshold Tuning — 0.5 Is Not Always Optimal!

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import pandas as pd

# Imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
                            weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

probs = model.predict_proba(X_test)[:, 1]

# Try different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []
for thresh in thresholds:
    preds = (probs >= thresh).astype(int)
    results.append({
        'Threshold': thresh,
        'Precision': precision_score(y_test, preds, zero_division=0),
        'Recall':    recall_score(y_test, preds, zero_division=0),
        'F1':        f1_score(y_test, preds, zero_division=0)
    })

results_df = pd.DataFrame(results)
print("Threshold Analysis:")
print(results_df.round(4).to_string(index=False))

# Find the best threshold for F1
best_thresh = results_df.loc[results_df['F1'].idxmax(), 'Threshold']
print(f"\n✅ Best Threshold: {best_thresh:.2f}")
print(f"Best F1 Score:    {results_df['F1'].max():.4f}")

Handling Class Imbalance

from sklearn.linear_model import LogisticRegression

# Method 1: class_weight='balanced' — simplest fix
model = LogisticRegression(class_weight='balanced', random_state=42)

# Method 2: SMOTE — generate synthetic minority samples
# pip install imbalanced-learn
import numpy as np
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"Original distribution:    {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After SMOTE:              {dict(zip(*np.unique(y_resampled, return_counts=True)))}")

# Method 3: Full Pipeline with SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler

pipeline = ImbPipeline([
    ('smote',  SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(random_state=42))
])

pipeline.fit(X_train, y_train)

8. 🚀 Advanced Concepts {#advanced}

Regularization — The C Parameter in Depth

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

C_values = np.logspace(-4, 4, 20)  # 0.0001 to 10000

results = []
for C in C_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(C=C, max_iter=1000, random_state=42))
    ])
    cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1')
    results.append({'C': C, 'Mean F1': cv_scores.mean(), 'Std F1': cv_scores.std()})

results_df = pd.DataFrame(results)
best_C = results_df.loc[results_df['Mean F1'].idxmax(), 'C']
print(f"Best C:       {best_C:.6f}")
print(f"Best Mean F1: {results_df['Mean F1'].max():.4f}")

LogisticRegressionCV — Automatic C Selection

from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

model_cv = LogisticRegressionCV(
    Cs=20,           # Number of C values to try
    cv=5,
    scoring='f1',
    penalty='l2',
    solver='lbfgs',
    max_iter=1000,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline([('scaler', StandardScaler()), ('model', model_cv)])
pipe.fit(X_train, y_train)

print(f"Best C: {model_cv.C_[0]:.6f}")
print(f"Test F1: {f1_score(y_test, pipe.predict(X_test)):.4f}")

Multi-class Strategies

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Strategy 1: One-vs-Rest (OvR)
# Trains one binary classifier per class
ovr_model = LogisticRegression(multi_class='ovr', solver='liblinear', random_state=42)
ovr_model.fit(X_train, y_train)
print(f"OvR Accuracy: {ovr_model.score(X_test, y_test):.4f}")

# Strategy 2: Multinomial (Softmax)
# One model that handles all classes simultaneously
mnm_model = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                                max_iter=200, random_state=42)
mnm_model.fit(X_train, y_train)
print(f"Multinomial Accuracy: {mnm_model.score(X_test, y_test):.4f}")

# Probabilities sum to 1 for each sample
probs = mnm_model.predict_proba(X_test[:3])
print("\nProbabilities (should sum to 1):")
for p in probs:
    print(f"  Setosa: {p[0]:.3f} | Versicolor: {p[1]:.3f} | Virginica: {p[2]:.3f} | Sum: {sum(p):.3f}")

9. 🌐 Real-World Use Cases {#real-world}

Use Case 1: Loan Default Prediction

"""
Problem: Will this customer default on their loan?
Business: Bank / Lending Institution
Impact: Risk management, credit scoring
"""

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

np.random.seed(42)
n = 5000

df = pd.DataFrame({
    'age':               np.random.randint(22, 65, n),
    'annual_income':     np.random.randint(20000, 200000, n),
    'loan_amount':       np.random.randint(5000, 200000, n),
    'loan_term_months':  np.random.choice([12, 24, 36, 48, 60], n),
    'credit_score':      np.random.randint(300, 850, n),
    'employment_type':   np.random.choice(['Salaried', 'Self-Employed', 'Business'], n),
    'existing_loans':    np.random.randint(0, 5, n),
    'monthly_expenses':  np.random.randint(1000, 10000, n),
})

default_prob = (
    - (df['credit_score'] - 600) * 0.005
    + (df['loan_amount'] / df['annual_income']) * 0.3
    + df['existing_loans'] * 0.1
    + (df['monthly_expenses'] / (df['annual_income'] / 12)) * 0.2
    + np.random.normal(0, 0.1, n)
)

df['defaulted'] = (default_prob > 0.3).astype(int)
print(f"Default Rate: {df['defaulted'].mean():.2%}")

num_features = ['age', 'annual_income', 'loan_amount', 'loan_term_months',
                 'credit_score', 'existing_loans', 'monthly_expenses']
cat_features = ['employment_type']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(drop='first'), cat_features)
])

X = df.drop('defaulted', axis=1)
y = df['defaulted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline = Pipeline([
    ('prep',  preprocessor),
    ('model', LogisticRegression(
        C=0.1, class_weight='balanced',
        solver='lbfgs', max_iter=1000, random_state=42
    ))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print(f"\n📊 Loan Default Model Performance:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

# Score a new applicant
new_applicant = pd.DataFrame({
    'age': [35], 'annual_income': [60000], 'loan_amount': [50000],
    'loan_term_months': [36], 'credit_score': [650],
    'employment_type': ['Salaried'], 'existing_loans': [1],
    'monthly_expenses': [3000]
})

risk_prob    = pipeline.predict_proba(new_applicant)[0, 1]
risk_category = 'High' if risk_prob > 0.6 else 'Medium' if risk_prob > 0.3 else 'Low'

print(f"\n🏦 Applicant Risk Assessment:")
print(f"  Default Probability: {risk_prob:.2%}")
print(f"  Risk Category:       {risk_category}")
recommendation = 'Reject' if risk_category == 'High' else 'Manual Review' if risk_category == 'Medium' else 'Approve'
print(f"  Recommendation:      {recommendation}")

10. 💻 Practical Examples — Complete Code {#practical}

Customer Churn Prediction — Full ML Pipeline

# ============================================================
# CUSTOMER CHURN PREDICTION — COMPLETE ML PROJECT
# ============================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, classification_report
)
import warnings
warnings.filterwarnings('ignore')

# ── 1. Generate Data ──────────────────────────────────────
np.random.seed(42)
n = 3000

df = pd.DataFrame({
    'tenure_months':    np.random.randint(1, 72, n),
    'monthly_charges':  np.random.uniform(20, 120, n),
    'total_charges':    np.random.uniform(100, 8000, n),
    'num_products':     np.random.randint(1, 5, n),
    'support_calls':    np.random.randint(0, 10, n),
    'payment_delays':   np.random.randint(0, 5, n),
    'has_premium':      np.random.randint(0, 2, n),
    'avg_session_days': np.random.uniform(1, 30, n),
})

churn_prob = (
    - df['tenure_months'] * 0.02
    + df['support_calls'] * 0.15
    + df['payment_delays'] * 0.25
    - df['num_products'] * 0.1
    - df['has_premium'] * 0.3
    + df['monthly_charges'] * 0.005
    + np.random.normal(0, 0.5, n)
)

df['churned'] = (churn_prob > 0).astype(int)
print(f"📊 Dataset: {n} customers")
print(f"📊 Churn Rate: {df['churned'].mean():.2%} ({df['churned'].sum()} churned)")

# ── 2. Features & Target ──────────────────────────────────
feature_cols = [c for c in df.columns if c != 'churned']
X = df[feature_cols]
y = df['churned']

# ── 3. Stratified Train-Test Split ────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n🔀 Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Train Churn Rate: {y_train.mean():.2%}")
print(f"Test  Churn Rate: {y_test.mean():.2%}")

# ── 4. Compare Models ─────────────────────────────────────
configs = {
    'LR (Default)':   LogisticRegression(random_state=42, max_iter=1000),
    'LR (C=0.1)':     LogisticRegression(C=0.1, random_state=42, max_iter=1000),
    'LR (C=10)':      LogisticRegression(C=10, random_state=42, max_iter=1000),
    'LR (Balanced)':  LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    'LR (L1)':        LogisticRegression(penalty='l1', solver='liblinear', random_state=42),
}

print("\n📊 Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'AUC':>10}")
print("-" * 60)

best_model = None
best_score = 0

for name, m in configs.items():
    pipe = Pipeline([('scaler', StandardScaler()), ('model', m)])
    acc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()
    f1  = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1').mean()
    auc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc').mean()
    print(f"{name:<25} {acc:>10.4f} {f1:>10.4f} {auc:>10.4f}")

    if f1 > best_score:
        best_score = f1
        best_model = (name, m)

print(f"\n🏆 Best Model: {best_model[0]}")

# ── 5. Train Final Model ──────────────────────────────────
final_pipe = Pipeline([('scaler', StandardScaler()), ('model', best_model[1])])
final_pipe.fit(X_train, y_train)

# ── 6. Optimise Threshold ─────────────────────────────────
probs_test = final_pipe.predict_proba(X_test)[:, 1]

best_thresh = 0.5
best_f1     = 0
for thresh in np.arange(0.2, 0.8, 0.05):
    preds = (probs_test >= thresh).astype(int)
    f1 = f1_score(y_test, preds, zero_division=0)
    if f1 > best_f1:
        best_f1     = f1
        best_thresh = thresh

print(f"🎯 Optimal Threshold: {best_thresh:.2f}")

y_pred_final = (probs_test >= best_thresh).astype(int)

# ── 7. Final Evaluation ───────────────────────────────────
print("\n📊 Final Model Performance:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred_final):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred_final):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred_final):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred_final):.4f}")
print(f"  AUC-ROC:   {roc_auc_score(y_test, probs_test):.4f}")

print("\n📋 Classification Report:")
print(classification_report(y_test, y_pred_final, target_names=['Retained', 'Churned']))

# ── 8. Feature Importance ────────────────────────────────
lr_model = final_pipe.named_steps['model']
coef_df  = pd.DataFrame({
    'Feature':     feature_cols,
    'Coefficient': lr_model.coef_[0],
    'Odds_Ratio':  np.exp(lr_model.coef_[0])
}).sort_values('Coefficient', key=abs, ascending=False)

print("\n🔑 Feature Importance:")
print(coef_df.to_string(index=False))

# ── 9. Score a New Customer ──────────────────────────────
new_customer = pd.DataFrame({
    'tenure_months': [6], 'monthly_charges': [85], 'total_charges': [510],
    'num_products': [1], 'support_calls': [5], 'payment_delays': [2],
    'has_premium': [0], 'avg_session_days': [5],
})

churn_prob_val = final_pipe.predict_proba(new_customer)[0, 1]
risk_level     = 'High 🔴' if churn_prob_val > 0.6 else 'Medium 🟡' if churn_prob_val > 0.3 else 'Low 🟢'

print(f"\n⚠️ Customer Churn Risk:")
print(f"  Probability: {churn_prob_val:.2%}")
print(f"  Risk Level:  {risk_level}")

action = (
    "Immediate retention offer required!" if churn_prob_val > 0.6 else
    "Monitor and send a personalised offer" if churn_prob_val > 0.3 else
    "Maintain regular engagement"
)
print(f"  Action:      {action}")

11. ⚠️ Edge Cases & Errors {#edge-cases}

Error 1: ConvergenceWarning

# ❌ Warning: Solver did not converge

# ✅ Fix 1: Increase max_iter
model = LogisticRegression(max_iter=10000)

# ✅ Fix 2: Scale the data
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_train)

# ✅ Fix 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=1000)

Error 2: UndefinedMetricWarning — Zero Division

# ❌ Warning: Precision is ill-defined (no positive predictions made)

# ✅ Fix: Use zero_division parameter
from sklearn.metrics import precision_score, f1_score

precision = precision_score(y_test, y_pred, zero_division=0)
f1        = f1_score(y_test, y_pred, zero_division=0)

Error 3: Penalty-Solver Incompatibility

# ❌ Error: Solver lbfgs does not support penalty='l1'
model = LogisticRegression(penalty='l1', solver='lbfgs')  # ERROR!

# ✅ Valid combinations:
# penalty='l1'        → solver='liblinear' or 'saga'
# penalty='l2'        → any solver
# penalty='elasticnet'→ solver='saga' only
# penalty=None        → 'lbfgs', 'saga', 'sag', 'newton-cg'

model = LogisticRegression(penalty='l1', solver='liblinear')  # ✅

Debugging Checklist

# 1. Check class distribution
print("Class Distribution:", dict(zip(*np.unique(y, return_counts=True))))

# 2. Verify model classes
print("Model Classes:", model.classes_)

# 3. Probability sanity check
probs = model.predict_proba(X_test)
print("Sum to 1:", np.allclose(probs.sum(axis=1), 1.0))
print("Min prob:", probs.min(), "| Max prob:", probs.max())

# 4. Check for NaN in coefficients
print("Any NaN in coef:", np.isnan(model.coef_).any())

# 5. Train vs test score (overfitting check)
print("Train Score:", model.score(X_train_sc, y_train))
print("Test Score:",  model.score(X_test_sc, y_test))

12. 💡 Pro Tips {#pro-tips}

# ✅ Tip 1: ALWAYS use predict_proba, not just predict
# Probabilities are far more useful for real business decisions
probs = model.predict_proba(X_test)[:, 1]
# Then tune the threshold for your specific use case

# ✅ Tip 2: Use stratified splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2  # Maintains class proportions!
)

# ✅ Tip 3: Calibrate probabilities for trustworthy confidence scores
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(LogisticRegression(), cv=5, method='sigmoid')
calibrated.fit(X_train, y_train)

# ✅ Tip 4: SHAP values for production-grade explainability
import shap
explainer   = shap.LinearExplainer(model, X_train_sc)
shap_values = explainer.shap_values(X_test_sc)
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)

# ✅ Tip 5: Use StratifiedKFold for cross-validation
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')

# ✅ Tip 6: Save the pipeline with metadata
import joblib, json
metadata = {
    'model_version':      'v1.0',
    'features':           feature_cols,
    'optimal_threshold':  best_thresh,
    'metrics': {'f1': best_f1, 'auc': float(roc_auc_score(y_test, probs_test))}
}
joblib.dump(final_pipe, 'churn_model.pkl')
with open('model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

13. ⚖️ Logistic Regression vs Other Classifiers {#comparison}

Aspect	Logistic Reg	Decision Tree	Random Forest	SVM	Neural Net
Interpretability	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐	⭐
Training Speed	⚡ Very Fast	⚡ Fast	🐢 Medium	🐌 Slow	🐌 Very Slow
Non-linear patterns	❌ (base)	✅	✅	✅	✅
Calibrated Probs	✅	✅ (approx)	✅ (approx)	❌ (needs calibration)	✅
Feature Scaling	Required	Not needed	Not needed	Required	Required
Overfitting Risk	Low	High	Low	Medium	High
Data Required	Small OK	Small OK	Medium	Medium	Large

When to Choose Logistic Regression?

✅ Choose Logistic Regression when:
  - A linear decision boundary is sufficient
  - Interpretability is critical (medical, finance, legal)
  - Calibrated probability output is needed
  - Working with a small to medium dataset
  - Building a quick, defensible baseline
  - Regulatory or audit requirements exist

❌ Avoid when:
  - Complex non-linear patterns exist
  - Working with images, text (without engineered features)
  - Features are highly multicollinear (use Ridge instead)
  - Very large sparse data (use SGD-based solvers)

14. 📊 Data Science Perspective {#data-science}

# Logistic Regression in A/B Testing Analysis
import scipy.stats as stats
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
n_per_group = 1000

control   = pd.DataFrame({
    'group': 0, 'converted': np.random.binomial(1, 0.05, n_per_group),
    'session_duration': np.random.exponential(120, n_per_group),
    'device': np.random.choice([0, 1], n_per_group)
})
treatment = pd.DataFrame({
    'group': 1, 'converted': np.random.binomial(1, 0.08, n_per_group),
    'session_duration': np.random.exponential(150, n_per_group),
    'device': np.random.choice([0, 1], n_per_group)
})

data = pd.concat([control, treatment], ignore_index=True)

import statsmodels.api as sm
X_sm     = sm.add_constant(data[['group', 'session_duration', 'device']])
sm_model = sm.Logit(data['converted'], X_sm).fit(disp=0)

print("A/B Test Analysis:")
print(sm_model.summary2().tables[1])

odds_ratio = np.exp(sm_model.params['group'])
print(f"\nTreatment Odds Ratio: {odds_ratio:.4f}")
print(f"Treatment lifts conversion by {(odds_ratio - 1) * 100:.1f}%")

15. 🎤 Interview Questions {#interview}

🟢 Basic

Q1: What is Logistic Regression?

A classification algorithm that uses the sigmoid function to convert a linear combination of features into a probability between 0 and 1.

Q2: Why do we use the sigmoid function?

A linear model's output can be any real number. The sigmoid squashes it into the range (0, 1), making it a valid probability.

Q3: Precision vs Recall — when to choose which?

Precision: When FP is costly (e.g. spam — don't block legitimate email). Recall: When FN is costly (e.g. cancer detection — never miss a sick patient).

Q4: Explain the Confusion Matrix.

A 2×2 table showing TP, TN, FP, FN. It compares actual vs predicted classes. All classification metrics (Accuracy, Precision, Recall, F1) are derived from it.

Q5: What does the C parameter control?

It is the inverse of regularization strength. Small C → Strong regularization (simpler model). Large C → Weak regularization (more complex model). Default is 1.

🟡 Intermediate

Q6: What is AUC-ROC?

ROC Curve: Plots TPR (Recall) vs FPR at various decision thresholds
AUC:       Area Under the ROC Curve
0.5 = Random classifier, 1.0 = Perfect classifier
Preferred over accuracy for comparing models on imbalanced data

Q7: How do you handle class imbalance?

1. class_weight='balanced'  → Simple and effective
2. SMOTE                    → Generate synthetic minority samples
3. Threshold tuning         → Lower the decision threshold below 0.5
4. Metrics                  → Use F1 / AUC instead of accuracy

Q8: What is the difference between L1 and L2 regularization?

L1 (penalty='l1'): Sets some coefficients exactly to 0 → Feature selection
L2 (penalty='l2'): Shrinks all coefficients, none to zero → Default
ElasticNet:        Combines both behaviours

🔴 Advanced

Q9: When is the Logistic Regression decision boundary non-linear?

The base model always produces a linear boundary. To get a non-linear boundary, add polynomial or interaction features — but the model itself remains linear in its parameters.

Q10: Why minimise Log Loss and not MSE for Logistic Regression?

MSE applied to probabilities: Non-convex → Multiple local minima → Gradient Descent can fail
Log Loss:                     Convex → Single global minimum → Gradient Descent works reliably

Log Loss also severely penalises confident wrong predictions, which is desirable behaviour.

Q11: What is probability calibration?

A model's predicted probabilities should match actual observed frequencies. If it predicts 70% probability, then 70% of those cases should actually be positive. Use CalibratedClassifierCV when calibration matters.

🎭 Scenario-Based

Q12: Your model has 99% accuracy but is completely useless. Why?

The dataset is heavily imbalanced! If 99% of samples belong to class 0 and the model predicts class 0 for everything, it achieves 99% accuracy while identifying zero positives. Fix: Use F1-Score and AUC-ROC as your metrics.

16. 🎯 Conclusion {#conclusion}

Key Takeaways

Topic	Key Learning
What	Classification via sigmoid function
Output	Probability (0–1), then apply a threshold
Types	Binary, Multiclass (OvR / Multinomial)
Key Params	C (inverse regularization), penalty, solver
Metrics	Precision, Recall, F1, AUC-ROC
Imbalance	class_weight='balanced', SMOTE, threshold tuning
Best Practice	Pipeline + Stratified CV + Threshold optimisation

Series Roadmap

✅ Part 1: ML Introduction
✅ Part 2: Linear Regression
✅ Part 3: Logistic Regression (This Blog!)
⏳ Part 4: Decision Trees
⏳ Part 5: Random Forests & Ensemble Methods
⏳ Part 6: Support Vector Machines
⏳ Part 7: K-Means Clustering
⏳ Part 8: Neural Networks

Final Advice

"Master Logistic Regression thoroughly. Every classification metric, class imbalance technique, and threshold optimisation concept you learn here carries forward to every advanced model you will ever use."

Do This Now:

✅ Implement on the Titanic dataset (Kaggle)
✅ Experiment with different decision thresholds
✅ Try SMOTE on an imbalanced dataset
✅ Plot and interpret the ROC and PR curves

Part 3 of ML A to Z | Next: Decision Trees! 🚀

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.