1. Introduction — What is Logistic Regression?
The name says "Regression" but this is actually a Classification algorithm! Confusing? Let's clear that up with a simple example.
Real-Life Problem
A doctor needs to decide: Does this patient have diabetes or not?
Blood sugar: 180 mg/dL
BMI: 32
Age: 45 years
The doctor calculates a probability: "There's a 78% chance this patient has diabetes." If probability > 50% → Diabetes is diagnosed.
That is exactly what Logistic Regression does!
Logistic Regression = A classification algorithm that predicts the probability that a data point belongs to a particular class.
How Is It Different from Linear Regression?
Linear Regression: Price = $200k, $300k, $1.2M → Continuous numbers
Logistic Regression: Email = Spam(1) / Not Spam(0) → Probability between 0 and 1
Why Is It Called "Regression"?
Logistic Regression runs a regression on the log-odds (logit) of the target — that's where the "regression" name comes from. But since the output is a probability used for classification, it is a classification algorithm in practice.
Why Does It Matter?
Simple and fast — works on large datasets
Interpretable — coefficients have clear, real-world meaning
Probability output — "78% chance" is far more actionable than a hard 0/1 label
The go-to baseline — starting point for every classification project
Everywhere in production — Medical, Finance, HR, Marketing
Real-World Applications
Domain | Problem | Output |
|---|---|---|
🏥 Medical | Has disease or not? | 0/1 |
💳 Finance | Will customer default? | Probability |
Is it spam? | 0/1 | |
🛒 E-commerce | Will customer buy? | Probability |
📱 App | Will user churn? | Risk score |
🏦 Banking | Is this fraud? | Flag |
🎓 Education | Will student pass? | Probability |
🚗 Insurance | Will they file a claim? | Risk |
2. 🧮 The Math — Sigmoid Function {#math}
The Problem with Linear Regression for Classification
Linear Regression: ŷ = β₀ + β₁x
Problem: Output can range from -∞ to +∞!
But a probability must be between 0 and 1.
Example:
Linear Regression might predict: 1.5 or -0.3
→ These are meaningless as probabilities!
Solution: The Sigmoid Function 🧠
Sigmoid Function:
σ(z) = 1 / (1 + e^(-z))
Where:
z = β₀ + β₁x₁ + β₂x₂ + ... (linear combination)
e = Euler's number (2.718...)
σ(z) = output in range (0, 1) — a valid probability!
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Squashes any number into the range (0, 1)"""
return 1 / (1 + np.exp(-z))
z_values = np.linspace(-10, 10, 100)
sigmoid_values = sigmoid(z_values)
print(f"sigmoid(-10) = {sigmoid(-10):.6f}") # ~0 (almost impossible)
print(f"sigmoid(-2) = {sigmoid(-2):.4f}") # 0.12
print(f"sigmoid(0) = {sigmoid(0):.4f}") # 0.5 (50/50)
print(f"sigmoid(2) = {sigmoid(2):.4f}") # 0.88
print(f"sigmoid(10) = {sigmoid(10):.6f}") # ~1 (almost certain)
plt.figure(figsize=(10, 5))
plt.plot(z_values, sigmoid_values, 'b-', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.axvline(x=0, color='gray', linestyle='--')
plt.xlabel('z (Linear combination of features)')
plt.ylabel('σ(z) (Probability)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()
The Complete Model
Step 1: Compute the linear combination
z = β₀ + β₁x₁ + β₂x₂ + ...
Step 2: Apply the sigmoid
P(y=1 | X) = σ(z) = 1 / (1 + e^(-z))
Step 3: Apply the decision rule
If P ≥ 0.5 → Predict Class 1
If P < 0.5 → Predict Class 0
Log-Odds (Logit) — The Full Picture
Odds = P(event) / P(no event) = p / (1 - p)
Log-Odds = log(p / (1 - p))
Logistic Regression fits a linear model to the log-odds:
log(p / (1-p)) = β₀ + β₁x₁ + β₂x₂
Coefficient β₁ means:
A 1-unit increase in x₁ increases the log-odds by β₁
= The odds are multiplied by exp(β₁)
import numpy as np
# Interpretation example
# Coefficient for blood_sugar = 0.05
coef = 0.05
odds_ratio = np.exp(coef)
print(f"For each 1-unit increase in blood sugar:")
print(f"Odds Ratio: {odds_ratio:.4f}")
print(f"Diabetes odds increase by {(odds_ratio - 1) * 100:.1f}%")
Loss Function — Log Loss (Binary Cross-Entropy)
L = -(1/n) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]
Where:
yᵢ = Actual class (0 or 1)
ŷᵢ = Predicted probability
Minimising Log Loss → best parameters → best model
from sklearn.metrics import log_loss
import numpy as np
y_true = np.array([1, 0, 1, 1, 0])
y_pred_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
logloss = log_loss(y_true, y_pred_prob)
print(f"Log Loss: {logloss:.4f}")
# Closer to 0 = better model
3. 🌿 Types of Logistic Regression {#types}
Logistic Regression
├── Binary Classification (2 classes: 0/1, Yes/No)
├── Multiclass Classification (3+ classes)
│ ├── One-vs-Rest (OvR)
│ └── Multinomial (Softmax)
└── Multilabel Classification (multiple labels per sample)
Binary Classification
from sklearn.linear_model import LogisticRegression
import numpy as np
# Spam detection example
X = np.array([[0.1, 2], [0.5, 3], [0.9, 5], [0.2, 1], [0.8, 4]])
y = np.array([0, 0, 1, 0, 1]) # 0=Ham, 1=Spam
model = LogisticRegression()
model.fit(X, y)
probs = model.predict_proba(X)
print("Probabilities:")
for i, p in enumerate(probs):
print(f" Sample {i+1}: Ham={p[0]:.3f}, Spam={p[1]:.3f}")
preds = model.predict(X)
print(f"\nPredictions: {preds}")
Multiclass Classification (3+ Classes)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
model = LogisticRegression(
multi_class='auto', # Automatically picks the best strategy
solver='lbfgs',
max_iter=200,
random_state=42
)
model.fit(X, y)
probs = model.predict_proba(X[:3])
print("Class Probabilities for first 3 flowers:")
print(f"{'Sample':<10} {'Setosa':>10} {'Versicolor':>12} {'Virginica':>10}")
for i, p in enumerate(probs):
print(f"{i+1:<10} {p[0]:>10.3f} {p[1]:>12.3f} {p[2]:>10.3f}")
4. 🎯 The Decision Boundary {#decision-boundary}
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=200, n_features=2, n_redundant=0,
n_clusters_per_class=1, random_state=42
)
model = LogisticRegression()
model.fit(X, y)
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', linewidth=0.5)
plt.title('Logistic Regression — Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.show()
print(f"Coefficients: {model.coef_[0]}")
print(f"Intercept: {model.intercept_[0]:.4f}")
5. 🔧 All Functions & Parameters {#functions}
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
# ─── Regularization ───────────────────────────────────
penalty='l2', # 'l1', 'l2', 'elasticnet', None
# l2: Default (Ridge-like behaviour)
# l1: Lasso-like (can zero out features)
# elasticnet: Combined (set l1_ratio too)
# None: No regularization
C=1.0, # Inverse of regularization strength
# C < 1 → Stronger regularization
# C > 1 → Weaker regularization
# Think: C = 1/alpha (opposite convention to Ridge!)
# ─── Solver ────────────────────────────────────────────
solver='lbfgs', # 'lbfgs' : Default, small-medium, L2/None
# 'liblinear' : Small data, L1/L2
# 'saga' : L1/L2/ElasticNet, large sparse data
# 'sag' : Large data, L2 only
# 'newton-cg' : L2/None, medium data
# ─── Multiclass ────────────────────────────────────────
multi_class='auto', # 'auto' : Automatically selects
# 'ovr' : One-vs-Rest
# 'multinomial' : Softmax (true multiclass)
# ─── Convergence ───────────────────────────────────────
max_iter=100, # Maximum number of solver iterations
tol=1e-4, # Convergence tolerance
# ─── Class Imbalance ───────────────────────────────────
class_weight=None, # None : All classes equal weight
# 'balanced': Auto-weight by class frequency
# {0:1, 1:10}: Custom per-class weights
# ─── ElasticNet only ───────────────────────────────────
l1_ratio=None, # 0 = L2, 1 = L1, 0.5 = equal mix
# ─── Other ─────────────────────────────────────────────
fit_intercept=True,
random_state=None,
n_jobs=None, # -1 = use all CPU cores
warm_start=False
)
Solver Selection Guide
Solver | Penalties Supported | Dataset Size | Notes |
|---|---|---|---|
| l2, None | Small–Medium | Default, fast |
| l1, l2 | Small | Good for small data |
| l1, l2, elasticnet | Large | Great for sparse data |
| l2 | Large | Faster than lbfgs at scale |
| l2, None | Medium | Memory intensive |
Important Methods
model.fit(X_train, y_train) # Train
predictions = model.predict(X_test) # Hard class predictions (0 or 1)
probabilities = model.predict_proba(X_test)
# Returns: [[P(class0), P(class1)], ...] — most useful output!
log_probs = model.predict_log_proba(X_test) # Log-scale probabilities
accuracy = model.score(X_test, y_test)
# Learned attributes
print(model.classes_) # [0, 1] or ['cat', 'dog']
print(model.coef_) # Feature coefficients
print(model.intercept_) # Intercept term
print(model.n_iter_) # Actual iterations used
6. 📊 Evaluation Metrics for Classification {#metrics}
Confusion Matrix — The Foundation
Actual \ Predicted | Positive (1) | Negative (0)
───────────────────────────────────────────────────
Positive (1) | TP | FN
Negative (0) | FP | TN
TP = True Positive (correct: predicted 1, actual 1)
TN = True Negative (correct: predicted 0, actual 0)
FP = False Positive (wrong: predicted 1, actual 0) — Type I Error
FN = False Negative (wrong: predicted 0, actual 1) — Type II Error
from sklearn.metrics import (
confusion_matrix, classification_report,
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_prob = [0.9, 0.1, 0.8, 0.4, 0.2, 0.7, 0.6, 0.3, 0.85, 0.15]
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
fig, ax = plt.subplots(figsize=(6, 5))
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot(ax=ax, cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
All Key Metrics
# 1. Accuracy — Overall correct predictions
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# = (TP + TN) / Total
# ⚠️ Misleading on imbalanced datasets!
# 2. Precision — Of all predicted positives, how many are actually positive?
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")
# = TP / (TP + FP)
# High Precision: "When I say Positive, I'm nearly always right"
# Use: Spam detection (avoid flagging legitimate email)
# 3. Recall (Sensitivity) — Of all actual positives, how many did we catch?
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.4f}")
# = TP / (TP + FN)
# High Recall: "I don't miss any real positive cases"
# Use: Cancer detection (never miss a sick patient)
# 4. F1 Score — Harmonic mean of Precision and Recall
f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.4f}")
# = 2 × (Precision × Recall) / (Precision + Recall)
# Best single metric for imbalanced data
# 5. AUC-ROC — Overall discrimination ability of the model
auc = roc_auc_score(y_true, y_prob)
print(f"AUC-ROC: {auc:.4f}")
# 0.5 = random guessing
# 1.0 = perfect classifier
# > 0.8 = good model
# 6. Full Report
print("\n📋 Full Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))
ROC Curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR / Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
Precision-Recall Curve (Better for Imbalanced Data!)
from sklearn.metrics import precision_recall_curve, average_precision_score
precision_arr, recall_arr, _ = precision_recall_curve(y_true, y_prob)
avg_precision = average_precision_score(y_true, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(recall_arr, precision_arr, 'b-', linewidth=2,
label=f'PR Curve (AP = {avg_precision:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()
# On highly imbalanced datasets:
# AUC-ROC can be overly optimistic
# PR Curve gives a more honest picture
Metric Selection Guide
Scenario | Recommended Metric | Reason |
|---|---|---|
Balanced classes | Accuracy | Simple and reliable |
Imbalanced, both errors equal | F1 Score | Balances Precision & Recall |
FP costly (spam filter) | Precision | Don't block legitimate email |
FN costly (cancer detection) | Recall | Never miss a sick patient |
Overall model discrimination | AUC-ROC | Threshold-independent |
Highly imbalanced data | AUC-PR | More informative than ROC |
7. 🌊 Intermediate Concepts {#intermediate}
Threshold Tuning — 0.5 Is Not Always Optimal!
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import pandas as pd
# Imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]
# Try different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
results = []
for thresh in thresholds:
preds = (probs >= thresh).astype(int)
results.append({
'Threshold': thresh,
'Precision': precision_score(y_test, preds, zero_division=0),
'Recall': recall_score(y_test, preds, zero_division=0),
'F1': f1_score(y_test, preds, zero_division=0)
})
results_df = pd.DataFrame(results)
print("Threshold Analysis:")
print(results_df.round(4).to_string(index=False))
# Find the best threshold for F1
best_thresh = results_df.loc[results_df['F1'].idxmax(), 'Threshold']
print(f"\n✅ Best Threshold: {best_thresh:.2f}")
print(f"Best F1 Score: {results_df['F1'].max():.4f}")
Handling Class Imbalance
from sklearn.linear_model import LogisticRegression
# Method 1: class_weight='balanced' — simplest fix
model = LogisticRegression(class_weight='balanced', random_state=42)
# Method 2: SMOTE — generate synthetic minority samples
# pip install imbalanced-learn
import numpy as np
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Original distribution: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After SMOTE: {dict(zip(*np.unique(y_resampled, return_counts=True)))}")
# Method 3: Full Pipeline with SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('scaler', StandardScaler()),
('model', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
8. 🚀 Advanced Concepts {#advanced}
Regularization — The C Parameter in Depth
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
C_values = np.logspace(-4, 4, 20) # 0.0001 to 10000
results = []
for C in C_values:
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(C=C, max_iter=1000, random_state=42))
])
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1')
results.append({'C': C, 'Mean F1': cv_scores.mean(), 'Std F1': cv_scores.std()})
results_df = pd.DataFrame(results)
best_C = results_df.loc[results_df['Mean F1'].idxmax(), 'C']
print(f"Best C: {best_C:.6f}")
print(f"Best Mean F1: {results_df['Mean F1'].max():.4f}")
LogisticRegressionCV — Automatic C Selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
model_cv = LogisticRegressionCV(
Cs=20, # Number of C values to try
cv=5,
scoring='f1',
penalty='l2',
solver='lbfgs',
max_iter=1000,
random_state=42,
n_jobs=-1
)
pipe = Pipeline([('scaler', StandardScaler()), ('model', model_cv)])
pipe.fit(X_train, y_train)
print(f"Best C: {model_cv.C_[0]:.6f}")
print(f"Test F1: {f1_score(y_test, pipe.predict(X_test)):.4f}")
Multi-class Strategies
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Strategy 1: One-vs-Rest (OvR)
# Trains one binary classifier per class
ovr_model = LogisticRegression(multi_class='ovr', solver='liblinear', random_state=42)
ovr_model.fit(X_train, y_train)
print(f"OvR Accuracy: {ovr_model.score(X_test, y_test):.4f}")
# Strategy 2: Multinomial (Softmax)
# One model that handles all classes simultaneously
mnm_model = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=200, random_state=42)
mnm_model.fit(X_train, y_train)
print(f"Multinomial Accuracy: {mnm_model.score(X_test, y_test):.4f}")
# Probabilities sum to 1 for each sample
probs = mnm_model.predict_proba(X_test[:3])
print("\nProbabilities (should sum to 1):")
for p in probs:
print(f" Setosa: {p[0]:.3f} | Versicolor: {p[1]:.3f} | Virginica: {p[2]:.3f} | Sum: {sum(p):.3f}")
9. 🌐 Real-World Use Cases {#real-world}
Use Case 1: Loan Default Prediction
"""
Problem: Will this customer default on their loan?
Business: Bank / Lending Institution
Impact: Risk management, credit scoring
"""
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
np.random.seed(42)
n = 5000
df = pd.DataFrame({
'age': np.random.randint(22, 65, n),
'annual_income': np.random.randint(20000, 200000, n),
'loan_amount': np.random.randint(5000, 200000, n),
'loan_term_months': np.random.choice([12, 24, 36, 48, 60], n),
'credit_score': np.random.randint(300, 850, n),
'employment_type': np.random.choice(['Salaried', 'Self-Employed', 'Business'], n),
'existing_loans': np.random.randint(0, 5, n),
'monthly_expenses': np.random.randint(1000, 10000, n),
})
default_prob = (
- (df['credit_score'] - 600) * 0.005
+ (df['loan_amount'] / df['annual_income']) * 0.3
+ df['existing_loans'] * 0.1
+ (df['monthly_expenses'] / (df['annual_income'] / 12)) * 0.2
+ np.random.normal(0, 0.1, n)
)
df['defaulted'] = (default_prob > 0.3).astype(int)
print(f"Default Rate: {df['defaulted'].mean():.2%}")
num_features = ['age', 'annual_income', 'loan_amount', 'loan_term_months',
'credit_score', 'existing_loans', 'monthly_expenses']
cat_features = ['employment_type']
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_features),
('cat', OneHotEncoder(drop='first'), cat_features)
])
X = df.drop('defaulted', axis=1)
y = df['defaulted']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipeline = Pipeline([
('prep', preprocessor),
('model', LogisticRegression(
C=0.1, class_weight='balanced',
solver='lbfgs', max_iter=1000, random_state=42
))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
print(f"\n📊 Loan Default Model Performance:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
# Score a new applicant
new_applicant = pd.DataFrame({
'age': [35], 'annual_income': [60000], 'loan_amount': [50000],
'loan_term_months': [36], 'credit_score': [650],
'employment_type': ['Salaried'], 'existing_loans': [1],
'monthly_expenses': [3000]
})
risk_prob = pipeline.predict_proba(new_applicant)[0, 1]
risk_category = 'High' if risk_prob > 0.6 else 'Medium' if risk_prob > 0.3 else 'Low'
print(f"\n🏦 Applicant Risk Assessment:")
print(f" Default Probability: {risk_prob:.2%}")
print(f" Risk Category: {risk_category}")
recommendation = 'Reject' if risk_category == 'High' else 'Manual Review' if risk_category == 'Medium' else 'Approve'
print(f" Recommendation: {recommendation}")
10. 💻 Practical Examples — Complete Code {#practical}
Customer Churn Prediction — Full ML Pipeline
# ============================================================
# CUSTOMER CHURN PREDICTION — COMPLETE ML PROJECT
# ============================================================
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, classification_report
)
import warnings
warnings.filterwarnings('ignore')
# ── 1. Generate Data ──────────────────────────────────────
np.random.seed(42)
n = 3000
df = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n),
'monthly_charges': np.random.uniform(20, 120, n),
'total_charges': np.random.uniform(100, 8000, n),
'num_products': np.random.randint(1, 5, n),
'support_calls': np.random.randint(0, 10, n),
'payment_delays': np.random.randint(0, 5, n),
'has_premium': np.random.randint(0, 2, n),
'avg_session_days': np.random.uniform(1, 30, n),
})
churn_prob = (
- df['tenure_months'] * 0.02
+ df['support_calls'] * 0.15
+ df['payment_delays'] * 0.25
- df['num_products'] * 0.1
- df['has_premium'] * 0.3
+ df['monthly_charges'] * 0.005
+ np.random.normal(0, 0.5, n)
)
df['churned'] = (churn_prob > 0).astype(int)
print(f"📊 Dataset: {n} customers")
print(f"📊 Churn Rate: {df['churned'].mean():.2%} ({df['churned'].sum()} churned)")
# ── 2. Features & Target ──────────────────────────────────
feature_cols = [c for c in df.columns if c != 'churned']
X = df[feature_cols]
y = df['churned']
# ── 3. Stratified Train-Test Split ────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\n🔀 Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Train Churn Rate: {y_train.mean():.2%}")
print(f"Test Churn Rate: {y_test.mean():.2%}")
# ── 4. Compare Models ─────────────────────────────────────
configs = {
'LR (Default)': LogisticRegression(random_state=42, max_iter=1000),
'LR (C=0.1)': LogisticRegression(C=0.1, random_state=42, max_iter=1000),
'LR (C=10)': LogisticRegression(C=10, random_state=42, max_iter=1000),
'LR (Balanced)': LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
'LR (L1)': LogisticRegression(penalty='l1', solver='liblinear', random_state=42),
}
print("\n📊 Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'AUC':>10}")
print("-" * 60)
best_model = None
best_score = 0
for name, m in configs.items():
pipe = Pipeline([('scaler', StandardScaler()), ('model', m)])
acc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()
f1 = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1').mean()
auc = cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc').mean()
print(f"{name:<25} {acc:>10.4f} {f1:>10.4f} {auc:>10.4f}")
if f1 > best_score:
best_score = f1
best_model = (name, m)
print(f"\n🏆 Best Model: {best_model[0]}")
# ── 5. Train Final Model ──────────────────────────────────
final_pipe = Pipeline([('scaler', StandardScaler()), ('model', best_model[1])])
final_pipe.fit(X_train, y_train)
# ── 6. Optimise Threshold ─────────────────────────────────
probs_test = final_pipe.predict_proba(X_test)[:, 1]
best_thresh = 0.5
best_f1 = 0
for thresh in np.arange(0.2, 0.8, 0.05):
preds = (probs_test >= thresh).astype(int)
f1 = f1_score(y_test, preds, zero_division=0)
if f1 > best_f1:
best_f1 = f1
best_thresh = thresh
print(f"🎯 Optimal Threshold: {best_thresh:.2f}")
y_pred_final = (probs_test >= best_thresh).astype(int)
# ── 7. Final Evaluation ───────────────────────────────────
print("\n📊 Final Model Performance:")
print(f" Accuracy: {accuracy_score(y_test, y_pred_final):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_final):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_final):.4f}")
print(f" F1 Score: {f1_score(y_test, y_pred_final):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, probs_test):.4f}")
print("\n📋 Classification Report:")
print(classification_report(y_test, y_pred_final, target_names=['Retained', 'Churned']))
# ── 8. Feature Importance ────────────────────────────────
lr_model = final_pipe.named_steps['model']
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': lr_model.coef_[0],
'Odds_Ratio': np.exp(lr_model.coef_[0])
}).sort_values('Coefficient', key=abs, ascending=False)
print("\n🔑 Feature Importance:")
print(coef_df.to_string(index=False))
# ── 9. Score a New Customer ──────────────────────────────
new_customer = pd.DataFrame({
'tenure_months': [6], 'monthly_charges': [85], 'total_charges': [510],
'num_products': [1], 'support_calls': [5], 'payment_delays': [2],
'has_premium': [0], 'avg_session_days': [5],
})
churn_prob_val = final_pipe.predict_proba(new_customer)[0, 1]
risk_level = 'High 🔴' if churn_prob_val > 0.6 else 'Medium 🟡' if churn_prob_val > 0.3 else 'Low 🟢'
print(f"\n⚠️ Customer Churn Risk:")
print(f" Probability: {churn_prob_val:.2%}")
print(f" Risk Level: {risk_level}")
action = (
"Immediate retention offer required!" if churn_prob_val > 0.6 else
"Monitor and send a personalised offer" if churn_prob_val > 0.3 else
"Maintain regular engagement"
)
print(f" Action: {action}")
11. ⚠️ Edge Cases & Errors {#edge-cases}
Error 1: ConvergenceWarning
# ❌ Warning: Solver did not converge
# ✅ Fix 1: Increase max_iter
model = LogisticRegression(max_iter=10000)
# ✅ Fix 2: Scale the data
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X_train)
# ✅ Fix 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=1000)
Error 2: UndefinedMetricWarning — Zero Division
# ❌ Warning: Precision is ill-defined (no positive predictions made)
# ✅ Fix: Use zero_division parameter
from sklearn.metrics import precision_score, f1_score
precision = precision_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
Error 3: Penalty-Solver Incompatibility
# ❌ Error: Solver lbfgs does not support penalty='l1'
model = LogisticRegression(penalty='l1', solver='lbfgs') # ERROR!
# ✅ Valid combinations:
# penalty='l1' → solver='liblinear' or 'saga'
# penalty='l2' → any solver
# penalty='elasticnet'→ solver='saga' only
# penalty=None → 'lbfgs', 'saga', 'sag', 'newton-cg'
model = LogisticRegression(penalty='l1', solver='liblinear') # ✅
Debugging Checklist
# 1. Check class distribution
print("Class Distribution:", dict(zip(*np.unique(y, return_counts=True))))
# 2. Verify model classes
print("Model Classes:", model.classes_)
# 3. Probability sanity check
probs = model.predict_proba(X_test)
print("Sum to 1:", np.allclose(probs.sum(axis=1), 1.0))
print("Min prob:", probs.min(), "| Max prob:", probs.max())
# 4. Check for NaN in coefficients
print("Any NaN in coef:", np.isnan(model.coef_).any())
# 5. Train vs test score (overfitting check)
print("Train Score:", model.score(X_train_sc, y_train))
print("Test Score:", model.score(X_test_sc, y_test))
12. 💡 Pro Tips {#pro-tips}
# ✅ Tip 1: ALWAYS use predict_proba, not just predict
# Probabilities are far more useful for real business decisions
probs = model.predict_proba(X_test)[:, 1]
# Then tune the threshold for your specific use case
# ✅ Tip 2: Use stratified splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, test_size=0.2 # Maintains class proportions!
)
# ✅ Tip 3: Calibrate probabilities for trustworthy confidence scores
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(LogisticRegression(), cv=5, method='sigmoid')
calibrated.fit(X_train, y_train)
# ✅ Tip 4: SHAP values for production-grade explainability
import shap
explainer = shap.LinearExplainer(model, X_train_sc)
shap_values = explainer.shap_values(X_test_sc)
shap.summary_plot(shap_values, X_test, feature_names=feature_cols)
# ✅ Tip 5: Use StratifiedKFold for cross-validation
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
# ✅ Tip 6: Save the pipeline with metadata
import joblib, json
metadata = {
'model_version': 'v1.0',
'features': feature_cols,
'optimal_threshold': best_thresh,
'metrics': {'f1': best_f1, 'auc': float(roc_auc_score(y_test, probs_test))}
}
joblib.dump(final_pipe, 'churn_model.pkl')
with open('model_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
13. ⚖️ Logistic Regression vs Other Classifiers {#comparison}
Aspect | Logistic Reg | Decision Tree | Random Forest | SVM | Neural Net |
|---|---|---|---|---|---|
Interpretability | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐ |
Training Speed | ⚡ Very Fast | ⚡ Fast | 🐢 Medium | 🐌 Slow | 🐌 Very Slow |
Non-linear patterns | ❌ (base) | ✅ | ✅ | ✅ | ✅ |
Calibrated Probs | ✅ | ✅ (approx) | ✅ (approx) | ❌ (needs calibration) | ✅ |
Feature Scaling | Required | Not needed | Not needed | Required | Required |
Overfitting Risk | Low | High | Low | Medium | High |
Data Required | Small OK | Small OK | Medium | Medium | Large |
When to Choose Logistic Regression?
✅ Choose Logistic Regression when:
- A linear decision boundary is sufficient
- Interpretability is critical (medical, finance, legal)
- Calibrated probability output is needed
- Working with a small to medium dataset
- Building a quick, defensible baseline
- Regulatory or audit requirements exist
❌ Avoid when:
- Complex non-linear patterns exist
- Working with images, text (without engineered features)
- Features are highly multicollinear (use Ridge instead)
- Very large sparse data (use SGD-based solvers)
14. 📊 Data Science Perspective {#data-science}
# Logistic Regression in A/B Testing Analysis
import scipy.stats as stats
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
np.random.seed(42)
n_per_group = 1000
control = pd.DataFrame({
'group': 0, 'converted': np.random.binomial(1, 0.05, n_per_group),
'session_duration': np.random.exponential(120, n_per_group),
'device': np.random.choice([0, 1], n_per_group)
})
treatment = pd.DataFrame({
'group': 1, 'converted': np.random.binomial(1, 0.08, n_per_group),
'session_duration': np.random.exponential(150, n_per_group),
'device': np.random.choice([0, 1], n_per_group)
})
data = pd.concat([control, treatment], ignore_index=True)
import statsmodels.api as sm
X_sm = sm.add_constant(data[['group', 'session_duration', 'device']])
sm_model = sm.Logit(data['converted'], X_sm).fit(disp=0)
print("A/B Test Analysis:")
print(sm_model.summary2().tables[1])
odds_ratio = np.exp(sm_model.params['group'])
print(f"\nTreatment Odds Ratio: {odds_ratio:.4f}")
print(f"Treatment lifts conversion by {(odds_ratio - 1) * 100:.1f}%")
15. 🎤 Interview Questions {#interview}
🟢 Basic
Q1: What is Logistic Regression?
A classification algorithm that uses the sigmoid function to convert a linear combination of features into a probability between 0 and 1.
Q2: Why do we use the sigmoid function?
A linear model's output can be any real number. The sigmoid squashes it into the range (0, 1), making it a valid probability.
Q3: Precision vs Recall — when to choose which?
Precision: When FP is costly (e.g. spam — don't block legitimate email). Recall: When FN is costly (e.g. cancer detection — never miss a sick patient).
Q4: Explain the Confusion Matrix.
A 2×2 table showing TP, TN, FP, FN. It compares actual vs predicted classes. All classification metrics (Accuracy, Precision, Recall, F1) are derived from it.
Q5: What does the C parameter control?
It is the inverse of regularization strength. Small C → Strong regularization (simpler model). Large C → Weak regularization (more complex model). Default is 1.
🟡 Intermediate
Q6: What is AUC-ROC?
ROC Curve: Plots TPR (Recall) vs FPR at various decision thresholds
AUC: Area Under the ROC Curve
0.5 = Random classifier, 1.0 = Perfect classifier
Preferred over accuracy for comparing models on imbalanced data
Q7: How do you handle class imbalance?
1. class_weight='balanced' → Simple and effective
2. SMOTE → Generate synthetic minority samples
3. Threshold tuning → Lower the decision threshold below 0.5
4. Metrics → Use F1 / AUC instead of accuracy
Q8: What is the difference between L1 and L2 regularization?
L1 (penalty='l1'): Sets some coefficients exactly to 0 → Feature selection
L2 (penalty='l2'): Shrinks all coefficients, none to zero → Default
ElasticNet: Combines both behaviours
🔴 Advanced
Q9: When is the Logistic Regression decision boundary non-linear?
The base model always produces a linear boundary. To get a non-linear boundary, add polynomial or interaction features — but the model itself remains linear in its parameters.
Q10: Why minimise Log Loss and not MSE for Logistic Regression?
MSE applied to probabilities: Non-convex → Multiple local minima → Gradient Descent can fail
Log Loss: Convex → Single global minimum → Gradient Descent works reliably
Log Loss also severely penalises confident wrong predictions, which is desirable behaviour.
Q11: What is probability calibration?
A model's predicted probabilities should match actual observed frequencies. If it predicts 70% probability, then 70% of those cases should actually be positive. Use
CalibratedClassifierCVwhen calibration matters.
🎭 Scenario-Based
Q12: Your model has 99% accuracy but is completely useless. Why?
The dataset is heavily imbalanced! If 99% of samples belong to class 0 and the model predicts class 0 for everything, it achieves 99% accuracy while identifying zero positives. Fix: Use F1-Score and AUC-ROC as your metrics.
16. 🎯 Conclusion {#conclusion}
Key Takeaways
Topic | Key Learning |
|---|---|
What | Classification via sigmoid function |
Output | Probability (0–1), then apply a threshold |
Types | Binary, Multiclass (OvR / Multinomial) |
Key Params | C (inverse regularization), penalty, solver |
Metrics | Precision, Recall, F1, AUC-ROC |
Imbalance | class_weight='balanced', SMOTE, threshold tuning |
Best Practice | Pipeline + Stratified CV + Threshold optimisation |
Series Roadmap
✅ Part 1: ML Introduction
✅ Part 2: Linear Regression
✅ Part 3: Logistic Regression (This Blog!)
⏳ Part 4: Decision Trees
⏳ Part 5: Random Forests & Ensemble Methods
⏳ Part 6: Support Vector Machines
⏳ Part 7: K-Means Clustering
⏳ Part 8: Neural Networks
Final Advice
"Master Logistic Regression thoroughly. Every classification metric, class imbalance technique, and threshold optimisation concept you learn here carries forward to every advanced model you will ever use."
Do This Now:
✅ Implement on the Titanic dataset (Kaggle)
✅ Experiment with different decision thresholds
✅ Try SMOTE on an imbalanced dataset
✅ Plot and interpret the ROC and PR curves
Part 3 of ML A to Z | Next: Decision Trees! 🚀