"Ek doctor jo 100 patients mein se 95 ko healthy bata de aur 5 cancer patients miss kar de — kya woh 95% accurate doctor acha hai?" Yahi problem hai sirf accuracy pe rely karne ki. Aao sahi tarike se ML models evaluate karna seekhein.
1. Introduction — Why Evaluation Matters
Bhai, soch — tumne ek ML model banaya, training accuracy 98% aayi. Khush ho gaye. Model deploy kiya. Production mein jaake model ne 70% cases galat predict kiye. Kya hua?
Answer: Tum galat metric dekh rahe the.
Real-World Scenario
Maan lo tumhara model predict karta hai ki koi transaction fraud hai ya nahi.
1000 transactions mein se 990 normal hain, 10 fraud hain.
Agar model sab ko normal bata de — accuracy = 990/1000 = 99% 🎉
Lekin usne ek bhi fraud nahi pakda 😱
Iska matlab — 99% accurate model completely useless hai is case mein.
Why Multiple Metrics?
Situation | Best Metric |
|---|---|
Balanced dataset | Accuracy |
Fraud / Cancer detection | Recall, F1 |
Email spam filter | Precision |
Ranking / Recommendation | AUC-ROC |
Price prediction | RMSE, MAE |
Aaj hum har ek metric ko zero se samjhenge — definition, formula, code, aur real-world use.
2. Basic Concepts — Classification vs Regression
Kya predict kar rahe ho?
Classification → Category predict karna
Spam ya Not Spam?
Cancer hai ya nahi?
Regression → Number predict karna
Ghar ki price kya hogi?
Temperature kal kitna hoga?
Sales next month kitni hongi?
Dono ke liye alag metrics hote hain
Classification Metrics:
→ Accuracy, Precision, Recall, F1-Score, AUC-ROC
Regression Metrics:
→ MAE, MSE, RMSE, R², MAPE
3. Confusion Matrix — The Foundation
Confusion Matrix ek 2x2 table hai jo dikhata hai ki model ne kya sahi predict kiya aur kya galat.
Binary Classification ke liye
Predicted: YES Predicted: NO
Actual: YES | True Positive (TP) | False Negative (FN) |
Actual: NO | False Positive (FP) | True Negative (TN) |
Definitions (Simple Language)
Term | Full Name | Matlab |
|---|---|---|
TP | True Positive | Model ne YES bola, Actually YES tha ✅ |
TN | True Negative | Model ne NO bola, Actually NO tha ✅ |
FP | False Positive | Model ne YES bola, Actually NO tha ❌ (Type I Error) |
FN | False Negative | Model ne NO bola, Actually YES tha ❌ (Type II Error) |
Real Example: COVID Test
TP → Test positive, Patient actually COVID positive ✅
TN → Test negative, Patient actually COVID negative ✅
FP → Test positive, Patient COVID negative ❌ (False Alarm)
FN → Test negative, Patient actually COVID positive ❌ (Dangerous Miss!)
Code — Confusion Matrix in Python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np
# Actual values (ground truth)
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
# Model predictions
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Confusion matrix banao
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# print kitne sahi hai kiya kitne galat hai
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")
Output:
Confusion Matrix:
[[4 1]
[1 4]]
TN=4, FP=1
FN=1, TP=4
4. All Classification Metrics
4.1 Accuracy
Definition: Total sahi predictions / Total predictions
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Kab use karein: Jab dataset balanced ho (har class ka count roughly same ho)
Kab avoid karein: Imbalanced dataset mein (fraud, cancer, etc.)
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}") # 80.00%
4.2 Precision
Definition: Jinhe model ne Positive bola, unme se kitne Actually Positive the?
Precision = TP / (TP + FP)
Simple Analogy: Agar model ne 10 logon ko "spam" bola, unme se 8 actually spam the → Precision = 80%
Kab important hai: Jab False Positive costly ho
Example: Email filter ne important email ko spam mein daala → Bad!
Example: Court case mein innocent ko guilty bola → Very Bad!
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2%}") # 80.00%
4.3 Recall (Sensitivity / True Positive Rate)
Definition: Jo Actually Positive the, unme se kitne ko model ne Positive bola?
Recall = TP / (TP + FN)
Simple Analogy: 100 COVID patients mein se model ne kitno ko pakda?
Kab important hai: Jab False Negative costly ho
Example: Cancer patient ko "healthy" bol diya → Dangerous!
Example: Fraud transaction miss ho gayi → Financial loss!
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2%}") # 80.00%
Precision vs Recall Trade-off
Yeh dono ek tension mein rehte hain:
High Precision → Model bahut cautious hai, sirf confident hone pe YES bolta hai
→ FP kam, lekin FN badh sakta hai
High Recall → Model aggressive hai, jyada cases ko YES bolta hai
→ FN kam, lekin FP badh sakta hai
Threshold badalne se dono affect hote hain — isse hum explore karenge AUC section mein.
4.4 F1-Score
Definition: Precision aur Recall ka Harmonic Mean
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Kab use karein: Jab dono Precision aur Recall important hon, especially imbalanced datasets mein.
Why Harmonic Mean, not Simple Average?
Agar Precision=1.0, Recall=0.0 → Simple Avg = 0.5 (misleading!)
Harmonic Mean = 0.0 (sahi — model useless hai)
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print(f"F1-Score: {f1:.2%}")
# Weighted F1 for multi-class
f1_weighted = f1_score(y_true, y_pred, average='weighted')
4.5 F-Beta Score
Definition: F1 ka generalized version jahan tum Precision ya Recall ko zyada importance de sakte ho.
F_beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)
β < 1 → Precision ko zyada importance
β > 1 → Recall ko zyada importance
β = 1 → F1 score (balanced)
from sklearn.metrics import fbeta_score
# Beta=2 means Recall 2x more important than Precision
f2 = fbeta_score(y_true, y_pred, beta=2)
print(f"F2-Score: {f2:.4f}")
# Beta=0.5 means Precision 2x more important
f05 = fbeta_score(y_true, y_pred, beta=0.5)
print(f"F0.5-Score: {f05:.4f}")
Real Use:
Medical diagnosis → β=2 (Recall zyada important, miss mat karo)
Spam filter → β=0.5 (Precision zyada important, legitimate email mat roko)
4.6 Specificity (True Negative Rate)
Definition: Jo Actually Negative the, unme se kitno ko model ne sahi Negative bola?
Specificity = TN / (TN + FP)
Recall ka ulta — yeh Negative class ke baare mein bolta hai.
# Manual calculation
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
specificity = tn / (tn + fp)
print(f"Specificity: {specificity:.2%}")
4.7 Matthews Correlation Coefficient (MCC)
Definition: Sabse balanced single metric for binary classification, imbalanced datasets ke liye bhi reliable.
MCC = (TP×TN - FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Range: -1 (worst) to +1 (best), 0 = random guessing
from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_true, y_pred)
print(f"MCC: {mcc:.4f}")
Pro Tip: Imbalanced dataset mein F1 mislead kar sakta hai, MCC nahi.
4.8 Cohen's Kappa
Definition: Model ki accuracy ko chance-level accuracy se compare karta hai.
Range: 0 (random) to 1 (perfect agreement)
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's Kappa: {kappa:.4f}")
Kappa Value | Interpretation |
|---|---|
< 0 | Chance se bhi bura |
0.0 – 0.2 | Slight agreement |
0.2 – 0.4 | Fair agreement |
0.4 – 0.6 | Moderate agreement |
0.6 – 0.8 | Substantial agreement |
0.8 – 1.0 | Almost perfect |
4.9 Log Loss (Cross-Entropy Loss)
Definition: Model ke probability predictions ko penalize karta hai — jitna confident aur galat, utna zyada penalty.
Log Loss = -1/N × Σ [y×log(p) + (1-y)×log(1-p)]
Kab use karein: Jab model probability output kare, na sirf 0/1.
from sklearn.metrics import log_loss
# y_prob = probability of class 1
y_prob = [0.9, 0.1, 0.8, 0.3, 0.2, 0.95, 0.7, 0.05, 0.85, 0.15]
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0 ]
loss = log_loss(y_true, y_prob)
print(f"Log Loss: {loss:.4f}") # Lower is better
Lower Log Loss = Better model
4.10 Hamming Loss
Definition: Multi-label classification mein use hota hai — fraction of wrong labels.
from sklearn.metrics import hamming_loss
# Multi-label example
y_true_ml = [[1, 0, 1], [0, 1, 1]]
y_pred_ml = [[1, 0, 0], [0, 1, 1]]
h_loss = hamming_loss(y_true_ml, y_pred_ml)
print(f"Hamming Loss: {h_loss:.4f}") # Lower is better
5. 📈 All Regression Metrics {#all-regression-metrics}
5.1 MAE — Mean Absolute Error
Definition: Average of absolute differences between predicted and actual values.
MAE = (1/n) × Σ |y_actual - y_predicted|
Simple Analogy: Tumne predict kiya ghar ki price 50 lakhs, actually 55 lakhs tha → Error = 5 lakhs. MAE = average of all such errors.
Properties:
Outliers se affected nahi hota (robust)
Same unit mein hota hai jaise target variable
from sklearn.metrics import mean_absolute_error
y_true_reg = [100, 200, 300, 400, 500]
y_pred_reg = [110, 190, 320, 380, 510]
mae = mean_absolute_error(y_true_reg, y_pred_reg)
print(f"MAE: {mae:.2f}") # 18.0
5.2 MSE — Mean Squared Error
Definition: Average of squared differences.
MSE = (1/n) × Σ (y_actual - y_predicted)²
Properties:
Outliers ko bahut zyada penalize karta hai (squared!)
Unit squared ho jaati hai (price² → weird)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_true_reg, y_pred_reg)
print(f"MSE: {mse:.2f}") # 460.0
5.3 RMSE — Root Mean Squared Error
Definition: MSE ka square root — same unit mein result aata hai.
RMSE = √MSE
Properties:
MSE se zyada interpretable
Outliers pe sensitive
import numpy as np
rmse = np.sqrt(mse)
# OR
rmse = mean_squared_error(y_true_reg, y_pred_reg, squared=False)
print(f"RMSE: {rmse:.2f}") # 21.45
5.4 R² Score (Coefficient of Determination)
Definition: Model kitna variation explain kar raha hai target variable ki, 0-1 scale pe.
R² = 1 - (SS_residual / SS_total)
SS_residual = Σ(y - ŷ)²
SS_total = Σ(y - ȳ)²
Interpretation:
R² = 1 → Perfect model
R² = 0 → Model sirf mean predict kar raha hai (useless)
R² < 0 → Model mean se bhi bura hai (bahut bura!)
from sklearn.metrics import r2_score
r2 = r2_score(y_true_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")
5.5 Adjusted R²
Definition: R² ka improved version jo extra features ke addition pe penalty lagata hai.
Adjusted R² = 1 - [(1-R²)(n-1) / (n-k-1)]
Where: n = samples, k = features
def adjusted_r2(r2, n, k):
return 1 - (1 - r2) * (n - 1) / (n - k - 1)
n = len(y_true_reg)
k = 3 # number of features
adj_r2 = adjusted_r2(r2, n, k)
print(f"Adjusted R²: {adj_r2:.4f}")
5.6 MAPE — Mean Absolute Percentage Error
Definition: Percentage mein error batata hai.
MAPE = (1/n) × Σ |((y_actual - y_predicted) / y_actual)| × 100
Limitation: Jab actual value zero ho, divide by zero error aata hai.
def mape(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
result = mape(y_true_reg, y_pred_reg)
print(f"MAPE: {result:.2f}%")
# Sklearn version (available from sklearn 0.24+)
from sklearn.metrics import mean_absolute_percentage_error
mape_sk = mean_absolute_percentage_error(y_true_reg, y_pred_reg) * 100
print(f"MAPE (sklearn): {mape_sk:.2f}%")
5.7 Huber Loss
Definition: MAE aur MSE ka combination — small errors pe MSE, large errors pe MAE.
from sklearn.linear_model import HuberRegressor
import numpy as np
# Huber loss for training (outlier-robust regression)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 2.1, 2.9, 4.2, 5.1])
model = HuberRegressor()
model.fit(X, y)
6. ⚖️ Intermediate Usage — Imbalanced Datasets {#intermediate-usage}
Problem: Imbalanced Data
Real world mein data almost kabhi balanced nahi hota:
Fraud detection: 0.1% fraud, 99.9% normal
Medical diagnosis: 1% disease, 99% healthy
Churn prediction: 5% churn, 95% no churn
Solutions aur Metrics
6.1 Class Weight
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Imbalanced dataset banao
X, y = make_classification(
n_samples=1000,
n_features=10,
weights=[0.95, 0.05], # 95% class 0, 5% class 1
random_state=42
)
# class_weight='balanced' automatically adjust karta hai
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X, y)
6.2 SMOTE — Synthetic Minority Over-sampling
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from collections import Counter
print(f"Before SMOTE: {Counter(y)}")
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"After SMOTE: {Counter(y_resampled)}")
6.3 Balanced Accuracy Score
from sklearn.metrics import balanced_accuracy_score
# Regular accuracy imbalanced data mein mislead karti hai
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred_imb = model.predict(X_test)
regular_acc = accuracy_score(y_test, y_pred_imb)
balanced_acc = balanced_accuracy_score(y_test, y_pred_imb)
print(f"Regular Accuracy: {regular_acc:.2%}")
print(f"Balanced Accuracy: {balanced_acc:.2%}")
6.4 Classification Report — Sab Ek Jagah
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_imb, target_names=['Normal', 'Fraud']))
Output:
precision recall f1-score support
Normal 0.97 0.98 0.98 190
Fraud 0.67 0.60 0.63 10
accuracy 0.96 200
macro avg 0.82 0.79 0.80 200
weighted avg 0.96 0.96 0.96 200
Macro vs Weighted Average:
Macro Avg → Har class ko equal weight
Weighted Avg → Class size ke hisaab se weight
7. 🔬 Advanced Concepts — AUC-ROC, PR Curve, Multi-class {#advanced-concepts}
7.1 ROC Curve — Receiver Operating Characteristic
Definition: ROC curve dikhata hai ki jab hum classification threshold change karte hain to True Positive Rate (Recall) vs False Positive Rate kaise change hote hain.
TPR (Y-axis) = Recall = TP / (TP + FN)
FPR (X-axis) = FP / (FP + TN) [1 - Specificity]
Threshold kya hota hai?
Model probability output karta hai (e.g., 0.73)
Agar threshold = 0.5 → 0.73 > 0.5 → Positive
Threshold badhao → Stricter (less FP, more FN)
Threshold ghato → Lenient (more FP, less FN)
from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Models
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
# Probabilities
y_prob_lr = lr.predict_proba(X_test)[:, 1]
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob_lr)
roc_auc = auc(fpr, tpr)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"AUC-ROC Score: {roc_auc:.4f}")
7.2 AUC — Area Under the Curve
Definition: ROC curve ke neeche ka area. 0 se 1 tak.
AUC Value | Interpretation |
|---|---|
1.0 | Perfect model |
0.9 – 1.0 | Excellent |
0.8 – 0.9 | Good |
0.7 – 0.8 | Fair |
0.6 – 0.7 | Poor |
0.5 | Random guessing |
< 0.5 | Worse than random |
Physical meaning: AUC = Probability ki model randomly chosen positive example ko randomly chosen negative example se higher rank karega.
7.3 Precision-Recall Curve
Definition: ROC jaisa hi, lekin yeh FPR ki jagah Precision dikhata hai vs Recall.
Kab prefer karein ROC pe: Highly imbalanced datasets mein PR Curve zyada informative hota hai.
from sklearn.metrics import precision_recall_curve, average_precision_score
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_prob_lr)
avg_precision = average_precision_score(y_test, y_prob_lr)
plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, color='green', lw=2,
label=f'PR Curve (AP = {avg_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
7.4 Finding Optimal Threshold
# Youden's J Statistic: Best threshold on ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob_lr)
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
best_threshold = thresholds[best_idx]
print(f"Optimal Threshold (Youden's J): {best_threshold:.4f}")
print(f"At this threshold → TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")
# Apply optimal threshold
y_pred_optimal = (y_prob_lr >= best_threshold).astype(int)
print(f"\nWith optimal threshold:")
print(classification_report(y_test, y_pred_optimal))
7.5 Multi-Class Metrics
Jab 3+ classes hon (e.g., Cat/Dog/Bird):
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, accuracy_score)
from sklearn.preprocessing import label_binarize
# Multi-class dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_mc = rf.predict(X_test)
y_prob_mc = rf.predict_proba(X_test)
# Classification Report
print(classification_report(y_test, y_pred_mc, target_names=iris.target_names))
# Multi-class AUC (One-vs-Rest)
auc_ovr = roc_auc_score(y_test, y_prob_mc, multi_class='ovr', average='macro')
print(f"AUC-ROC (OvR, Macro): {auc_ovr:.4f}")
# Multi-class AUC (One-vs-One)
auc_ovo = roc_auc_score(y_test, y_prob_mc, multi_class='ovo', average='macro')
print(f"AUC-ROC (OvO, Macro): {auc_ovo:.4f}")
8. 🌍 Real-World Use Cases {#real-world-use-cases}
8.1 Medical Diagnosis (Cancer Detection)
Priority: High Recall (koi patient miss na ho) Acceptable trade-off: Lower Precision (kuch healthy logon ko bhi check karenge)
# Cancer detection: Recall sabse important
# Even if Precision suffers, we want to catch ALL cancer cases
from sklearn.svm import SVC
svc = SVC(probability=True, class_weight='balanced', random_state=42)
svc.fit(X_train, y_train)
y_prob_svc = svc.predict_proba(X_test)[:, 1]
# Lower threshold → Higher Recall
y_pred_conservative = (y_prob_svc >= 0.3).astype(int) # Threshold = 0.3
print("Conservative threshold (0.3) - High Recall:")
print(classification_report(y_test, y_pred_conservative))
8.2 Email Spam Filter
Priority: High Precision (legitimate email spam mein nahi jaana chahiye) Acceptable trade-off: Lower Recall (kuch spam inbox mein aa sakta hai)
# Higher threshold → Higher Precision
y_pred_strict = (y_prob_lr >= 0.7).astype(int) # Threshold = 0.7
print("Strict threshold (0.7) - High Precision:")
print(classification_report(y_test, y_pred_strict))
8.3 Fraud Detection
Priority: F1-Score + AUC-ROC + Balanced Accuracy
# Complete fraud detection evaluation
def fraud_model_report(y_true, y_pred, y_prob, model_name="Model"):
print(f"\n{'='*50}")
print(f" {model_name} — Complete Evaluation")
print(f"{'='*50}")
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"TP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1-Score: {f1_score(y_true, y_pred):.4f}")
print(f"F2-Score: {fbeta_score(y_true, y_pred, beta=2):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_true, y_prob):.4f}")
print(f"MCC: {matthews_corrcoef(y_true, y_pred):.4f}")
print(f"Log Loss: {log_loss(y_true, y_prob):.4f}")
# Usage
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score
fraud_model_report(y_test, y_pred_optimal, y_prob_lr, "Logistic Regression")
8.4 House Price Prediction (Regression)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train_h, y_train_h)
y_pred_h = gbr.predict(X_test_h)
# Complete regression evaluation
mae_h = mean_absolute_error(y_test_h, y_pred_h)
rmse_h = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
r2_h = r2_score(y_test_h, y_pred_h)
mape_h = mean_absolute_percentage_error(y_test_h, y_pred_h) * 100
print(f"MAE: {mae_h:.4f}")
print(f"RMSE: {rmse_h:.4f}")
print(f"R²: {r2_h:.4f}")
print(f"MAPE: {mape_h:.2f}%")
9. 💻 Practical Examples with Code {#practical-examples}
Example 1: Complete Model Comparison Pipeline
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, matthews_corrcoef,
balanced_accuracy_score, log_loss)
import warnings
warnings.filterwarnings('ignore')
# Dataset banao
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=15,
n_redundant=3, weights=[0.85, 0.15], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Models
models = {
'Logistic Regression': LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
'Decision Tree': DecisionTreeClassifier(class_weight='balanced', random_state=42),
'Random Forest': RandomForestClassifier(class_weight='balanced', random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(probability=True, class_weight='balanced', random_state=42),
'KNN': KNeighborsClassifier(),
'Naive Bayes': GaussianNB(),
}
# Evaluate all models
results = []
for name, model in models.items():
model.fit(X_train, y_train)
y_pred_m = model.predict(X_test)
y_prob_m = model.predict_proba(X_test)[:, 1]
results.append({
'Model': name,
'Accuracy': round(accuracy_score(y_test, y_pred_m), 4),
'Bal. Accuracy': round(balanced_accuracy_score(y_test, y_pred_m), 4),
'Precision': round(precision_score(y_test, y_pred_m), 4),
'Recall': round(recall_score(y_test, y_pred_m), 4),
'F1-Score': round(f1_score(y_test, y_pred_m), 4),
'AUC-ROC': round(roc_auc_score(y_test, y_prob_m), 4),
'MCC': round(matthews_corrcoef(y_test, y_pred_m), 4),
'Log Loss': round(log_loss(y_test, y_prob_m), 4),
})
# Display as DataFrame
df_results = pd.DataFrame(results).set_index('Model')
print(df_results.sort_values('AUC-ROC', ascending=False).to_string())
Example 2: Cross-Validation with Multiple Metrics
from sklearn.model_selection import cross_validate
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc',
'balanced_accuracy': 'balanced_accuracy',
'neg_log_loss': 'neg_log_loss',
}
rf_model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)
cv_results = cross_validate(
rf_model, X, y,
cv=5,
scoring=scoring,
return_train_score=True
)
print("5-Fold Cross-Validation Results:")
print("-" * 40)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
test_scores = cv_results[f'test_{metric}']
print(f"{metric.capitalize():20s}: {test_scores.mean():.4f} ± {test_scores.std():.4f}")
Example 3: Custom Evaluation Dashboard
import matplotlib.pyplot as plt
from sklearn.metrics import (confusion_matrix, roc_curve, precision_recall_curve,
ConfusionMatrixDisplay, auc)
def full_evaluation_dashboard(model, X_test, y_test, model_name="Model"):
"""
Complete visual evaluation dashboard for a binary classifier.
"""
y_pred_d = model.predict(X_test)
y_prob_d = model.predict_proba(X_test)[:, 1]
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle(f'{model_name} — Complete Evaluation Dashboard', fontsize=16, fontweight='bold')
# 1. Confusion Matrix
cm_d = confusion_matrix(y_test, y_pred_d)
ConfusionMatrixDisplay(cm_d).plot(ax=axes[0, 0], cmap='Blues')
axes[0, 0].set_title('Confusion Matrix')
# 2. ROC Curve
fpr_d, tpr_d, _ = roc_curve(y_test, y_prob_d)
roc_auc_d = auc(fpr_d, tpr_d)
axes[0, 1].plot(fpr_d, tpr_d, 'darkorange', lw=2, label=f'AUC = {roc_auc_d:.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'navy', linestyle='--')
axes[0, 1].set_xlabel('FPR'); axes[0, 1].set_ylabel('TPR')
axes[0, 1].set_title('ROC Curve'); axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. Precision-Recall Curve
prec_d, rec_d, _ = precision_recall_curve(y_test, y_prob_d)
ap_d = average_precision_score(y_test, y_prob_d)
axes[0, 2].plot(rec_d, prec_d, 'green', lw=2, label=f'AP = {ap_d:.3f}')
axes[0, 2].set_xlabel('Recall'); axes[0, 2].set_ylabel('Precision')
axes[0, 2].set_title('Precision-Recall Curve'); axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
# 4. Metrics Bar Chart
metrics_d = {
'Accuracy': accuracy_score(y_test, y_pred_d),
'Precision': precision_score(y_test, y_pred_d),
'Recall': recall_score(y_test, y_pred_d),
'F1-Score': f1_score(y_test, y_pred_d),
'Bal. Acc': balanced_accuracy_score(y_test, y_pred_d),
'AUC-ROC': roc_auc_d,
}
colors_d = ['#2196F3', '#4CAF50', '#FF9800', '#9C27B0', '#F44336', '#00BCD4']
bars = axes[1, 0].bar(metrics_d.keys(), metrics_d.values(), color=colors_d, alpha=0.8)
axes[1, 0].set_ylim(0, 1.1)
axes[1, 0].set_title('All Metrics Overview')
for bar_d, val_d in zip(bars, metrics_d.values()):
axes[1, 0].text(bar_d.get_x() + bar_d.get_width()/2., bar_d.get_height() + 0.01,
f'{val_d:.2f}', ha='center', va='bottom', fontsize=9)
# 5. Probability Distribution
axes[1, 1].hist(y_prob_d[y_test == 0], bins=30, alpha=0.6, color='blue', label='Class 0 (Negative)')
axes[1, 1].hist(y_prob_d[y_test == 1], bins=30, alpha=0.6, color='red', label='Class 1 (Positive)')
axes[1, 1].set_xlabel('Predicted Probability')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Probability Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
# 6. Threshold Analysis
thresholds_range = np.arange(0.1, 0.9, 0.05)
precisions_t, recalls_t, f1s_t = [], [], []
for thresh in thresholds_range:
y_pred_t = (y_prob_d >= thresh).astype(int)
precisions_t.append(precision_score(y_test, y_pred_t, zero_division=0))
recalls_t.append(recall_score(y_test, y_pred_t, zero_division=0))
f1s_t.append(f1_score(y_test, y_pred_t, zero_division=0))
axes[1, 2].plot(thresholds_range, precisions_t, 'b-', label='Precision')
axes[1, 2].plot(thresholds_range, recalls_t, 'r-', label='Recall')
axes[1, 2].plot(thresholds_range, f1s_t, 'g-', label='F1-Score')
axes[1, 2].set_xlabel('Threshold')
axes[1, 2].set_ylabel('Score')
axes[1, 2].set_title('Threshold vs Metrics')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return metrics_d
# Usage
rf_final = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)
rf_final.fit(X_train, y_train)
metrics = full_evaluation_dashboard(rf_final, X_test, y_test, "Random Forest")
10. ⚠️ Edge Cases and Errors {#edge-cases-and-errors}
Error 1: UndefinedMetricWarning — Zero Division
# Problem: Jab koi class predict hi nahi ki
from sklearn.exceptions import UndefinedMetricWarning
import warnings
y_pred_all_zero = np.zeros(len(y_test)) # Sab zero predict kiya
# Warning aayega
with warnings.catch_warnings(record=True) as w:
p = precision_score(y_test, y_pred_all_zero)
if w:
print(f"Warning: {w[0].message}")
# Fix: zero_division parameter use karo
p_safe = precision_score(y_test, y_pred_all_zero, zero_division=0)
print(f"Precision (safe): {p_safe}")
Error 2: Wrong Average for Multi-class
# Problem: Binary metric function multi-class pe call karna
y_mc_true = [0, 1, 2, 0, 1, 2]
y_mc_pred = [0, 2, 1, 0, 1, 2]
# WRONG — binary default
# precision_score(y_mc_true, y_mc_pred) # ValueError!
# CORRECT — specify average
print(precision_score(y_mc_true, y_mc_pred, average='macro'))
print(precision_score(y_mc_true, y_mc_pred, average='micro'))
print(precision_score(y_mc_true, y_mc_pred, average='weighted'))
Average Types:
Type | Matlab |
|---|---|
| Har class ka simple average (imbalanced classes mein mislead kar sakta) |
| Global TP/FP/FN calculate karke metric nikalo |
| Class support ke hisaab se weighted average |
| Har class ka alag score return karo |
Error 3: Data Leakage in Evaluation
# WRONG — Test data pe fit karna
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# DO NOT DO THIS:
X_scaled_wrong = scaler.fit_transform(X) # Puri data pe fit!
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_scaled_wrong, y, test_size=0.2)
# CORRECT — Pipeline use karo
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))
])
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train_c, y_train_c)
y_pred_pipe = pipeline.predict(X_test_c)
print(f"F1-Score (No Leakage): {f1_score(y_test_c, y_pred_pipe):.4f}")
Error 4: Stratified Split Ignore Karna
# WRONG — Imbalanced data mein regular split
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X, y, test_size=0.2)
# CORRECT — Stratified split
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # stratify=y is key!
)
from collections import Counter
print(f"Test set class distribution (Stratified): {Counter(y_test_s)}")
Error 5: Overfitting — Train vs Test Metrics
# Har baar train aur test metrics dono check karo
rf_check = RandomForestClassifier(n_estimators=100, random_state=42)
rf_check.fit(X_train, y_train)
train_f1 = f1_score(y_train, rf_check.predict(X_train))
test_f1 = f1_score(y_test, rf_check.predict(X_test))
print(f"Train F1: {train_f1:.4f}")
print(f"Test F1: {test_f1:.4f}")
print(f"Gap: {train_f1 - test_f1:.4f}")
if train_f1 - test_f1 > 0.1:
print("⚠️ Warning: Model is OVERFITTING!")
11. 🏆 Pro Developer Insights {#pro-developer-insights}
Best Practices
1. Metric Selection Framework:
Step 1: Dataset balanced hai?
→ Yes → Accuracy, F1 use karo
→ No → Balanced Accuracy, MCC use karo
Step 2: Cost of errors kya hai?
→ FN costly (Cancer/Fraud miss) → Recall focus
→ FP costly (Wrong alarm) → Precision focus
→ Both costly → F1 / F-beta
Step 3: Probability output chahiye?
→ Yes → AUC-ROC, Log Loss use karo
→ No → Threshold-based metrics
2. Multiple Metrics Always:
# Never rely on single metric
# Always evaluate: Accuracy + F1 + AUC + Confusion Matrix
3. Business Metric Alignment:
# ML metric ko business impact se connect karo
# Example: Fraud Detection
# 1 FN (missed fraud) = Rs. 50,000 loss
# 1 FP (false alarm) = Rs. 500 investigation cost
def business_cost(y_true, y_pred, fn_cost=50000, fp_cost=500):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
total_cost = (fn * fn_cost) + (fp * fp_cost)
return total_cost, fn, fp
cost, fn_count, fp_count = business_cost(y_test, y_pred_optimal)
print(f"Total Business Cost: Rs. {cost:,}")
print(f"Missed Frauds: {fn_count}, False Alarms: {fp_count}")
4. Statistical Significance:
from scipy import stats
# Do alag models ke scores compare karo
scores_model1 = cross_val_score(
RandomForestClassifier(random_state=42), X, y, cv=10, scoring='f1'
)
scores_model2 = cross_val_score(
GradientBoostingClassifier(random_state=42), X, y, cv=10, scoring='f1'
)
t_stat, p_value = stats.ttest_rel(scores_model1, scores_model2)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Difference is STATISTICALLY SIGNIFICANT")
else:
print("Difference is NOT statistically significant")
5. Calibration — Probability Reliability:
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
# Model ki probabilities reliable hain?
prob_true, prob_pred = calibration_curve(y_test, y_prob_lr, n_bins=10)
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', linewidth=1, label='Logistic Regression')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly Calibrated')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve (Reliability Diagram)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
12. 📊 Comparison Table {#comparison-table}
Classification Metrics Comparison
Metric | Range | Higher=Better | Imbalanced OK? | Probability Based | Best For |
|---|---|---|---|---|---|
Accuracy | 0–1 | ✅ | ❌ | ❌ | Balanced datasets |
Precision | 0–1 | ✅ | Partial | ❌ | Minimize false alarms |
Recall | 0–1 | ✅ | Partial | ❌ | Minimize missed cases |
F1-Score | 0–1 | ✅ | Partial | ❌ | Imbalanced, both matter |
F-Beta | 0–1 | ✅ | ✅ | ❌ | Custom precision/recall weight |
AUC-ROC | 0–1 | ✅ | ✅ | ✅ | Ranking, probability models |
PR-AUC | 0–1 | ✅ | ✅ | ✅ | Very imbalanced data |
MCC | -1 to 1 | ✅ | ✅ | ❌ | Imbalanced, gold standard |
Log Loss | 0–∞ | ❌ | ✅ | ✅ | Probability calibration |
Cohen's Kappa | -1 to 1 | ✅ | ✅ | ❌ | vs. chance comparison |
Regression Metrics Comparison
Metric | Unit | Outlier Sensitive | Interpretable | Best For |
|---|---|---|---|---|
MAE | Same as target | ❌ | ✅ | General purpose |
MSE | Target² | ✅ | ❌ | Penalize large errors |
RMSE | Same as target | ✅ | ✅ | Most common |
R² | Unitless | ✅ | ✅ | Model comparison |
Adjusted R² | Unitless | ✅ | ✅ | Feature selection |
MAPE | % | Partial | ✅ | Business reporting |
Huber Loss | Same as target | ❌ | Partial | Robust regression |
13. 🧪 Data Science Perspective {#data-science-perspective}
ML Pipeline mein Evaluation kahan aata hai?
Data Collection
↓
Data Preprocessing
↓
Feature Engineering
↓
Model Training
↓
Model Evaluation ← YE STEP HAI HAMARA
↓
Hyperparameter Tuning
↓
Final Evaluation on Test Set
↓
Model Deployment
↓
Monitoring (Production Evaluation)
Hyperparameter Tuning with Custom Scoring
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10],
}
# AUC-ROC pe optimize karo
grid_search = GridSearchCV(
RandomForestClassifier(class_weight='balanced', random_state=42),
param_grid,
cv=5,
scoring='roc_auc', # Ya 'f1', 'recall', 'precision'
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best Params: {grid_search.best_params_}")
print(f"Best AUC-ROC: {grid_search.best_score_:.4f}")
Custom Scorer Banana
from sklearn.metrics import make_scorer
# Business-specific metric: Rs. mein loss minimize karo
def business_loss_scorer(y_true, y_pred, fn_cost=50000, fp_cost=500):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
total_cost = -(fn * fn_cost + fp * fp_cost) # Negative kyunki GridSearch maximize karta hai
return total_cost
custom_scorer = make_scorer(business_loss_scorer)
grid_search_custom = GridSearchCV(
RandomForestClassifier(class_weight='balanced', random_state=42),
{'n_estimators': [50, 100], 'max_depth': [3, 5]},
cv=5,
scoring=custom_scorer,
n_jobs=-1
)
grid_search_custom.fit(X_train, y_train)
print(f"Best Params for min. business loss: {grid_search_custom.best_params_}")
Production Monitoring — Concept
# Production mein model drift detect karna
def monitor_model_performance(y_true_prod, y_pred_prod, baseline_f1=0.80):
current_f1 = f1_score(y_true_prod, y_pred_prod)
print(f"Baseline F1: {baseline_f1:.4f}")
print(f"Current F1: {current_f1:.4f}")
print(f"Drift: {baseline_f1 - current_f1:.4f}")
if baseline_f1 - current_f1 > 0.05:
print("🚨 ALERT: Significant model drift detected! Retraining needed.")
else:
print("✅ Model performance stable.")
# Simulate production predictions
y_prod_true = np.random.choice([0, 1], size=500, p=[0.85, 0.15])
y_prod_pred = np.random.choice([0, 1], size=500, p=[0.85, 0.15])
monitor_model_performance(y_prod_true, y_prod_pred)
14. 🎯 Interview Questions {#interview-questions}
Basic Level
Q1. Accuracy reliable metric kyun nahi hai imbalanced datasets mein?
Agar 95% data class 0 ka hai aur model sab ko class 0 predict kare, accuracy 95% hogi — lekin model completely useless hai. Isliye Recall, F1, AUC use karte hain.
Q2. Precision aur Recall mein difference batao.
Precision = "Jo positive bola uska kitna sahi?" → TP/(TP+FP) Recall = "Jo actually positive tha uska kitna pakda?" → TP/(TP+FN)
Q3. F1-Score kab use karte hain?
Jab Precision aur Recall dono important hon, aur dataset imbalanced ho. Yeh dono ka harmonic mean hai.
Q4. Confusion Matrix ke 4 elements kya hain?
TP (sahi positive), TN (sahi negative), FP (galat positive / Type I Error), FN (galat negative / Type II Error)
Q5. ROC curve kya hai?
Alag-alag thresholds pe TPR vs FPR ka graph. AUC is curve ke neeche ka area hai — jitna zyada utna better.
Intermediate Level
Q6. Type I aur Type II Error mein difference?
Type I Error = False Positive (Innocent ko guilty bola) Type II Error = False Negative (Guilty ko innocent bola) Medical domain mein Type II zyada dangerous hota hai.
Q7. ROC-AUC vs PR-AUC — kab kya use karein?
ROC-AUC better hai jab negative class majority ho aur false positives count karna ho. PR-AUC better hai jab positive class minority ho (highly imbalanced) — fraud, rare disease detection.
Q8. Macro vs Weighted vs Micro average kya hai?
Macro: Simple average of per-class metrics Weighted: Class size se weighted average Micro: Global TP/FP/FN se calculate
Q9. R² negative kyun ho sakta hai?
Jab model ek simple mean prediction se bhi worse perform kare. Usually severe underfitting ya wrong model type use karne pe hota hai.
Q10. Log Loss kya penalize karta hai?
Confident aur galat predictions ko bahut zyada penalize karta hai. Isliye yeh probability calibration ke liye important hai.
Advanced Level
Q11. MCC (Matthews Correlation Coefficient) F1 se better kyun hai imbalanced data mein?
F1 sirf Positive class pe focus karta hai (TP, FP, FN) — TN ignore karta hai. MCC saari four values (TP, TN, FP, FN) use karta hai, isliye zyada balanced aur trustworthy hai.
Q12. Optimal threshold kaise decide karein?
Youden's J statistic:
max(TPR - FPR)from ROC curve PR Curve ka F1-maximizing point Business cost function minimize karke
Q13. AUC-ROC ka physical interpretation kya hai?
Probability ki randomly chosen positive sample ka score randomly chosen negative sample se higher hoga. Ranking ability measure karta hai.
Q14. Cross-validation mein konsa scoring metric use karein?
Business objective ke hisaab se:
Imbalanced →
'f1','roc_auc','balanced_accuracy'Probability needed →
'roc_auc','neg_log_loss'Regression →
'neg_rmse','r2'
Q15. Model calibration kya hota hai? Kab important hai?
Model ki predicted probabilities ki accuracy. Agar model 0.8 probability bolta hai, to 80% cases mein actually positive hona chahiye. Risk scoring, insurance, medical diagnosis mein critical hai.
Scenario-Based
Q16. Tumhare fraud detection model ki accuracy 99% hai. Client khush hai. Kya tum bhi khush ho?
Nahi! Dataset likely imbalanced hai. Shayad model sab ko "not fraud" bol raha ho. Recall check karo — kitne actual frauds pakde? AUC-ROC aur F1 check karo fraud class ke liye.
Q17. Cancer detection model: Precision 90%, Recall 60% hai. Theek hai?
Nahi — Recall 60% bahut low hai. 40% cancer patients miss ho rahe hain. Medical diagnosis mein Recall maximize karna chahiye, chahe Precision thodi kam ho jaaye.
Q18. Do models ka F1 same hai (0.82), kaise decide karein konsa better hai?
AUC-ROC compare karo (threshold-independent) PR-AUC compare karo (imbalanced data mein) Cross-validation scores ki stability dekho (mean ± std) Statistical significance test karo Business cost function pe evaluate karo
15. 🎓 Conclusion {#conclusion}
Bhai, aaj humne ML Evaluation Metrics ka poora safar kiya —
Key Learnings
Topic | Key Takeaway |
|---|---|
Accuracy | Sirf balanced datasets mein useful |
Precision | "False alarms" minimize karne ke liye |
Recall | "Miss kiye cases" minimize karne ke liye |
F1-Score | Dono ka balance, imbalanced data mein jaao |
AUC-ROC | Threshold-independent, ranking ability |
PR-AUC | Highly imbalanced data ka AUC |
MCC | Imbalanced data ka best single metric |
Log Loss | Probability quality check |
MAE/RMSE | Regression evaluation |
R² | Model fit quality for regression |
Quick Decision Guide
Mera dataset balanced hai?
├── YES → Accuracy + F1-Score
└── NO → MCC + Balanced Accuracy + AUC-ROC
FN miss karna costly hai?
├── YES → Recall maximize karo (Cancer, Fraud)
└── NO → Precision focus karo (Spam filter)
Model probability output karta hai?
├── YES → AUC-ROC + Log Loss + Calibration
└── NO → Threshold-based: F1, Precision, Recall
Regression problem hai?
├── Outliers hain? → Huber Loss / MAE
├── Large errors costly? → RMSE
└── Model comparison? → R² / Adjusted R²
Final Advice
Ek metric pe mat ruko — hamesha 3-4 metrics dekho
Business impact se align karo — Rs. mein loss calculate karo
Threshold tune karo — default 0.5 hamesha best nahi hota
Cross-validation use karo — single train-test split misleading hota hai
Production mein monitor karo — model drift real hai
"The goal is not to have the most accurate model, but to have the most useful one."
Happy Learning! Agar koi doubt ho toh comment section mein poochho!
SQLDataDev— Making Data Science Simple for Indian Developers 🇮🇳