Data Science Machine learning

Understanding Accuracy and Evaluation Metrics in Machine Learning

Learn accuracy and ML evaluation metrics with simple Hinglish explanations, formulas, examples, and real-world use cases for better model performance.

Apr 26, 2026 27 min read

"Ek doctor jo 100 patients mein se 95 ko healthy bata de aur 5 cancer patients miss kar de — kya woh 95% accurate doctor acha hai?" Yahi problem hai sirf accuracy pe rely karne ki. Aao sahi tarike se ML models evaluate karna seekhein.

1. Introduction — Why Evaluation Matters

Bhai, soch — tumne ek ML model banaya, training accuracy 98% aayi. Khush ho gaye. Model deploy kiya. Production mein jaake model ne 70% cases galat predict kiye. Kya hua?

Answer: Tum galat metric dekh rahe the.

Real-World Scenario

Maan lo tumhara model predict karta hai ki koi transaction fraud hai ya nahi.

1000 transactions mein se 990 normal hain, 10 fraud hain.
Agar model sab ko normal bata de — accuracy = 990/1000 = 99% 🎉
Lekin usne ek bhi fraud nahi pakda 😱

Iska matlab — 99% accurate model completely useless hai is case mein.

Why Multiple Metrics?

Situation	Best Metric
Balanced dataset	Accuracy
Fraud / Cancer detection	Recall, F1
Email spam filter	Precision
Ranking / Recommendation	AUC-ROC
Price prediction	RMSE, MAE

Aaj hum har ek metric ko zero se samjhenge — definition, formula, code, aur real-world use.

2. Basic Concepts — Classification vs Regression

Kya predict kar rahe ho?

Classification → Category predict karna

Spam ya Not Spam?
Cancer hai ya nahi?

Regression → Number predict karna

Ghar ki price kya hogi?
Temperature kal kitna hoga?
Sales next month kitni hongi?

Dono ke liye alag metrics hote hain

Classification Metrics:
  → Accuracy, Precision, Recall, F1-Score, AUC-ROC

Regression Metrics:
  → MAE, MSE, RMSE, R², MAPE

3. Confusion Matrix — The Foundation

Confusion Matrix ek 2x2 table hai jo dikhata hai ki model ne kya sahi predict kiya aur kya galat.

Binary Classification ke liye

                      Predicted: YES        Predicted: NO
Actual: YES    |   True Positive (TP)  | False Negative (FN) |
Actual: NO     |   False Positive (FP) | True Negative (TN)  |

Definitions (Simple Language)

Term	Full Name	Matlab
TP	True Positive	Model ne YES bola, Actually YES tha ✅
TN	True Negative	Model ne NO bola, Actually NO tha ✅
FP	False Positive	Model ne YES bola, Actually NO tha ❌ (Type I Error)
FN	False Negative	Model ne NO bola, Actually YES tha ❌ (Type II Error)

Real Example: COVID Test

TP → Test positive, Patient actually COVID positive ✅
TN → Test negative, Patient actually COVID negative ✅
FP → Test positive, Patient COVID negative ❌ (False Alarm)
FN → Test negative, Patient actually COVID positive ❌ (Dangerous Miss!)

Code — Confusion Matrix in Python

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

# Actual values (ground truth)
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]

# Model predictions
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Confusion matrix banao
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# print kitne sahi hai kiya kitne galat hai
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

Output:

Confusion Matrix:
[[4 1]
 [1 4]]

TN=4, FP=1
FN=1, TP=4

4. All Classification Metrics

4.1 Accuracy

Definition: Total sahi predictions / Total predictions

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Kab use karein: Jab dataset balanced ho (har class ka count roughly same ho)

Kab avoid karein: Imbalanced dataset mein (fraud, cancer, etc.)

from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # 80.00%

4.2 Precision

Definition: Jinhe model ne Positive bola, unme se kitne Actually Positive the?

Precision = TP / (TP + FP)

Simple Analogy: Agar model ne 10 logon ko "spam" bola, unme se 8 actually spam the → Precision = 80%

Kab important hai: Jab False Positive costly ho

Example: Email filter ne important email ko spam mein daala → Bad!
Example: Court case mein innocent ko guilty bola → Very Bad!

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2%}")  # 80.00%

4.3 Recall (Sensitivity / True Positive Rate)

Definition: Jo Actually Positive the, unme se kitne ko model ne Positive bola?

Recall = TP / (TP + FN)

Simple Analogy: 100 COVID patients mein se model ne kitno ko pakda?

Kab important hai: Jab False Negative costly ho

Example: Cancer patient ko "healthy" bol diya → Dangerous!
Example: Fraud transaction miss ho gayi → Financial loss!

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2%}")  # 80.00%

Precision vs Recall Trade-off

Yeh dono ek tension mein rehte hain:

High Precision → Model bahut cautious hai, sirf confident hone pe YES bolta hai
                  → FP kam, lekin FN badh sakta hai

High Recall    → Model aggressive hai, jyada cases ko YES bolta hai
                  → FN kam, lekin FP badh sakta hai

Threshold badalne se dono affect hote hain — isse hum explore karenge AUC section mein.

4.4 F1-Score

Definition: Precision aur Recall ka Harmonic Mean

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Kab use karein: Jab dono Precision aur Recall important hon, especially imbalanced datasets mein.

Why Harmonic Mean, not Simple Average?

Agar Precision=1.0, Recall=0.0 → Simple Avg = 0.5 (misleading!)
Harmonic Mean = 0.0 (sahi — model useless hai)

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1-Score: {f1:.2%}")

# Weighted F1 for multi-class
f1_weighted = f1_score(y_true, y_pred, average='weighted')

4.5 F-Beta Score

Definition: F1 ka generalized version jahan tum Precision ya Recall ko zyada importance de sakte ho.

F_beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β < 1 → Precision ko zyada importance
β > 1 → Recall ko zyada importance
β = 1 → F1 score (balanced)

from sklearn.metrics import fbeta_score

# Beta=2 means Recall 2x more important than Precision
f2 = fbeta_score(y_true, y_pred, beta=2)
print(f"F2-Score: {f2:.4f}")

# Beta=0.5 means Precision 2x more important
f05 = fbeta_score(y_true, y_pred, beta=0.5)
print(f"F0.5-Score: {f05:.4f}")

Real Use:

Medical diagnosis → β=2 (Recall zyada important, miss mat karo)
Spam filter → β=0.5 (Precision zyada important, legitimate email mat roko)

4.6 Specificity (True Negative Rate)

Definition: Jo Actually Negative the, unme se kitno ko model ne sahi Negative bola?

Specificity = TN / (TN + FP)

Recall ka ulta — yeh Negative class ke baare mein bolta hai.

# Manual calculation
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
specificity = tn / (tn + fp)
print(f"Specificity: {specificity:.2%}")

4.7 Matthews Correlation Coefficient (MCC)

Definition: Sabse balanced single metric for binary classification, imbalanced datasets ke liye bhi reliable.

MCC = (TP×TN - FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Range: -1 (worst) to +1 (best), 0 = random guessing

from sklearn.metrics import matthews_corrcoef

mcc = matthews_corrcoef(y_true, y_pred)
print(f"MCC: {mcc:.4f}")

Pro Tip: Imbalanced dataset mein F1 mislead kar sakta hai, MCC nahi.

4.8 Cohen's Kappa

Definition: Model ki accuracy ko chance-level accuracy se compare karta hai.

Range: 0 (random) to 1 (perfect agreement)

from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's Kappa: {kappa:.4f}")

Kappa Value	Interpretation
< 0	Chance se bhi bura
0.0 – 0.2	Slight agreement
0.2 – 0.4	Fair agreement
0.4 – 0.6	Moderate agreement
0.6 – 0.8	Substantial agreement
0.8 – 1.0	Almost perfect

4.9 Log Loss (Cross-Entropy Loss)

Definition: Model ke probability predictions ko penalize karta hai — jitna confident aur galat, utna zyada penalty.

Log Loss = -1/N × Σ [y×log(p) + (1-y)×log(1-p)]

Kab use karein: Jab model probability output kare, na sirf 0/1.

from sklearn.metrics import log_loss

# y_prob = probability of class 1
y_prob = [0.9, 0.1, 0.8, 0.3, 0.2, 0.95, 0.7, 0.05, 0.85, 0.15]
y_true = [1,   0,   1,   1,   0,   1,    0,   0,    1,    0  ]

loss = log_loss(y_true, y_prob)
print(f"Log Loss: {loss:.4f}")  # Lower is better

Lower Log Loss = Better model

4.10 Hamming Loss

Definition: Multi-label classification mein use hota hai — fraction of wrong labels.

from sklearn.metrics import hamming_loss

# Multi-label example
y_true_ml = [[1, 0, 1], [0, 1, 1]]
y_pred_ml = [[1, 0, 0], [0, 1, 1]]

h_loss = hamming_loss(y_true_ml, y_pred_ml)
print(f"Hamming Loss: {h_loss:.4f}")  # Lower is better

5. 📈 All Regression Metrics {#all-regression-metrics}

5.1 MAE — Mean Absolute Error

Definition: Average of absolute differences between predicted and actual values.

MAE = (1/n) × Σ |y_actual - y_predicted|

Simple Analogy: Tumne predict kiya ghar ki price 50 lakhs, actually 55 lakhs tha → Error = 5 lakhs. MAE = average of all such errors.

Properties:

Outliers se affected nahi hota (robust)
Same unit mein hota hai jaise target variable

from sklearn.metrics import mean_absolute_error

y_true_reg = [100, 200, 300, 400, 500]
y_pred_reg = [110, 190, 320, 380, 510]

mae = mean_absolute_error(y_true_reg, y_pred_reg)
print(f"MAE: {mae:.2f}")  # 18.0

5.2 MSE — Mean Squared Error

Definition: Average of squared differences.

MSE = (1/n) × Σ (y_actual - y_predicted)²

Properties:

Outliers ko bahut zyada penalize karta hai (squared!)
Unit squared ho jaati hai (price² → weird)

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true_reg, y_pred_reg)
print(f"MSE: {mse:.2f}")  # 460.0

5.3 RMSE — Root Mean Squared Error

Definition: MSE ka square root — same unit mein result aata hai.

RMSE = √MSE

Properties:

MSE se zyada interpretable
Outliers pe sensitive

import numpy as np

rmse = np.sqrt(mse)
# OR
rmse = mean_squared_error(y_true_reg, y_pred_reg, squared=False)
print(f"RMSE: {rmse:.2f}")  # 21.45

5.4 R² Score (Coefficient of Determination)

Definition: Model kitna variation explain kar raha hai target variable ki, 0-1 scale pe.

R² = 1 - (SS_residual / SS_total)
SS_residual = Σ(y - ŷ)²
SS_total    = Σ(y - ȳ)²

Interpretation:

R² = 1 → Perfect model
R² = 0 → Model sirf mean predict kar raha hai (useless)
R² < 0 → Model mean se bhi bura hai (bahut bura!)

from sklearn.metrics import r2_score

r2 = r2_score(y_true_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")

5.5 Adjusted R²

Definition: R² ka improved version jo extra features ke addition pe penalty lagata hai.

Adjusted R² = 1 - [(1-R²)(n-1) / (n-k-1)]

Where: n = samples, k = features

def adjusted_r2(r2, n, k):
    return 1 - (1 - r2) * (n - 1) / (n - k - 1)

n = len(y_true_reg)
k = 3  # number of features
adj_r2 = adjusted_r2(r2, n, k)
print(f"Adjusted R²: {adj_r2:.4f}")

5.6 MAPE — Mean Absolute Percentage Error

Definition: Percentage mein error batata hai.

MAPE = (1/n) × Σ |((y_actual - y_predicted) / y_actual)| × 100

Limitation: Jab actual value zero ho, divide by zero error aata hai.

def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

result = mape(y_true_reg, y_pred_reg)
print(f"MAPE: {result:.2f}%")

# Sklearn version (available from sklearn 0.24+)
from sklearn.metrics import mean_absolute_percentage_error
mape_sk = mean_absolute_percentage_error(y_true_reg, y_pred_reg) * 100
print(f"MAPE (sklearn): {mape_sk:.2f}%")

5.7 Huber Loss

Definition: MAE aur MSE ka combination — small errors pe MSE, large errors pe MAE.

from sklearn.linear_model import HuberRegressor
import numpy as np

# Huber loss for training (outlier-robust regression)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1.5, 2.1, 2.9, 4.2, 5.1])

model = HuberRegressor()
model.fit(X, y)

6. ⚖️ Intermediate Usage — Imbalanced Datasets {#intermediate-usage}

Problem: Imbalanced Data

Real world mein data almost kabhi balanced nahi hota:

Fraud detection: 0.1% fraud, 99.9% normal
Medical diagnosis: 1% disease, 99% healthy
Churn prediction: 5% churn, 95% no churn

Solutions aur Metrics

6.1 Class Weight

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Imbalanced dataset banao
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    random_state=42
)

# class_weight='balanced' automatically adjust karta hai
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X, y)

6.2 SMOTE — Synthetic Minority Over-sampling

# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from collections import Counter

print(f"Before SMOTE: {Counter(y)}")

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"After SMOTE: {Counter(y_resampled)}")

6.3 Balanced Accuracy Score

from sklearn.metrics import balanced_accuracy_score

# Regular accuracy imbalanced data mein mislead karti hai
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model.fit(X_train, y_train)
y_pred_imb = model.predict(X_test)

regular_acc = accuracy_score(y_test, y_pred_imb)
balanced_acc = balanced_accuracy_score(y_test, y_pred_imb)

print(f"Regular Accuracy: {regular_acc:.2%}")
print(f"Balanced Accuracy: {balanced_acc:.2%}")

6.4 Classification Report — Sab Ek Jagah

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_imb, target_names=['Normal', 'Fraud']))

Output:

              precision    recall  f1-score   support

      Normal       0.97      0.98      0.98       190
       Fraud       0.67      0.60      0.63        10

    accuracy                           0.96       200
   macro avg       0.82      0.79      0.80       200
weighted avg       0.96      0.96      0.96       200

Macro vs Weighted Average:

Macro Avg → Har class ko equal weight
Weighted Avg → Class size ke hisaab se weight

7. 🔬 Advanced Concepts — AUC-ROC, PR Curve, Multi-class {#advanced-concepts}

7.1 ROC Curve — Receiver Operating Characteristic

Definition: ROC curve dikhata hai ki jab hum classification threshold change karte hain to True Positive Rate (Recall) vs False Positive Rate kaise change hote hain.

TPR (Y-axis) = Recall = TP / (TP + FN)
FPR (X-axis) = FP / (FP + TN)  [1 - Specificity]

Threshold kya hota hai?

Model probability output karta hai (e.g., 0.73)
Agar threshold = 0.5 → 0.73 > 0.5 → Positive
Threshold badhao → Stricter (less FP, more FN)
Threshold ghato → Lenient (more FP, less FN)

from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)

# Probabilities
y_prob_lr = lr.predict_proba(X_test)[:, 1]

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob_lr)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC-ROC Score: {roc_auc:.4f}")

7.2 AUC — Area Under the Curve

Definition: ROC curve ke neeche ka area. 0 se 1 tak.

AUC Value	Interpretation
1.0	Perfect model
0.9 – 1.0	Excellent
0.8 – 0.9	Good
0.7 – 0.8	Fair
0.6 – 0.7	Poor
0.5	Random guessing
< 0.5	Worse than random

Physical meaning: AUC = Probability ki model randomly chosen positive example ko randomly chosen negative example se higher rank karega.

7.3 Precision-Recall Curve

Definition: ROC jaisa hi, lekin yeh FPR ki jagah Precision dikhata hai vs Recall.

Kab prefer karein ROC pe: Highly imbalanced datasets mein PR Curve zyada informative hota hai.

from sklearn.metrics import precision_recall_curve, average_precision_score

precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_prob_lr)
avg_precision = average_precision_score(y_test, y_prob_lr)

plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, color='green', lw=2,
         label=f'PR Curve (AP = {avg_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

7.4 Finding Optimal Threshold

# Youden's J Statistic: Best threshold on ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob_lr)
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
best_threshold = thresholds[best_idx]

print(f"Optimal Threshold (Youden's J): {best_threshold:.4f}")
print(f"At this threshold → TPR: {tpr[best_idx]:.4f}, FPR: {fpr[best_idx]:.4f}")

# Apply optimal threshold
y_pred_optimal = (y_prob_lr >= best_threshold).astype(int)
print(f"\nWith optimal threshold:")
print(classification_report(y_test, y_pred_optimal))

7.5 Multi-Class Metrics

Jab 3+ classes hon (e.g., Cat/Dog/Bird):

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
                              roc_auc_score, accuracy_score)
from sklearn.preprocessing import label_binarize

# Multi-class dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_mc = rf.predict(X_test)
y_prob_mc = rf.predict_proba(X_test)

# Classification Report
print(classification_report(y_test, y_pred_mc, target_names=iris.target_names))

# Multi-class AUC (One-vs-Rest)
auc_ovr = roc_auc_score(y_test, y_prob_mc, multi_class='ovr', average='macro')
print(f"AUC-ROC (OvR, Macro): {auc_ovr:.4f}")

# Multi-class AUC (One-vs-One)
auc_ovo = roc_auc_score(y_test, y_prob_mc, multi_class='ovo', average='macro')
print(f"AUC-ROC (OvO, Macro): {auc_ovo:.4f}")

8. 🌍 Real-World Use Cases {#real-world-use-cases}

8.1 Medical Diagnosis (Cancer Detection)

Priority: High Recall (koi patient miss na ho) Acceptable trade-off: Lower Precision (kuch healthy logon ko bhi check karenge)

# Cancer detection: Recall sabse important
# Even if Precision suffers, we want to catch ALL cancer cases

from sklearn.svm import SVC

svc = SVC(probability=True, class_weight='balanced', random_state=42)
svc.fit(X_train, y_train)
y_prob_svc = svc.predict_proba(X_test)[:, 1]

# Lower threshold → Higher Recall
y_pred_conservative = (y_prob_svc >= 0.3).astype(int)  # Threshold = 0.3

print("Conservative threshold (0.3) - High Recall:")
print(classification_report(y_test, y_pred_conservative))

8.2 Email Spam Filter

Priority: High Precision (legitimate email spam mein nahi jaana chahiye) Acceptable trade-off: Lower Recall (kuch spam inbox mein aa sakta hai)

# Higher threshold → Higher Precision
y_pred_strict = (y_prob_lr >= 0.7).astype(int)  # Threshold = 0.7

print("Strict threshold (0.7) - High Precision:")
print(classification_report(y_test, y_pred_strict))

8.3 Fraud Detection

Priority: F1-Score + AUC-ROC + Balanced Accuracy

# Complete fraud detection evaluation
def fraud_model_report(y_true, y_pred, y_prob, model_name="Model"):
    print(f"\n{'='*50}")
    print(f"  {model_name} — Complete Evaluation")
    print(f"{'='*50}")
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    print(f"TP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}")
    print(f"Accuracy:          {accuracy_score(y_true, y_pred):.4f}")
    print(f"Balanced Accuracy: {balanced_accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision:         {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:            {recall_score(y_true, y_pred):.4f}")
    print(f"F1-Score:          {f1_score(y_true, y_pred):.4f}")
    print(f"F2-Score:          {fbeta_score(y_true, y_pred, beta=2):.4f}")
    print(f"AUC-ROC:           {roc_auc_score(y_true, y_prob):.4f}")
    print(f"MCC:               {matthews_corrcoef(y_true, y_pred):.4f}")
    print(f"Log Loss:          {log_loss(y_true, y_prob):.4f}")

# Usage
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score
fraud_model_report(y_test, y_pred_optimal, y_prob_lr, "Logistic Regression")

8.4 House Price Prediction (Regression)

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train_h, y_train_h)
y_pred_h = gbr.predict(X_test_h)

# Complete regression evaluation
mae_h = mean_absolute_error(y_test_h, y_pred_h)
rmse_h = np.sqrt(mean_squared_error(y_test_h, y_pred_h))
r2_h = r2_score(y_test_h, y_pred_h)
mape_h = mean_absolute_percentage_error(y_test_h, y_pred_h) * 100

print(f"MAE:   {mae_h:.4f}")
print(f"RMSE:  {rmse_h:.4f}")
print(f"R²:    {r2_h:.4f}")
print(f"MAPE:  {mape_h:.2f}%")

9. 💻 Practical Examples with Code {#practical-examples}

Example 1: Complete Model Comparison Pipeline

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                              f1_score, roc_auc_score, matthews_corrcoef,
                              balanced_accuracy_score, log_loss)
import warnings
warnings.filterwarnings('ignore')

# Dataset banao
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=15,
    n_redundant=3, weights=[0.85, 0.15], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Models
models = {
    'Logistic Regression': LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(class_weight='balanced', random_state=42),
    'Random Forest':       RandomForestClassifier(class_weight='balanced', random_state=42),
    'Gradient Boosting':   GradientBoostingClassifier(random_state=42),
    'SVM':                 SVC(probability=True, class_weight='balanced', random_state=42),
    'KNN':                 KNeighborsClassifier(),
    'Naive Bayes':         GaussianNB(),
}

# Evaluate all models
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred_m = model.predict(X_test)
    y_prob_m = model.predict_proba(X_test)[:, 1]
    
    results.append({
        'Model': name,
        'Accuracy':       round(accuracy_score(y_test, y_pred_m), 4),
        'Bal. Accuracy':  round(balanced_accuracy_score(y_test, y_pred_m), 4),
        'Precision':      round(precision_score(y_test, y_pred_m), 4),
        'Recall':         round(recall_score(y_test, y_pred_m), 4),
        'F1-Score':       round(f1_score(y_test, y_pred_m), 4),
        'AUC-ROC':        round(roc_auc_score(y_test, y_prob_m), 4),
        'MCC':            round(matthews_corrcoef(y_test, y_pred_m), 4),
        'Log Loss':       round(log_loss(y_test, y_prob_m), 4),
    })

# Display as DataFrame
df_results = pd.DataFrame(results).set_index('Model')
print(df_results.sort_values('AUC-ROC', ascending=False).to_string())

Example 2: Cross-Validation with Multiple Metrics

from sklearn.model_selection import cross_validate

scoring = {
    'accuracy':          'accuracy',
    'precision':         'precision',
    'recall':            'recall',
    'f1':                'f1',
    'roc_auc':           'roc_auc',
    'balanced_accuracy': 'balanced_accuracy',
    'neg_log_loss':      'neg_log_loss',
}

rf_model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)

cv_results = cross_validate(
    rf_model, X, y,
    cv=5,
    scoring=scoring,
    return_train_score=True
)

print("5-Fold Cross-Validation Results:")
print("-" * 40)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    test_scores = cv_results[f'test_{metric}']
    print(f"{metric.capitalize():20s}: {test_scores.mean():.4f} ± {test_scores.std():.4f}")

Example 3: Custom Evaluation Dashboard

import matplotlib.pyplot as plt
from sklearn.metrics import (confusion_matrix, roc_curve, precision_recall_curve,
                              ConfusionMatrixDisplay, auc)

def full_evaluation_dashboard(model, X_test, y_test, model_name="Model"):
    """
    Complete visual evaluation dashboard for a binary classifier.
    """
    y_pred_d = model.predict(X_test)
    y_prob_d = model.predict_proba(X_test)[:, 1]
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle(f'{model_name} — Complete Evaluation Dashboard', fontsize=16, fontweight='bold')
    
    # 1. Confusion Matrix
    cm_d = confusion_matrix(y_test, y_pred_d)
    ConfusionMatrixDisplay(cm_d).plot(ax=axes[0, 0], cmap='Blues')
    axes[0, 0].set_title('Confusion Matrix')
    
    # 2. ROC Curve
    fpr_d, tpr_d, _ = roc_curve(y_test, y_prob_d)
    roc_auc_d = auc(fpr_d, tpr_d)
    axes[0, 1].plot(fpr_d, tpr_d, 'darkorange', lw=2, label=f'AUC = {roc_auc_d:.3f}')
    axes[0, 1].plot([0, 1], [0, 1], 'navy', linestyle='--')
    axes[0, 1].set_xlabel('FPR'); axes[0, 1].set_ylabel('TPR')
    axes[0, 1].set_title('ROC Curve'); axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Precision-Recall Curve
    prec_d, rec_d, _ = precision_recall_curve(y_test, y_prob_d)
    ap_d = average_precision_score(y_test, y_prob_d)
    axes[0, 2].plot(rec_d, prec_d, 'green', lw=2, label=f'AP = {ap_d:.3f}')
    axes[0, 2].set_xlabel('Recall'); axes[0, 2].set_ylabel('Precision')
    axes[0, 2].set_title('Precision-Recall Curve'); axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # 4. Metrics Bar Chart
    metrics_d = {
        'Accuracy': accuracy_score(y_test, y_pred_d),
        'Precision': precision_score(y_test, y_pred_d),
        'Recall': recall_score(y_test, y_pred_d),
        'F1-Score': f1_score(y_test, y_pred_d),
        'Bal. Acc': balanced_accuracy_score(y_test, y_pred_d),
        'AUC-ROC': roc_auc_d,
    }
    colors_d = ['#2196F3', '#4CAF50', '#FF9800', '#9C27B0', '#F44336', '#00BCD4']
    bars = axes[1, 0].bar(metrics_d.keys(), metrics_d.values(), color=colors_d, alpha=0.8)
    axes[1, 0].set_ylim(0, 1.1)
    axes[1, 0].set_title('All Metrics Overview')
    for bar_d, val_d in zip(bars, metrics_d.values()):
        axes[1, 0].text(bar_d.get_x() + bar_d.get_width()/2., bar_d.get_height() + 0.01,
                        f'{val_d:.2f}', ha='center', va='bottom', fontsize=9)
    
    # 5. Probability Distribution
    axes[1, 1].hist(y_prob_d[y_test == 0], bins=30, alpha=0.6, color='blue', label='Class 0 (Negative)')
    axes[1, 1].hist(y_prob_d[y_test == 1], bins=30, alpha=0.6, color='red', label='Class 1 (Positive)')
    axes[1, 1].set_xlabel('Predicted Probability')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].set_title('Probability Distribution')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # 6. Threshold Analysis
    thresholds_range = np.arange(0.1, 0.9, 0.05)
    precisions_t, recalls_t, f1s_t = [], [], []
    
    for thresh in thresholds_range:
        y_pred_t = (y_prob_d >= thresh).astype(int)
        precisions_t.append(precision_score(y_test, y_pred_t, zero_division=0))
        recalls_t.append(recall_score(y_test, y_pred_t, zero_division=0))
        f1s_t.append(f1_score(y_test, y_pred_t, zero_division=0))
    
    axes[1, 2].plot(thresholds_range, precisions_t, 'b-', label='Precision')
    axes[1, 2].plot(thresholds_range, recalls_t, 'r-', label='Recall')
    axes[1, 2].plot(thresholds_range, f1s_t, 'g-', label='F1-Score')
    axes[1, 2].set_xlabel('Threshold')
    axes[1, 2].set_ylabel('Score')
    axes[1, 2].set_title('Threshold vs Metrics')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return metrics_d

# Usage
rf_final = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)
rf_final.fit(X_train, y_train)
metrics = full_evaluation_dashboard(rf_final, X_test, y_test, "Random Forest")

10. ⚠️ Edge Cases and Errors {#edge-cases-and-errors}

Error 1: UndefinedMetricWarning — Zero Division

# Problem: Jab koi class predict hi nahi ki
from sklearn.exceptions import UndefinedMetricWarning
import warnings

y_pred_all_zero = np.zeros(len(y_test))  # Sab zero predict kiya

# Warning aayega
with warnings.catch_warnings(record=True) as w:
    p = precision_score(y_test, y_pred_all_zero)
    if w:
        print(f"Warning: {w[0].message}")

# Fix: zero_division parameter use karo
p_safe = precision_score(y_test, y_pred_all_zero, zero_division=0)
print(f"Precision (safe): {p_safe}")

Error 2: Wrong Average for Multi-class

# Problem: Binary metric function multi-class pe call karna
y_mc_true = [0, 1, 2, 0, 1, 2]
y_mc_pred = [0, 2, 1, 0, 1, 2]

# WRONG — binary default
# precision_score(y_mc_true, y_mc_pred)  # ValueError!

# CORRECT — specify average
print(precision_score(y_mc_true, y_mc_pred, average='macro'))
print(precision_score(y_mc_true, y_mc_pred, average='micro'))
print(precision_score(y_mc_true, y_mc_pred, average='weighted'))

Average Types:

Type	Matlab
`macro`	Har class ka simple average (imbalanced classes mein mislead kar sakta)
`micro`	Global TP/FP/FN calculate karke metric nikalo
`weighted`	Class support ke hisaab se weighted average
`None`	Har class ka alag score return karo

Error 3: Data Leakage in Evaluation

# WRONG — Test data pe fit karna
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# DO NOT DO THIS:
X_scaled_wrong = scaler.fit_transform(X)  # Puri data pe fit!
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_scaled_wrong, y, test_size=0.2)

# CORRECT — Pipeline use karo
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(class_weight='balanced', random_state=42))
])

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train_c, y_train_c)
y_pred_pipe = pipeline.predict(X_test_c)
print(f"F1-Score (No Leakage): {f1_score(y_test_c, y_pred_pipe):.4f}")

Error 4: Stratified Split Ignore Karna

# WRONG — Imbalanced data mein regular split
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X, y, test_size=0.2)

# CORRECT — Stratified split
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify=y is key!
)

from collections import Counter
print(f"Test set class distribution (Stratified): {Counter(y_test_s)}")

Error 5: Overfitting — Train vs Test Metrics

# Har baar train aur test metrics dono check karo
rf_check = RandomForestClassifier(n_estimators=100, random_state=42)
rf_check.fit(X_train, y_train)

train_f1 = f1_score(y_train, rf_check.predict(X_train))
test_f1 = f1_score(y_test, rf_check.predict(X_test))

print(f"Train F1: {train_f1:.4f}")
print(f"Test F1:  {test_f1:.4f}")
print(f"Gap:      {train_f1 - test_f1:.4f}")

if train_f1 - test_f1 > 0.1:
    print("⚠️ Warning: Model is OVERFITTING!")

11. 🏆 Pro Developer Insights {#pro-developer-insights}

Best Practices

1. Metric Selection Framework:

Step 1: Dataset balanced hai?
   → Yes → Accuracy, F1 use karo
   → No  → Balanced Accuracy, MCC use karo

Step 2: Cost of errors kya hai?
   → FN costly (Cancer/Fraud miss) → Recall focus
   → FP costly (Wrong alarm)       → Precision focus
   → Both costly                   → F1 / F-beta

Step 3: Probability output chahiye?
   → Yes → AUC-ROC, Log Loss use karo
   → No  → Threshold-based metrics

2. Multiple Metrics Always:

# Never rely on single metric
# Always evaluate: Accuracy + F1 + AUC + Confusion Matrix

3. Business Metric Alignment:

# ML metric ko business impact se connect karo
# Example: Fraud Detection
# 1 FN (missed fraud) = Rs. 50,000 loss
# 1 FP (false alarm) = Rs. 500 investigation cost

def business_cost(y_true, y_pred, fn_cost=50000, fp_cost=500):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    total_cost = (fn * fn_cost) + (fp * fp_cost)
    return total_cost, fn, fp

cost, fn_count, fp_count = business_cost(y_test, y_pred_optimal)
print(f"Total Business Cost: Rs. {cost:,}")
print(f"Missed Frauds: {fn_count}, False Alarms: {fp_count}")

4. Statistical Significance:

from scipy import stats

# Do alag models ke scores compare karo
scores_model1 = cross_val_score(
    RandomForestClassifier(random_state=42), X, y, cv=10, scoring='f1'
)
scores_model2 = cross_val_score(
    GradientBoostingClassifier(random_state=42), X, y, cv=10, scoring='f1'
)

t_stat, p_value = stats.ttest_rel(scores_model1, scores_model2)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Difference is STATISTICALLY SIGNIFICANT")
else:
    print("Difference is NOT statistically significant")

5. Calibration — Probability Reliability:

from sklearn.calibration import calibration_curve, CalibratedClassifierCV

# Model ki probabilities reliable hain?
prob_true, prob_pred = calibration_curve(y_test, y_prob_lr, n_bins=10)

plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', linewidth=1, label='Logistic Regression')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly Calibrated')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve (Reliability Diagram)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

12. 📊 Comparison Table {#comparison-table}

Classification Metrics Comparison

Metric	Range	Higher=Better	Imbalanced OK?	Probability Based	Best For
Accuracy	0–1	✅	❌	❌	Balanced datasets
Precision	0–1	✅	Partial	❌	Minimize false alarms
Recall	0–1	✅	Partial	❌	Minimize missed cases
F1-Score	0–1	✅	Partial	❌	Imbalanced, both matter
F-Beta	0–1	✅	✅	❌	Custom precision/recall weight
AUC-ROC	0–1	✅	✅	✅	Ranking, probability models
PR-AUC	0–1	✅	✅	✅	Very imbalanced data
MCC	-1 to 1	✅	✅	❌	Imbalanced, gold standard
Log Loss	0–∞	❌	✅	✅	Probability calibration
Cohen's Kappa	-1 to 1	✅	✅	❌	vs. chance comparison

Regression Metrics Comparison

Metric	Unit	Outlier Sensitive	Interpretable	Best For
MAE	Same as target	❌	✅	General purpose
MSE	Target²	✅	❌	Penalize large errors
RMSE	Same as target	✅	✅	Most common
R²	Unitless	✅	✅	Model comparison
Adjusted R²	Unitless	✅	✅	Feature selection
MAPE	%	Partial	✅	Business reporting
Huber Loss	Same as target	❌	Partial	Robust regression

13. 🧪 Data Science Perspective {#data-science-perspective}

ML Pipeline mein Evaluation kahan aata hai?

Data Collection
     ↓
Data Preprocessing
     ↓
Feature Engineering
     ↓
Model Training
     ↓
Model Evaluation  ← YE STEP HAI HAMARA
     ↓
Hyperparameter Tuning
     ↓
Final Evaluation on Test Set
     ↓
Model Deployment
     ↓
Monitoring (Production Evaluation)

Hyperparameter Tuning with Custom Scoring

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
}

# AUC-ROC pe optimize karo
grid_search = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc',   # Ya 'f1', 'recall', 'precision'
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)
print(f"Best Params: {grid_search.best_params_}")
print(f"Best AUC-ROC: {grid_search.best_score_:.4f}")

Custom Scorer Banana

from sklearn.metrics import make_scorer

# Business-specific metric: Rs. mein loss minimize karo
def business_loss_scorer(y_true, y_pred, fn_cost=50000, fp_cost=500):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    total_cost = -(fn * fn_cost + fp * fp_cost)  # Negative kyunki GridSearch maximize karta hai
    return total_cost

custom_scorer = make_scorer(business_loss_scorer)

grid_search_custom = GridSearchCV(
    RandomForestClassifier(class_weight='balanced', random_state=42),
    {'n_estimators': [50, 100], 'max_depth': [3, 5]},
    cv=5,
    scoring=custom_scorer,
    n_jobs=-1
)

grid_search_custom.fit(X_train, y_train)
print(f"Best Params for min. business loss: {grid_search_custom.best_params_}")

Production Monitoring — Concept

# Production mein model drift detect karna
def monitor_model_performance(y_true_prod, y_pred_prod, baseline_f1=0.80):
    current_f1 = f1_score(y_true_prod, y_pred_prod)
    
    print(f"Baseline F1: {baseline_f1:.4f}")
    print(f"Current F1:  {current_f1:.4f}")
    print(f"Drift:       {baseline_f1 - current_f1:.4f}")
    
    if baseline_f1 - current_f1 > 0.05:
        print("🚨 ALERT: Significant model drift detected! Retraining needed.")
    else:
        print("✅ Model performance stable.")

# Simulate production predictions
y_prod_true = np.random.choice([0, 1], size=500, p=[0.85, 0.15])
y_prod_pred = np.random.choice([0, 1], size=500, p=[0.85, 0.15])

monitor_model_performance(y_prod_true, y_prod_pred)

14. 🎯 Interview Questions {#interview-questions}

Basic Level

Q1. Accuracy reliable metric kyun nahi hai imbalanced datasets mein?

Agar 95% data class 0 ka hai aur model sab ko class 0 predict kare, accuracy 95% hogi — lekin model completely useless hai. Isliye Recall, F1, AUC use karte hain.

Q2. Precision aur Recall mein difference batao.

Precision = "Jo positive bola uska kitna sahi?" → TP/(TP+FP) Recall = "Jo actually positive tha uska kitna pakda?" → TP/(TP+FN)

Q3. F1-Score kab use karte hain?

Jab Precision aur Recall dono important hon, aur dataset imbalanced ho. Yeh dono ka harmonic mean hai.

Q4. Confusion Matrix ke 4 elements kya hain?

TP (sahi positive), TN (sahi negative), FP (galat positive / Type I Error), FN (galat negative / Type II Error)

Q5. ROC curve kya hai?

Alag-alag thresholds pe TPR vs FPR ka graph. AUC is curve ke neeche ka area hai — jitna zyada utna better.

Intermediate Level

Q6. Type I aur Type II Error mein difference?

Type I Error = False Positive (Innocent ko guilty bola) Type II Error = False Negative (Guilty ko innocent bola) Medical domain mein Type II zyada dangerous hota hai.

Q7. ROC-AUC vs PR-AUC — kab kya use karein?

ROC-AUC better hai jab negative class majority ho aur false positives count karna ho. PR-AUC better hai jab positive class minority ho (highly imbalanced) — fraud, rare disease detection.

Q8. Macro vs Weighted vs Micro average kya hai?

Macro: Simple average of per-class metrics Weighted: Class size se weighted average Micro: Global TP/FP/FN se calculate

Q9. R² negative kyun ho sakta hai?

Jab model ek simple mean prediction se bhi worse perform kare. Usually severe underfitting ya wrong model type use karne pe hota hai.

Q10. Log Loss kya penalize karta hai?

Confident aur galat predictions ko bahut zyada penalize karta hai. Isliye yeh probability calibration ke liye important hai.

Advanced Level

Q11. MCC (Matthews Correlation Coefficient) F1 se better kyun hai imbalanced data mein?

F1 sirf Positive class pe focus karta hai (TP, FP, FN) — TN ignore karta hai. MCC saari four values (TP, TN, FP, FN) use karta hai, isliye zyada balanced aur trustworthy hai.

Q12. Optimal threshold kaise decide karein?

Youden's J statistic: max(TPR - FPR) from ROC curve PR Curve ka F1-maximizing point Business cost function minimize karke

Q13. AUC-ROC ka physical interpretation kya hai?

Probability ki randomly chosen positive sample ka score randomly chosen negative sample se higher hoga. Ranking ability measure karta hai.

Q14. Cross-validation mein konsa scoring metric use karein?

Business objective ke hisaab se:
Imbalanced → 'f1', 'roc_auc', 'balanced_accuracy'
Probability needed → 'roc_auc', 'neg_log_loss'
Regression → 'neg_rmse', 'r2'

Q15. Model calibration kya hota hai? Kab important hai?

Model ki predicted probabilities ki accuracy. Agar model 0.8 probability bolta hai, to 80% cases mein actually positive hona chahiye. Risk scoring, insurance, medical diagnosis mein critical hai.

Scenario-Based

Q16. Tumhare fraud detection model ki accuracy 99% hai. Client khush hai. Kya tum bhi khush ho?

Nahi! Dataset likely imbalanced hai. Shayad model sab ko "not fraud" bol raha ho. Recall check karo — kitne actual frauds pakde? AUC-ROC aur F1 check karo fraud class ke liye.

Q17. Cancer detection model: Precision 90%, Recall 60% hai. Theek hai?

Nahi — Recall 60% bahut low hai. 40% cancer patients miss ho rahe hain. Medical diagnosis mein Recall maximize karna chahiye, chahe Precision thodi kam ho jaaye.

Q18. Do models ka F1 same hai (0.82), kaise decide karein konsa better hai?

AUC-ROC compare karo (threshold-independent) PR-AUC compare karo (imbalanced data mein) Cross-validation scores ki stability dekho (mean ± std) Statistical significance test karo Business cost function pe evaluate karo

15. 🎓 Conclusion {#conclusion}

Bhai, aaj humne ML Evaluation Metrics ka poora safar kiya —

Key Learnings

Topic	Key Takeaway
Accuracy	Sirf balanced datasets mein useful
Precision	"False alarms" minimize karne ke liye
Recall	"Miss kiye cases" minimize karne ke liye
F1-Score	Dono ka balance, imbalanced data mein jaao
AUC-ROC	Threshold-independent, ranking ability
PR-AUC	Highly imbalanced data ka AUC
MCC	Imbalanced data ka best single metric
Log Loss	Probability quality check
MAE/RMSE	Regression evaluation
R²	Model fit quality for regression

Quick Decision Guide

Mera dataset balanced hai?
├── YES → Accuracy + F1-Score
└── NO  → MCC + Balanced Accuracy + AUC-ROC

FN miss karna costly hai?
├── YES → Recall maximize karo (Cancer, Fraud)
└── NO  → Precision focus karo (Spam filter)

Model probability output karta hai?
├── YES → AUC-ROC + Log Loss + Calibration
└── NO  → Threshold-based: F1, Precision, Recall

Regression problem hai?
├── Outliers hain? → Huber Loss / MAE
├── Large errors costly? → RMSE
└── Model comparison? → R² / Adjusted R²

Final Advice

Ek metric pe mat ruko — hamesha 3-4 metrics dekho
Business impact se align karo — Rs. mein loss calculate karo
Threshold tune karo — default 0.5 hamesha best nahi hota
Cross-validation use karo — single train-test split misleading hota hai
Production mein monitor karo — model drift real hai

"The goal is not to have the most accurate model, but to have the most useful one."

1. Introduction — Why Evaluation Matters

Real-World Scenario

Why Multiple Metrics?

2. Basic Concepts — Classification vs Regression

Kya predict kar rahe ho?

Dono ke liye alag metrics hote hain

3. Confusion Matrix — The Foundation

Binary Classification ke liye

Definitions (Simple Language)

Real Example: COVID Test

Code — Confusion Matrix in Python

4. All Classification Metrics

4.1 Accuracy

4.2 Precision

4.3 Recall (Sensitivity / True Positive Rate)

Precision vs Recall Trade-off

4.4 F1-Score

4.5 F-Beta Score

4.6 Specificity (True Negative Rate)

4.7 Matthews Correlation Coefficient (MCC)

4.8 Cohen's Kappa

4.9 Log Loss (Cross-Entropy Loss)

4.10 Hamming Loss

5. 📈 All Regression Metrics {#all-regression-metrics}

5.1 MAE — Mean Absolute Error

5.2 MSE — Mean Squared Error

5.3 RMSE — Root Mean Squared Error

5.4 R² Score (Coefficient of Determination)

5.5 Adjusted R²

5.6 MAPE — Mean Absolute Percentage Error

5.7 Huber Loss

6. ⚖️ Intermediate Usage — Imbalanced Datasets {#intermediate-usage}

Problem: Imbalanced Data

Solutions aur Metrics

6.1 Class Weight

6.2 SMOTE — Synthetic Minority Over-sampling

6.3 Balanced Accuracy Score

6.4 Classification Report — Sab Ek Jagah

7. 🔬 Advanced Concepts — AUC-ROC, PR Curve, Multi-class {#advanced-concepts}

7.1 ROC Curve — Receiver Operating Characteristic

7.2 AUC — Area Under the Curve

7.3 Precision-Recall Curve

7.4 Finding Optimal Threshold

7.5 Multi-Class Metrics

8. 🌍 Real-World Use Cases {#real-world-use-cases}

8.1 Medical Diagnosis (Cancer Detection)

8.2 Email Spam Filter

8.3 Fraud Detection

8.4 House Price Prediction (Regression)

9. 💻 Practical Examples with Code {#practical-examples}

Example 1: Complete Model Comparison Pipeline

Example 2: Cross-Validation with Multiple Metrics

Example 3: Custom Evaluation Dashboard

10. ⚠️ Edge Cases and Errors {#edge-cases-and-errors}

Error 1: UndefinedMetricWarning — Zero Division

Error 2: Wrong Average for Multi-class

Error 3: Data Leakage in Evaluation

Error 4: Stratified Split Ignore Karna

Error 5: Overfitting — Train vs Test Metrics

11. 🏆 Pro Developer Insights {#pro-developer-insights}

Best Practices

12. 📊 Comparison Table {#comparison-table}

Classification Metrics Comparison

Regression Metrics Comparison

13. 🧪 Data Science Perspective {#data-science-perspective}

ML Pipeline mein Evaluation kahan aata hai?

Hyperparameter Tuning with Custom Scoring

Custom Scorer Banana

Production Monitoring — Concept

14. 🎯 Interview Questions {#interview-questions}

Basic Level

Intermediate Level

Advanced Level

Scenario-Based

15. 🎓 Conclusion {#conclusion}

Key Learnings

Quick Decision Guide

Final Advice

Latest comments