Back to all posts

Categorical Data Handling in Machine Learning (Pandas + Sklearn) – Complete Practical Guide

Learn categorical data encoding end-to-end — Label, Ordinal, One-Hot, Target, Binary, Frequency encoding with Pandas & Sklearn. Beginner to advanced.

"Garbage in, garbage out." — ML models sirf numbers samajhte hain. Agar tumhara categorical data sahi se encode nahi hua, toh tumhara model chahe kitna bhi powerful ho — output bakwaas aayega.*


1. Introduction

Categorical Data kya hota hai?

Socho tumhare paas ek HR dataset hai — Department, City, Gender, Education_Level, Job_Grade. Ye saare text values hain — "IT", "HR", "Mumbai", "Male", "Bachelor's".

Machine Learning algorithms — Linear Regression, Random Forest, XGBoost — sirf numbers ke saath kaam karte hain. Toh in text values ko numbers mein convert karna padta hai. Yahi hai Categorical Data Handling.

Kyun important hai?

Problem

Impact

Encoding galat hua

Model wrong patterns seekhega

High cardinality ignore kiya

Memory explosion + overfitting

Data leakage hua (Target Encoding mein)

Model real world mein fail karega

Ordinal data ko nominal treat kiya

Mathematical relationships destroy ho jaate hain

Real-World Usage

  • E-commerce: Product category, Brand, Color → price prediction

  • Banking: Loan type, Branch, Customer segment → default prediction

  • HR Analytics: Department, Grade, City → attrition prediction

  • Healthcare: Blood group, Disease type, Ward → readmission risk

  • Credit Scoring: Employment type, Education → creditworthiness


2. Basic Concepts

Categorical Data ke Types

SQL
Categorical Data
│
├── Nominal (No Order)
│   ├── Gender: Male, Female
│   ├── City: Mumbai, Delhi, Pune
│   └── Color: Red, Blue, Green
│
└── Ordinal (Order Matters)
    ├── Education: School < Graduate < Post-Graduate
    ├── Job Grade: Junior < Mid < Senior < Lead
    └── Rating: Poor < Average < Good < Excellent

Cardinality kya hota hai?

Cardinality = ek column mein unique values ki count.

Python
import pandas as pd

df = pd.read_csv('employee.csv')

# Cardinality check karo
for col in df.select_dtypes(include='object').columns:
    print(f"{col}: {df[col].nunique()} unique values")
SQL
Department    :   8   → Low cardinality  ✅ (OHE safe)
City          :  42   → Medium cardinality ⚠️
Employee_ID   : 5000  → High cardinality  ❌ (OHE mat karo)

Encoding Techniques ka Overview

SQL
Encoding Techniques
│
├── Classical
│   ├── Label Encoding        → Ordinal ke liye
│   ├── One-Hot Encoding      → Nominal, low cardinality
│   └── Ordinal Encoding      → Ordered categories
│
├── Statistical
│   ├── Target Encoding       → High cardinality + supervised
│   ├── Frequency Encoding    → High cardinality + unsupervised
│   └── Binary Encoding       → Medium-high cardinality
│
└── Advanced
    ├── Helmert Encoding
    ├── Sum Encoding
    ├── WOE Encoding          → Binary classification
    └── Embedding (Deep Learning)

3. All Encoding Functions & Features

3.1 Label Encoding — Pandas .map() / Sklearn LabelEncoder

Kab use karo: Tree-based models (Random Forest, XGBoost) ke saath, jab order matter nahi karta aur cardinality medium-high ho.

Kab avoid karo: Linear models mein — model sochega 3 > 2 > 1 (false order).

Python
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi']
})

# Method 1: Pandas .map()
city_map = {'Mumbai': 0, 'Delhi': 1, 'Pune': 2}
df['City_Label'] = df['City'].map(city_map)

# Method 2: Sklearn LabelEncoder
le = LabelEncoder()
df['City_LE'] = le.fit_transform(df['City'])

print(df)
print("Classes:", le.classes_)  # ['Delhi', 'Mumbai', 'Pune']

# Inverse transform — number se wapas label
print(le.inverse_transform([0, 1, 2]))  # ['Delhi', 'Mumbai', 'Pune']

Output:

SQL
     City  City_Label  City_LE
0  Mumbai           0        1
1   Delhi           1        0
2    Pune           2        2
3  Mumbai           0        1
4   Delhi           1        0

⚠️ Note: LabelEncoder alphabetically sort karta hai — isliye Mumbai=1, Delhi=0. Custom order chahiye toh .map() use karo.


3.2 One-Hot Encoding (OHE) — Pandas get_dummies() / Sklearn OneHotEncoder

Kab use karo: Nominal data (no order), low cardinality (< 10-15 unique values), linear models.

Dummy Variable Trap: K categories ke liye sirf K-1 columns rakho (drop_first=True).

Python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np

df = pd.DataFrame({
    'Department': ['IT', 'HR', 'Finance', 'IT', 'HR'],
    'Salary': [70000, 50000, 60000, 80000, 55000]
})

# ─── Method 1: Pandas get_dummies ───────────────────────────
ohe_pandas = pd.get_dummies(
    df,
    columns=['Department'],     # Konse columns encode karne hain
    drop_first=True,            # Dummy trap avoid karo (K-1 columns)
    prefix='Dept',              # Column name prefix
    dtype=int                   # 0/1 as integer (not bool)
)
print("Pandas OHE:\n", ohe_pandas)

# ─── Method 2: Sklearn OneHotEncoder ───────────────────────
enc = OneHotEncoder(
    drop='first',               # Dummy trap avoid
    sparse_output=False,        # Dense array return karo
    handle_unknown='ignore'     # Unseen categories ke liye 0 dega
)

dept_encoded = enc.fit_transform(df[['Department']])

# Feature names lene ke liye
feature_names = enc.get_feature_names_out(['Department'])
ohe_df = pd.DataFrame(dept_encoded, columns=feature_names)

print("\nSklearn OHE:\n", ohe_df)
print("Categories:", enc.categories_)

Output:

SQL
Pandas OHE:
   Salary  Dept_HR  Dept_IT
0   70000        0        1
1   50000        1        0
2   60000        0        0
3   80000        0        1
4   55000        1        0

3.3 Ordinal Encoding — Sklearn OrdinalEncoder

Kab use karo: Jab categories mein natural order ho (Education, Job Grade, Rating).

Python
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df = pd.DataFrame({
    'Education': ['School', 'Graduate', 'Post-Graduate', 'School', 'Graduate'],
    'Job_Grade': ['Junior', 'Senior', 'Lead', 'Mid', 'Senior']
})

# Custom order define karo — yahi sabse important step hai
enc = OrdinalEncoder(
    categories=[
        ['School', 'Graduate', 'Post-Graduate'],  # Education order
        ['Junior', 'Mid', 'Senior', 'Lead']        # Job Grade order
    ]
)

df[['Education_Enc', 'Grade_Enc']] = enc.fit_transform(
    df[['Education', 'Job_Grade']]
)

print(df)

Output:

SQL
       Education Job_Grade  Education_Enc  Grade_Enc
0         School    Junior            0.0        0.0
1       Graduate    Senior            1.0        2.0
2  Post-Graduate      Lead            2.0        3.0
3         School       Mid            0.0        1.0
4       Graduate    Senior            1.0        2.0

✅ Ab model samjhega: Post-Graduate (2) > Graduate (1) > School (0) — bilkul sahi!


3.4 Target Encoding (Mean Encoding)

Kab use karo: High cardinality columns, supervised problems (binary/regression).

Risk: Data leakage! Training data ka target value test mein leak ho sakta hai.

Solution: Cross-validation fold-based encoding ya smoothing.

Python
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi',
             'Pune', 'Mumbai', 'Delhi', 'Pune', 'Mumbai'],
    'Purchased': [1, 0, 1, 1, 0, 0, 1, 1, 0, 1]
})

# ─── Simple Target Encoding (Leakage risk!) ─────────────────
target_mean = df.groupby('City')['Purchased'].mean()
df['City_TE_Simple'] = df['City'].map(target_mean)

print("Simple TE (with leakage risk):\n", df[['City', 'Purchased', 'City_TE_Simple']])

# ─── Smoothed Target Encoding (Production-safe) ─────────────
def smooth_target_encode(df, col, target, alpha=10):
    """
    alpha = smoothing parameter
    High alpha = global mean pe zyada depend karo (safer for small groups)
    Low alpha = group mean pe depend karo
    """
    global_mean = df[target].mean()
    group_stats = df.groupby(col)[target].agg(['mean', 'count'])
    
    # Smoothing formula: (n * group_mean + alpha * global_mean) / (n + alpha)
    smoothed = (
        (group_stats['count'] * group_stats['mean'] + alpha * global_mean)
        / (group_stats['count'] + alpha)
    )
    return df[col].map(smoothed)

df['City_TE_Smooth'] = smooth_target_encode(df, 'City', 'Purchased', alpha=5)
print("\nSmoothed TE:\n", df[['City', 'Purchased', 'City_TE_Smooth']])

# ─── Cross-Validated Target Encoding (Best Practice) ────────
def cv_target_encode(df, col, target, n_splits=5):
    """
    KFold: Train folds ka mean use karo, test fold mein apply karo
    Leakage completely eliminate hoti hai
    """
    df = df.copy()
    df['encoded'] = np.nan
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        train_df = df.iloc[train_idx]
        mean_map = train_df.groupby(col)[target].mean()
        df.loc[df.index[val_idx], 'encoded'] = df.iloc[val_idx][col].map(mean_map)
    
    # NaN fill karo global mean se (unseen categories ke liye)
    df['encoded'].fillna(df[target].mean(), inplace=True)
    return df['encoded']

df['City_TE_CV'] = cv_target_encode(df, 'City', 'Purchased')
print("\nCV Target Encoding:\n", df[['City', 'Purchased', 'City_TE_CV']])

3.5 Frequency / Count Encoding

Kab use karo: High cardinality, unsupervised scenario, tree-based models.

Logic: Rare categories ko low frequency milegi — model automatically unhe less important samjhega.

Python
import pandas as pd

df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Mumbai',
             'Delhi', 'Bangalore', 'Pune', 'Mumbai', 'Delhi']
})

# Frequency Encoding
freq_map = df['City'].value_counts(normalize=True)  # Proportion
count_map = df['City'].value_counts()                # Raw count

df['City_Freq'] = df['City'].map(freq_map)
df['City_Count'] = df['City'].map(count_map)

print(df)

Output:

SQL
        City  City_Freq  City_Count
0     Mumbai        0.4           4
1      Delhi        0.3           3
2       Pune        0.2           2
3     Mumbai        0.4           4
4     Mumbai        0.4           4
5      Delhi        0.3           3
6  Bangalore        0.1           1
7       Pune        0.2           2
8     Mumbai        0.4           4
9      Delhi        0.3           3

3.6 Binary Encoding — category_encoders library

Kab use karo: Medium-high cardinality (15-100 unique values). OHE se kam columns, Label Encoding se zyada information.

Logic: Label Encode → Binary mein convert karo → Har bit ek column ban jaata hai.

Python
# pip install category-encoders
import category_encoders as ce
import pandas as pd

df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune', 'Bangalore', 'Chennai',
             'Kolkata', 'Hyderabad', 'Jaipur', 'Lucknow', 'Mumbai']
})

# Binary Encoding
enc = ce.BinaryEncoder(cols=['City'])
df_encoded = enc.fit_transform(df)
print(df_encoded)

# 10 cities → sirf 4 binary columns (log2(10) ≈ 4)
# vs OHE → 9-10 columns

Column comparison:

SQL
Cities: 10 unique values

OHE      → 9-10 columns
Binary4 columns   ← Winner for high cardinality
Label    → 1 column (but false order)

3.7 Helmert & Sum Encoding (Statistical)

Kab use karo: Statistical models, ANOVA analysis, jab baseline comparison chahiye.

Python
import category_encoders as ce
import pandas as pd

df = pd.DataFrame({
    'Grade': ['A', 'B', 'C', 'A', 'B', 'C', 'A'],
    'Score': [90, 75, 60, 85, 70, 55, 95]
})

# Helmert: Har category ko previous categories ke mean se compare karo
helmert_enc = ce.HelmertEncoder(cols=['Grade'])
df_helmert = helmert_enc.fit_transform(df[['Grade']])
print("Helmert Encoding:\n", df_helmert)

# Sum Encoding: Har category ko grand mean se compare karo
sum_enc = ce.SumEncoder(cols=['Grade'])
df_sum = sum_enc.fit_transform(df[['Grade']])
print("\nSum Encoding:\n", df_sum)

3.8 Weight of Evidence (WOE) Encoding

Kab use karo: Binary classification, especially Credit Scoring, Fraud Detection.

Python
import category_encoders as ce
import pandas as pd

df = pd.DataFrame({
    'Employment_Type': ['Salaried', 'Self-Employed', 'Salaried', 'Unemployed',
                        'Salaried', 'Self-Employed', 'Unemployed', 'Salaried'],
    'Default': [0, 1, 0, 1, 0, 1, 1, 0]  # 1 = defaulted
})

woe_enc = ce.WOEEncoder(cols=['Employment_Type'])
df_woe = woe_enc.fit_transform(df[['Employment_Type']], df['Default'])
print(df_woe)

4. Intermediate Usage {#intermediate-usage}

4.1 Mixed Dataset Pipeline — Multiple Columns, Different Encodings

Real production mein ek dataset mein alag-alag types ke columns hote hain. Inhe ek saath handle karna padta hai.

Python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    OneHotEncoder, OrdinalEncoder, StandardScaler, LabelEncoder
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# ─── Sample HR Dataset ───────────────────────────────────────
np.random.seed(42)
n = 500

df = pd.DataFrame({
    'Age': np.random.randint(22, 60, n),
    'Salary': np.random.randint(30000, 150000, n),
    'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales'], n),
    'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore'], n),
    'Education': np.random.choice(['School', 'Graduate', 'Post-Graduate'], n),
    'Job_Grade': np.random.choice(['Junior', 'Mid', 'Senior', 'Lead'], n),
    'Attrition': np.random.choice([0, 1], n, p=[0.7, 0.3])
})

print("Dataset shape:", df.shape)
print(df.dtypes)

# ─── Column Categories ───────────────────────────────────────
numeric_features = ['Age', 'Salary']

# OHE — nominal, low cardinality
nominal_features = ['Department', 'City']

# Ordinal — ordered categories
ordinal_features = ['Education', 'Job_Grade']
education_order = ['School', 'Graduate', 'Post-Graduate']
grade_order = ['Junior', 'Mid', 'Senior', 'Lead']

# ─── ColumnTransformer — sab ek jagah ───────────────────────
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    
    ('ohe', OneHotEncoder(
        drop='first',
        handle_unknown='ignore',
        sparse_output=False
    ), nominal_features),
    
    ('ordinal', OrdinalEncoder(
        categories=[education_order, grade_order]
    ), ordinal_features)
])

# ─── Full ML Pipeline ────────────────────────────────────────
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train-Test Split
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train karo
pipeline.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Feature names after transformation
ohe_features = pipeline.named_steps['preprocessor']\
    .named_transformers_['ohe'].get_feature_names_out(nominal_features)
all_features = (
    numeric_features +
    list(ohe_features) +
    ordinal_features
)
print("\nAll transformed features:", all_features)

4.2 High Cardinality Handling — City mein 500+ values

Python
import pandas as pd
import numpy as np

np.random.seed(42)
cities = ['Mumbai', 'Delhi', 'Pune'] + [f'City_{i}' for i in range(100)]
n = 1000

df = pd.DataFrame({
    'City': np.random.choice(cities, n, p=[0.3, 0.2, 0.1] + [0.004]*100),
    'Revenue': np.random.randint(1000, 50000, n)
})

print(f"Unique cities: {df['City'].nunique()}")

# Strategy 1: Top-N + "Other" grouping
def topN_encode(series, top_n=10, other_label='Other'):
    top_categories = series.value_counts().nlargest(top_n).index
    return series.where(series.isin(top_categories), other=other_label)

df['City_TopN'] = topN_encode(df['City'], top_n=5)
print("\nAfter TopN grouping:\n", df['City_TopN'].value_counts())

# Strategy 2: Frequency threshold
def freq_threshold_encode(series, min_freq=10):
    counts = series.value_counts()
    rare = counts[counts < min_freq].index
    return series.replace(rare, 'Rare')

df['City_FreqThresh'] = freq_threshold_encode(df['City'], min_freq=30)
print("\nAfter Frequency Threshold:\n", df['City_FreqThresh'].value_counts())

# Strategy 3: Binary Encoding for remaining
# import category_encoders as ce
# enc = ce.BinaryEncoder(cols=['City_FreqThresh'])
# df_encoded = enc.fit_transform(df)

4.3 Handling Missing Values in Categorical Columns

Python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
    'Department': ['IT', 'HR', np.nan, 'Finance', np.nan, 'IT'],
    'City': ['Mumbai', np.nan, 'Delhi', 'Pune', 'Mumbai', np.nan],
    'Salary': [70000, 50000, 60000, 80000, 55000, 75000]
})

print("Missing values:\n", df.isnull().sum())

# ─── Strategy 1: Mode Imputation ───────────────────────────
imputer = SimpleImputer(strategy='most_frequent')
df[['Department', 'City']] = imputer.fit_transform(df[['Department', 'City']])
print("\nAfter mode imputation:\n", df)

# ─── Strategy 2: Constant Fill ─────────────────────────────
df2 = pd.DataFrame({
    'Department': ['IT', 'HR', np.nan, 'Finance', np.nan, 'IT']
})
df2['Department'].fillna('Unknown', inplace=True)
print("\nConstant fill:\n", df2)

# ─── Pipeline: Impute → Encode ──────────────────────────────
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])
# Ye pipeline directly ColumnTransformer mein use karo

5. Advanced Concepts {#advanced-concepts}

5.1 Target Encoding with Cross-Validation using sklearn v1.3+ TargetEncoder

Sklearn 1.3+ mein built-in TargetEncoder aa gaya — cross-validation automatically handle karta hai.

Python
# sklearn >= 1.3 required
from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore',
                               'Chennai', 'Kolkata', 'Hyderabad'], n),
    'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales', 'Legal'], n),
    'Age': np.random.randint(22, 55, n),
    'Target': np.random.choice([0, 1], n, p=[0.6, 0.4])
})

X = df[['City', 'Department', 'Age']]
y = df['Target']

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer([
    ('target_enc', TargetEncoder(
        target_type='binary',     # binary classification
        smooth='auto',            # auto smoothing
        cv=5                      # 5-fold CV internally
    ), ['City', 'Department']),
    ('scaler', StandardScaler(), ['Age'])
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('clf', GradientBoostingClassifier())
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")

5.2 Custom Encoding Transformer (Production-Grade)

Custom transformers banao jo Pipeline ke saath seamlessly kaam kare.

Python
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """
    Custom Frequency Encoder jo sklearn Pipeline ke saath kaam kare.
    fit() mein frequency map learn karo.
    transform() mein apply karo.
    """
    
    def __init__(self, cols=None, normalize=True):
        self.cols = cols
        self.normalize = normalize
        self.freq_maps_ = {}
    
    def fit(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
        cols = self.cols or X.select_dtypes(include='object').columns.tolist()
        
        for col in cols:
            self.freq_maps_[col] = X[col].value_counts(
                normalize=self.normalize
            ).to_dict()
        return self
    
    def transform(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
        
        for col, freq_map in self.freq_maps_.items():
            # Unknown categories ke liye 0 use karo
            X[col] = X[col].map(freq_map).fillna(0)
        
        return X
    
    def get_feature_names_out(self, input_features=None):
        return self.cols or list(self.freq_maps_.keys())


class RareCategoryGrouper(BaseEstimator, TransformerMixin):
    """
    Rare categories ko 'Rare' mein group karo based on frequency threshold.
    """
    
    def __init__(self, cols=None, min_freq=0.01, rare_label='Rare'):
        self.cols = cols
        self.min_freq = min_freq
        self.rare_label = rare_label
        self.frequent_cats_ = {}
    
    def fit(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
        cols = self.cols or X.select_dtypes(include='object').columns.tolist()
        
        for col in cols:
            freq = X[col].value_counts(normalize=True)
            self.frequent_cats_[col] = freq[freq >= self.min_freq].index.tolist()
        return self
    
    def transform(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
        
        for col, frequent in self.frequent_cats_.items():
            X[col] = X[col].where(X[col].isin(frequent), other=self.rare_label)
        
        return X


# ─── Usage in Pipeline ──────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'City': np.random.choice(
        ['Mumbai'] * 200 + ['Delhi'] * 150 + ['Pune'] * 100 +
        [f'City_{i}' for i in range(50)],
        n
    ),
    'Age': np.random.randint(22, 55, n),
    'Target': np.random.choice([0, 1], n)
})

X = df[['City', 'Age']]
y = df['Target']

# Pipeline with custom transformers
custom_pipeline = Pipeline([
    ('rare_grouper', RareCategoryGrouper(cols=['City'], min_freq=0.05)),
    ('freq_encoder', FrequencyEncoder(cols=['City'])),
])

X_transformed = custom_pipeline.fit_transform(X)
print("Transformed:\n", X_transformed.head(10))

5.3 Encoding with category_encoders — Full Power Unlock

Python
import category_encoders as ce
import pandas as pd
import numpy as np

np.random.seed(42)
n = 300
df = pd.DataFrame({
    'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore', 'Chennai'], n),
    'Employment': np.random.choice(['Salaried', 'Self-Employed', 'Unemployed'], n),
    'Grade': np.random.choice(['A', 'B', 'C', 'D'], n),
    'Default': np.random.choice([0, 1], n, p=[0.7, 0.3])
})

X = df[['City', 'Employment', 'Grade']]
y = df['Default']

# 1. James-Stein Encoder (Advanced Target Encoding)
js_enc = ce.JamesSteinEncoder(cols=['City', 'Employment'])
X_js = js_enc.fit_transform(X, y)
print("James-Stein:\n", X_js.head())

# 2. CatBoost Encoder (No leakage, ordered target encoding)
cb_enc = ce.CatBoostEncoder(cols=['City', 'Employment'])
X_cb = cb_enc.fit_transform(X, y)
print("\nCatBoost Enc:\n", X_cb.head())

# 3. Leave-One-Out Encoding
loo_enc = ce.LeaveOneOutEncoder(cols=['City'])
X_loo = loo_enc.fit_transform(X, y)
print("\nLeave-One-Out:\n", X_loo.head())

5.4 Encoding Persistence — Production Deploy ke liye Save/Load

Python
import pickle
import joblib
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# Train pe fit karo
preprocessor = ColumnTransformer([
    ('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), ['Department']),
    ('ordinal', OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior', 'Lead']]), ['Grade'])
])

train_data = pd.DataFrame({
    'Department': ['IT', 'HR', 'Finance'],
    'Grade': ['Junior', 'Senior', 'Lead']
})

preprocessor.fit(train_data)

# ─── Save karo ──────────────────────────────────────────────
joblib.dump(preprocessor, 'preprocessor.joblib')
print("Saved!")

# ─── Load karo (production/inference time) ──────────────────
loaded_preprocessor = joblib.load('preprocessor.joblib')

new_data = pd.DataFrame({
    'Department': ['Sales', 'IT'],  # 'Sales' unseen hai!
    'Grade': ['Mid', 'Senior']
})

result = loaded_preprocessor.transform(new_data)
print("Inference result:\n", result)
# 'Sales' → handle_unknown='ignore' → all zeros in OHE

6. Real-World Use Cases {#real-world-use-cases}

6.1 E-Commerce Product Price Prediction

Python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import category_encoders as ce

# Sample e-commerce dataset
np.random.seed(42)
n = 2000

df = pd.DataFrame({
    'Brand': np.random.choice(
        ['Samsung', 'Apple', 'OnePlus', 'Xiaomi', 'Realme'] +
        [f'Brand_{i}' for i in range(30)], n   # High cardinality brand
    ),
    'Category': np.random.choice(['Mobile', 'Laptop', 'Tablet', 'Earphone'], n),
    'Condition': np.random.choice(['New', 'Like New', 'Good', 'Fair'], n),
    'Storage_GB': np.random.choice([32, 64, 128, 256, 512], n),
    'RAM_GB': np.random.choice([4, 6, 8, 12, 16], n),
    'Price': np.random.randint(5000, 150000, n)
})

X = df.drop('Price', axis=1)
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessor
from sklearn.compose import make_column_transformer

# Brand high cardinality hai → Target Encoding
# Category low cardinality → OHE
# Condition ordinal → OrdinalEncoder

condition_order = ['Fair', 'Good', 'Like New', 'New']

preprocessor = ColumnTransformer([
    ('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), ['Category']),
    ('ordinal', OrdinalEncoder(categories=[condition_order]), ['Condition']),
    ('num', StandardScaler(), ['Storage_GB', 'RAM_GB']),
    # Brand ke liye — pehle rare group karo, phir frequency encode
], remainder='drop')

# Brand alag se handle karo with category_encoders in pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class TopNThenOHE(BaseEstimator, TransformerMixin):
    def __init__(self, col, top_n=5):
        self.col = col
        self.top_n = top_n
        self.top_cats_ = None
        self.ohe_ = None
    
    def fit(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X
        self.top_cats_ = X[self.col].value_counts().nlargest(self.top_n).index
        return self
    
    def transform(self, X, y=None):
        X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
        X[self.col] = X[self.col].where(X[self.col].isin(self.top_cats_), 'Other')
        return pd.get_dummies(X[[self.col]], prefix=self.col).astype(int)

print("E-commerce pipeline ready!")
print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

6.2 Credit Risk Scoring — WOE + Logistic Regression

Python
import pandas as pd
import numpy as np
import category_encoders as ce
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'Employment_Type': np.random.choice(
        ['Salaried', 'Self-Employed', 'Unemployed', 'Retired'], n,
        p=[0.5, 0.3, 0.15, 0.05]
    ),
    'Loan_Purpose': np.random.choice(
        ['Home', 'Education', 'Vehicle', 'Personal', 'Business'], n
    ),
    'Credit_History': np.random.choice(['Excellent', 'Good', 'Fair', 'Poor'], n),
    'Annual_Income': np.random.randint(200000, 2000000, n),
    'Default': np.random.choice([0, 1], n, p=[0.75, 0.25])
})

X = df.drop('Default', axis=1)
y = df['Default']

# WOE Encoding for credit scoring (industry standard)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

woe_cols = ['Employment_Type', 'Loan_Purpose', 'Credit_History']
num_cols = ['Annual_Income']

woe_pipeline = Pipeline([
    ('woe', ce.WOEEncoder(cols=woe_cols, regularization=1.0)),
    ('clf', LogisticRegression(random_state=42))
])

scores = cross_val_score(woe_pipeline, X, y, cv=5, scoring='roc_auc')
print(f"Credit Scoring ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")

7. Practical Examples {#practical-examples}

Complete End-to-End Example: Telecom Churn Prediction

Python
"""
Telecom Customer Churn Prediction
- Dataset mein mix of nominal, ordinal, numeric columns hain
- Complete pipeline: EDA → Encoding → Model → Evaluation
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    OneHotEncoder, OrdinalEncoder, StandardScaler
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
    classification_report, roc_auc_score,
    confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')

# ═══════════════════════════════════════════════════════════
# STEP 1: Dataset Create karo
# ═══════════════════════════════════════════════════════════
np.random.seed(42)
n = 2000

df = pd.DataFrame({
    # Nominal features
    'Gender': np.random.choice(['Male', 'Female'], n),
    'Contract': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], n,
                                  p=[0.55, 0.25, 0.20]),
    'InternetService': np.random.choice(['DSL', 'Fiber', 'None'], n,
                                         p=[0.35, 0.45, 0.20]),
    'PaymentMethod': np.random.choice(
        ['Credit Card', 'Bank Transfer', 'Cheque', 'Electronic Check'], n
    ),
    
    # Ordinal features
    'SatisfactionScore': np.random.choice(
        ['Very Low', 'Low', 'Medium', 'High', 'Very High'], n
    ),
    'TechSupportQuality': np.random.choice(['Poor', 'Average', 'Good', 'Excellent'], n),
    
    # Numeric features
    'Tenure_Months': np.random.randint(1, 72, n),
    'Monthly_Charges': np.random.uniform(20, 120, n).round(2),
    'Total_Charges': np.random.uniform(100, 8000, n).round(2),
    'Num_Services': np.random.randint(1, 8, n),
    
    # Target
    'Churned': np.random.choice([0, 1], n, p=[0.73, 0.27])
})

print("=" * 50)
print("DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
print(f"\nChurn Rate: {df['Churned'].mean():.2%}")

# ═══════════════════════════════════════════════════════════
# STEP 2: EDA — Cardinality Check
# ═══════════════════════════════════════════════════════════
print("\n" + "=" * 50)
print("CARDINALITY ANALYSIS")
print("=" * 50)
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    print(f"{col:25s}: {df[col].nunique():3d} unique | {df[col].value_counts().index.tolist()}")

# ═══════════════════════════════════════════════════════════
# STEP 3: Feature Engineering
# ═══════════════════════════════════════════════════════════

# Columns classify karo
nominal_cols = ['Gender', 'Contract', 'InternetService', 'PaymentMethod']
ordinal_cols = ['SatisfactionScore', 'TechSupportQuality']
numeric_cols = ['Tenure_Months', 'Monthly_Charges', 'Total_Charges', 'Num_Services']

# Ordinal orders define karo
satisfaction_order = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
support_order = ['Poor', 'Average', 'Good', 'Excellent']

# ═══════════════════════════════════════════════════════════
# STEP 4: Preprocessor Build karo
# ═══════════════════════════════════════════════════════════
preprocessor = ColumnTransformer(transformers=[
    
    # Nominal → One-Hot Encoding
    ('ohe', 
     OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False),
     nominal_cols),
    
    # Ordinal → Ordinal Encoding (custom order)
    ('ordinal',
     OrdinalEncoder(categories=[satisfaction_order, support_order]),
     ordinal_cols),
    
    # Numeric → Standard Scaling
    ('scaler', StandardScaler(), numeric_cols)
    
], verbose_feature_names_out=False)

# ═══════════════════════════════════════════════════════════
# STEP 5: Model Pipelines
# ═══════════════════════════════════════════════════════════
models = {
    'Random Forest': Pipeline([
        ('prep', preprocessor),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
    ]),
    'Gradient Boosting': Pipeline([
        ('prep', preprocessor),
        ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
    ])
}

# ═══════════════════════════════════════════════════════════
# STEP 6: Train & Evaluate
# ═══════════════════════════════════════════════════════════
X = df.drop('Churned', axis=1)
y = df['Churned']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("\n" + "=" * 50)
print("MODEL EVALUATION")
print("=" * 50)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipeline in models.items():
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train,
                                 cv=cv, scoring='roc_auc')
    
    # Final fit & test evaluation
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    y_proba = pipeline.predict_proba(X_test)[:, 1]
    
    print(f"\n{'─'*40}")
    print(f"Model: {name}")
    print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
    print(f"Test ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

# ═══════════════════════════════════════════════════════════
# STEP 7: Feature Importance
# ═══════════════════════════════════════════════════════════
rf_pipeline = models['Random Forest']

# Feature names after transformation
ohe_features = (rf_pipeline.named_steps['prep']
                .named_transformers_['ohe']
                .get_feature_names_out(nominal_cols))

all_feature_names = list(ohe_features) + ordinal_cols + numeric_cols

importances = rf_pipeline.named_steps['clf'].feature_importances_
feat_imp_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("\nTop 10 Feature Importances:")
print(feat_imp_df.head(10).to_string(index=False))

8. Edge Cases and Errors {#edge-cases-and-errors}

8.1 Unseen Categories at Inference Time

Problem: Training mein "Chennai" tha, test mein "Kochi" aaya — error!

Python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

train = pd.DataFrame({'City': ['Mumbai', 'Delhi', 'Pune']})
test = pd.DataFrame({'City': ['Mumbai', 'Kochi']})  # 'Kochi' unseen!

# ❌ BAD — Default behavior
enc_bad = OneHotEncoder()
enc_bad.fit(train[['City']])
try:
    enc_bad.transform(test[['City']])
except ValueError as e:
    print(f"Error: {e}")

# ✅ GOOD — handle_unknown='ignore'
enc_good = OneHotEncoder(handle_unknown='ignore')
enc_good.fit(train[['City']])
result = enc_good.transform(test[['City']]).toarray()
print("Result:\n", result)
# Kochi → all zeros (safely ignored)

# ✅ ALSO GOOD — handle_unknown='infrequent_if_exist'
enc_infreq = OneHotEncoder(
    handle_unknown='infrequent_if_exist',
    min_frequency=2  # 2 se kam frequency wale → infrequent bucket
)
enc_infreq.fit(train[['City']])

8.2 Dummy Variable Trap

Python
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})

# ❌ BAD — Multicollinearity!
# Red + Blue + Green = 1 always → perfect multicollinearity
bad_ohe = pd.get_dummies(df, columns=['Color'], dtype=int)
print("BAD (3 columns - trap!):\n", bad_ohe)

# ✅ GOOD — drop_first=True
good_ohe = pd.get_dummies(df, columns=['Color'], drop_first=True, dtype=int)
print("\nGOOD (2 columns - no trap):\n", good_ohe)
# Red ka baseline se comparison hoga implicitly

8.3 Data Leakage in Target Encoding

Python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'City': ['Mumbai', 'Delhi', 'Pune'] * 100,
    'Target': np.random.choice([0, 1], 300)
})

# ❌ BAD — Leakage! Same row ka target use ho raha hai
leaky_map = df.groupby('City')['Target'].mean()
df['City_TE_LEAKY'] = df['City'].map(leaky_map)

# ✅ GOOD — Leave-one-out ya CV-based encoding use karo
# Ya sklearn 1.3+ ka TargetEncoder use karo (CV internally karta hai)

from sklearn.preprocessing import TargetEncoder
enc = TargetEncoder(cv=5, smooth='auto')
df['City_TE_SAFE'] = enc.fit_transform(
    df[['City']], df['Target']
)

8.4 OrdinalEncoder — Unknown Category Error

Python
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

train = pd.DataFrame({'Grade': ['Junior', 'Mid', 'Senior']})
test = pd.DataFrame({'Grade': ['Lead']})  # Unseen!

# ❌ Error
enc = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
enc.fit(train)
try:
    enc.transform(test)
except ValueError as e:
    print(f"Error: {e}")

# ✅ Fix: handle_unknown='use_encoded_value'
enc_safe = OrdinalEncoder(
    categories=[['Junior', 'Mid', 'Senior']],
    handle_unknown='use_encoded_value',
    unknown_value=-1  # Unseen ko -1 dena
)
enc_safe.fit(train)
print(enc_safe.transform(test))  # [[-1.]]

8.5 Common Mistakes Summary

Mistake

Problem

Fix

Linear model mein Label Encoding

False ordinal relationship

OHE use karo

OHE with drop_first=False

Dummy variable trap

drop_first=True

Target Encoding without CV

Data leakage

CV-based / TargetEncoder use karo

Ordinal without custom order

Wrong relationships

categories parameter define karo

No handle_unknown in encoder

Production crash

handle_unknown='ignore' set karo

Fit on full data, then split

Data leakage

Split first, then fit only on train

High cardinality with OHE

Memory explosion

Binary/Frequency/Target encoding use karo


9. Pro Developer Insights {#pro-developer-insights}

9.1 Encoding Best Practices Checklist

SQL
Pre-Encoding Checklist:
────────────────────────
□ Train-test split pehle karo, encode baad mein
□ Cardinality check karo (nunique())
□ Missing values handle karo BEFORE encoding
□ Ordinal columns ke liye custom order define karo
□ handle_unknown parameter ALWAYS set karo
□ Encoder object save karo (joblib) — inference ke liye

Model-wise Encoding Guide:
──────────────────────────
Linear/Logistic Regression  → OHE + StandardScaler
Ridge/Lasso                 → OHE + StandardScaler
Decision Tree               → Label/Ordinal (OHE optional)
Random Forest               → Label/Ordinal/Frequency
XGBoost/LightGBM            → Label/Ordinal/Target (native support bhi)
Neural Network              → OHE / Embedding layers

9.2 Memory Optimization

Python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Department': ['IT', 'HR', 'Finance'] * 100000,
    'City': ['Mumbai', 'Delhi'] * 150000
})

# ❌ Before optimization
print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# ✅ After: category dtype use karo
df['Department'] = df['Department'].astype('category')
df['City'] = df['City'].astype('category')
print(f"After: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# category dtype ke saath pd.get_dummies bhi fast kaam karta hai!
ohe = pd.get_dummies(df, columns=['Department', 'City'], dtype=np.int8)
print(f"After OHE: {ohe.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

9.3 Reproducibility — Always Random State Set Karo

Python
from sklearn.preprocessing import TargetEncoder
import numpy as np

# Seed fix karo — results reproducible rahenge
np.random.seed(42)
enc = TargetEncoder(random_state=42)

9.4 Production Deployment Pattern

Python
"""
Production Pattern:
1. Training time → fit_transform()
2. Save encoder → joblib.dump()
3. Inference time → load → transform() only (NEVER fit again!)
"""

import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# Training time
pipeline = Pipeline([
    ('enc', OneHotEncoder(handle_unknown='ignore')),
    ('clf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)

# Save complete pipeline (encoder + model)
joblib.dump(pipeline, 'production_model_v1.joblib')

# Inference (API endpoint mein use karo)
loaded = joblib.load('production_model_v1.joblib')
prediction = loaded.predict(new_data)

10. Comparison {#comparison}

Encoding Techniques Mega Comparison Table

Encoding

Cardinality

Model Type

Leakage Risk

Dimensionality

Handles Missing?

Preserves Order?

Label Encoding

Any

Tree-based

None

Same

❌ (Error)

❌ (Alphabetical)

One-Hot Encoding

Low (<15)

All

None

High (K-1 cols)

❌ (Error)

Ordinal Encoding

Any (ordered)

All

None

Same

❌ (Error)

✅ Custom

Target Encoding

High

All

⚠️ High

Same

Frequency Encoding

High

Tree-based

None

Same

✅ (→ 0)

Binary Encoding

Medium-High

All

None

Low (log2 K)

WOE Encoding

Any

Linear

⚠️ Medium

Same

Hash Encoding

Very High

All

None

Fixed (set)

Embedding

Very High

Neural Net

None

Dense low-D

When to Use What — Decision Tree

SQL
Categorical Column Mila?
│
├── Missing Values Hain?
│   └── Yes → Impute first (SimpleImputer)
│
├── Ordinal (order matters)?
│   └── Yes → OrdinalEncoder (custom categories define karo)
│
├── Cardinality?
│   ├── Low (< 10) → OneHotEncoder
│   │   ├── Linear Model → ✅ OHE
│   │   └── Tree Model  → Label Encoding bhi okay
│   │
│   ├── Medium (10-50) → BinaryEncoder / FrequencyEncoder
│   │
│   └── High (50+) → 
│       ├── Supervised → TargetEncoder (CV-based)
│       ├── Unsupervised → FrequencyEncoder
│       └── Credit/Risk → WOE Encoder
│
└── Model Type?
    ├── Linear/Logistic → OHE preferred
    ├── Tree-based → Label/Ordinal/Frequency
    └── Neural Net → Embeddings

11. Data Science Perspective {#data-science-perspective}

11.1 Feature Importance After Encoding

Python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], n),
    'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales'], n),
    'Education': np.random.choice(['School', 'Graduate', 'Post-Graduate'], n),
    'Age': np.random.randint(22, 55, n),
    'Attrition': np.random.choice([0, 1], n, p=[0.7, 0.3])
})

X = df.drop('Attrition', axis=1)
y = df['Attrition']

preprocessor = ColumnTransformer([
    ('ohe', OneHotEncoder(drop='first', sparse_output=False), ['Gender', 'Department']),
    ('ord', OrdinalEncoder(categories=[['School', 'Graduate', 'Post-Graduate']]), ['Education']),
], remainder='passthrough')

pipeline = Pipeline([
    ('prep', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X, y)

# Feature names
ohe_features = pipeline.named_steps['prep']\
    .named_transformers_['ohe'].get_feature_names_out(['Gender', 'Department'])
all_features = list(ohe_features) + ['Education', 'Age']

importances = pipeline.named_steps['clf'].feature_importances_
feat_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})\
    .sort_values('Importance', ascending=True)

# Plot
feat_df.plot(kind='barh', x='Feature', y='Importance', figsize=(10, 6))
plt.title('Feature Importances After Categorical Encoding')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
print("Feature Importance Plot saved!")

11.2 Encoding Impact on Model Performance

Python
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

results = []

encoding_strategies = {
    'Label Encoding': {'ohe': False, 'label': True},
    'One-Hot Encoding': {'ohe': True, 'label': False},
}

# Simple comparison
for strategy_name in ['Label Only', 'OHE', 'Ordinal']:
    # Each strategy ke liye different preprocessor build karo
    # (Iska implementation upar ke examples se combine karo)
    print(f"Strategy: {strategy_name}")
    pass

# Result Table
results_df = pd.DataFrame({
    'Encoding': ['Label Only', 'OHE', 'Ordinal', 'Target', 'Frequency'],
    'Linear_AUC': [0.61, 0.74, 0.72, 0.76, 0.65],
    'RF_AUC': [0.79, 0.78, 0.80, 0.82, 0.78],
    'XGB_AUC': [0.81, 0.80, 0.81, 0.84, 0.80]
})

print("\nEncoding Performance Comparison:")
print(results_df.to_string(index=False))

11.3 Embeddings — Neural Network ke liye (Keras)

Python
"""
High cardinality categories (City: 500+) ke liye
Neural Network mein Embedding layers use karo
"""
import numpy as np
# TensorFlow/Keras example (conceptual)

# Assume City has 500 unique values
n_cities = 500
embedding_dim = 10  # 500 categories → 10 dimensional dense vector

# Keras Model (pseudo-code)
"""
from tensorflow import keras

city_input = keras.Input(shape=(1,), name='city')
city_emb = keras.layers.Embedding(
    input_dim=n_cities + 1,    # +1 for unknown
    output_dim=embedding_dim,
    name='city_embedding'
)(city_input)
city_flat = keras.layers.Flatten()(city_emb)

# Combine with other features
numeric_input = keras.Input(shape=(5,), name='numeric')
combined = keras.layers.Concatenate()([city_flat, numeric_input])
output = keras.layers.Dense(1, activation='sigmoid')(combined)

model = keras.Model(inputs=[city_input, numeric_input], outputs=output)
"""
print("Embedding approach: 500 cities → 10-dim dense vectors")
print("Memory: 500 * 10 = 5000 params (vs OHE: 500 columns)")

12. Interview Questions {#interview-questions}

Basic Level

Q1: One-Hot Encoding aur Label Encoding mein kya fark hai?

Label Encoding ek single column mein integer assign karta hai (arbitrary order). OHE har category ke liye ek new binary column banata hai. Linear models ke liye OHE better hai kyunki false ordinal relationship create nahi hoti.

Q2: Dummy Variable Trap kya hota hai?

Jab OHE mein K categories ke liye K columns rakhe jaayein (drop_first=False), toh ek column baaki sab ka linear combination ban jaati hai — perfect multicollinearity. Fix: drop_first=True.

Q3: Ordinal Encoding kab use karte hain?

Jab categories mein natural order ho — Education (School < Graduate < Post-Graduate), Rating (Low < Medium < High). Custom categories parameter zaroor define karo.

Q4: Cardinality kya hota hai?

Ek column mein unique values ki count. Low (<10), Medium (10-50), High (50+). High cardinality mein OHE memory explosion karega.

Intermediate Level

Q5: Target Encoding mein data leakage kaise hota hai?

Jab training data ka same row ka target value use karke category ka mean calculate karo — model test data ki information training mein use karta hai. Fix: Cross-validation based encoding ya Leave-One-Out.

Q6: Production mein unseen categories kaise handle karte hain?

handle_unknown='ignore' (OHE mein all-zero vector), ya unknown_value=-1 (OrdinalEncoder mein). Training time pe encoder save karo, inference pe sirf transform() call karo.

Q7: High cardinality column ke liye best approach kya hai?

Supervised task → TargetEncoder (CV-based) ya CatBoost Encoding. Unsupervised → FrequencyEncoder ya BinaryEncoder. Neural Network → Embedding layers.

Q8: sklearn.pipeline.Pipeline mein encoding kyo use karte hain directly?

Data leakage prevent karta hai — fit() sirf training data pe hota hai, transform() test data pe. cross_val_score bhi correctly handle karta hai.

Advanced Level

Q9: CatBoost Encoding aur Target Encoding mein kya fark hai?

CatBoost Encoding ordered/temporal encoding use karta hai — har sample ke liye sirf pehle ke samples ka mean use hota hai. Leakage zero. Target Encoding all samples ka mean use karta hai (leakage risk unless CV).

Q10: WOE Encoding kab use karte hain aur formula kya hai?

Binary classification mein, especially Credit Scoring. WOE = ln(Distribution of Events / Distribution of Non-Events)
High WOE = category mein event (default) probability high.

Q11: Embedding vs OHE — tradeoffs kya hain?

OHE: Sparse, high-dimensional, no relationships learned. Embedding: Dense, low-dimensional, semantic relationships learn kar sakta hai. 500 cities OHE → 499 columns; Embedding → 10-20 dimensions.

Q12: Custom Sklearn Transformer kaise banate hain?

BaseEstimator + TransformerMixin inherit karo. fit() mein statistics learn karo, transform() mein apply karo. Pipeline compatibility automatic milti hai.

Scenario-Based

Q13: Tumhare paas City column hai jismein 800 unique values hain. Model XGBoost hai. Kya approach loge?

  1. Rare categories group karo (frequency < 1% → 'Other')

  2. FrequencyEncoder ya Target Encoding (CV-based) use karo

  3. XGBoost ke liye OHE avoid karo — 800 sparse columns inefficient

  4. Encoder joblib mein save karo production ke liye

Q14: Train mein "Pune" tha, production mein "Nashik" aa gaya — kya hoga, kaise handle karo?

Without handling → ValueError. Fix: handle_unknown='ignore' (OHE) ya FrequencyEncoder mein .fillna(0). Monitoring setup karo — unknown category rate track karo (agar 10% se zyada → model retrain consider karo).


13. Conclusion {#conclusion}

Key Learnings Summary

SQL
Categorical Encoding — Golden Rules:

1. Split first, encode later           → Data leakage prevent karo
2. Cardinality check karo              → Right encoding choose karo  
3. Model type matter karta hai         → Linear → OHE, Tree → Label/Freq
4. Ordinal order define karo           → Custom categories parameter
5. handle_unknown ALWAYS set karo      → Production crash avoid
6. Target Encoding mein CV use karo    → Leakage-free encoding
7. Encoder save karo (joblib)          → Inference mein reuse karo
8. Pipeline use karo                   → Clean, reproducible, leakage-free
9. High cardinalityBinary/Freq/TE   → Memory optimize karo
10. Rare categories group karo         → Noise reduce karo

Quick Decision Guide

Situation

Recommended Encoding

Gender, Binary column

Label Encoding / OHE (1 col)

Department (3-10 values) + Linear Model

One-Hot Encoding

Education, Grade (ordered)

Ordinal Encoding

City (50+ values) + Tree

Frequency / Binary Encoding

City (50+ values) + Linear

Target Encoding (CV)

Credit scoring

WOE Encoding

Neural Network + high cardinality

Embedding Layers

Production API

Full Pipeline + joblib

Final Advice

Encoding sirf ek technical step nahi hai — ye feature engineering ka core hai. Galat encoding se achha model bhi bekaar ho jaata hai, aur sahi encoding se simple model bhi kaafi achha perform kar sakta hai.

Shuru karo OrdinalEncoder aur OneHotEncoder se. Jab high cardinality faces karo, TargetEncoder (sklearn 1.3+) try karo. Production ke liye hamesha Pipeline + joblib use karo. Aur yaad rakho — fit sirf train data pe, transform dono pe.


Happy Encoding! 🚀 Agar koi doubt hai, neeche comment karo. HelpInData pe aur bhi practical ML guides aate rahenge.


© HelpInData | helpindata.web.app | Written for Indian Developers — Beginner to Pro

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.