"Garbage in, garbage out." — ML models sirf numbers samajhte hain. Agar tumhara categorical data sahi se encode nahi hua, toh tumhara model chahe kitna bhi powerful ho — output bakwaas aayega.*
1. Introduction
Categorical Data kya hota hai?
Socho tumhare paas ek HR dataset hai — Department, City, Gender, Education_Level, Job_Grade. Ye saare text values hain — "IT", "HR", "Mumbai", "Male", "Bachelor's".
Machine Learning algorithms — Linear Regression, Random Forest, XGBoost — sirf numbers ke saath kaam karte hain. Toh in text values ko numbers mein convert karna padta hai. Yahi hai Categorical Data Handling.
Kyun important hai?
Problem | Impact |
|---|---|
Encoding galat hua | Model wrong patterns seekhega |
High cardinality ignore kiya | Memory explosion + overfitting |
Data leakage hua (Target Encoding mein) | Model real world mein fail karega |
Ordinal data ko nominal treat kiya | Mathematical relationships destroy ho jaate hain |
Real-World Usage
E-commerce: Product category, Brand, Color → price prediction
Banking: Loan type, Branch, Customer segment → default prediction
HR Analytics: Department, Grade, City → attrition prediction
Healthcare: Blood group, Disease type, Ward → readmission risk
Credit Scoring: Employment type, Education → creditworthiness
2. Basic Concepts
Categorical Data ke Types
Categorical Data
│
├── Nominal (No Order)
│ ├── Gender: Male, Female
│ ├── City: Mumbai, Delhi, Pune
│ └── Color: Red, Blue, Green
│
└── Ordinal (Order Matters)
├── Education: School < Graduate < Post-Graduate
├── Job Grade: Junior < Mid < Senior < Lead
└── Rating: Poor < Average < Good < Excellent
Cardinality kya hota hai?
Cardinality = ek column mein unique values ki count.
import pandas as pd
df = pd.read_csv('employee.csv')
# Cardinality check karo
for col in df.select_dtypes(include='object').columns:
print(f"{col}: {df[col].nunique()} unique values")
Department : 8 → Low cardinality ✅ (OHE safe)
City : 42 → Medium cardinality ⚠️
Employee_ID : 5000 → High cardinality ❌ (OHE mat karo)
Encoding Techniques ka Overview
Encoding Techniques
│
├── Classical
│ ├── Label Encoding → Ordinal ke liye
│ ├── One-Hot Encoding → Nominal, low cardinality
│ └── Ordinal Encoding → Ordered categories
│
├── Statistical
│ ├── Target Encoding → High cardinality + supervised
│ ├── Frequency Encoding → High cardinality + unsupervised
│ └── Binary Encoding → Medium-high cardinality
│
└── Advanced
├── Helmert Encoding
├── Sum Encoding
├── WOE Encoding → Binary classification
└── Embedding (Deep Learning)
3. All Encoding Functions & Features
3.1 Label Encoding — Pandas .map() / Sklearn LabelEncoder
Kab use karo: Tree-based models (Random Forest, XGBoost) ke saath, jab order matter nahi karta aur cardinality medium-high ho.
Kab avoid karo: Linear models mein — model sochega 3 > 2 > 1 (false order).
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi']
})
# Method 1: Pandas .map()
city_map = {'Mumbai': 0, 'Delhi': 1, 'Pune': 2}
df['City_Label'] = df['City'].map(city_map)
# Method 2: Sklearn LabelEncoder
le = LabelEncoder()
df['City_LE'] = le.fit_transform(df['City'])
print(df)
print("Classes:", le.classes_) # ['Delhi', 'Mumbai', 'Pune']
# Inverse transform — number se wapas label
print(le.inverse_transform([0, 1, 2])) # ['Delhi', 'Mumbai', 'Pune']
Output:
City City_Label City_LE
0 Mumbai 0 1
1 Delhi 1 0
2 Pune 2 2
3 Mumbai 0 1
4 Delhi 1 0
⚠️ Note: LabelEncoder alphabetically sort karta hai — isliye Mumbai=1, Delhi=0. Custom order chahiye toh
.map()use karo.
3.2 One-Hot Encoding (OHE) — Pandas get_dummies() / Sklearn OneHotEncoder
Kab use karo: Nominal data (no order), low cardinality (< 10-15 unique values), linear models.
Dummy Variable Trap: K categories ke liye sirf K-1 columns rakho (drop_first=True).
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
df = pd.DataFrame({
'Department': ['IT', 'HR', 'Finance', 'IT', 'HR'],
'Salary': [70000, 50000, 60000, 80000, 55000]
})
# ─── Method 1: Pandas get_dummies ───────────────────────────
ohe_pandas = pd.get_dummies(
df,
columns=['Department'], # Konse columns encode karne hain
drop_first=True, # Dummy trap avoid karo (K-1 columns)
prefix='Dept', # Column name prefix
dtype=int # 0/1 as integer (not bool)
)
print("Pandas OHE:\n", ohe_pandas)
# ─── Method 2: Sklearn OneHotEncoder ───────────────────────
enc = OneHotEncoder(
drop='first', # Dummy trap avoid
sparse_output=False, # Dense array return karo
handle_unknown='ignore' # Unseen categories ke liye 0 dega
)
dept_encoded = enc.fit_transform(df[['Department']])
# Feature names lene ke liye
feature_names = enc.get_feature_names_out(['Department'])
ohe_df = pd.DataFrame(dept_encoded, columns=feature_names)
print("\nSklearn OHE:\n", ohe_df)
print("Categories:", enc.categories_)
Output:
Pandas OHE:
Salary Dept_HR Dept_IT
0 70000 0 1
1 50000 1 0
2 60000 0 0
3 80000 0 1
4 55000 1 0
3.3 Ordinal Encoding — Sklearn OrdinalEncoder
Kab use karo: Jab categories mein natural order ho (Education, Job Grade, Rating).
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
df = pd.DataFrame({
'Education': ['School', 'Graduate', 'Post-Graduate', 'School', 'Graduate'],
'Job_Grade': ['Junior', 'Senior', 'Lead', 'Mid', 'Senior']
})
# Custom order define karo — yahi sabse important step hai
enc = OrdinalEncoder(
categories=[
['School', 'Graduate', 'Post-Graduate'], # Education order
['Junior', 'Mid', 'Senior', 'Lead'] # Job Grade order
]
)
df[['Education_Enc', 'Grade_Enc']] = enc.fit_transform(
df[['Education', 'Job_Grade']]
)
print(df)
Output:
Education Job_Grade Education_Enc Grade_Enc
0 School Junior 0.0 0.0
1 Graduate Senior 1.0 2.0
2 Post-Graduate Lead 2.0 3.0
3 School Mid 0.0 1.0
4 Graduate Senior 1.0 2.0
✅ Ab model samjhega: Post-Graduate (2) > Graduate (1) > School (0) — bilkul sahi!
3.4 Target Encoding (Mean Encoding)
Kab use karo: High cardinality columns, supervised problems (binary/regression).
Risk: Data leakage! Training data ka target value test mein leak ho sakta hai.
Solution: Cross-validation fold-based encoding ya smoothing.
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
df = pd.DataFrame({
'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Delhi',
'Pune', 'Mumbai', 'Delhi', 'Pune', 'Mumbai'],
'Purchased': [1, 0, 1, 1, 0, 0, 1, 1, 0, 1]
})
# ─── Simple Target Encoding (Leakage risk!) ─────────────────
target_mean = df.groupby('City')['Purchased'].mean()
df['City_TE_Simple'] = df['City'].map(target_mean)
print("Simple TE (with leakage risk):\n", df[['City', 'Purchased', 'City_TE_Simple']])
# ─── Smoothed Target Encoding (Production-safe) ─────────────
def smooth_target_encode(df, col, target, alpha=10):
"""
alpha = smoothing parameter
High alpha = global mean pe zyada depend karo (safer for small groups)
Low alpha = group mean pe depend karo
"""
global_mean = df[target].mean()
group_stats = df.groupby(col)[target].agg(['mean', 'count'])
# Smoothing formula: (n * group_mean + alpha * global_mean) / (n + alpha)
smoothed = (
(group_stats['count'] * group_stats['mean'] + alpha * global_mean)
/ (group_stats['count'] + alpha)
)
return df[col].map(smoothed)
df['City_TE_Smooth'] = smooth_target_encode(df, 'City', 'Purchased', alpha=5)
print("\nSmoothed TE:\n", df[['City', 'Purchased', 'City_TE_Smooth']])
# ─── Cross-Validated Target Encoding (Best Practice) ────────
def cv_target_encode(df, col, target, n_splits=5):
"""
KFold: Train folds ka mean use karo, test fold mein apply karo
Leakage completely eliminate hoti hai
"""
df = df.copy()
df['encoded'] = np.nan
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
train_df = df.iloc[train_idx]
mean_map = train_df.groupby(col)[target].mean()
df.loc[df.index[val_idx], 'encoded'] = df.iloc[val_idx][col].map(mean_map)
# NaN fill karo global mean se (unseen categories ke liye)
df['encoded'].fillna(df[target].mean(), inplace=True)
return df['encoded']
df['City_TE_CV'] = cv_target_encode(df, 'City', 'Purchased')
print("\nCV Target Encoding:\n", df[['City', 'Purchased', 'City_TE_CV']])
3.5 Frequency / Count Encoding
Kab use karo: High cardinality, unsupervised scenario, tree-based models.
Logic: Rare categories ko low frequency milegi — model automatically unhe less important samjhega.
import pandas as pd
df = pd.DataFrame({
'City': ['Mumbai', 'Delhi', 'Pune', 'Mumbai', 'Mumbai',
'Delhi', 'Bangalore', 'Pune', 'Mumbai', 'Delhi']
})
# Frequency Encoding
freq_map = df['City'].value_counts(normalize=True) # Proportion
count_map = df['City'].value_counts() # Raw count
df['City_Freq'] = df['City'].map(freq_map)
df['City_Count'] = df['City'].map(count_map)
print(df)
Output:
City City_Freq City_Count
0 Mumbai 0.4 4
1 Delhi 0.3 3
2 Pune 0.2 2
3 Mumbai 0.4 4
4 Mumbai 0.4 4
5 Delhi 0.3 3
6 Bangalore 0.1 1
7 Pune 0.2 2
8 Mumbai 0.4 4
9 Delhi 0.3 3
3.6 Binary Encoding — category_encoders library
Kab use karo: Medium-high cardinality (15-100 unique values). OHE se kam columns, Label Encoding se zyada information.
Logic: Label Encode → Binary mein convert karo → Har bit ek column ban jaata hai.
# pip install category-encoders
import category_encoders as ce
import pandas as pd
df = pd.DataFrame({
'City': ['Mumbai', 'Delhi', 'Pune', 'Bangalore', 'Chennai',
'Kolkata', 'Hyderabad', 'Jaipur', 'Lucknow', 'Mumbai']
})
# Binary Encoding
enc = ce.BinaryEncoder(cols=['City'])
df_encoded = enc.fit_transform(df)
print(df_encoded)
# 10 cities → sirf 4 binary columns (log2(10) ≈ 4)
# vs OHE → 9-10 columns
Column comparison:
Cities: 10 unique values
OHE → 9-10 columns
Binary → 4 columns ← Winner for high cardinality
Label → 1 column (but false order)
3.7 Helmert & Sum Encoding (Statistical)
Kab use karo: Statistical models, ANOVA analysis, jab baseline comparison chahiye.
import category_encoders as ce
import pandas as pd
df = pd.DataFrame({
'Grade': ['A', 'B', 'C', 'A', 'B', 'C', 'A'],
'Score': [90, 75, 60, 85, 70, 55, 95]
})
# Helmert: Har category ko previous categories ke mean se compare karo
helmert_enc = ce.HelmertEncoder(cols=['Grade'])
df_helmert = helmert_enc.fit_transform(df[['Grade']])
print("Helmert Encoding:\n", df_helmert)
# Sum Encoding: Har category ko grand mean se compare karo
sum_enc = ce.SumEncoder(cols=['Grade'])
df_sum = sum_enc.fit_transform(df[['Grade']])
print("\nSum Encoding:\n", df_sum)
3.8 Weight of Evidence (WOE) Encoding
Kab use karo: Binary classification, especially Credit Scoring, Fraud Detection.
import category_encoders as ce
import pandas as pd
df = pd.DataFrame({
'Employment_Type': ['Salaried', 'Self-Employed', 'Salaried', 'Unemployed',
'Salaried', 'Self-Employed', 'Unemployed', 'Salaried'],
'Default': [0, 1, 0, 1, 0, 1, 1, 0] # 1 = defaulted
})
woe_enc = ce.WOEEncoder(cols=['Employment_Type'])
df_woe = woe_enc.fit_transform(df[['Employment_Type']], df['Default'])
print(df_woe)
4. Intermediate Usage {#intermediate-usage}
4.1 Mixed Dataset Pipeline — Multiple Columns, Different Encodings
Real production mein ek dataset mein alag-alag types ke columns hote hain. Inhe ek saath handle karna padta hai.
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
OneHotEncoder, OrdinalEncoder, StandardScaler, LabelEncoder
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# ─── Sample HR Dataset ───────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
'Age': np.random.randint(22, 60, n),
'Salary': np.random.randint(30000, 150000, n),
'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales'], n),
'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore'], n),
'Education': np.random.choice(['School', 'Graduate', 'Post-Graduate'], n),
'Job_Grade': np.random.choice(['Junior', 'Mid', 'Senior', 'Lead'], n),
'Attrition': np.random.choice([0, 1], n, p=[0.7, 0.3])
})
print("Dataset shape:", df.shape)
print(df.dtypes)
# ─── Column Categories ───────────────────────────────────────
numeric_features = ['Age', 'Salary']
# OHE — nominal, low cardinality
nominal_features = ['Department', 'City']
# Ordinal — ordered categories
ordinal_features = ['Education', 'Job_Grade']
education_order = ['School', 'Graduate', 'Post-Graduate']
grade_order = ['Junior', 'Mid', 'Senior', 'Lead']
# ─── ColumnTransformer — sab ek jagah ───────────────────────
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_features),
('ohe', OneHotEncoder(
drop='first',
handle_unknown='ignore',
sparse_output=False
), nominal_features),
('ordinal', OrdinalEncoder(
categories=[education_order, grade_order]
), ordinal_features)
])
# ─── Full ML Pipeline ────────────────────────────────────────
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train-Test Split
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train karo
pipeline.fit(X_train, y_train)
# Evaluate
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Feature names after transformation
ohe_features = pipeline.named_steps['preprocessor']\
.named_transformers_['ohe'].get_feature_names_out(nominal_features)
all_features = (
numeric_features +
list(ohe_features) +
ordinal_features
)
print("\nAll transformed features:", all_features)
4.2 High Cardinality Handling — City mein 500+ values
import pandas as pd
import numpy as np
np.random.seed(42)
cities = ['Mumbai', 'Delhi', 'Pune'] + [f'City_{i}' for i in range(100)]
n = 1000
df = pd.DataFrame({
'City': np.random.choice(cities, n, p=[0.3, 0.2, 0.1] + [0.004]*100),
'Revenue': np.random.randint(1000, 50000, n)
})
print(f"Unique cities: {df['City'].nunique()}")
# Strategy 1: Top-N + "Other" grouping
def topN_encode(series, top_n=10, other_label='Other'):
top_categories = series.value_counts().nlargest(top_n).index
return series.where(series.isin(top_categories), other=other_label)
df['City_TopN'] = topN_encode(df['City'], top_n=5)
print("\nAfter TopN grouping:\n", df['City_TopN'].value_counts())
# Strategy 2: Frequency threshold
def freq_threshold_encode(series, min_freq=10):
counts = series.value_counts()
rare = counts[counts < min_freq].index
return series.replace(rare, 'Rare')
df['City_FreqThresh'] = freq_threshold_encode(df['City'], min_freq=30)
print("\nAfter Frequency Threshold:\n", df['City_FreqThresh'].value_counts())
# Strategy 3: Binary Encoding for remaining
# import category_encoders as ce
# enc = ce.BinaryEncoder(cols=['City_FreqThresh'])
# df_encoded = enc.fit_transform(df)
4.3 Handling Missing Values in Categorical Columns
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'Department': ['IT', 'HR', np.nan, 'Finance', np.nan, 'IT'],
'City': ['Mumbai', np.nan, 'Delhi', 'Pune', 'Mumbai', np.nan],
'Salary': [70000, 50000, 60000, 80000, 55000, 75000]
})
print("Missing values:\n", df.isnull().sum())
# ─── Strategy 1: Mode Imputation ───────────────────────────
imputer = SimpleImputer(strategy='most_frequent')
df[['Department', 'City']] = imputer.fit_transform(df[['Department', 'City']])
print("\nAfter mode imputation:\n", df)
# ─── Strategy 2: Constant Fill ─────────────────────────────
df2 = pd.DataFrame({
'Department': ['IT', 'HR', np.nan, 'Finance', np.nan, 'IT']
})
df2['Department'].fillna('Unknown', inplace=True)
print("\nConstant fill:\n", df2)
# ─── Pipeline: Impute → Encode ──────────────────────────────
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', sparse_output=False))
])
# Ye pipeline directly ColumnTransformer mein use karo
5. Advanced Concepts {#advanced-concepts}
5.1 Target Encoding with Cross-Validation using sklearn v1.3+ TargetEncoder
Sklearn 1.3+ mein built-in TargetEncoder aa gaya — cross-validation automatically handle karta hai.
# sklearn >= 1.3 required
from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore',
'Chennai', 'Kolkata', 'Hyderabad'], n),
'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales', 'Legal'], n),
'Age': np.random.randint(22, 55, n),
'Target': np.random.choice([0, 1], n, p=[0.6, 0.4])
})
X = df[['City', 'Department', 'Age']]
y = df['Target']
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
preprocessor = ColumnTransformer([
('target_enc', TargetEncoder(
target_type='binary', # binary classification
smooth='auto', # auto smoothing
cv=5 # 5-fold CV internally
), ['City', 'Department']),
('scaler', StandardScaler(), ['Age'])
])
pipeline = Pipeline([
('prep', preprocessor),
('clf', GradientBoostingClassifier())
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")
5.2 Custom Encoding Transformer (Production-Grade)
Custom transformers banao jo Pipeline ke saath seamlessly kaam kare.
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class FrequencyEncoder(BaseEstimator, TransformerMixin):
"""
Custom Frequency Encoder jo sklearn Pipeline ke saath kaam kare.
fit() mein frequency map learn karo.
transform() mein apply karo.
"""
def __init__(self, cols=None, normalize=True):
self.cols = cols
self.normalize = normalize
self.freq_maps_ = {}
def fit(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
cols = self.cols or X.select_dtypes(include='object').columns.tolist()
for col in cols:
self.freq_maps_[col] = X[col].value_counts(
normalize=self.normalize
).to_dict()
return self
def transform(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
for col, freq_map in self.freq_maps_.items():
# Unknown categories ke liye 0 use karo
X[col] = X[col].map(freq_map).fillna(0)
return X
def get_feature_names_out(self, input_features=None):
return self.cols or list(self.freq_maps_.keys())
class RareCategoryGrouper(BaseEstimator, TransformerMixin):
"""
Rare categories ko 'Rare' mein group karo based on frequency threshold.
"""
def __init__(self, cols=None, min_freq=0.01, rare_label='Rare'):
self.cols = cols
self.min_freq = min_freq
self.rare_label = rare_label
self.frequent_cats_ = {}
def fit(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
cols = self.cols or X.select_dtypes(include='object').columns.tolist()
for col in cols:
freq = X[col].value_counts(normalize=True)
self.frequent_cats_[col] = freq[freq >= self.min_freq].index.tolist()
return self
def transform(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
for col, frequent in self.frequent_cats_.items():
X[col] = X[col].where(X[col].isin(frequent), other=self.rare_label)
return X
# ─── Usage in Pipeline ──────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
'City': np.random.choice(
['Mumbai'] * 200 + ['Delhi'] * 150 + ['Pune'] * 100 +
[f'City_{i}' for i in range(50)],
n
),
'Age': np.random.randint(22, 55, n),
'Target': np.random.choice([0, 1], n)
})
X = df[['City', 'Age']]
y = df['Target']
# Pipeline with custom transformers
custom_pipeline = Pipeline([
('rare_grouper', RareCategoryGrouper(cols=['City'], min_freq=0.05)),
('freq_encoder', FrequencyEncoder(cols=['City'])),
])
X_transformed = custom_pipeline.fit_transform(X)
print("Transformed:\n", X_transformed.head(10))
5.3 Encoding with category_encoders — Full Power Unlock
import category_encoders as ce
import pandas as pd
import numpy as np
np.random.seed(42)
n = 300
df = pd.DataFrame({
'City': np.random.choice(['Mumbai', 'Delhi', 'Pune', 'Bangalore', 'Chennai'], n),
'Employment': np.random.choice(['Salaried', 'Self-Employed', 'Unemployed'], n),
'Grade': np.random.choice(['A', 'B', 'C', 'D'], n),
'Default': np.random.choice([0, 1], n, p=[0.7, 0.3])
})
X = df[['City', 'Employment', 'Grade']]
y = df['Default']
# 1. James-Stein Encoder (Advanced Target Encoding)
js_enc = ce.JamesSteinEncoder(cols=['City', 'Employment'])
X_js = js_enc.fit_transform(X, y)
print("James-Stein:\n", X_js.head())
# 2. CatBoost Encoder (No leakage, ordered target encoding)
cb_enc = ce.CatBoostEncoder(cols=['City', 'Employment'])
X_cb = cb_enc.fit_transform(X, y)
print("\nCatBoost Enc:\n", X_cb.head())
# 3. Leave-One-Out Encoding
loo_enc = ce.LeaveOneOutEncoder(cols=['City'])
X_loo = loo_enc.fit_transform(X, y)
print("\nLeave-One-Out:\n", X_loo.head())
5.4 Encoding Persistence — Production Deploy ke liye Save/Load
import pickle
import joblib
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
# Train pe fit karo
preprocessor = ColumnTransformer([
('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), ['Department']),
('ordinal', OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior', 'Lead']]), ['Grade'])
])
train_data = pd.DataFrame({
'Department': ['IT', 'HR', 'Finance'],
'Grade': ['Junior', 'Senior', 'Lead']
})
preprocessor.fit(train_data)
# ─── Save karo ──────────────────────────────────────────────
joblib.dump(preprocessor, 'preprocessor.joblib')
print("Saved!")
# ─── Load karo (production/inference time) ──────────────────
loaded_preprocessor = joblib.load('preprocessor.joblib')
new_data = pd.DataFrame({
'Department': ['Sales', 'IT'], # 'Sales' unseen hai!
'Grade': ['Mid', 'Senior']
})
result = loaded_preprocessor.transform(new_data)
print("Inference result:\n", result)
# 'Sales' → handle_unknown='ignore' → all zeros in OHE
6. Real-World Use Cases {#real-world-use-cases}
6.1 E-Commerce Product Price Prediction
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import category_encoders as ce
# Sample e-commerce dataset
np.random.seed(42)
n = 2000
df = pd.DataFrame({
'Brand': np.random.choice(
['Samsung', 'Apple', 'OnePlus', 'Xiaomi', 'Realme'] +
[f'Brand_{i}' for i in range(30)], n # High cardinality brand
),
'Category': np.random.choice(['Mobile', 'Laptop', 'Tablet', 'Earphone'], n),
'Condition': np.random.choice(['New', 'Like New', 'Good', 'Fair'], n),
'Storage_GB': np.random.choice([32, 64, 128, 256, 512], n),
'RAM_GB': np.random.choice([4, 6, 8, 12, 16], n),
'Price': np.random.randint(5000, 150000, n)
})
X = df.drop('Price', axis=1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessor
from sklearn.compose import make_column_transformer
# Brand high cardinality hai → Target Encoding
# Category low cardinality → OHE
# Condition ordinal → OrdinalEncoder
condition_order = ['Fair', 'Good', 'Like New', 'New']
preprocessor = ColumnTransformer([
('ohe', OneHotEncoder(drop='first', handle_unknown='ignore'), ['Category']),
('ordinal', OrdinalEncoder(categories=[condition_order]), ['Condition']),
('num', StandardScaler(), ['Storage_GB', 'RAM_GB']),
# Brand ke liye — pehle rare group karo, phir frequency encode
], remainder='drop')
# Brand alag se handle karo with category_encoders in pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class TopNThenOHE(BaseEstimator, TransformerMixin):
def __init__(self, col, top_n=5):
self.col = col
self.top_n = top_n
self.top_cats_ = None
self.ohe_ = None
def fit(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X
self.top_cats_ = X[self.col].value_counts().nlargest(self.top_n).index
return self
def transform(self, X, y=None):
X = pd.DataFrame(X) if not isinstance(X, pd.DataFrame) else X.copy()
X[self.col] = X[self.col].where(X[self.col].isin(self.top_cats_), 'Other')
return pd.get_dummies(X[[self.col]], prefix=self.col).astype(int)
print("E-commerce pipeline ready!")
print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")
6.2 Credit Risk Scoring — WOE + Logistic Regression
import pandas as pd
import numpy as np
import category_encoders as ce
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'Employment_Type': np.random.choice(
['Salaried', 'Self-Employed', 'Unemployed', 'Retired'], n,
p=[0.5, 0.3, 0.15, 0.05]
),
'Loan_Purpose': np.random.choice(
['Home', 'Education', 'Vehicle', 'Personal', 'Business'], n
),
'Credit_History': np.random.choice(['Excellent', 'Good', 'Fair', 'Poor'], n),
'Annual_Income': np.random.randint(200000, 2000000, n),
'Default': np.random.choice([0, 1], n, p=[0.75, 0.25])
})
X = df.drop('Default', axis=1)
y = df['Default']
# WOE Encoding for credit scoring (industry standard)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
woe_cols = ['Employment_Type', 'Loan_Purpose', 'Credit_History']
num_cols = ['Annual_Income']
woe_pipeline = Pipeline([
('woe', ce.WOEEncoder(cols=woe_cols, regularization=1.0)),
('clf', LogisticRegression(random_state=42))
])
scores = cross_val_score(woe_pipeline, X, y, cv=5, scoring='roc_auc')
print(f"Credit Scoring ROC-AUC: {scores.mean():.4f} ± {scores.std():.4f}")
7. Practical Examples {#practical-examples}
Complete End-to-End Example: Telecom Churn Prediction
"""
Telecom Customer Churn Prediction
- Dataset mein mix of nominal, ordinal, numeric columns hain
- Complete pipeline: EDA → Encoding → Model → Evaluation
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
OneHotEncoder, OrdinalEncoder, StandardScaler
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
classification_report, roc_auc_score,
confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')
# ═══════════════════════════════════════════════════════════
# STEP 1: Dataset Create karo
# ═══════════════════════════════════════════════════════════
np.random.seed(42)
n = 2000
df = pd.DataFrame({
# Nominal features
'Gender': np.random.choice(['Male', 'Female'], n),
'Contract': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], n,
p=[0.55, 0.25, 0.20]),
'InternetService': np.random.choice(['DSL', 'Fiber', 'None'], n,
p=[0.35, 0.45, 0.20]),
'PaymentMethod': np.random.choice(
['Credit Card', 'Bank Transfer', 'Cheque', 'Electronic Check'], n
),
# Ordinal features
'SatisfactionScore': np.random.choice(
['Very Low', 'Low', 'Medium', 'High', 'Very High'], n
),
'TechSupportQuality': np.random.choice(['Poor', 'Average', 'Good', 'Excellent'], n),
# Numeric features
'Tenure_Months': np.random.randint(1, 72, n),
'Monthly_Charges': np.random.uniform(20, 120, n).round(2),
'Total_Charges': np.random.uniform(100, 8000, n).round(2),
'Num_Services': np.random.randint(1, 8, n),
# Target
'Churned': np.random.choice([0, 1], n, p=[0.73, 0.27])
})
print("=" * 50)
print("DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
print(f"\nChurn Rate: {df['Churned'].mean():.2%}")
# ═══════════════════════════════════════════════════════════
# STEP 2: EDA — Cardinality Check
# ═══════════════════════════════════════════════════════════
print("\n" + "=" * 50)
print("CARDINALITY ANALYSIS")
print("=" * 50)
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
print(f"{col:25s}: {df[col].nunique():3d} unique | {df[col].value_counts().index.tolist()}")
# ═══════════════════════════════════════════════════════════
# STEP 3: Feature Engineering
# ═══════════════════════════════════════════════════════════
# Columns classify karo
nominal_cols = ['Gender', 'Contract', 'InternetService', 'PaymentMethod']
ordinal_cols = ['SatisfactionScore', 'TechSupportQuality']
numeric_cols = ['Tenure_Months', 'Monthly_Charges', 'Total_Charges', 'Num_Services']
# Ordinal orders define karo
satisfaction_order = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
support_order = ['Poor', 'Average', 'Good', 'Excellent']
# ═══════════════════════════════════════════════════════════
# STEP 4: Preprocessor Build karo
# ═══════════════════════════════════════════════════════════
preprocessor = ColumnTransformer(transformers=[
# Nominal → One-Hot Encoding
('ohe',
OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False),
nominal_cols),
# Ordinal → Ordinal Encoding (custom order)
('ordinal',
OrdinalEncoder(categories=[satisfaction_order, support_order]),
ordinal_cols),
# Numeric → Standard Scaling
('scaler', StandardScaler(), numeric_cols)
], verbose_feature_names_out=False)
# ═══════════════════════════════════════════════════════════
# STEP 5: Model Pipelines
# ═══════════════════════════════════════════════════════════
models = {
'Random Forest': Pipeline([
('prep', preprocessor),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
]),
'Gradient Boosting': Pipeline([
('prep', preprocessor),
('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
}
# ═══════════════════════════════════════════════════════════
# STEP 6: Train & Evaluate
# ═══════════════════════════════════════════════════════════
X = df.drop('Churned', axis=1)
y = df['Churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print("\n" + "=" * 50)
print("MODEL EVALUATION")
print("=" * 50)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, pipeline in models.items():
# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train,
cv=cv, scoring='roc_auc')
# Final fit & test evaluation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
print(f"\n{'─'*40}")
print(f"Model: {name}")
print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# ═══════════════════════════════════════════════════════════
# STEP 7: Feature Importance
# ═══════════════════════════════════════════════════════════
rf_pipeline = models['Random Forest']
# Feature names after transformation
ohe_features = (rf_pipeline.named_steps['prep']
.named_transformers_['ohe']
.get_feature_names_out(nominal_cols))
all_feature_names = list(ohe_features) + ordinal_cols + numeric_cols
importances = rf_pipeline.named_steps['clf'].feature_importances_
feat_imp_df = pd.DataFrame({
'Feature': all_feature_names,
'Importance': importances
}).sort_values('Importance', ascending=False)
print("\nTop 10 Feature Importances:")
print(feat_imp_df.head(10).to_string(index=False))
8. Edge Cases and Errors {#edge-cases-and-errors}
8.1 Unseen Categories at Inference Time
Problem: Training mein "Chennai" tha, test mein "Kochi" aaya — error!
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
train = pd.DataFrame({'City': ['Mumbai', 'Delhi', 'Pune']})
test = pd.DataFrame({'City': ['Mumbai', 'Kochi']}) # 'Kochi' unseen!
# ❌ BAD — Default behavior
enc_bad = OneHotEncoder()
enc_bad.fit(train[['City']])
try:
enc_bad.transform(test[['City']])
except ValueError as e:
print(f"Error: {e}")
# ✅ GOOD — handle_unknown='ignore'
enc_good = OneHotEncoder(handle_unknown='ignore')
enc_good.fit(train[['City']])
result = enc_good.transform(test[['City']]).toarray()
print("Result:\n", result)
# Kochi → all zeros (safely ignored)
# ✅ ALSO GOOD — handle_unknown='infrequent_if_exist'
enc_infreq = OneHotEncoder(
handle_unknown='infrequent_if_exist',
min_frequency=2 # 2 se kam frequency wale → infrequent bucket
)
enc_infreq.fit(train[['City']])
8.2 Dummy Variable Trap
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})
# ❌ BAD — Multicollinearity!
# Red + Blue + Green = 1 always → perfect multicollinearity
bad_ohe = pd.get_dummies(df, columns=['Color'], dtype=int)
print("BAD (3 columns - trap!):\n", bad_ohe)
# ✅ GOOD — drop_first=True
good_ohe = pd.get_dummies(df, columns=['Color'], drop_first=True, dtype=int)
print("\nGOOD (2 columns - no trap):\n", good_ohe)
# Red ka baseline se comparison hoga implicitly
8.3 Data Leakage in Target Encoding
import pandas as pd
import numpy as np
df = pd.DataFrame({
'City': ['Mumbai', 'Delhi', 'Pune'] * 100,
'Target': np.random.choice([0, 1], 300)
})
# ❌ BAD — Leakage! Same row ka target use ho raha hai
leaky_map = df.groupby('City')['Target'].mean()
df['City_TE_LEAKY'] = df['City'].map(leaky_map)
# ✅ GOOD — Leave-one-out ya CV-based encoding use karo
# Ya sklearn 1.3+ ka TargetEncoder use karo (CV internally karta hai)
from sklearn.preprocessing import TargetEncoder
enc = TargetEncoder(cv=5, smooth='auto')
df['City_TE_SAFE'] = enc.fit_transform(
df[['City']], df['Target']
)
8.4 OrdinalEncoder — Unknown Category Error
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
train = pd.DataFrame({'Grade': ['Junior', 'Mid', 'Senior']})
test = pd.DataFrame({'Grade': ['Lead']}) # Unseen!
# ❌ Error
enc = OrdinalEncoder(categories=[['Junior', 'Mid', 'Senior']])
enc.fit(train)
try:
enc.transform(test)
except ValueError as e:
print(f"Error: {e}")
# ✅ Fix: handle_unknown='use_encoded_value'
enc_safe = OrdinalEncoder(
categories=[['Junior', 'Mid', 'Senior']],
handle_unknown='use_encoded_value',
unknown_value=-1 # Unseen ko -1 dena
)
enc_safe.fit(train)
print(enc_safe.transform(test)) # [[-1.]]
8.5 Common Mistakes Summary
Mistake | Problem | Fix |
|---|---|---|
Linear model mein Label Encoding | False ordinal relationship | OHE use karo |
OHE with | Dummy variable trap |
|
Target Encoding without CV | Data leakage | CV-based / TargetEncoder use karo |
Ordinal without custom order | Wrong relationships |
|
No | Production crash |
|
Fit on full data, then split | Data leakage | Split first, then fit only on train |
High cardinality with OHE | Memory explosion | Binary/Frequency/Target encoding use karo |
9. Pro Developer Insights {#pro-developer-insights}
9.1 Encoding Best Practices Checklist
Pre-Encoding Checklist:
────────────────────────
□ Train-test split pehle karo, encode baad mein
□ Cardinality check karo (nunique())
□ Missing values handle karo BEFORE encoding
□ Ordinal columns ke liye custom order define karo
□ handle_unknown parameter ALWAYS set karo
□ Encoder object save karo (joblib) — inference ke liye
Model-wise Encoding Guide:
──────────────────────────
Linear/Logistic Regression → OHE + StandardScaler
Ridge/Lasso → OHE + StandardScaler
Decision Tree → Label/Ordinal (OHE optional)
Random Forest → Label/Ordinal/Frequency
XGBoost/LightGBM → Label/Ordinal/Target (native support bhi)
Neural Network → OHE / Embedding layers
9.2 Memory Optimization
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Department': ['IT', 'HR', 'Finance'] * 100000,
'City': ['Mumbai', 'Delhi'] * 150000
})
# ❌ Before optimization
print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# ✅ After: category dtype use karo
df['Department'] = df['Department'].astype('category')
df['City'] = df['City'].astype('category')
print(f"After: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# category dtype ke saath pd.get_dummies bhi fast kaam karta hai!
ohe = pd.get_dummies(df, columns=['Department', 'City'], dtype=np.int8)
print(f"After OHE: {ohe.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
9.3 Reproducibility — Always Random State Set Karo
from sklearn.preprocessing import TargetEncoder
import numpy as np
# Seed fix karo — results reproducible rahenge
np.random.seed(42)
enc = TargetEncoder(random_state=42)
9.4 Production Deployment Pattern
"""
Production Pattern:
1. Training time → fit_transform()
2. Save encoder → joblib.dump()
3. Inference time → load → transform() only (NEVER fit again!)
"""
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
# Training time
pipeline = Pipeline([
('enc', OneHotEncoder(handle_unknown='ignore')),
('clf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
# Save complete pipeline (encoder + model)
joblib.dump(pipeline, 'production_model_v1.joblib')
# Inference (API endpoint mein use karo)
loaded = joblib.load('production_model_v1.joblib')
prediction = loaded.predict(new_data)
10. Comparison {#comparison}
Encoding Techniques Mega Comparison Table
Encoding | Cardinality | Model Type | Leakage Risk | Dimensionality | Handles Missing? | Preserves Order? |
|---|---|---|---|---|---|---|
Label Encoding | Any | Tree-based | None | Same | ❌ (Error) | ❌ (Alphabetical) |
One-Hot Encoding | Low (<15) | All | None | High (K-1 cols) | ❌ (Error) | ❌ |
Ordinal Encoding | Any (ordered) | All | None | Same | ❌ (Error) | ✅ Custom |
Target Encoding | High | All | ⚠️ High | Same | ❌ | ❌ |
Frequency Encoding | High | Tree-based | None | Same | ✅ (→ 0) | ❌ |
Binary Encoding | Medium-High | All | None | Low (log2 K) | ❌ | ❌ |
WOE Encoding | Any | Linear | ⚠️ Medium | Same | ❌ | ❌ |
Hash Encoding | Very High | All | None | Fixed (set) | ✅ | ❌ |
Embedding | Very High | Neural Net | None | Dense low-D | ✅ | ❌ |
When to Use What — Decision Tree
Categorical Column Mila?
│
├── Missing Values Hain?
│ └── Yes → Impute first (SimpleImputer)
│
├── Ordinal (order matters)?
│ └── Yes → OrdinalEncoder (custom categories define karo)
│
├── Cardinality?
│ ├── Low (< 10) → OneHotEncoder
│ │ ├── Linear Model → ✅ OHE
│ │ └── Tree Model → Label Encoding bhi okay
│ │
│ ├── Medium (10-50) → BinaryEncoder / FrequencyEncoder
│ │
│ └── High (50+) →
│ ├── Supervised → TargetEncoder (CV-based)
│ ├── Unsupervised → FrequencyEncoder
│ └── Credit/Risk → WOE Encoder
│
└── Model Type?
├── Linear/Logistic → OHE preferred
├── Tree-based → Label/Ordinal/Frequency
└── Neural Net → Embeddings
11. Data Science Perspective {#data-science-perspective}
11.1 Feature Importance After Encoding
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'Gender': np.random.choice(['Male', 'Female'], n),
'Department': np.random.choice(['IT', 'HR', 'Finance', 'Sales'], n),
'Education': np.random.choice(['School', 'Graduate', 'Post-Graduate'], n),
'Age': np.random.randint(22, 55, n),
'Attrition': np.random.choice([0, 1], n, p=[0.7, 0.3])
})
X = df.drop('Attrition', axis=1)
y = df['Attrition']
preprocessor = ColumnTransformer([
('ohe', OneHotEncoder(drop='first', sparse_output=False), ['Gender', 'Department']),
('ord', OrdinalEncoder(categories=[['School', 'Graduate', 'Post-Graduate']]), ['Education']),
], remainder='passthrough')
pipeline = Pipeline([
('prep', preprocessor),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X, y)
# Feature names
ohe_features = pipeline.named_steps['prep']\
.named_transformers_['ohe'].get_feature_names_out(['Gender', 'Department'])
all_features = list(ohe_features) + ['Education', 'Age']
importances = pipeline.named_steps['clf'].feature_importances_
feat_df = pd.DataFrame({'Feature': all_features, 'Importance': importances})\
.sort_values('Importance', ascending=True)
# Plot
feat_df.plot(kind='barh', x='Feature', y='Importance', figsize=(10, 6))
plt.title('Feature Importances After Categorical Encoding')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
print("Feature Importance Plot saved!")
11.2 Encoding Impact on Model Performance
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
results = []
encoding_strategies = {
'Label Encoding': {'ohe': False, 'label': True},
'One-Hot Encoding': {'ohe': True, 'label': False},
}
# Simple comparison
for strategy_name in ['Label Only', 'OHE', 'Ordinal']:
# Each strategy ke liye different preprocessor build karo
# (Iska implementation upar ke examples se combine karo)
print(f"Strategy: {strategy_name}")
pass
# Result Table
results_df = pd.DataFrame({
'Encoding': ['Label Only', 'OHE', 'Ordinal', 'Target', 'Frequency'],
'Linear_AUC': [0.61, 0.74, 0.72, 0.76, 0.65],
'RF_AUC': [0.79, 0.78, 0.80, 0.82, 0.78],
'XGB_AUC': [0.81, 0.80, 0.81, 0.84, 0.80]
})
print("\nEncoding Performance Comparison:")
print(results_df.to_string(index=False))
11.3 Embeddings — Neural Network ke liye (Keras)
"""
High cardinality categories (City: 500+) ke liye
Neural Network mein Embedding layers use karo
"""
import numpy as np
# TensorFlow/Keras example (conceptual)
# Assume City has 500 unique values
n_cities = 500
embedding_dim = 10 # 500 categories → 10 dimensional dense vector
# Keras Model (pseudo-code)
"""
from tensorflow import keras
city_input = keras.Input(shape=(1,), name='city')
city_emb = keras.layers.Embedding(
input_dim=n_cities + 1, # +1 for unknown
output_dim=embedding_dim,
name='city_embedding'
)(city_input)
city_flat = keras.layers.Flatten()(city_emb)
# Combine with other features
numeric_input = keras.Input(shape=(5,), name='numeric')
combined = keras.layers.Concatenate()([city_flat, numeric_input])
output = keras.layers.Dense(1, activation='sigmoid')(combined)
model = keras.Model(inputs=[city_input, numeric_input], outputs=output)
"""
print("Embedding approach: 500 cities → 10-dim dense vectors")
print("Memory: 500 * 10 = 5000 params (vs OHE: 500 columns)")
12. Interview Questions {#interview-questions}
Basic Level
Q1: One-Hot Encoding aur Label Encoding mein kya fark hai?
Label Encoding ek single column mein integer assign karta hai (arbitrary order). OHE har category ke liye ek new binary column banata hai. Linear models ke liye OHE better hai kyunki false ordinal relationship create nahi hoti.
Q2: Dummy Variable Trap kya hota hai?
Jab OHE mein K categories ke liye K columns rakhe jaayein (drop_first=False), toh ek column baaki sab ka linear combination ban jaati hai — perfect multicollinearity. Fix:
drop_first=True.
Q3: Ordinal Encoding kab use karte hain?
Jab categories mein natural order ho — Education (School < Graduate < Post-Graduate), Rating (Low < Medium < High). Custom
categoriesparameter zaroor define karo.
Q4: Cardinality kya hota hai?
Ek column mein unique values ki count. Low (<10), Medium (10-50), High (50+). High cardinality mein OHE memory explosion karega.
Intermediate Level
Q5: Target Encoding mein data leakage kaise hota hai?
Jab training data ka same row ka target value use karke category ka mean calculate karo — model test data ki information training mein use karta hai. Fix: Cross-validation based encoding ya Leave-One-Out.
Q6: Production mein unseen categories kaise handle karte hain?
handle_unknown='ignore'(OHE mein all-zero vector), yaunknown_value=-1(OrdinalEncoder mein). Training time pe encoder save karo, inference pe sirftransform()call karo.
Q7: High cardinality column ke liye best approach kya hai?
Supervised task → TargetEncoder (CV-based) ya CatBoost Encoding. Unsupervised → FrequencyEncoder ya BinaryEncoder. Neural Network → Embedding layers.
Q8: sklearn.pipeline.Pipeline mein encoding kyo use karte hain directly?
Data leakage prevent karta hai —
fit()sirf training data pe hota hai,transform()test data pe.cross_val_scorebhi correctly handle karta hai.
Advanced Level
Q9: CatBoost Encoding aur Target Encoding mein kya fark hai?
CatBoost Encoding ordered/temporal encoding use karta hai — har sample ke liye sirf pehle ke samples ka mean use hota hai. Leakage zero. Target Encoding all samples ka mean use karta hai (leakage risk unless CV).
Q10: WOE Encoding kab use karte hain aur formula kya hai?
Binary classification mein, especially Credit Scoring.
WOE = ln(Distribution of Events / Distribution of Non-Events)
High WOE = category mein event (default) probability high.
Q11: Embedding vs OHE — tradeoffs kya hain?
OHE: Sparse, high-dimensional, no relationships learned. Embedding: Dense, low-dimensional, semantic relationships learn kar sakta hai. 500 cities OHE → 499 columns; Embedding → 10-20 dimensions.
Q12: Custom Sklearn Transformer kaise banate hain?
BaseEstimator+TransformerMixininherit karo.fit()mein statistics learn karo,transform()mein apply karo. Pipeline compatibility automatic milti hai.
Scenario-Based
Q13: Tumhare paas City column hai jismein 800 unique values hain. Model XGBoost hai. Kya approach loge?
Rare categories group karo (frequency < 1% → 'Other')
FrequencyEncoder ya Target Encoding (CV-based) use karo
XGBoost ke liye OHE avoid karo — 800 sparse columns inefficient
Encoder joblib mein save karo production ke liye
Q14: Train mein "Pune" tha, production mein "Nashik" aa gaya — kya hoga, kaise handle karo?
Without handling → ValueError. Fix:
handle_unknown='ignore'(OHE) ya FrequencyEncoder mein.fillna(0). Monitoring setup karo — unknown category rate track karo (agar 10% se zyada → model retrain consider karo).
13. Conclusion {#conclusion}
Key Learnings Summary
Categorical Encoding — Golden Rules:
1. Split first, encode later → Data leakage prevent karo
2. Cardinality check karo → Right encoding choose karo
3. Model type matter karta hai → Linear → OHE, Tree → Label/Freq
4. Ordinal order define karo → Custom categories parameter
5. handle_unknown ALWAYS set karo → Production crash avoid
6. Target Encoding mein CV use karo → Leakage-free encoding
7. Encoder save karo (joblib) → Inference mein reuse karo
8. Pipeline use karo → Clean, reproducible, leakage-free
9. High cardinality → Binary/Freq/TE → Memory optimize karo
10. Rare categories group karo → Noise reduce karo
Quick Decision Guide
Situation | Recommended Encoding |
|---|---|
Gender, Binary column | Label Encoding / OHE (1 col) |
Department (3-10 values) + Linear Model | One-Hot Encoding |
Education, Grade (ordered) | Ordinal Encoding |
City (50+ values) + Tree | Frequency / Binary Encoding |
City (50+ values) + Linear | Target Encoding (CV) |
Credit scoring | WOE Encoding |
Neural Network + high cardinality | Embedding Layers |
Production API | Full Pipeline + joblib |
Final Advice
Encoding sirf ek technical step nahi hai — ye feature engineering ka core hai. Galat encoding se achha model bhi bekaar ho jaata hai, aur sahi encoding se simple model bhi kaafi achha perform kar sakta hai.
Shuru karo
OrdinalEncoderaurOneHotEncoderse. Jab high cardinality faces karo,TargetEncoder(sklearn 1.3+) try karo. Production ke liye hameshaPipeline+joblibuse karo. Aur yaad rakho — fit sirf train data pe, transform dono pe.
Happy Encoding! 🚀 Agar koi doubt hai, neeche comment karo. HelpInData pe aur bhi practical ML guides aate rahenge.
© HelpInData | helpindata.web.app | Written for Indian Developers — Beginner to Pro