Data Science Machine learning

sklearn.preprocessing Complete Guide: Data Scaling, Encoding & Transformation in Python

Master sklearn.preprocessing from scratch to advanced level. Learn StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, and 30+ transformers with real-...

Apr 25, 2026 30 min read

Master every tool in scikit-learn's preprocessing arsenal — beginner se lekar production-ready ML pipelines tak, sab kuch ek jagah.

📋 Table of Contents

1. Introduction

Yeh `sklearn.preprocessing` hai kya?

Bhai, sochte ho ki Machine Learning sirf model fit karna hai? Galat! Asli kaam data ko model ke liye ready karna hai — aur yehi kaam karta hai sklearn.preprocessing.

sklearn.preprocessing ek Python module hai jo scikit-learn library ka part hai. Iska kaam hai raw, messy, unscaled data ko ML models ke liye clean, transformed, aur normalized banana.

Kyun zaroori hai?

Real data kabhi bhi perfect nahi hota:

Salary column mein values 50,000 se 1,00,00,000 tak hain
City column mein "Mumbai", "Delhi", "Bangalore" jaise strings hain
Kuch columns mein NaN values hain
Kuch features skewed hain

Agar yeh data seedha ML model ko dedo, to model:

Bias ho jaata hai high-value columns ki taraf (jaise salary)
Fail karta hai string categories ko process karne mein
Slow converge hota hai gradient descent mein

sklearn.preprocessing in sab problems ko solve karta hai.

Real-World Usage

Industry	Problem	Preprocessing Used
Banking	Credit scoring	StandardScaler, LabelEncoder
E-commerce	Recommendation system	OneHotEncoder, MinMaxScaler
Healthcare	Disease prediction	RobustScaler, Normalizer
NLP	Text classification	Binarizer, LabelBinarizer
Finance	Fraud detection	QuantileTransformer, PowerTransformer
HR Analytics	Employee churn	OrdinalEncoder, PolynomialFeatures

2. Basic Concepts

2.1 Preprocessing Kya Hota Hai?

Preprocessing matlab data ko usable form mein convert karna — jaise:

Raw Data → Clean Data → Transformed Data → ML Model

2.2 Core Problem Areas

1. Feature Scaling      → Numbers ka range equalize karna
2. Encoding            → Strings ko numbers mein convert karna  
3. Normalization       → Rows ko unit norm pe laana
4. Transformation      → Distribution shape change karna
5. Binarization        → Continuous values ko 0/1 mein convert karna
6. Imputation          → Missing values fill karna (SimpleImputer)
7. Polynomial Features → New features generate karna

2.3 Basic Import

# Pura module import karo
from sklearn import preprocessing

# Ya specific classes import karo (recommended)
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    MaxAbsScaler,
    RobustScaler,
    Normalizer,
    Binarizer,
    LabelEncoder,
    OrdinalEncoder,
    OneHotEncoder,
    LabelBinarizer,
    MultiLabelBinarizer,
    PolynomialFeatures,
    FunctionTransformer,
    PowerTransformer,
    QuantileTransformer,
    KBinsDiscretizer,
)

2.4 Fit-Transform Pattern (Most Important!)

Har sklearn preprocessor same pattern follow karta hai:

# Step 1: Object banao
scaler = StandardScaler()

# Step 2: Training data pe fit karo (statistics seekhta hai)
scaler.fit(X_train)

# Step 3: Transform karo
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Same scaler use karo!

# Ya fit + transform ek saath (sirf training data pe)
X_train_scaled = scaler.fit_transform(X_train)

⚠️ Golden Rule: fit() sirf X_train pe karo. X_test pe sirf transform() karo. Test data ki information training mein leak nahi honi chahiye — isko Data Leakage kehte hain.

3. All Functions and Features

3.1 Feature Scaling (Numerical Data)

🔹 StandardScaler — Z-Score Normalization

Formula: z = (x - mean) / std_dev

Result: Mean = 0, Std Dev = 1

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1000, 2], [2000, 4], [3000, 6], [4000, 8]])

scaler = StandardScaler()
scaled = scaler.fit_transform(data)

print("Original:\n", data)
print("\nScaled:\n", scaled)
print("\nMean:", scaler.mean_)
print("Std:", scaler.scale_)

Original:
 [[1000    2]
  [2000    4]
  [3000    6]
  [4000    8]]

Scaled:
 [[-1.34164079 -1.34164079]
  [-0.4472136  -0.4472136 ]
  [ 0.4472136   0.4472136 ]
  [ 1.34164079  1.34164079]]

Mean: [2500.    5.]
Std:  [1118.03398875    2.23606798]

Kab use karein:

Linear Regression, Logistic Regression, SVM, Neural Networks
Data Gaussian distribution follow karta ho
Outliers nahi hone chahiye (sensitive to outliers)

🔹 MinMaxScaler — Range Scaling

Formula: x_scaled = (x - x_min) / (x_max - x_min)

Result: Values [0, 1] range mein (ya custom range)

from sklearn.preprocessing import MinMaxScaler

data = np.array([[100], [200], [300], [400], [500]])

scaler = MinMaxScaler(feature_range=(0, 1))  # Default
scaled = scaler.fit_transform(data)
print(scaled)
# [[0.  ], [0.25], [0.5 ], [0.75], [1.  ]]

# Custom range: -1 to 1
scaler2 = MinMaxScaler(feature_range=(-1, 1))
scaled2 = scaler2.fit_transform(data)
print(scaled2)
# [[-1. ], [-0.5], [0.  ], [0.5 ], [1.  ]]

Kab use karein:

Neural Networks (especially image pixel values 0-255 → 0-1)
KNN, K-Means
Jab exact [0,1] range chahiye
Outliers nahi hone chahiye

🔹 MaxAbsScaler — Absolute Maximum Scaling

Formula: x_scaled = x / |x_max|

Result: Values [-1, 1] range mein (sign preserve hota hai)

from sklearn.preprocessing import MaxAbsScaler

data = np.array([[-10], [-5], [0], [5], [10]])

scaler = MaxAbsScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5], [0. ], [0.5], [1. ]]

Kab use karein:

Sparse data (mostly zeros) — zeros disturb nahi hote
Text/NLP TF-IDF matrices
Data pehle se centered ho (zero mean)

🔹 RobustScaler — Outlier-Resistant Scaling

Formula: x_scaled = (x - median) / IQR

IQR = Q3 - Q1 (Interquartile Range)

from sklearn.preprocessing import RobustScaler

# Outliers ke saath data
data = np.array([[1], [2], [3], [4], [100]])  # 100 outlier hai

scaler = RobustScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1.  ], [-0.5 ], [0.   ], [0.5  ], [48.5 ]]

print("Median:", scaler.center_)   # [3.]
print("IQR:", scaler.scale_)       # [2.]

Kab use karein:

Data mein outliers hain
Financial data (salary, transactions)
Medical data
Real-world messy data

🔹 Normalizer — Row-wise Normalization

Ye rows ko normalize karta hai (columns nahi!) — ek sample ki saari values ek unit norm pe laata hai.

from sklearn.preprocessing import Normalizer

data = np.array([[3, 4], [1, 2], [6, 8]])

# L2 norm (default): sqrt(sum of squares) = 1
norm_l2 = Normalizer(norm='l2')
print(norm_l2.fit_transform(data))
# [[0.6, 0.8], [0.447, 0.894], [0.6, 0.8]]

# L1 norm: sum of absolute values = 1
norm_l1 = Normalizer(norm='l1')
print(norm_l1.fit_transform(data))
# [[0.429, 0.571], [0.333, 0.667], [0.429, 0.571]]

# Max norm: max value = 1
norm_max = Normalizer(norm='max')
print(norm_max.fit_transform(data))
# [[0.75, 1.0], [0.5, 1.0], [0.75, 1.0]]

Kab use karein:

Text classification (TF-IDF vectors)
Cosine similarity compute karna ho
Each sample independent ho

3.2 Encoding (Categorical Data)

🔹 LabelEncoder — Integer Encoding

Ek column ki categorical values ko integers mein convert karta hai.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# String labels
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi']
encoded = le.fit_transform(cities)
print(encoded)       # [2 1 0 2 1]
print(le.classes_)   # ['Bangalore' 'Delhi' 'Mumbai']

# Inverse transform
print(le.inverse_transform([2, 1, 0]))  # ['Mumbai' 'Delhi' 'Bangalore']

Kab use karein:

Target variable (y) encode karne ke liye ✅
Tree-based models (Random Forest, XGBoost) ke features ke liye ✅
Linear models ke features mein mat use karo ❌ (ordinal assumption)

🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)

Multiple columns ko ek saath encode karta hai, order preserve karta hai.

from sklearn.preprocessing import OrdinalEncoder

data = np.array([
    ['Low', 'Small'],
    ['Medium', 'Medium'],
    ['High', 'Large'],
    ['Low', 'Large']
])

encoder = OrdinalEncoder(
    categories=[
        ['Low', 'Medium', 'High'],    # Column 1 order
        ['Small', 'Medium', 'Large']   # Column 2 order
    ]
)

encoded = encoder.fit_transform(data)
print(encoded)
# [[0. 0.]
#  [1. 1.]
#  [2. 2.]
#  [0. 2.]]

Kab use karein:

Education: 'School' < 'Bachelors' < 'Masters' < 'PhD'
Size: 'Small' < 'Medium' < 'Large'
Rating: 'Poor' < 'Average' < 'Good' < 'Excellent'

🔹 OneHotEncoder — Dummy Variable Encoding

Categorical values ko binary columns mein convert karta hai (dummy variables).

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = np.array([['Mumbai'], ['Delhi'], ['Bangalore'], ['Mumbai']])

ohe = OneHotEncoder(
    sparse_output=False,    # numpy array return karo
    drop='first',           # Dummy variable trap avoid karo
    handle_unknown='ignore' # New categories ko ignore karo
)

encoded = ohe.fit_transform(data)
print(encoded)
# [[0. 1.]    <- Mumbai
#  [1. 0.]    <- Delhi  
#  [0. 0.]    <- Bangalore (dropped - first)
#  [0. 1.]]   <- Mumbai

print(ohe.categories_)
print(ohe.get_feature_names_out(['city']))
# ['city_Delhi' 'city_Mumbai']

Key Parameters:

Parameter	Default	Description
`sparse_output`	`True`	Sparse matrix return karo
`drop`	`None`	'first' ya 'if_binary'
`handle_unknown`	`'error'`	'ignore' ya 'infrequent_if_exist'
`min_frequency`	`None`	Rare categories handle karo
`max_categories`	`None`	Categories cap karo

Kab use karein:

Linear models ke liye categorical features ✅
Neural networks ✅
Jab categories mein koi order na ho ✅

🔹 LabelBinarizer — Binary Matrix for Single Column

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai']

encoded = lb.fit_transform(cities)
print(encoded)
# [[0 0 1]   <- Mumbai
#  [0 1 0]   <- Delhi
#  [1 0 0]   <- Bangalore
#  [0 0 1]]  <- Mumbai

print(lb.classes_)  # ['Bangalore' 'Delhi' 'Mumbai']

🔹 MultiLabelBinarizer — Multiple Labels per Sample

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

# Ek user multiple genres pasand karta hai
genres = [
    ['Action', 'Drama'],
    ['Comedy'],
    ['Action', 'Comedy', 'Romance'],
    ['Drama', 'Romance']
]

encoded = mlb.fit_transform(genres)
print(mlb.classes_)  # ['Action' 'Comedy' 'Drama' 'Romance']
print(encoded)
# [[1 0 1 0]
#  [0 1 0 0]
#  [1 1 0 1]
#  [0 0 1 1]]

3.3 Binarization

🔹 Binarizer — Threshold-based 0/1

from sklearn.preprocessing import Binarizer

data = np.array([[0.1, 0.5, 0.9], [0.2, 0.6, 0.4]])

# Default threshold = 0.0
binarizer = Binarizer(threshold=0.5)
result = binarizer.fit_transform(data)
print(result)
# [[0 0 1]
#  [0 1 0]]

Kab use karein:

Probability scores ko binary predictions mein convert karna
Image thresholding
Feature engineering

3.4 Distribution Transformation

🔹 PowerTransformer — Gaussian Distribution Banana

from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [4], [100], [1000]])  # Skewed data

# Yeo-Johnson: positive aur negative dono values handle karta hai
pt_yj = PowerTransformer(method='yeo-johnson')
transformed = pt_yj.fit_transform(data)
print("Yeo-Johnson:\n", transformed)

# Box-Cox: sirf positive values
pt_bc = PowerTransformer(method='box-cox')
transformed_bc = pt_bc.fit_transform(data)
print("Box-Cox lambda:", pt_bc.lambdas_)

Kab use karein:

Highly skewed numerical data (income, population)
Linear regression assume karta hai normality
Before t-tests, ANOVA

🔹 QuantileTransformer — Percentile-based Transform

from sklearn.preprocessing import QuantileTransformer

data = np.array([[1], [2], [3], [100], [1000], [50000]])

# Uniform distribution (0 to 1)
qt_uniform = QuantileTransformer(
    n_quantiles=100,
    output_distribution='uniform',
    random_state=42
)

# Normal distribution
qt_normal = QuantileTransformer(
    n_quantiles=100,
    output_distribution='normal',
    random_state=42
)

print("Uniform:\n", qt_uniform.fit_transform(data))
print("Normal:\n", qt_normal.fit_transform(data))

Advantage: Outliers ke against most robust hai — extreme values median ke paas aajati hain.

🔹 FunctionTransformer — Custom Transformation

from sklearn.preprocessing import FunctionTransformer

# Log transformation
log_transformer = FunctionTransformer(
    func=np.log1p,           # log(x + 1)
    inverse_func=np.expm1    # inverse: e^x - 1
)

data = np.array([[0], [1], [10], [100], [1000]])
transformed = log_transformer.fit_transform(data)
print(transformed)
# [[0.   ]
#  [0.693]
#  [2.398]
#  [4.615]
#  [6.909]]

# Koi bhi custom function use kar sakte ho
sqrt_transformer = FunctionTransformer(np.sqrt)

3.5 Discretization

🔹 KBinsDiscretizer — Continuous to Categorical

from sklearn.preprocessing import KBinsDiscretizer

data = np.array([[0], [10], [20], [30], [40], [50], [100]])

# Equal-width bins
kbd_width = KBinsDiscretizer(
    n_bins=3,
    encode='ordinal',     # 'onehot', 'onehot-dense', 'ordinal'
    strategy='uniform'    # 'uniform', 'quantile', 'kmeans'
)

print("Uniform:\n", kbd_width.fit_transform(data))

# Equal-frequency bins
kbd_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
print("Quantile:\n", kbd_freq.fit_transform(data))

print("Bin edges:", kbd_width.bin_edges_)

3.6 Polynomial Features

🔹 PolynomialFeatures — Non-linear Feature Engineering

from sklearn.preprocessing import PolynomialFeatures

data = np.array([[2, 3]])  # 2 features: a=2, b=3

# Degree 2: 1, a, b, a^2, ab, b^2
poly = PolynomialFeatures(degree=2, include_bias=True)
result = poly.fit_transform(data)
print(result)
# [[ 1.  2.  3.  4.  6.  9.]]

print(poly.get_feature_names_out(['a', 'b']))
# ['1' 'a' 'b' 'a^2' 'a b' 'b^2']

# Interaction only (no powers)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
result2 = poly_interact.fit_transform(data)
print(result2)
# [[2. 3. 6.]]  -> a, b, a*b

4. Intermediate Usage

4.1 Pipeline ke Saath Integration

Real projects mein preprocessors Pipeline ke andar use karte hain:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'age':    [25, 35, 45, 28, 55],
    'salary': [30000, 60000, 90000, 40000, 120000],
    'city':   ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi'],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'churn':  [0, 0, 1, 0, 1]
})

X = df.drop('churn', axis=1)
y = df['churn']

# Numeric aur categorical columns alag karo
numeric_features = ['age', 'salary']
categorical_features = ['city', 'gender']

# Column-specific preprocessors
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# ColumnTransformer: different columns pe different transformers
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Train/test split ke saath
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))

4.2 ColumnTransformer — Mixed Data Types Handle Karna

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

# Complex real-world scenario
preprocessor = ColumnTransformer(
    transformers=[
        ('scale_num', StandardScaler(), ['age', 'income', 'credit_score']),
        ('encode_cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'occupation']),
        ('encode_ord', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['risk_level']),
    ],
    remainder='passthrough'  # Baaki columns as-is rakhna
)

# Feature names extract karna
X_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
print(feature_names)

4.3 Inverse Transform — Original Data Wapas Laana

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = np.array([[1000, 2], [2000, 4], [3000, 6]])

scaled = scaler.fit_transform(data)
print("Scaled:\n", scaled)

# Wapas original values mein convert karo
original = scaler.inverse_transform(scaled)
print("Recovered:\n", original)
# Same as original data

4.4 Partial Fit — Incremental Learning (Big Data)

Jab data ek saath memory mein nahi aata:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Batch by batch fit karo
for batch in data_batches:  # Generator ya list of chunks
    scaler.partial_fit(batch)

# Ab transform karo
X_scaled = scaler.transform(X_test)

StandardScaler, MinMaxScaler, MaxAbsScaler partial_fit support karte hain.

4.5 Cross-Validation ke Saath Correct Usage

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# ✅ CORRECT: Pipeline ke andar scaler rakhna
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

scores = cross_val_score(pipe, X, y, cv=5)
# Automatically: train fold pe fit, val fold pe transform


# ❌ WRONG: Pehle scale karna phir CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Data leakage!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)

5. Advanced Concepts

5.1 Custom Transformer Banana

BaseEstimator aur TransformerMixin use karke custom preprocessor banate hain:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LogTransformer(BaseEstimator, TransformerMixin):
    """Custom log(1+x) transformer with clipping."""
    
    def __init__(self, clip_min=0.0):
        self.clip_min = clip_min
    
    def fit(self, X, y=None):
        # Koi statistics seekhna ho to yahaan
        self.n_features_in_ = X.shape[1] if hasattr(X, 'shape') else 1
        return self  # Hamesha self return karo
    
    def transform(self, X, y=None):
        X = np.array(X)
        X_clipped = np.clip(X, self.clip_min, None)  # Negative values clip
        return np.log1p(X_clipped)
    
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return [f'log_feature_{i}' for i in range(self.n_features_in_)]
        return [f'log_{name}' for name in input_features]

# Use in pipeline
pipe = Pipeline([
    ('log', LogTransformer(clip_min=0)),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

5.2 TargetEncoder (sklearn 1.3+)

High-cardinality categorical data ke liye:

from sklearn.preprocessing import TargetEncoder

# City column mein 500+ unique values hain
te = TargetEncoder(
    target_type='continuous',  # ya 'binary', 'multiclass'
    smooth='auto',              # Regularization
    cv=5                        # Cross-validation folds
)

X_train_enc = te.fit_transform(X_train[['city']], y_train)
X_test_enc = te.transform(X_test[['city']])

5.3 Internal Working of StandardScaler

# StandardScaler internally kya karta hai:
class MyStandardScaler:
    def fit(self, X):
        self.mean_ = X.mean(axis=0)     # Column-wise mean
        self.var_  = X.var(axis=0)       # Column-wise variance
        self.scale_ = np.sqrt(self.var_) # Standard deviation
        return self
    
    def transform(self, X):
        return (X - self.mean_) / self.scale_
    
    def inverse_transform(self, X_scaled):
        return X_scaled * self.scale_ + self.mean_

5.4 PolynomialFeatures — Combinatorial Explosion

# Degree 2, 3 features se kitne features bante hain?
from sklearn.preprocessing import PolynomialFeatures
import math

def count_poly_features(n_features, degree, include_bias=True):
    from math import comb
    total = sum(comb(n_features + d, d) for d in range(1, degree + 1))
    return total + (1 if include_bias else 0)

# n=10 features, degree=2 → 66 features!
# n=10 features, degree=3 → 286 features!
# n=100 features, degree=2 → 5151 features! (curse of dimensionality)

poly = PolynomialFeatures(degree=2)
print(f"100 features → {poly.fit(np.zeros((1, 100))).n_output_features_} features")

5.5 Sparse Matrix Optimization

from scipy.sparse import issparse
from sklearn.preprocessing import MaxAbsScaler

# OneHotEncoder returns sparse matrix by default
ohe = OneHotEncoder()  # sparse_output=True by default
X_sparse = ohe.fit_transform(categories_data)

print("Is sparse:", issparse(X_sparse))  # True
print("Memory saved:", f"{X_sparse.nnz / X_sparse.size:.2%}")

# MaxAbsScaler sparse matrices ke saath kaam karta hai
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)

5.6 set_output API (sklearn 1.2+)

from sklearn.preprocessing import StandardScaler
import pandas as pd

X = pd.DataFrame({'age': [25, 35, 45], 'salary': [30000, 60000, 90000]})

# DataFrame output chahiye (numpy array nahi)
scaler = StandardScaler()
scaler.set_output(transform='pandas')  # or 'default' for numpy

result = scaler.fit_transform(X)
print(type(result))   # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # ['age', 'salary']

6. Real-World Use Cases

6.1 Customer Churn Prediction Pipeline

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Telecom churn dataset simulation
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'tenure_months':  np.random.randint(1, 72, n),
    'monthly_charges': np.random.uniform(20, 120, n),
    'total_charges':   np.random.uniform(100, 8000, n),
    'contract':        np.random.choice(['Month-to-month', 'One year', 'Two year'], n),
    'payment_method':  np.random.choice(['Credit card', 'Bank transfer', 'Mailed check'], n),
    'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n),
    'tech_support':    np.random.choice(['Yes', 'No'], n),
    'churn':           np.random.choice([0, 1], n, p=[0.73, 0.27])
})

X = df.drop('churn', axis=1)
y = df['churn']

numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges']
ordinal_cols = ['contract']
nominal_cols = ['payment_method', 'internet_service', 'tech_support']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('ord', OrdinalEncoder(categories=[['Month-to-month', 'One year', 'Two year']]), ordinal_cols),
    ('nom', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_cols),
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print(f"Churn Prediction Accuracy: {pipeline.score(X_test, y_test):.4f}")

6.2 E-commerce Price Prediction

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge

# Product pricing model
# Features: area_sqft, rooms, age, location_score
X = np.array([
    [500, 1, 10, 7],
    [1000, 2, 5, 8],
    [1500, 3, 2, 9],
    [800, 2, 15, 6],
])
y = np.array([25000, 60000, 95000, 40000])

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

pipeline.fit(X, y)
new_property = np.array([[1200, 2, 8, 7.5]])
predicted_price = pipeline.predict(new_property)
print(f"Predicted Price: ₹{predicted_price[0]:,.0f}")

6.3 NLP Text Classification Preprocessing

from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import issparse

texts = [
    "best product ever",
    "terrible quality waste of money",
    "average product nothing special",
    "loved it will buy again",
    "poor customer service",
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative']

# Label encoding
le = LabelEncoder()
y = le.fit_transform(labels)  # [1, 0, 2, 1, 0]

# TF-IDF + L2 normalization
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)  # Sparse matrix

# L2 normalize rows
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')

print("Classes:", le.classes_)
print("Feature shape:", X_normalized.shape)

6.4 Time Series Feature Engineering

from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
import pandas as pd
import numpy as np

# Sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365)
sales = np.random.exponential(scale=1000, size=365) + \
        500 * np.sin(np.linspace(0, 4*np.pi, 365))  # Seasonal pattern

df = pd.DataFrame({'date': dates, 'sales': sales})
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

# Sales ko bins mein divide karo (Low/Medium/High)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df['sales_bucket'] = discretizer.fit_transform(df[['sales']])

# Continuous features scale karo
scaler = StandardScaler()
df['sales_scaled'] = scaler.fit_transform(df[['sales']])

print(df[['date', 'sales', 'sales_scaled', 'sales_bucket']].head(10))

7. Practical Examples

Example 1: Titanic Dataset — Complete Preprocessing

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Titanic data simulate karte hain
titanic = pd.DataFrame({
    'Pclass':   [1, 2, 3, 1, 3, 2, 1, 3],
    'Sex':      ['male', 'female', 'male', 'female', 'male', 'female', 'male', 'female'],
    'Age':      [22, 38, np.nan, 35, 28, np.nan, 54, 27],
    'SibSp':    [1, 1, 0, 1, 0, 0, 0, 0],
    'Fare':     [7.25, 71.28, 7.92, 53.1, 8.05, 13.0, 51.86, 7.92],
    'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', None],
    'Survived': [0, 1, 1, 1, 0, 0, 0, 1]
})

X = titanic.drop('Survived', axis=1)
y = titanic['Survived']

# Numeric features: Impute + Scale
numeric_features = ['Age', 'Fare', 'SibSp']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Missing values
    ('scaler', StandardScaler())                     # Scale
])

# Categorical features: Impute + Encode
categorical_features = ['Sex', 'Embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Missing values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))    # Encode
])

# Ordinal feature (Pclass already numeric but ordinal)
ordinal_features = ['Pclass']
# Already numeric, bas scaler lagado

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
    ('ord', StandardScaler(), ordinal_features),
])

# Full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])

from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

pipeline.fit(X_train, y_train)

# Cross validation score
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='accuracy')

print(f"Train Score: {pipeline.score(X_train, y_train):.4f}")
print(f"Test Score:  {pipeline.score(X_test, y_test):.4f}")
print(f"CV Score:    {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Line-by-line explanation:

Line	Kya hota hai
`SimpleImputer(strategy='median')`	Missing Age values ko median se fill karo
`StandardScaler()`	Numeric features ko z-score se scale karo
`SimpleImputer(strategy='most_frequent')`	Missing Embarked ko mode se fill karo
`OneHotEncoder(handle_unknown='ignore')`	Sex, Embarked ko dummy variables mein convert karo
`ColumnTransformer`	Different columns pe different transformers apply karo
`Pipeline`	Sab steps ko sequence mein joodo
`cross_val_score`	Proper train-validation split ke saath evaluate karo

Example 2: Salary Prediction — Handling Skewed Data

import numpy as np
import pandas as pd
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

np.random.seed(42)
n = 500

df = pd.DataFrame({
    'experience_years': np.random.randint(0, 20, n),
    'education':        np.random.choice(['High School', 'Bachelors', 'Masters', 'PhD'], n),
    'department':       np.random.choice(['IT', 'Finance', 'HR', 'Marketing'], n),
    # Salary is log-normally distributed (highly skewed) - typical in real world
    'salary': np.random.lognormal(mean=11, sigma=0.5, size=n)
})

print("Salary statistics before transform:")
print(f"  Skewness: {df['salary'].skew():.3f}")

# Target variable bhi transform karo!
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y = pt.fit_transform(df[['salary']]).ravel()

print(f"  Skewness after transform: {pd.Series(y).skew():.3f}")

X = df.drop('salary', axis=1)

preprocessor = ColumnTransformer([
    ('scale', PowerTransformer(), ['experience_years']),
    ('ordinal', OrdinalEncoder(categories=[
        [['High School', 'Bachelors', 'Masters', 'PhD']]
    ]), ['education']),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False), ['department']),
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', LinearRegression())
])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
y_pred_transformed = pipeline.predict(X_test)

# Predictions ko original salary mein wapas laao
y_pred_original = pt.inverse_transform(y_pred_transformed.reshape(-1, 1))
y_test_original = pt.inverse_transform(y_test.reshape(-1, 1))

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test_original, y_pred_original)
print(f"\nMean Absolute Error: ₹{mae:,.0f}")
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")

Example 3: Multi-class Classification — Complete Workflow

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV

# Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Label ko human-readable banao
lb = LabelBinarizer()
y_binary = lb.fit_transform(y)  # [0,1,2] → [[1,0,0],[0,1,0],[0,0,1]]
print("Classes:", lb.classes_)  # [0, 1, 2]

# Pipeline with hyperparameter tuning
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(probability=True, random_state=42))
])

param_grid = {
    'scaler__with_mean': [True, False],  # Scaler params
    'svm__C': [0.1, 1, 10],             # SVM params
    'svm__kernel': ['rbf', 'linear'],
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")

8. Edge Cases and Errors

8.1 Data Leakage — Sabse Common Mistake

# ❌ WRONG — Data leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Puri dataset pe fit kar liya

X_train, X_test = train_test_split(X_scaled, ...)  # Test data leak ho gaya

# ✅ CORRECT
X_train, X_test = train_test_split(X, ...)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Sirf train pe fit
X_test_scaled  = scaler.transform(X_test)          # Test pe sirf transform

8.2 Unseen Categories in OneHotEncoder

# ❌ Error: Unknown category at inference time
ohe = OneHotEncoder()
ohe.fit([['Mumbai'], ['Delhi']])
ohe.transform([['Chennai']])  # ValueError: Found unknown categories

# ✅ FIX: handle_unknown='ignore'
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit([['Mumbai'], ['Delhi']])
result = ohe.transform([['Chennai']])  # Silently becomes [0, 0]

# ✅ FIX: handle_unknown='infrequent_if_exist'
ohe = OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=2)

8.3 Negative Values with Box-Cox

# ❌ Error: Box-Cox needs strictly positive values
pt = PowerTransformer(method='box-cox')
data_with_zeros = np.array([[0], [1], [2], [-1]])
pt.fit_transform(data_with_zeros)  # ValueError!

# ✅ FIX 1: Yeo-Johnson use karo
pt = PowerTransformer(method='yeo-johnson')
pt.fit_transform(data_with_zeros)  # Works!

# ✅ FIX 2: Shift data before box-cox
from sklearn.preprocessing import FunctionTransformer
shift_transformer = FunctionTransformer(lambda x: x + abs(x.min()) + 1)

8.4 String Data in Numeric Transformer

# ❌ Error: Strings in numeric column
data = pd.DataFrame({'salary': ['50000', '60000', 'N/A', '70000']})
scaler = StandardScaler()
scaler.fit_transform(data)  # ValueError: could not convert string to float

# ✅ FIX: Clean data first
data['salary'] = pd.to_numeric(data['salary'], errors='coerce')  # 'N/A' → NaN
# Then use SimpleImputer + StandardScaler in pipeline

8.5 Shape Mismatch

# ❌ Common mistake: 1D array dena
data_1d = np.array([1, 2, 3, 4, 5])
scaler = StandardScaler()
scaler.fit_transform(data_1d)  # ValueError: Expected 2D array

# ✅ FIX: Reshape karo
scaler.fit_transform(data_1d.reshape(-1, 1))  # (5,) → (5, 1)

# Ya pandas Series ke liye
scaler.fit_transform(pd.Series([1,2,3]).values.reshape(-1, 1))

8.6 Fitting on Wrong Data

# ❌ WRONG: Test set pe fit karna
scaler = StandardScaler()
scaler.fit(X_test)  # Test ke statistics use ho rahe hain!
X_train_scaled = scaler.transform(X_train)

# ✅ CORRECT: Training set pe fit karo
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Same scaler!

8.7 Debugging Toolkit

# Scaler ke statistics check karo
scaler = StandardScaler()
scaler.fit(X_train)

print("Mean per feature:", scaler.mean_)
print("Std per feature:", scaler.scale_)
print("Variance:", scaler.var_)
print("N samples seen:", scaler.n_samples_seen_)

# Transform hone ke baad verify karo
X_scaled = scaler.transform(X_train)
print("Scaled mean (should be ~0):", X_scaled.mean(axis=0))
print("Scaled std (should be ~1):",  X_scaled.std(axis=0))

# OneHotEncoder categories check karo
ohe = OneHotEncoder()
ohe.fit(X_cat)
print("Categories:", ohe.categories_)
print("Feature names:", ohe.get_feature_names_out())

9. Pro Developer Insights

9.1 Best Practices

# ✅ 1. HAMESHA Pipeline use karo
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])

# ✅ 2. Train/test ke baad fit karo
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

# ✅ 3. set_output se DataFrame maintain karo
preprocessor.set_output(transform='pandas')

# ✅ 4. get_params() se hyperparameters check karo
print(scaler.get_params())

# ✅ 5. Preprocessor ko save karo (model ke saath)
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')
loaded = joblib.load('model_pipeline.pkl')
# Test time pe same transformations apply honge automatically

9.2 Performance Tips

# Tip 1: Sparse matrix use karo jab possible
ohe = OneHotEncoder()  # sparse_output=True by default — use it!

# Tip 2: n_jobs=-1 se parallel processing
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer()  

# Tip 3: Partial fit for big data
for chunk in pd.read_csv('big_data.csv', chunksize=10000):
    scaler.partial_fit(chunk[numeric_cols])

# Tip 4: Memory-efficient: float32 use karo
import numpy as np
X = X.astype(np.float32)

# Tip 5: DataFrame set_output API (sklearn 1.2+)
from sklearn import set_config
set_config(transform_output='pandas')  # Global setting

9.3 Security Tips

# ❌ Kabhi bhi untrusted pickle files load mat karo
import pickle
model = pickle.load(open('untrusted_model.pkl', 'rb'))  # Dangerous!

# ✅ joblib use karo + source verify karo
import joblib
model = joblib.load('trusted_model.pkl')

# Production mein: Preprocessing artifacts bhi save karo
artifacts = {
    'preprocessor': preprocessor,
    'model': trained_model,
    'feature_names': list(X.columns),
    'version': '1.0.0'
}
joblib.dump(artifacts, 'production_artifacts.pkl')

9.4 Clean Code Patterns

# Pattern 1: Constants alag rakhna
NUMERIC_FEATURES = ['age', 'salary', 'tenure']
CATEGORICAL_FEATURES = ['department', 'city']
ORDINAL_FEATURES = ['education']
ORDINAL_CATEGORIES = [['High School', 'Bachelors', 'Masters', 'PhD']]

# Pattern 2: Factory function
def build_preprocessor():
    return ColumnTransformer([
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), NUMERIC_FEATURES),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), CATEGORICAL_FEATURES),
    ])

# Pattern 3: Feature validation
def validate_features(df, required_features):
    missing = set(required_features) - set(df.columns)
    if missing:
        raise ValueError(f"Missing features: {missing}")

validate_features(X_test, NUMERIC_FEATURES + CATEGORICAL_FEATURES)

10. Comparison

10.1 Scalers Comparison

Scaler	Formula	Range	Outlier Sensitive	Best For
`StandardScaler`	(x-μ)/σ	(-∞, +∞)	Yes	Normal dist, SVM, LR
`MinMaxScaler`	(x-min)/(max-min)	[0,1]	Very Yes	Neural nets, KNN
`MaxAbsScaler`	x/	max		[-1,1]
`RobustScaler`	(x-median)/IQR	Variable	No	Outlier-heavy data
`Normalizer`	x/‖x‖	[-1,1]	No	Text, cosine sim
`QuantileTransformer`	Percentile-based	[0,1] or N(0,1)	Very No	Any distribution
`PowerTransformer`	Yeo-Johnson/Box-Cox	Variable	No	Skewed data

10.2 Encoders Comparison

Encoder	Input	Output	Order?	Use Case
`LabelEncoder`	1D	1D integers	No (arbitrary)	Target variable
`OrdinalEncoder`	2D	2D integers	Yes (custom)	Ordered categories
`OneHotEncoder`	2D	2D binary sparse	No	Nominal categories
`LabelBinarizer`	1D	2D binary	No	Multi-class targets
`MultiLabelBinarizer`	List of lists	2D binary	No	Multi-label classification
`TargetEncoder`	2D	2D float	No	High-cardinality categories

10.3 sklearn vs Alternatives

Feature	`sklearn.preprocessing`	`pandas`	`category_encoders`	`feature_engine`
Pipeline integration	✅ Native	❌ Manual	✅	✅
Fit/transform pattern	✅	❌	✅	✅
Sparse matrix support	✅	❌	❌	❌
Target encoding	✅ (v1.3)	❌	✅	✅
Mean encoding	❌	Manual	✅	✅
Ease of use	Medium	Easy	Medium	Easy
Production-ready	✅	⚠️	✅	✅

11. Data Science Perspective

11.1 EDA-Driven Preprocessing Decision

import pandas as pd
import numpy as np

def preprocessing_recommendations(df):
    """Data dekhke automatically preprocessing suggest karo."""
    
    print("=" * 60)
    print("PREPROCESSING RECOMMENDATIONS")
    print("=" * 60)
    
    for col in df.columns:
        dtype = df[col].dtype
        nunique = df[col].nunique()
        missing_pct = df[col].isna().mean() * 100
        
        print(f"\n📊 Column: {col}")
        print(f"   Type: {dtype}, Unique: {nunique}, Missing: {missing_pct:.1f}%")
        
        if dtype in ['object', 'category']:
            if nunique == 2:
                print("   ✅ Recommended: LabelEncoder (binary)")
            elif nunique <= 10:
                print("   ✅ Recommended: OneHotEncoder (nominal)")
            elif nunique > 10:
                print("   ✅ Recommended: TargetEncoder (high cardinality)")
        
        elif dtype in ['int64', 'float64']:
            skewness = df[col].skew()
            if abs(skewness) > 1:
                print(f"   📈 High skewness: {skewness:.2f}")
                print("   ✅ Recommended: PowerTransformer or QuantileTransformer")
            else:
                print("   ✅ Recommended: StandardScaler")
        
        if missing_pct > 0:
            print(f"   ⚠️  Missing values: Use SimpleImputer")

# Use karo
preprocessing_recommendations(df)

11.2 Feature Importance After Preprocessing

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# Pipeline train karo
pipeline.fit(X_train, y_train)

# Feature names extract karo
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Feature importance
rf = pipeline.named_steps['model']
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importance_df.head(10))

11.3 Preprocessing in ML Workflow

1. Data Collection
    ↓
2. EDA (Exploratory Data Analysis)
   - Distribution check → Scaler choice
   - Missing values check → Imputer strategy
   - Cardinality check → Encoder choice
   - Correlation analysis
    ↓
3. Feature Engineering
   - PolynomialFeatures
   - KBinsDiscretizer
   - FunctionTransformer
    ↓
4. Preprocessing Pipeline Build
   - ColumnTransformer
   - Pipeline
    ↓
5. Train/Test Split (PHIR fit karo!)
    ↓
6. Model Training + Hyperparameter Tuning
    ↓
7. Evaluation
    ↓
8. Save Pipeline (preprocessor + model)
    ↓
9. Production Deployment

11.4 Distribution Analysis for Scaler Selection

import numpy as np
from scipy import stats

def select_scaler(data):
    """Distribution dekhke best scaler recommend karo."""
    
    # Normality test
    _, p_value = stats.normaltest(data)
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)
    
    # Outlier detection
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    outliers = np.sum((data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr))
    outlier_pct = outliers / len(data) * 100
    
    print(f"Skewness: {skewness:.3f}")
    print(f"Kurtosis: {kurtosis:.3f}")
    print(f"Outlier %: {outlier_pct:.1f}%")
    print(f"Normality p-value: {p_value:.4f}")
    
    if outlier_pct > 5:
        print("→ Recommendation: RobustScaler")
    elif abs(skewness) > 1:
        print("→ Recommendation: PowerTransformer or QuantileTransformer")
    elif p_value > 0.05:  # Normally distributed
        print("→ Recommendation: StandardScaler")
    else:
        print("→ Recommendation: QuantileTransformer (safe choice)")

# Test
salary_data = np.random.lognormal(10, 1, 1000)
select_scaler(salary_data)

12. Interview Questions

🟢 Basic Level

Q1. StandardScaler aur MinMaxScaler mein kya difference hai?

StandardScaler mean=0, std=1 banata hai (Z-score). Range fixed nahi hoti. Outliers se affect hota hai. Normal distribution ke liye best.
MinMaxScaler values ko [0,1] range mein laata hai. Fixed range milti hai. Outliers se bahut affect hota hai. Neural networks ke liye best.

Q2. Kyun fit_transform() sirf training data pe use karna chahiye?

Agar test data pe bhi fit() karo, to test set ki statistics (mean, std, min, max) training mein leak ho jaati hain. Yahi Data Leakage hai. Model unrealistically achha score karta hai training time pe lekin production mein fail karta hai.

Q3. LabelEncoder aur OneHotEncoder mein kab kya use karein?

LabelEncoder: Target variable (y) ke liye ya tree-based models ke features ke liye
OneHotEncoder: Linear/distance-based models ke nominal categorical features ke liye
LabelEncoder features mein ordinal relationship assume karta hai jo wrong prediction cause karta hai

Q4. Normalizer kya karta hai aur kab use karte hain?

Normalizer rows ko normalize karta hai (columns nahi). Har sample ki magnitude 1 ho jaati hai. Text classification, cosine similarity calculation mein use karte hain.

🟡 Intermediate Level

Q5. RobustScaler outliers handle kaise karta hai?

RobustScaler median aur IQR (Q3-Q1) use karta hai. Formula: (x - median) / IQR. Outliers median aur IQR ko affect nahi karte (jaise mean aur std ko karte hain). Isliye outlier-heavy data pe behtar kaam karta hai.

Q6. ColumnTransformer ka kya kaam hai? remainder parameter kya karta hai?

ColumnTransformer different columns pe different transformers apply karta hai ek saath. remainder='passthrough' baaki columns ko as-is rakhta hai. remainder='drop' unhe remove karta hai. Default 'drop' hai.

Q7. Sparse matrix kya hoti hai aur MaxAbsScaler kyun prefer karte hain sparse data ke liye?

Sparse matrix mostly zeros hoti hai (TF-IDF, OneHotEncoded data). MaxAbsScaler zero values ko disturb nahi karta (zeros zero rehte hain). StandardScaler mean subtract karta hai jisse sparse matrix dense ho jaati hai — memory waste!

Q8. PolynomialFeatures degree=2, n=10 features se kitne features banate hain?

Formula: C(n+d, d) = C(12, 2) = 66 features (bias ke saath). Isko curse of dimensionality bolte hain. High degree pe features exponentially badhte hain.

🔴 Advanced Level

Q9. Custom Transformer banana hai jo Pipeline ke saath kaam kare — kaise banate hain?

BaseEstimator aur TransformerMixin inherit karo. fit() method mein statistics seekho aur self return karo. transform() mein actual transformation karo. get_feature_names_out() bhi implement karo for Pipeline compatibility.

Q10. TargetEncoder kya hai aur Label/OneHot se better kab hai?

TargetEncoder categorical values ko target variable ke conditional mean se replace karta hai. High-cardinality columns (500+ unique values) mein OneHotEncoder infeasible ho jaata hai. TargetEncoder mein K-fold cross-validation internally hoti hai target leakage avoid karne ke liye.

Q11. partial_fit() kab zaroorat padti hai aur kaun se preprocessors support karte hain?

Jab data memory mein fit na ho (Big Data), tab batch-by-batch partial_fit() se incrementally fit karte hain. StandardScaler, MinMaxScaler, MaxAbsScaler support karte hain. PowerTransformer aur QuantileTransformer support nahi karte.

Q12. set_output API kya hai aur iska faida kya hai?

sklearn 1.2+ mein set_output(transform='pandas') se transformers DataFrame return karte hain numpy array ki jagah. Column names preserve hote hain. Debugging easy hoti hai. Feature names automatically track hoti hain.

🎯 Scenario-Based

Q13. Production mein naye category (unseen label) aayi to kya hoga? Handle kaise karein?

OneHotEncoder(handle_unknown='ignore') se unseen categories zero vector ban jaati hain. handle_unknown='infrequent_if_exist' se rare + unseen categories ek "infrequent" bin mein group hoti hain. Pipeline mein yeh handling automatically hoti hai.

Q14. Ek column mein salary hai (0 se 10 crore tak) aur outliers bhi hain. Kaunsa scaler use karoge aur kyun?

RobustScaler best choice hai kyunki:
Wide range hai — StandardScaler/MinMaxScaler outliers se skew honge
Outliers hain — MinMaxScaler outliers se range compress ho jaati hai
RobustScaler median/IQR use karta hai jo outlier-resistant hai

Q15. Train test split ke baad validation set banani hai aur preprocessor apply karna hai. Correct sequence kya hai?

# Correct sequence:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# Fit only on train!
preprocessor.fit(X_train)
X_train_t = preprocessor.transform(X_train)
X_val_t   = preprocessor.transform(X_val)    # Same scaler!
X_test_t  = preprocessor.transform(X_test)   # Same scaler!

13. Conclusion

Key Learnings Recap

✅ StandardScaler  → Normal data, SVM, LR, Neural Nets
✅ MinMaxScaler    → Image data, KNN, Neural Nets
✅ RobustScaler    → Outlier-heavy data (financial, medical)
✅ MaxAbsScaler    → Sparse data, NLP/TF-IDF
✅ QuantileTransformer → Any distribution, most robust
✅ PowerTransformer → Skewed data, linear model assumptions

✅ LabelEncoder    → Target variable, tree models
✅ OrdinalEncoder  → Ordered categories (size, education)
✅ OneHotEncoder   → Nominal categories, linear/distance models
✅ TargetEncoder   → High-cardinality categories

✅ PolynomialFeatures → Non-linear feature engineering
✅ KBinsDiscretizer   → Continuous → categorical
✅ FunctionTransformer → Custom any function

When to Use What — Decision Tree

Data hai → Kya type?

NUMERIC → Outliers hain?
   Yes → RobustScaler
   No  → Skewed?
        Yes → PowerTransformer / QuantileTransformer
        No  → StandardScaler (normal) / MinMaxScaler (neural nets)

CATEGORICAL → Order hai?
   Yes → OrdinalEncoder
   No  → Cardinality?
         High (>50) → TargetEncoder
         Low  (<50) → OneHotEncoder

TARGET VARIABLE → LabelEncoder (multi-class) / as-is (regression)

Final Advice

Bhai, sklearn.preprocessing sirf numbers scale karna nahi hai — yeh ek data engineering toolkit hai jo:

Data quality improve karta hai
Model performance dramatically badhata hai
Training speed fast karta hai
Generalization improve karta hai

Production mein hamesha Pipeline use karo — isse preprocessing aur model ek unit ban jaate hain, deployment easy hoti hai, aur data leakage nahi hota.

Ek achha ML Engineer wo hai jo model se zyaada data samajhta hai — aur data samajhne ke liye preprocessing mastery zaroori hai. 🚀

📌 Pro Tip: "Garbage in, garbage out" — Machine Learning mein agar preprocessing sahi nahi hai, to duniya ka best model bhi fail karega. Preprocessing mein time lagao, results khud achhe aayenge.

Happy Coding! 🐍

Agar yeh blog helpful laga to share karo aur comments mein batao ki kaunsa concept most useful laga!

📋 Table of Contents

1. Introduction

Yeh sklearn.preprocessing hai kya?

Kyun zaroori hai?

Real-World Usage

2. Basic Concepts

2.1 Preprocessing Kya Hota Hai?

2.2 Core Problem Areas

2.3 Basic Import

2.4 Fit-Transform Pattern (Most Important!)

3. All Functions and Features

3.1 Feature Scaling (Numerical Data)

🔹 StandardScaler — Z-Score Normalization

🔹 MinMaxScaler — Range Scaling

🔹 MaxAbsScaler — Absolute Maximum Scaling

🔹 RobustScaler — Outlier-Resistant Scaling

🔹 Normalizer — Row-wise Normalization

3.2 Encoding (Categorical Data)

🔹 LabelEncoder — Integer Encoding

🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)

🔹 OneHotEncoder — Dummy Variable Encoding

🔹 LabelBinarizer — Binary Matrix for Single Column

🔹 MultiLabelBinarizer — Multiple Labels per Sample

3.3 Binarization

🔹 Binarizer — Threshold-based 0/1

3.4 Distribution Transformation

🔹 PowerTransformer — Gaussian Distribution Banana

🔹 QuantileTransformer — Percentile-based Transform

🔹 FunctionTransformer — Custom Transformation

3.5 Discretization

🔹 KBinsDiscretizer — Continuous to Categorical

3.6 Polynomial Features

🔹 PolynomialFeatures — Non-linear Feature Engineering

4. Intermediate Usage

4.1 Pipeline ke Saath Integration

4.2 ColumnTransformer — Mixed Data Types Handle Karna

4.3 Inverse Transform — Original Data Wapas Laana

4.4 Partial Fit — Incremental Learning (Big Data)

4.5 Cross-Validation ke Saath Correct Usage

5. Advanced Concepts

5.1 Custom Transformer Banana

5.2 TargetEncoder (sklearn 1.3+)

5.3 Internal Working of StandardScaler

5.4 PolynomialFeatures — Combinatorial Explosion

5.5 Sparse Matrix Optimization

5.6 set_output API (sklearn 1.2+)

6. Real-World Use Cases

6.1 Customer Churn Prediction Pipeline

6.2 E-commerce Price Prediction

6.3 NLP Text Classification Preprocessing

6.4 Time Series Feature Engineering

7. Practical Examples

Example 1: Titanic Dataset — Complete Preprocessing

Example 2: Salary Prediction — Handling Skewed Data

Example 3: Multi-class Classification — Complete Workflow

8. Edge Cases and Errors

8.1 Data Leakage — Sabse Common Mistake

8.2 Unseen Categories in OneHotEncoder

8.3 Negative Values with Box-Cox

8.4 String Data in Numeric Transformer

8.5 Shape Mismatch

8.6 Fitting on Wrong Data

8.7 Debugging Toolkit

9. Pro Developer Insights

9.1 Best Practices

9.2 Performance Tips

9.3 Security Tips

9.4 Clean Code Patterns

10. Comparison

10.1 Scalers Comparison

10.2 Encoders Comparison

10.3 sklearn vs Alternatives

11. Data Science Perspective

11.1 EDA-Driven Preprocessing Decision

11.2 Feature Importance After Preprocessing

11.3 Preprocessing in ML Workflow

11.4 Distribution Analysis for Scaler Selection

12. Interview Questions

🟢 Basic Level

🟡 Intermediate Level

Yeh `sklearn.preprocessing` hai kya?