Back to all posts
Data Science

sklearn.preprocessing Complete Guide: Data Scaling, Encoding & Transformation in Python

Master sklearn.preprocessing from scratch to advanced level. Learn StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, and 30+ transformers with real-...

Master every tool in scikit-learn's preprocessing arsenal — beginner se lekar production-ready ML pipelines tak, sab kuch ek jagah.


📋 Table of Contents

  1. Introduction

  2. Basic Concepts

  3. All Functions and Features

  4. Intermediate Usage

  5. Advanced Concepts

  6. Real-World Use Cases

  7. Practical Examples

  8. Edge Cases and Errors

  9. Pro Developer Insights

  10. Comparison

  11. Data Science Perspective

  12. Interview Questions

  13. Conclusion


1. Introduction

Yeh sklearn.preprocessing hai kya?

Bhai, sochte ho ki Machine Learning sirf model fit karna hai? Galat! Asli kaam data ko model ke liye ready karna hai — aur yehi kaam karta hai sklearn.preprocessing.

sklearn.preprocessing ek Python module hai jo scikit-learn library ka part hai. Iska kaam hai raw, messy, unscaled data ko ML models ke liye clean, transformed, aur normalized banana.

Kyun zaroori hai?

Real data kabhi bhi perfect nahi hota:

  • Salary column mein values 50,000 se 1,00,00,000 tak hain

  • City column mein "Mumbai", "Delhi", "Bangalore" jaise strings hain

  • Kuch columns mein NaN values hain

  • Kuch features skewed hain

Agar yeh data seedha ML model ko dedo, to model:

  • Bias ho jaata hai high-value columns ki taraf (jaise salary)

  • Fail karta hai string categories ko process karne mein

  • Slow converge hota hai gradient descent mein

sklearn.preprocessing in sab problems ko solve karta hai.

Real-World Usage

Industry

Problem

Preprocessing Used

Banking

Credit scoring

StandardScaler, LabelEncoder

E-commerce

Recommendation system

OneHotEncoder, MinMaxScaler

Healthcare

Disease prediction

RobustScaler, Normalizer

NLP

Text classification

Binarizer, LabelBinarizer

Finance

Fraud detection

QuantileTransformer, PowerTransformer

HR Analytics

Employee churn

OrdinalEncoder, PolynomialFeatures


2. Basic Concepts

2.1 Preprocessing Kya Hota Hai?

Preprocessing matlab data ko usable form mein convert karna — jaise:

SQL
Raw Data → Clean Data → Transformed Data → ML Model

2.2 Core Problem Areas

SQL
1. Feature Scaling      → Numbers ka range equalize karna
2. Encoding            → Strings ko numbers mein convert karna  
3. Normalization       → Rows ko unit norm pe laana
4. Transformation      → Distribution shape change karna
5. Binarization        → Continuous values ko 0/1 mein convert karna
6. Imputation          → Missing values fill karna (SimpleImputer)
7. Polynomial Features → New features generate karna

2.3 Basic Import

Python
# Pura module import karo
from sklearn import preprocessing

# Ya specific classes import karo (recommended)
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    MaxAbsScaler,
    RobustScaler,
    Normalizer,
    Binarizer,
    LabelEncoder,
    OrdinalEncoder,
    OneHotEncoder,
    LabelBinarizer,
    MultiLabelBinarizer,
    PolynomialFeatures,
    FunctionTransformer,
    PowerTransformer,
    QuantileTransformer,
    KBinsDiscretizer,
)

2.4 Fit-Transform Pattern (Most Important!)

Har sklearn preprocessor same pattern follow karta hai:

Python
# Step 1: Object banao
scaler = StandardScaler()

# Step 2: Training data pe fit karo (statistics seekhta hai)
scaler.fit(X_train)

# Step 3: Transform karo
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Same scaler use karo!

# Ya fit + transform ek saath (sirf training data pe)
X_train_scaled = scaler.fit_transform(X_train)

⚠️ Golden Rule: fit() sirf X_train pe karo. X_test pe sirf transform() karo. Test data ki information training mein leak nahi honi chahiye — isko Data Leakage kehte hain.


3. All Functions and Features

3.1 Feature Scaling (Numerical Data)

🔹 StandardScaler — Z-Score Normalization

Formula: z = (x - mean) / std_dev

Result: Mean = 0, Std Dev = 1

Python
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1000, 2], [2000, 4], [3000, 6], [4000, 8]])

scaler = StandardScaler()
scaled = scaler.fit_transform(data)

print("Original:\n", data)
print("\nScaled:\n", scaled)
print("\nMean:", scaler.mean_)
print("Std:", scaler.scale_)
SQL
Original:
 [[1000    2]
  [2000    4]
  [3000    6]
  [4000    8]]

Scaled:
 [[-1.34164079 -1.34164079]
  [-0.4472136  -0.4472136 ]
  [ 0.4472136   0.4472136 ]
  [ 1.34164079  1.34164079]]

Mean: [2500.    5.]
Std:  [1118.03398875    2.23606798]

Kab use karein:

  • Linear Regression, Logistic Regression, SVM, Neural Networks

  • Data Gaussian distribution follow karta ho

  • Outliers nahi hone chahiye (sensitive to outliers)


🔹 MinMaxScaler — Range Scaling

Formula: x_scaled = (x - x_min) / (x_max - x_min)

Result: Values [0, 1] range mein (ya custom range)

Python
from sklearn.preprocessing import MinMaxScaler

data = np.array([[100], [200], [300], [400], [500]])

scaler = MinMaxScaler(feature_range=(0, 1))  # Default
scaled = scaler.fit_transform(data)
print(scaled)
# [[0.  ], [0.25], [0.5 ], [0.75], [1.  ]]

# Custom range: -1 to 1
scaler2 = MinMaxScaler(feature_range=(-1, 1))
scaled2 = scaler2.fit_transform(data)
print(scaled2)
# [[-1. ], [-0.5], [0.  ], [0.5 ], [1.  ]]

Kab use karein:

  • Neural Networks (especially image pixel values 0-255 → 0-1)

  • KNN, K-Means

  • Jab exact [0,1] range chahiye

  • Outliers nahi hone chahiye


🔹 MaxAbsScaler — Absolute Maximum Scaling

Formula: x_scaled = x / |x_max|

Result: Values [-1, 1] range mein (sign preserve hota hai)

Python
from sklearn.preprocessing import MaxAbsScaler

data = np.array([[-10], [-5], [0], [5], [10]])

scaler = MaxAbsScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5], [0. ], [0.5], [1. ]]

Kab use karein:

  • Sparse data (mostly zeros) — zeros disturb nahi hote

  • Text/NLP TF-IDF matrices

  • Data pehle se centered ho (zero mean)


🔹 RobustScaler — Outlier-Resistant Scaling

Formula: x_scaled = (x - median) / IQR

IQR = Q3 - Q1 (Interquartile Range)

Python
from sklearn.preprocessing import RobustScaler

# Outliers ke saath data
data = np.array([[1], [2], [3], [4], [100]])  # 100 outlier hai

scaler = RobustScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1.  ], [-0.5 ], [0.   ], [0.5  ], [48.5 ]]

print("Median:", scaler.center_)   # [3.]
print("IQR:", scaler.scale_)       # [2.]

Kab use karein:

  • Data mein outliers hain

  • Financial data (salary, transactions)

  • Medical data

  • Real-world messy data


🔹 Normalizer — Row-wise Normalization

Ye rows ko normalize karta hai (columns nahi!) — ek sample ki saari values ek unit norm pe laata hai.

Python
from sklearn.preprocessing import Normalizer

data = np.array([[3, 4], [1, 2], [6, 8]])

# L2 norm (default): sqrt(sum of squares) = 1
norm_l2 = Normalizer(norm='l2')
print(norm_l2.fit_transform(data))
# [[0.6, 0.8], [0.447, 0.894], [0.6, 0.8]]

# L1 norm: sum of absolute values = 1
norm_l1 = Normalizer(norm='l1')
print(norm_l1.fit_transform(data))
# [[0.429, 0.571], [0.333, 0.667], [0.429, 0.571]]

# Max norm: max value = 1
norm_max = Normalizer(norm='max')
print(norm_max.fit_transform(data))
# [[0.75, 1.0], [0.5, 1.0], [0.75, 1.0]]

Kab use karein:

  • Text classification (TF-IDF vectors)

  • Cosine similarity compute karna ho

  • Each sample independent ho


3.2 Encoding (Categorical Data)

🔹 LabelEncoder — Integer Encoding

Ek column ki categorical values ko integers mein convert karta hai.

Python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# String labels
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi']
encoded = le.fit_transform(cities)
print(encoded)       # [2 1 0 2 1]
print(le.classes_)   # ['Bangalore' 'Delhi' 'Mumbai']

# Inverse transform
print(le.inverse_transform([2, 1, 0]))  # ['Mumbai' 'Delhi' 'Bangalore']

Kab use karein:

  • Target variable (y) encode karne ke liye ✅

  • Tree-based models (Random Forest, XGBoost) ke features ke liye ✅

  • Linear models ke features mein mat use karo ❌ (ordinal assumption)


🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)

Multiple columns ko ek saath encode karta hai, order preserve karta hai.

Python
from sklearn.preprocessing import OrdinalEncoder

data = np.array([
    ['Low', 'Small'],
    ['Medium', 'Medium'],
    ['High', 'Large'],
    ['Low', 'Large']
])

encoder = OrdinalEncoder(
    categories=[
        ['Low', 'Medium', 'High'],    # Column 1 order
        ['Small', 'Medium', 'Large']   # Column 2 order
    ]
)

encoded = encoder.fit_transform(data)
print(encoded)
# [[0. 0.]
#  [1. 1.]
#  [2. 2.]
#  [0. 2.]]

Kab use karein:

  • Education: 'School' < 'Bachelors' < 'Masters' < 'PhD'

  • Size: 'Small' < 'Medium' < 'Large'

  • Rating: 'Poor' < 'Average' < 'Good' < 'Excellent'


🔹 OneHotEncoder — Dummy Variable Encoding

Categorical values ko binary columns mein convert karta hai (dummy variables).

Python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = np.array([['Mumbai'], ['Delhi'], ['Bangalore'], ['Mumbai']])

ohe = OneHotEncoder(
    sparse_output=False,    # numpy array return karo
    drop='first',           # Dummy variable trap avoid karo
    handle_unknown='ignore' # New categories ko ignore karo
)

encoded = ohe.fit_transform(data)
print(encoded)
# [[0. 1.]    <- Mumbai
#  [1. 0.]    <- Delhi  
#  [0. 0.]    <- Bangalore (dropped - first)
#  [0. 1.]]   <- Mumbai

print(ohe.categories_)
print(ohe.get_feature_names_out(['city']))
# ['city_Delhi' 'city_Mumbai']

Key Parameters:

Parameter

Default

Description

sparse_output

True

Sparse matrix return karo

drop

None

'first' ya 'if_binary'

handle_unknown

'error'

'ignore' ya 'infrequent_if_exist'

min_frequency

None

Rare categories handle karo

max_categories

None

Categories cap karo

Kab use karein:

  • Linear models ke liye categorical features ✅

  • Neural networks ✅

  • Jab categories mein koi order na ho ✅


🔹 LabelBinarizer — Binary Matrix for Single Column

Python
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai']

encoded = lb.fit_transform(cities)
print(encoded)
# [[0 0 1]   <- Mumbai
#  [0 1 0]   <- Delhi
#  [1 0 0]   <- Bangalore
#  [0 0 1]]  <- Mumbai

print(lb.classes_)  # ['Bangalore' 'Delhi' 'Mumbai']

🔹 MultiLabelBinarizer — Multiple Labels per Sample

Python
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

# Ek user multiple genres pasand karta hai
genres = [
    ['Action', 'Drama'],
    ['Comedy'],
    ['Action', 'Comedy', 'Romance'],
    ['Drama', 'Romance']
]

encoded = mlb.fit_transform(genres)
print(mlb.classes_)  # ['Action' 'Comedy' 'Drama' 'Romance']
print(encoded)
# [[1 0 1 0]
#  [0 1 0 0]
#  [1 1 0 1]
#  [0 0 1 1]]

3.3 Binarization

🔹 Binarizer — Threshold-based 0/1

Python
from sklearn.preprocessing import Binarizer

data = np.array([[0.1, 0.5, 0.9], [0.2, 0.6, 0.4]])

# Default threshold = 0.0
binarizer = Binarizer(threshold=0.5)
result = binarizer.fit_transform(data)
print(result)
# [[0 0 1]
#  [0 1 0]]

Kab use karein:

  • Probability scores ko binary predictions mein convert karna

  • Image thresholding

  • Feature engineering


3.4 Distribution Transformation

🔹 PowerTransformer — Gaussian Distribution Banana

Python
from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [4], [100], [1000]])  # Skewed data

# Yeo-Johnson: positive aur negative dono values handle karta hai
pt_yj = PowerTransformer(method='yeo-johnson')
transformed = pt_yj.fit_transform(data)
print("Yeo-Johnson:\n", transformed)

# Box-Cox: sirf positive values
pt_bc = PowerTransformer(method='box-cox')
transformed_bc = pt_bc.fit_transform(data)
print("Box-Cox lambda:", pt_bc.lambdas_)

Kab use karein:

  • Highly skewed numerical data (income, population)

  • Linear regression assume karta hai normality

  • Before t-tests, ANOVA


🔹 QuantileTransformer — Percentile-based Transform

Python
from sklearn.preprocessing import QuantileTransformer

data = np.array([[1], [2], [3], [100], [1000], [50000]])

# Uniform distribution (0 to 1)
qt_uniform = QuantileTransformer(
    n_quantiles=100,
    output_distribution='uniform',
    random_state=42
)

# Normal distribution
qt_normal = QuantileTransformer(
    n_quantiles=100,
    output_distribution='normal',
    random_state=42
)

print("Uniform:\n", qt_uniform.fit_transform(data))
print("Normal:\n", qt_normal.fit_transform(data))

Advantage: Outliers ke against most robust hai — extreme values median ke paas aajati hain.


🔹 FunctionTransformer — Custom Transformation

Python
from sklearn.preprocessing import FunctionTransformer

# Log transformation
log_transformer = FunctionTransformer(
    func=np.log1p,           # log(x + 1)
    inverse_func=np.expm1    # inverse: e^x - 1
)

data = np.array([[0], [1], [10], [100], [1000]])
transformed = log_transformer.fit_transform(data)
print(transformed)
# [[0.   ]
#  [0.693]
#  [2.398]
#  [4.615]
#  [6.909]]

# Koi bhi custom function use kar sakte ho
sqrt_transformer = FunctionTransformer(np.sqrt)

3.5 Discretization

🔹 KBinsDiscretizer — Continuous to Categorical

Python
from sklearn.preprocessing import KBinsDiscretizer

data = np.array([[0], [10], [20], [30], [40], [50], [100]])

# Equal-width bins
kbd_width = KBinsDiscretizer(
    n_bins=3,
    encode='ordinal',     # 'onehot', 'onehot-dense', 'ordinal'
    strategy='uniform'    # 'uniform', 'quantile', 'kmeans'
)

print("Uniform:\n", kbd_width.fit_transform(data))

# Equal-frequency bins
kbd_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
print("Quantile:\n", kbd_freq.fit_transform(data))

print("Bin edges:", kbd_width.bin_edges_)

3.6 Polynomial Features

🔹 PolynomialFeatures — Non-linear Feature Engineering

Python
from sklearn.preprocessing import PolynomialFeatures

data = np.array([[2, 3]])  # 2 features: a=2, b=3

# Degree 2: 1, a, b, a^2, ab, b^2
poly = PolynomialFeatures(degree=2, include_bias=True)
result = poly.fit_transform(data)
print(result)
# [[ 1.  2.  3.  4.  6.  9.]]

print(poly.get_feature_names_out(['a', 'b']))
# ['1' 'a' 'b' 'a^2' 'a b' 'b^2']

# Interaction only (no powers)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
result2 = poly_interact.fit_transform(data)
print(result2)
# [[2. 3. 6.]]  -> a, b, a*b

4. Intermediate Usage

4.1 Pipeline ke Saath Integration

Real projects mein preprocessors Pipeline ke andar use karte hain:

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'age':    [25, 35, 45, 28, 55],
    'salary': [30000, 60000, 90000, 40000, 120000],
    'city':   ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi'],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'churn':  [0, 0, 1, 0, 1]
})

X = df.drop('churn', axis=1)
y = df['churn']

# Numeric aur categorical columns alag karo
numeric_features = ['age', 'salary']
categorical_features = ['city', 'gender']

# Column-specific preprocessors
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# ColumnTransformer: different columns pe different transformers
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Train/test split ke saath
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))

4.2 ColumnTransformer — Mixed Data Types Handle Karna

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

# Complex real-world scenario
preprocessor = ColumnTransformer(
    transformers=[
        ('scale_num', StandardScaler(), ['age', 'income', 'credit_score']),
        ('encode_cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'occupation']),
        ('encode_ord', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['risk_level']),
    ],
    remainder='passthrough'  # Baaki columns as-is rakhna
)

# Feature names extract karna
X_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
print(feature_names)

4.3 Inverse Transform — Original Data Wapas Laana

Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = np.array([[1000, 2], [2000, 4], [3000, 6]])

scaled = scaler.fit_transform(data)
print("Scaled:\n", scaled)

# Wapas original values mein convert karo
original = scaler.inverse_transform(scaled)
print("Recovered:\n", original)
# Same as original data

4.4 Partial Fit — Incremental Learning (Big Data)

Jab data ek saath memory mein nahi aata:

Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Batch by batch fit karo
for batch in data_batches:  # Generator ya list of chunks
    scaler.partial_fit(batch)

# Ab transform karo
X_scaled = scaler.transform(X_test)

StandardScaler, MinMaxScaler, MaxAbsScaler partial_fit support karte hain.


4.5 Cross-Validation ke Saath Correct Usage

Python
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# ✅ CORRECT: Pipeline ke andar scaler rakhna
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

scores = cross_val_score(pipe, X, y, cv=5)
# Automatically: train fold pe fit, val fold pe transform


# ❌ WRONG: Pehle scale karna phir CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Data leakage!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)

5. Advanced Concepts

5.1 Custom Transformer Banana

BaseEstimator aur TransformerMixin use karke custom preprocessor banate hain:

Python
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LogTransformer(BaseEstimator, TransformerMixin):
    """Custom log(1+x) transformer with clipping."""
    
    def __init__(self, clip_min=0.0):
        self.clip_min = clip_min
    
    def fit(self, X, y=None):
        # Koi statistics seekhna ho to yahaan
        self.n_features_in_ = X.shape[1] if hasattr(X, 'shape') else 1
        return self  # Hamesha self return karo
    
    def transform(self, X, y=None):
        X = np.array(X)
        X_clipped = np.clip(X, self.clip_min, None)  # Negative values clip
        return np.log1p(X_clipped)
    
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return [f'log_feature_{i}' for i in range(self.n_features_in_)]
        return [f'log_{name}' for name in input_features]

# Use in pipeline
pipe = Pipeline([
    ('log', LogTransformer(clip_min=0)),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

5.2 TargetEncoder (sklearn 1.3+)

High-cardinality categorical data ke liye:

Python
from sklearn.preprocessing import TargetEncoder

# City column mein 500+ unique values hain
te = TargetEncoder(
    target_type='continuous',  # ya 'binary', 'multiclass'
    smooth='auto',              # Regularization
    cv=5                        # Cross-validation folds
)

X_train_enc = te.fit_transform(X_train[['city']], y_train)
X_test_enc = te.transform(X_test[['city']])

5.3 Internal Working of StandardScaler

Python
# StandardScaler internally kya karta hai:
class MyStandardScaler:
    def fit(self, X):
        self.mean_ = X.mean(axis=0)     # Column-wise mean
        self.var_  = X.var(axis=0)       # Column-wise variance
        self.scale_ = np.sqrt(self.var_) # Standard deviation
        return self
    
    def transform(self, X):
        return (X - self.mean_) / self.scale_
    
    def inverse_transform(self, X_scaled):
        return X_scaled * self.scale_ + self.mean_

5.4 PolynomialFeatures — Combinatorial Explosion

Python
# Degree 2, 3 features se kitne features bante hain?
from sklearn.preprocessing import PolynomialFeatures
import math

def count_poly_features(n_features, degree, include_bias=True):
    from math import comb
    total = sum(comb(n_features + d, d) for d in range(1, degree + 1))
    return total + (1 if include_bias else 0)

# n=10 features, degree=2 → 66 features!
# n=10 features, degree=3 → 286 features!
# n=100 features, degree=2 → 5151 features! (curse of dimensionality)

poly = PolynomialFeatures(degree=2)
print(f"100 features → {poly.fit(np.zeros((1, 100))).n_output_features_} features")

5.5 Sparse Matrix Optimization

Python
from scipy.sparse import issparse
from sklearn.preprocessing import MaxAbsScaler

# OneHotEncoder returns sparse matrix by default
ohe = OneHotEncoder()  # sparse_output=True by default
X_sparse = ohe.fit_transform(categories_data)

print("Is sparse:", issparse(X_sparse))  # True
print("Memory saved:", f"{X_sparse.nnz / X_sparse.size:.2%}")

# MaxAbsScaler sparse matrices ke saath kaam karta hai
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)

5.6 set_output API (sklearn 1.2+)

Python
from sklearn.preprocessing import StandardScaler
import pandas as pd

X = pd.DataFrame({'age': [25, 35, 45], 'salary': [30000, 60000, 90000]})

# DataFrame output chahiye (numpy array nahi)
scaler = StandardScaler()
scaler.set_output(transform='pandas')  # or 'default' for numpy

result = scaler.fit_transform(X)
print(type(result))   # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # ['age', 'salary']

6. Real-World Use Cases

6.1 Customer Churn Prediction Pipeline

Python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Telecom churn dataset simulation
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'tenure_months':  np.random.randint(1, 72, n),
    'monthly_charges': np.random.uniform(20, 120, n),
    'total_charges':   np.random.uniform(100, 8000, n),
    'contract':        np.random.choice(['Month-to-month', 'One year', 'Two year'], n),
    'payment_method':  np.random.choice(['Credit card', 'Bank transfer', 'Mailed check'], n),
    'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n),
    'tech_support':    np.random.choice(['Yes', 'No'], n),
    'churn':           np.random.choice([0, 1], n, p=[0.73, 0.27])
})

X = df.drop('churn', axis=1)
y = df['churn']

numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges']
ordinal_cols = ['contract']
nominal_cols = ['payment_method', 'internet_service', 'tech_support']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('ord', OrdinalEncoder(categories=[['Month-to-month', 'One year', 'Two year']]), ordinal_cols),
    ('nom', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_cols),
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print(f"Churn Prediction Accuracy: {pipeline.score(X_test, y_test):.4f}")

6.2 E-commerce Price Prediction

Python
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge

# Product pricing model
# Features: area_sqft, rooms, age, location_score
X = np.array([
    [500, 1, 10, 7],
    [1000, 2, 5, 8],
    [1500, 3, 2, 9],
    [800, 2, 15, 6],
])
y = np.array([25000, 60000, 95000, 40000])

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

pipeline.fit(X, y)
new_property = np.array([[1200, 2, 8, 7.5]])
predicted_price = pipeline.predict(new_property)
print(f"Predicted Price: ₹{predicted_price[0]:,.0f}")

6.3 NLP Text Classification Preprocessing

Python
from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import issparse

texts = [
    "best product ever",
    "terrible quality waste of money",
    "average product nothing special",
    "loved it will buy again",
    "poor customer service",
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative']

# Label encoding
le = LabelEncoder()
y = le.fit_transform(labels)  # [1, 0, 2, 1, 0]

# TF-IDF + L2 normalization
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)  # Sparse matrix

# L2 normalize rows
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')

print("Classes:", le.classes_)
print("Feature shape:", X_normalized.shape)

6.4 Time Series Feature Engineering

Python
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
import pandas as pd
import numpy as np

# Sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365)
sales = np.random.exponential(scale=1000, size=365) + \
        500 * np.sin(np.linspace(0, 4*np.pi, 365))  # Seasonal pattern

df = pd.DataFrame({'date': dates, 'sales': sales})
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

# Sales ko bins mein divide karo (Low/Medium/High)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df['sales_bucket'] = discretizer.fit_transform(df[['sales']])

# Continuous features scale karo
scaler = StandardScaler()
df['sales_scaled'] = scaler.fit_transform(df[['sales']])

print(df[['date', 'sales', 'sales_scaled', 'sales_bucket']].head(10))

7. Practical Examples

Example 1: Titanic Dataset — Complete Preprocessing

Python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Titanic data simulate karte hain
titanic = pd.DataFrame({
    'Pclass':   [1, 2, 3, 1, 3, 2, 1, 3],
    'Sex':      ['male', 'female', 'male', 'female', 'male', 'female', 'male', 'female'],
    'Age':      [22, 38, np.nan, 35, 28, np.nan, 54, 27],
    'SibSp':    [1, 1, 0, 1, 0, 0, 0, 0],
    'Fare':     [7.25, 71.28, 7.92, 53.1, 8.05, 13.0, 51.86, 7.92],
    'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', None],
    'Survived': [0, 1, 1, 1, 0, 0, 0, 1]
})

X = titanic.drop('Survived', axis=1)
y = titanic['Survived']

# Numeric features: Impute + Scale
numeric_features = ['Age', 'Fare', 'SibSp']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Missing values
    ('scaler', StandardScaler())                     # Scale
])

# Categorical features: Impute + Encode
categorical_features = ['Sex', 'Embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Missing values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))    # Encode
])

# Ordinal feature (Pclass already numeric but ordinal)
ordinal_features = ['Pclass']
# Already numeric, bas scaler lagado

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
    ('ord', StandardScaler(), ordinal_features),
])

# Full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])

from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

pipeline.fit(X_train, y_train)

# Cross validation score
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='accuracy')

print(f"Train Score: {pipeline.score(X_train, y_train):.4f}")
print(f"Test Score:  {pipeline.score(X_test, y_test):.4f}")
print(f"CV Score:    {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Line-by-line explanation:

Line

Kya hota hai

SimpleImputer(strategy='median')

Missing Age values ko median se fill karo

StandardScaler()

Numeric features ko z-score se scale karo

SimpleImputer(strategy='most_frequent')

Missing Embarked ko mode se fill karo

OneHotEncoder(handle_unknown='ignore')

Sex, Embarked ko dummy variables mein convert karo

ColumnTransformer

Different columns pe different transformers apply karo

Pipeline

Sab steps ko sequence mein joodo

cross_val_score

Proper train-validation split ke saath evaluate karo


Example 2: Salary Prediction — Handling Skewed Data

Python
import numpy as np
import pandas as pd
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

np.random.seed(42)
n = 500

df = pd.DataFrame({
    'experience_years': np.random.randint(0, 20, n),
    'education':        np.random.choice(['High School', 'Bachelors', 'Masters', 'PhD'], n),
    'department':       np.random.choice(['IT', 'Finance', 'HR', 'Marketing'], n),
    # Salary is log-normally distributed (highly skewed) - typical in real world
    'salary': np.random.lognormal(mean=11, sigma=0.5, size=n)
})

print("Salary statistics before transform:")
print(f"  Skewness: {df['salary'].skew():.3f}")

# Target variable bhi transform karo!
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y = pt.fit_transform(df[['salary']]).ravel()

print(f"  Skewness after transform: {pd.Series(y).skew():.3f}")

X = df.drop('salary', axis=1)

preprocessor = ColumnTransformer([
    ('scale', PowerTransformer(), ['experience_years']),
    ('ordinal', OrdinalEncoder(categories=[
        [['High School', 'Bachelors', 'Masters', 'PhD']]
    ]), ['education']),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False), ['department']),
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', LinearRegression())
])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
y_pred_transformed = pipeline.predict(X_test)

# Predictions ko original salary mein wapas laao
y_pred_original = pt.inverse_transform(y_pred_transformed.reshape(-1, 1))
y_test_original = pt.inverse_transform(y_test.reshape(-1, 1))

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test_original, y_pred_original)
print(f"\nMean Absolute Error: ₹{mae:,.0f}")
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")

Example 3: Multi-class Classification — Complete Workflow

Python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV

# Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Label ko human-readable banao
lb = LabelBinarizer()
y_binary = lb.fit_transform(y)  # [0,1,2] → [[1,0,0],[0,1,0],[0,0,1]]
print("Classes:", lb.classes_)  # [0, 1, 2]

# Pipeline with hyperparameter tuning
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(probability=True, random_state=42))
])

param_grid = {
    'scaler__with_mean': [True, False],  # Scaler params
    'svm__C': [0.1, 1, 10],             # SVM params
    'svm__kernel': ['rbf', 'linear'],
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")

8. Edge Cases and Errors

8.1 Data Leakage — Sabse Common Mistake

Python
# ❌ WRONG — Data leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Puri dataset pe fit kar liya

X_train, X_test = train_test_split(X_scaled, ...)  # Test data leak ho gaya

# ✅ CORRECT
X_train, X_test = train_test_split(X, ...)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Sirf train pe fit
X_test_scaled  = scaler.transform(X_test)          # Test pe sirf transform

8.2 Unseen Categories in OneHotEncoder

Python
# ❌ Error: Unknown category at inference time
ohe = OneHotEncoder()
ohe.fit([['Mumbai'], ['Delhi']])
ohe.transform([['Chennai']])  # ValueError: Found unknown categories

# ✅ FIX: handle_unknown='ignore'
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit([['Mumbai'], ['Delhi']])
result = ohe.transform([['Chennai']])  # Silently becomes [0, 0]

# ✅ FIX: handle_unknown='infrequent_if_exist'
ohe = OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=2)

8.3 Negative Values with Box-Cox

Python
# ❌ Error: Box-Cox needs strictly positive values
pt = PowerTransformer(method='box-cox')
data_with_zeros = np.array([[0], [1], [2], [-1]])
pt.fit_transform(data_with_zeros)  # ValueError!

# ✅ FIX 1: Yeo-Johnson use karo
pt = PowerTransformer(method='yeo-johnson')
pt.fit_transform(data_with_zeros)  # Works!

# ✅ FIX 2: Shift data before box-cox
from sklearn.preprocessing import FunctionTransformer
shift_transformer = FunctionTransformer(lambda x: x + abs(x.min()) + 1)

8.4 String Data in Numeric Transformer

Python
# ❌ Error: Strings in numeric column
data = pd.DataFrame({'salary': ['50000', '60000', 'N/A', '70000']})
scaler = StandardScaler()
scaler.fit_transform(data)  # ValueError: could not convert string to float

# ✅ FIX: Clean data first
data['salary'] = pd.to_numeric(data['salary'], errors='coerce')  # 'N/A' → NaN
# Then use SimpleImputer + StandardScaler in pipeline

8.5 Shape Mismatch

Python
# ❌ Common mistake: 1D array dena
data_1d = np.array([1, 2, 3, 4, 5])
scaler = StandardScaler()
scaler.fit_transform(data_1d)  # ValueError: Expected 2D array

# ✅ FIX: Reshape karo
scaler.fit_transform(data_1d.reshape(-1, 1))  # (5,) → (5, 1)

# Ya pandas Series ke liye
scaler.fit_transform(pd.Series([1,2,3]).values.reshape(-1, 1))

8.6 Fitting on Wrong Data

Python
# ❌ WRONG: Test set pe fit karna
scaler = StandardScaler()
scaler.fit(X_test)  # Test ke statistics use ho rahe hain!
X_train_scaled = scaler.transform(X_train)

# ✅ CORRECT: Training set pe fit karo
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Same scaler!

8.7 Debugging Toolkit

Python
# Scaler ke statistics check karo
scaler = StandardScaler()
scaler.fit(X_train)

print("Mean per feature:", scaler.mean_)
print("Std per feature:", scaler.scale_)
print("Variance:", scaler.var_)
print("N samples seen:", scaler.n_samples_seen_)

# Transform hone ke baad verify karo
X_scaled = scaler.transform(X_train)
print("Scaled mean (should be ~0):", X_scaled.mean(axis=0))
print("Scaled std (should be ~1):",  X_scaled.std(axis=0))

# OneHotEncoder categories check karo
ohe = OneHotEncoder()
ohe.fit(X_cat)
print("Categories:", ohe.categories_)
print("Feature names:", ohe.get_feature_names_out())

9. Pro Developer Insights

9.1 Best Practices

Python
# ✅ 1. HAMESHA Pipeline use karo
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])

# ✅ 2. Train/test ke baad fit karo
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

# ✅ 3. set_output se DataFrame maintain karo
preprocessor.set_output(transform='pandas')

# ✅ 4. get_params() se hyperparameters check karo
print(scaler.get_params())

# ✅ 5. Preprocessor ko save karo (model ke saath)
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')
loaded = joblib.load('model_pipeline.pkl')
# Test time pe same transformations apply honge automatically

9.2 Performance Tips

Python
# Tip 1: Sparse matrix use karo jab possible
ohe = OneHotEncoder()  # sparse_output=True by default — use it!

# Tip 2: n_jobs=-1 se parallel processing
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer()  

# Tip 3: Partial fit for big data
for chunk in pd.read_csv('big_data.csv', chunksize=10000):
    scaler.partial_fit(chunk[numeric_cols])

# Tip 4: Memory-efficient: float32 use karo
import numpy as np
X = X.astype(np.float32)

# Tip 5: DataFrame set_output API (sklearn 1.2+)
from sklearn import set_config
set_config(transform_output='pandas')  # Global setting

9.3 Security Tips

Python
# ❌ Kabhi bhi untrusted pickle files load mat karo
import pickle
model = pickle.load(open('untrusted_model.pkl', 'rb'))  # Dangerous!

# ✅ joblib use karo + source verify karo
import joblib
model = joblib.load('trusted_model.pkl')

# Production mein: Preprocessing artifacts bhi save karo
artifacts = {
    'preprocessor': preprocessor,
    'model': trained_model,
    'feature_names': list(X.columns),
    'version': '1.0.0'
}
joblib.dump(artifacts, 'production_artifacts.pkl')

9.4 Clean Code Patterns

Python
# Pattern 1: Constants alag rakhna
NUMERIC_FEATURES = ['age', 'salary', 'tenure']
CATEGORICAL_FEATURES = ['department', 'city']
ORDINAL_FEATURES = ['education']
ORDINAL_CATEGORIES = [['High School', 'Bachelors', 'Masters', 'PhD']]

# Pattern 2: Factory function
def build_preprocessor():
    return ColumnTransformer([
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), NUMERIC_FEATURES),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), CATEGORICAL_FEATURES),
    ])

# Pattern 3: Feature validation
def validate_features(df, required_features):
    missing = set(required_features) - set(df.columns)
    if missing:
        raise ValueError(f"Missing features: {missing}")

validate_features(X_test, NUMERIC_FEATURES + CATEGORICAL_FEATURES)

10. Comparison

10.1 Scalers Comparison

Scaler

Formula

Range

Outlier Sensitive

Best For

StandardScaler

(x-μ)/σ

(-∞, +∞)

Yes

Normal dist, SVM, LR

MinMaxScaler

(x-min)/(max-min)

[0,1]

Very Yes

Neural nets, KNN

MaxAbsScaler

x/

max

[-1,1]

RobustScaler

(x-median)/IQR

Variable

No

Outlier-heavy data

Normalizer

x/‖x‖

[-1,1]

No

Text, cosine sim

QuantileTransformer

Percentile-based

[0,1] or N(0,1)

Very No

Any distribution

PowerTransformer

Yeo-Johnson/Box-Cox

Variable

No

Skewed data


10.2 Encoders Comparison

Encoder

Input

Output

Order?

Use Case

LabelEncoder

1D

1D integers

No (arbitrary)

Target variable

OrdinalEncoder

2D

2D integers

Yes (custom)

Ordered categories

OneHotEncoder

2D

2D binary sparse

No

Nominal categories

LabelBinarizer

1D

2D binary

No

Multi-class targets

MultiLabelBinarizer

List of lists

2D binary

No

Multi-label classification

TargetEncoder

2D

2D float

No

High-cardinality categories


10.3 sklearn vs Alternatives

Feature

sklearn.preprocessing

pandas

category_encoders

feature_engine

Pipeline integration

✅ Native

❌ Manual

Fit/transform pattern

Sparse matrix support

Target encoding

✅ (v1.3)

Mean encoding

Manual

Ease of use

Medium

Easy

Medium

Easy

Production-ready

⚠️


11. Data Science Perspective

11.1 EDA-Driven Preprocessing Decision

Python
import pandas as pd
import numpy as np

def preprocessing_recommendations(df):
    """Data dekhke automatically preprocessing suggest karo."""
    
    print("=" * 60)
    print("PREPROCESSING RECOMMENDATIONS")
    print("=" * 60)
    
    for col in df.columns:
        dtype = df[col].dtype
        nunique = df[col].nunique()
        missing_pct = df[col].isna().mean() * 100
        
        print(f"\n📊 Column: {col}")
        print(f"   Type: {dtype}, Unique: {nunique}, Missing: {missing_pct:.1f}%")
        
        if dtype in ['object', 'category']:
            if nunique == 2:
                print("   ✅ Recommended: LabelEncoder (binary)")
            elif nunique <= 10:
                print("   ✅ Recommended: OneHotEncoder (nominal)")
            elif nunique > 10:
                print("   ✅ Recommended: TargetEncoder (high cardinality)")
        
        elif dtype in ['int64', 'float64']:
            skewness = df[col].skew()
            if abs(skewness) > 1:
                print(f"   📈 High skewness: {skewness:.2f}")
                print("   ✅ Recommended: PowerTransformer or QuantileTransformer")
            else:
                print("   ✅ Recommended: StandardScaler")
        
        if missing_pct > 0:
            print(f"   ⚠️  Missing values: Use SimpleImputer")

# Use karo
preprocessing_recommendations(df)

11.2 Feature Importance After Preprocessing

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# Pipeline train karo
pipeline.fit(X_train, y_train)

# Feature names extract karo
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Feature importance
rf = pipeline.named_steps['model']
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importance_df.head(10))

11.3 Preprocessing in ML Workflow

SQL
1. Data Collection
    ↓
2. EDA (Exploratory Data Analysis)
   - Distribution check → Scaler choice
   - Missing values check → Imputer strategy
   - Cardinality check → Encoder choice
   - Correlation analysis
    ↓
3. Feature Engineering
   - PolynomialFeatures
   - KBinsDiscretizer
   - FunctionTransformer
    ↓
4. Preprocessing Pipeline Build
   - ColumnTransformer
   - Pipeline
    ↓
5. Train/Test Split (PHIR fit karo!)
    ↓
6. Model Training + Hyperparameter Tuning
    ↓
7. Evaluation
    ↓
8. Save Pipeline (preprocessor + model)
    ↓
9. Production Deployment

11.4 Distribution Analysis for Scaler Selection

Python
import numpy as np
from scipy import stats

def select_scaler(data):
    """Distribution dekhke best scaler recommend karo."""
    
    # Normality test
    _, p_value = stats.normaltest(data)
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)
    
    # Outlier detection
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    outliers = np.sum((data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr))
    outlier_pct = outliers / len(data) * 100
    
    print(f"Skewness: {skewness:.3f}")
    print(f"Kurtosis: {kurtosis:.3f}")
    print(f"Outlier %: {outlier_pct:.1f}%")
    print(f"Normality p-value: {p_value:.4f}")
    
    if outlier_pct > 5:
        print("→ Recommendation: RobustScaler")
    elif abs(skewness) > 1:
        print("→ Recommendation: PowerTransformer or QuantileTransformer")
    elif p_value > 0.05:  # Normally distributed
        print("→ Recommendation: StandardScaler")
    else:
        print("→ Recommendation: QuantileTransformer (safe choice)")

# Test
salary_data = np.random.lognormal(10, 1, 1000)
select_scaler(salary_data)

12. Interview Questions

🟢 Basic Level

Q1. StandardScaler aur MinMaxScaler mein kya difference hai?

StandardScaler mean=0, std=1 banata hai (Z-score). Range fixed nahi hoti. Outliers se affect hota hai. Normal distribution ke liye best.

MinMaxScaler values ko [0,1] range mein laata hai. Fixed range milti hai. Outliers se bahut affect hota hai. Neural networks ke liye best.


Q2. Kyun fit_transform() sirf training data pe use karna chahiye?

Agar test data pe bhi fit() karo, to test set ki statistics (mean, std, min, max) training mein leak ho jaati hain. Yahi Data Leakage hai. Model unrealistically achha score karta hai training time pe lekin production mein fail karta hai.


Q3. LabelEncoder aur OneHotEncoder mein kab kya use karein?

  • LabelEncoder: Target variable (y) ke liye ya tree-based models ke features ke liye

  • OneHotEncoder: Linear/distance-based models ke nominal categorical features ke liye

  • LabelEncoder features mein ordinal relationship assume karta hai jo wrong prediction cause karta hai


Q4. Normalizer kya karta hai aur kab use karte hain?

Normalizer rows ko normalize karta hai (columns nahi). Har sample ki magnitude 1 ho jaati hai. Text classification, cosine similarity calculation mein use karte hain.


🟡 Intermediate Level

Q5. RobustScaler outliers handle kaise karta hai?

RobustScaler median aur IQR (Q3-Q1) use karta hai. Formula: (x - median) / IQR. Outliers median aur IQR ko affect nahi karte (jaise mean aur std ko karte hain). Isliye outlier-heavy data pe behtar kaam karta hai.


Q6. ColumnTransformer ka kya kaam hai? remainder parameter kya karta hai?

ColumnTransformer different columns pe different transformers apply karta hai ek saath. remainder='passthrough' baaki columns ko as-is rakhta hai. remainder='drop' unhe remove karta hai. Default 'drop' hai.


Q7. Sparse matrix kya hoti hai aur MaxAbsScaler kyun prefer karte hain sparse data ke liye?

Sparse matrix mostly zeros hoti hai (TF-IDF, OneHotEncoded data). MaxAbsScaler zero values ko disturb nahi karta (zeros zero rehte hain). StandardScaler mean subtract karta hai jisse sparse matrix dense ho jaati hai — memory waste!


Q8. PolynomialFeatures degree=2, n=10 features se kitne features banate hain?

Formula: C(n+d, d) = C(12, 2) = 66 features (bias ke saath). Isko curse of dimensionality bolte hain. High degree pe features exponentially badhte hain.


🔴 Advanced Level

Q9. Custom Transformer banana hai jo Pipeline ke saath kaam kare — kaise banate hain?

BaseEstimator aur TransformerMixin inherit karo. fit() method mein statistics seekho aur self return karo. transform() mein actual transformation karo. get_feature_names_out() bhi implement karo for Pipeline compatibility.


Q10. TargetEncoder kya hai aur Label/OneHot se better kab hai?

TargetEncoder categorical values ko target variable ke conditional mean se replace karta hai. High-cardinality columns (500+ unique values) mein OneHotEncoder infeasible ho jaata hai. TargetEncoder mein K-fold cross-validation internally hoti hai target leakage avoid karne ke liye.


Q11. partial_fit() kab zaroorat padti hai aur kaun se preprocessors support karte hain?

Jab data memory mein fit na ho (Big Data), tab batch-by-batch partial_fit() se incrementally fit karte hain. StandardScaler, MinMaxScaler, MaxAbsScaler support karte hain. PowerTransformer aur QuantileTransformer support nahi karte.


Q12. set_output API kya hai aur iska faida kya hai?

sklearn 1.2+ mein set_output(transform='pandas') se transformers DataFrame return karte hain numpy array ki jagah. Column names preserve hote hain. Debugging easy hoti hai. Feature names automatically track hoti hain.


🎯 Scenario-Based

Q13. Production mein naye category (unseen label) aayi to kya hoga? Handle kaise karein?

OneHotEncoder(handle_unknown='ignore') se unseen categories zero vector ban jaati hain. handle_unknown='infrequent_if_exist' se rare + unseen categories ek "infrequent" bin mein group hoti hain. Pipeline mein yeh handling automatically hoti hai.


Q14. Ek column mein salary hai (0 se 10 crore tak) aur outliers bhi hain. Kaunsa scaler use karoge aur kyun?

RobustScaler best choice hai kyunki:

  1. Wide range hai — StandardScaler/MinMaxScaler outliers se skew honge

  2. Outliers hain — MinMaxScaler outliers se range compress ho jaati hai

  3. RobustScaler median/IQR use karta hai jo outlier-resistant hai


Q15. Train test split ke baad validation set banani hai aur preprocessor apply karna hai. Correct sequence kya hai?

Python
# Correct sequence:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# Fit only on train!
preprocessor.fit(X_train)
X_train_t = preprocessor.transform(X_train)
X_val_t   = preprocessor.transform(X_val)    # Same scaler!
X_test_t  = preprocessor.transform(X_test)   # Same scaler!

13. Conclusion

Key Learnings Recap

SQL
✅ StandardScaler  → Normal data, SVM, LR, Neural Nets
✅ MinMaxScaler    → Image data, KNN, Neural Nets
✅ RobustScaler    → Outlier-heavy data (financial, medical)
✅ MaxAbsScaler    → Sparse data, NLP/TF-IDF
✅ QuantileTransformer → Any distribution, most robust
✅ PowerTransformer → Skewed data, linear model assumptions

✅ LabelEncoder    → Target variable, tree models
✅ OrdinalEncoder  → Ordered categories (size, education)
✅ OneHotEncoder   → Nominal categories, linear/distance models
✅ TargetEncoder   → High-cardinality categories

✅ PolynomialFeatures → Non-linear feature engineering
✅ KBinsDiscretizer   → Continuous → categorical
✅ FunctionTransformer → Custom any function

When to Use What — Decision Tree

SQL
Data hai → Kya type?

NUMERIC → Outliers hain?
   Yes → RobustScaler
   No  → Skewed?
        Yes → PowerTransformer / QuantileTransformer
        No  → StandardScaler (normal) / MinMaxScaler (neural nets)

CATEGORICAL → Order hai?
   Yes → OrdinalEncoder
   NoCardinality?
         High (>50) → TargetEncoder
         Low  (<50) → OneHotEncoder

TARGET VARIABLE → LabelEncoder (multi-class) / as-is (regression)

Final Advice

Bhai, sklearn.preprocessing sirf numbers scale karna nahi hai — yeh ek data engineering toolkit hai jo:

  1. Data quality improve karta hai

  2. Model performance dramatically badhata hai

  3. Training speed fast karta hai

  4. Generalization improve karta hai

Production mein hamesha Pipeline use karo — isse preprocessing aur model ek unit ban jaate hain, deployment easy hoti hai, aur data leakage nahi hota.

Ek achha ML Engineer wo hai jo model se zyaada data samajhta hai — aur data samajhne ke liye preprocessing mastery zaroori hai. 🚀


📌 Pro Tip: "Garbage in, garbage out" — Machine Learning mein agar preprocessing sahi nahi hai, to duniya ka best model bhi fail karega. Preprocessing mein time lagao, results khud achhe aayenge.


Happy Coding! 🐍

Agar yeh blog helpful laga to share karo aur comments mein batao ki kaunsa concept most useful laga!

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.