Master every tool in scikit-learn's preprocessing arsenal — beginner se lekar production-ready ML pipelines tak, sab kuch ek jagah.
📋 Table of Contents
1. Introduction
Yeh sklearn.preprocessing hai kya?
Bhai, sochte ho ki Machine Learning sirf model fit karna hai? Galat! Asli kaam data ko model ke liye ready karna hai — aur yehi kaam karta hai sklearn.preprocessing.
sklearn.preprocessing ek Python module hai jo scikit-learn library ka part hai. Iska kaam hai raw, messy, unscaled data ko ML models ke liye clean, transformed, aur normalized banana.
Kyun zaroori hai?
Real data kabhi bhi perfect nahi hota:
Salary column mein values
50,000se1,00,00,000tak hainCity column mein "Mumbai", "Delhi", "Bangalore" jaise strings hain
Kuch columns mein
NaNvalues hainKuch features skewed hain
Agar yeh data seedha ML model ko dedo, to model:
Bias ho jaata hai high-value columns ki taraf (jaise salary)
Fail karta hai string categories ko process karne mein
Slow converge hota hai gradient descent mein
sklearn.preprocessing in sab problems ko solve karta hai.
Real-World Usage
Industry | Problem | Preprocessing Used |
|---|---|---|
Banking | Credit scoring | StandardScaler, LabelEncoder |
E-commerce | Recommendation system | OneHotEncoder, MinMaxScaler |
Healthcare | Disease prediction | RobustScaler, Normalizer |
NLP | Text classification | Binarizer, LabelBinarizer |
Finance | Fraud detection | QuantileTransformer, PowerTransformer |
HR Analytics | Employee churn | OrdinalEncoder, PolynomialFeatures |
2. Basic Concepts
2.1 Preprocessing Kya Hota Hai?
Preprocessing matlab data ko usable form mein convert karna — jaise:
Raw Data → Clean Data → Transformed Data → ML Model
2.2 Core Problem Areas
1. Feature Scaling → Numbers ka range equalize karna
2. Encoding → Strings ko numbers mein convert karna
3. Normalization → Rows ko unit norm pe laana
4. Transformation → Distribution shape change karna
5. Binarization → Continuous values ko 0/1 mein convert karna
6. Imputation → Missing values fill karna (SimpleImputer)
7. Polynomial Features → New features generate karna
2.3 Basic Import
# Pura module import karo
from sklearn import preprocessing
# Ya specific classes import karo (recommended)
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
MaxAbsScaler,
RobustScaler,
Normalizer,
Binarizer,
LabelEncoder,
OrdinalEncoder,
OneHotEncoder,
LabelBinarizer,
MultiLabelBinarizer,
PolynomialFeatures,
FunctionTransformer,
PowerTransformer,
QuantileTransformer,
KBinsDiscretizer,
)
2.4 Fit-Transform Pattern (Most Important!)
Har sklearn preprocessor same pattern follow karta hai:
# Step 1: Object banao
scaler = StandardScaler()
# Step 2: Training data pe fit karo (statistics seekhta hai)
scaler.fit(X_train)
# Step 3: Transform karo
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Same scaler use karo!
# Ya fit + transform ek saath (sirf training data pe)
X_train_scaled = scaler.fit_transform(X_train)
⚠️ Golden Rule:
fit()sirfX_trainpe karo.X_testpe sirftransform()karo. Test data ki information training mein leak nahi honi chahiye — isko Data Leakage kehte hain.
3. All Functions and Features
3.1 Feature Scaling (Numerical Data)
🔹 StandardScaler — Z-Score Normalization
Formula: z = (x - mean) / std_dev
Result: Mean = 0, Std Dev = 1
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[1000, 2], [2000, 4], [3000, 6], [4000, 8]])
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
print("Original:\n", data)
print("\nScaled:\n", scaled)
print("\nMean:", scaler.mean_)
print("Std:", scaler.scale_)
Original:
[[1000 2]
[2000 4]
[3000 6]
[4000 8]]
Scaled:
[[-1.34164079 -1.34164079]
[-0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079]]
Mean: [2500. 5.]
Std: [1118.03398875 2.23606798]
Kab use karein:
Linear Regression, Logistic Regression, SVM, Neural Networks
Data Gaussian distribution follow karta ho
Outliers nahi hone chahiye (sensitive to outliers)
🔹 MinMaxScaler — Range Scaling
Formula: x_scaled = (x - x_min) / (x_max - x_min)
Result: Values [0, 1] range mein (ya custom range)
from sklearn.preprocessing import MinMaxScaler
data = np.array([[100], [200], [300], [400], [500]])
scaler = MinMaxScaler(feature_range=(0, 1)) # Default
scaled = scaler.fit_transform(data)
print(scaled)
# [[0. ], [0.25], [0.5 ], [0.75], [1. ]]
# Custom range: -1 to 1
scaler2 = MinMaxScaler(feature_range=(-1, 1))
scaled2 = scaler2.fit_transform(data)
print(scaled2)
# [[-1. ], [-0.5], [0. ], [0.5 ], [1. ]]
Kab use karein:
Neural Networks (especially image pixel values 0-255 → 0-1)
KNN, K-Means
Jab exact [0,1] range chahiye
Outliers nahi hone chahiye
🔹 MaxAbsScaler — Absolute Maximum Scaling
Formula: x_scaled = x / |x_max|
Result: Values [-1, 1] range mein (sign preserve hota hai)
from sklearn.preprocessing import MaxAbsScaler
data = np.array([[-10], [-5], [0], [5], [10]])
scaler = MaxAbsScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5], [0. ], [0.5], [1. ]]
Kab use karein:
Sparse data (mostly zeros) — zeros disturb nahi hote
Text/NLP TF-IDF matrices
Data pehle se centered ho (zero mean)
🔹 RobustScaler — Outlier-Resistant Scaling
Formula: x_scaled = (x - median) / IQR
IQR = Q3 - Q1 (Interquartile Range)
from sklearn.preprocessing import RobustScaler
# Outliers ke saath data
data = np.array([[1], [2], [3], [4], [100]]) # 100 outlier hai
scaler = RobustScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5 ], [0. ], [0.5 ], [48.5 ]]
print("Median:", scaler.center_) # [3.]
print("IQR:", scaler.scale_) # [2.]
Kab use karein:
Data mein outliers hain
Financial data (salary, transactions)
Medical data
Real-world messy data
🔹 Normalizer — Row-wise Normalization
Ye rows ko normalize karta hai (columns nahi!) — ek sample ki saari values ek unit norm pe laata hai.
from sklearn.preprocessing import Normalizer
data = np.array([[3, 4], [1, 2], [6, 8]])
# L2 norm (default): sqrt(sum of squares) = 1
norm_l2 = Normalizer(norm='l2')
print(norm_l2.fit_transform(data))
# [[0.6, 0.8], [0.447, 0.894], [0.6, 0.8]]
# L1 norm: sum of absolute values = 1
norm_l1 = Normalizer(norm='l1')
print(norm_l1.fit_transform(data))
# [[0.429, 0.571], [0.333, 0.667], [0.429, 0.571]]
# Max norm: max value = 1
norm_max = Normalizer(norm='max')
print(norm_max.fit_transform(data))
# [[0.75, 1.0], [0.5, 1.0], [0.75, 1.0]]
Kab use karein:
Text classification (TF-IDF vectors)
Cosine similarity compute karna ho
Each sample independent ho
3.2 Encoding (Categorical Data)
🔹 LabelEncoder — Integer Encoding
Ek column ki categorical values ko integers mein convert karta hai.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# String labels
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi']
encoded = le.fit_transform(cities)
print(encoded) # [2 1 0 2 1]
print(le.classes_) # ['Bangalore' 'Delhi' 'Mumbai']
# Inverse transform
print(le.inverse_transform([2, 1, 0])) # ['Mumbai' 'Delhi' 'Bangalore']
Kab use karein:
Target variable (y) encode karne ke liye ✅
Tree-based models (Random Forest, XGBoost) ke features ke liye ✅
Linear models ke features mein mat use karo ❌ (ordinal assumption)
🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)
Multiple columns ko ek saath encode karta hai, order preserve karta hai.
from sklearn.preprocessing import OrdinalEncoder
data = np.array([
['Low', 'Small'],
['Medium', 'Medium'],
['High', 'Large'],
['Low', 'Large']
])
encoder = OrdinalEncoder(
categories=[
['Low', 'Medium', 'High'], # Column 1 order
['Small', 'Medium', 'Large'] # Column 2 order
]
)
encoded = encoder.fit_transform(data)
print(encoded)
# [[0. 0.]
# [1. 1.]
# [2. 2.]
# [0. 2.]]
Kab use karein:
Education: 'School' < 'Bachelors' < 'Masters' < 'PhD'
Size: 'Small' < 'Medium' < 'Large'
Rating: 'Poor' < 'Average' < 'Good' < 'Excellent'
🔹 OneHotEncoder — Dummy Variable Encoding
Categorical values ko binary columns mein convert karta hai (dummy variables).
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
data = np.array([['Mumbai'], ['Delhi'], ['Bangalore'], ['Mumbai']])
ohe = OneHotEncoder(
sparse_output=False, # numpy array return karo
drop='first', # Dummy variable trap avoid karo
handle_unknown='ignore' # New categories ko ignore karo
)
encoded = ohe.fit_transform(data)
print(encoded)
# [[0. 1.] <- Mumbai
# [1. 0.] <- Delhi
# [0. 0.] <- Bangalore (dropped - first)
# [0. 1.]] <- Mumbai
print(ohe.categories_)
print(ohe.get_feature_names_out(['city']))
# ['city_Delhi' 'city_Mumbai']
Key Parameters:
Parameter | Default | Description |
|---|---|---|
|
| Sparse matrix return karo |
|
| 'first' ya 'if_binary' |
|
| 'ignore' ya 'infrequent_if_exist' |
|
| Rare categories handle karo |
|
| Categories cap karo |
Kab use karein:
Linear models ke liye categorical features ✅
Neural networks ✅
Jab categories mein koi order na ho ✅
🔹 LabelBinarizer — Binary Matrix for Single Column
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai']
encoded = lb.fit_transform(cities)
print(encoded)
# [[0 0 1] <- Mumbai
# [0 1 0] <- Delhi
# [1 0 0] <- Bangalore
# [0 0 1]] <- Mumbai
print(lb.classes_) # ['Bangalore' 'Delhi' 'Mumbai']
🔹 MultiLabelBinarizer — Multiple Labels per Sample
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# Ek user multiple genres pasand karta hai
genres = [
['Action', 'Drama'],
['Comedy'],
['Action', 'Comedy', 'Romance'],
['Drama', 'Romance']
]
encoded = mlb.fit_transform(genres)
print(mlb.classes_) # ['Action' 'Comedy' 'Drama' 'Romance']
print(encoded)
# [[1 0 1 0]
# [0 1 0 0]
# [1 1 0 1]
# [0 0 1 1]]
3.3 Binarization
🔹 Binarizer — Threshold-based 0/1
from sklearn.preprocessing import Binarizer
data = np.array([[0.1, 0.5, 0.9], [0.2, 0.6, 0.4]])
# Default threshold = 0.0
binarizer = Binarizer(threshold=0.5)
result = binarizer.fit_transform(data)
print(result)
# [[0 0 1]
# [0 1 0]]
Kab use karein:
Probability scores ko binary predictions mein convert karna
Image thresholding
Feature engineering
3.4 Distribution Transformation
🔹 PowerTransformer — Gaussian Distribution Banana
from sklearn.preprocessing import PowerTransformer
data = np.array([[1], [2], [3], [4], [100], [1000]]) # Skewed data
# Yeo-Johnson: positive aur negative dono values handle karta hai
pt_yj = PowerTransformer(method='yeo-johnson')
transformed = pt_yj.fit_transform(data)
print("Yeo-Johnson:\n", transformed)
# Box-Cox: sirf positive values
pt_bc = PowerTransformer(method='box-cox')
transformed_bc = pt_bc.fit_transform(data)
print("Box-Cox lambda:", pt_bc.lambdas_)
Kab use karein:
Highly skewed numerical data (income, population)
Linear regression assume karta hai normality
Before t-tests, ANOVA
🔹 QuantileTransformer — Percentile-based Transform
from sklearn.preprocessing import QuantileTransformer
data = np.array([[1], [2], [3], [100], [1000], [50000]])
# Uniform distribution (0 to 1)
qt_uniform = QuantileTransformer(
n_quantiles=100,
output_distribution='uniform',
random_state=42
)
# Normal distribution
qt_normal = QuantileTransformer(
n_quantiles=100,
output_distribution='normal',
random_state=42
)
print("Uniform:\n", qt_uniform.fit_transform(data))
print("Normal:\n", qt_normal.fit_transform(data))
Advantage: Outliers ke against most robust hai — extreme values median ke paas aajati hain.
🔹 FunctionTransformer — Custom Transformation
from sklearn.preprocessing import FunctionTransformer
# Log transformation
log_transformer = FunctionTransformer(
func=np.log1p, # log(x + 1)
inverse_func=np.expm1 # inverse: e^x - 1
)
data = np.array([[0], [1], [10], [100], [1000]])
transformed = log_transformer.fit_transform(data)
print(transformed)
# [[0. ]
# [0.693]
# [2.398]
# [4.615]
# [6.909]]
# Koi bhi custom function use kar sakte ho
sqrt_transformer = FunctionTransformer(np.sqrt)
3.5 Discretization
🔹 KBinsDiscretizer — Continuous to Categorical
from sklearn.preprocessing import KBinsDiscretizer
data = np.array([[0], [10], [20], [30], [40], [50], [100]])
# Equal-width bins
kbd_width = KBinsDiscretizer(
n_bins=3,
encode='ordinal', # 'onehot', 'onehot-dense', 'ordinal'
strategy='uniform' # 'uniform', 'quantile', 'kmeans'
)
print("Uniform:\n", kbd_width.fit_transform(data))
# Equal-frequency bins
kbd_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
print("Quantile:\n", kbd_freq.fit_transform(data))
print("Bin edges:", kbd_width.bin_edges_)
3.6 Polynomial Features
🔹 PolynomialFeatures — Non-linear Feature Engineering
from sklearn.preprocessing import PolynomialFeatures
data = np.array([[2, 3]]) # 2 features: a=2, b=3
# Degree 2: 1, a, b, a^2, ab, b^2
poly = PolynomialFeatures(degree=2, include_bias=True)
result = poly.fit_transform(data)
print(result)
# [[ 1. 2. 3. 4. 6. 9.]]
print(poly.get_feature_names_out(['a', 'b']))
# ['1' 'a' 'b' 'a^2' 'a b' 'b^2']
# Interaction only (no powers)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
result2 = poly_interact.fit_transform(data)
print(result2)
# [[2. 3. 6.]] -> a, b, a*b
4. Intermediate Usage
4.1 Pipeline ke Saath Integration
Real projects mein preprocessors Pipeline ke andar use karte hain:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
import pandas as pd
# Sample dataset
df = pd.DataFrame({
'age': [25, 35, 45, 28, 55],
'salary': [30000, 60000, 90000, 40000, 120000],
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi'],
'gender': ['M', 'F', 'M', 'F', 'M'],
'churn': [0, 0, 1, 0, 1]
})
X = df.drop('churn', axis=1)
y = df['churn']
# Numeric aur categorical columns alag karo
numeric_features = ['age', 'salary']
categorical_features = ['city', 'gender']
# Column-specific preprocessors
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# ColumnTransformer: different columns pe different transformers
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Final pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Train/test split ke saath
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))
4.2 ColumnTransformer — Mixed Data Types Handle Karna
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Complex real-world scenario
preprocessor = ColumnTransformer(
transformers=[
('scale_num', StandardScaler(), ['age', 'income', 'credit_score']),
('encode_cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'occupation']),
('encode_ord', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['risk_level']),
],
remainder='passthrough' # Baaki columns as-is rakhna
)
# Feature names extract karna
X_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
print(feature_names)
4.3 Inverse Transform — Original Data Wapas Laana
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array([[1000, 2], [2000, 4], [3000, 6]])
scaled = scaler.fit_transform(data)
print("Scaled:\n", scaled)
# Wapas original values mein convert karo
original = scaler.inverse_transform(scaled)
print("Recovered:\n", original)
# Same as original data
4.4 Partial Fit — Incremental Learning (Big Data)
Jab data ek saath memory mein nahi aata:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Batch by batch fit karo
for batch in data_batches: # Generator ya list of chunks
scaler.partial_fit(batch)
# Ab transform karo
X_scaled = scaler.transform(X_test)
StandardScaler, MinMaxScaler, MaxAbsScaler partial_fit support karte hain.
4.5 Cross-Validation ke Saath Correct Usage
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# ✅ CORRECT: Pipeline ke andar scaler rakhna
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipe, X, y, cv=5)
# Automatically: train fold pe fit, val fold pe transform
# ❌ WRONG: Pehle scale karna phir CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Data leakage!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
5. Advanced Concepts
5.1 Custom Transformer Banana
BaseEstimator aur TransformerMixin use karke custom preprocessor banate hain:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class LogTransformer(BaseEstimator, TransformerMixin):
"""Custom log(1+x) transformer with clipping."""
def __init__(self, clip_min=0.0):
self.clip_min = clip_min
def fit(self, X, y=None):
# Koi statistics seekhna ho to yahaan
self.n_features_in_ = X.shape[1] if hasattr(X, 'shape') else 1
return self # Hamesha self return karo
def transform(self, X, y=None):
X = np.array(X)
X_clipped = np.clip(X, self.clip_min, None) # Negative values clip
return np.log1p(X_clipped)
def get_feature_names_out(self, input_features=None):
if input_features is None:
return [f'log_feature_{i}' for i in range(self.n_features_in_)]
return [f'log_{name}' for name in input_features]
# Use in pipeline
pipe = Pipeline([
('log', LogTransformer(clip_min=0)),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
5.2 TargetEncoder (sklearn 1.3+)
High-cardinality categorical data ke liye:
from sklearn.preprocessing import TargetEncoder
# City column mein 500+ unique values hain
te = TargetEncoder(
target_type='continuous', # ya 'binary', 'multiclass'
smooth='auto', # Regularization
cv=5 # Cross-validation folds
)
X_train_enc = te.fit_transform(X_train[['city']], y_train)
X_test_enc = te.transform(X_test[['city']])
5.3 Internal Working of StandardScaler
# StandardScaler internally kya karta hai:
class MyStandardScaler:
def fit(self, X):
self.mean_ = X.mean(axis=0) # Column-wise mean
self.var_ = X.var(axis=0) # Column-wise variance
self.scale_ = np.sqrt(self.var_) # Standard deviation
return self
def transform(self, X):
return (X - self.mean_) / self.scale_
def inverse_transform(self, X_scaled):
return X_scaled * self.scale_ + self.mean_
5.4 PolynomialFeatures — Combinatorial Explosion
# Degree 2, 3 features se kitne features bante hain?
from sklearn.preprocessing import PolynomialFeatures
import math
def count_poly_features(n_features, degree, include_bias=True):
from math import comb
total = sum(comb(n_features + d, d) for d in range(1, degree + 1))
return total + (1 if include_bias else 0)
# n=10 features, degree=2 → 66 features!
# n=10 features, degree=3 → 286 features!
# n=100 features, degree=2 → 5151 features! (curse of dimensionality)
poly = PolynomialFeatures(degree=2)
print(f"100 features → {poly.fit(np.zeros((1, 100))).n_output_features_} features")
5.5 Sparse Matrix Optimization
from scipy.sparse import issparse
from sklearn.preprocessing import MaxAbsScaler
# OneHotEncoder returns sparse matrix by default
ohe = OneHotEncoder() # sparse_output=True by default
X_sparse = ohe.fit_transform(categories_data)
print("Is sparse:", issparse(X_sparse)) # True
print("Memory saved:", f"{X_sparse.nnz / X_sparse.size:.2%}")
# MaxAbsScaler sparse matrices ke saath kaam karta hai
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)
5.6 set_output API (sklearn 1.2+)
from sklearn.preprocessing import StandardScaler
import pandas as pd
X = pd.DataFrame({'age': [25, 35, 45], 'salary': [30000, 60000, 90000]})
# DataFrame output chahiye (numpy array nahi)
scaler = StandardScaler()
scaler.set_output(transform='pandas') # or 'default' for numpy
result = scaler.fit_transform(X)
print(type(result)) # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # ['age', 'salary']
6. Real-World Use Cases
6.1 Customer Churn Prediction Pipeline
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
# Telecom churn dataset simulation
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n),
'monthly_charges': np.random.uniform(20, 120, n),
'total_charges': np.random.uniform(100, 8000, n),
'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n),
'payment_method': np.random.choice(['Credit card', 'Bank transfer', 'Mailed check'], n),
'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n),
'tech_support': np.random.choice(['Yes', 'No'], n),
'churn': np.random.choice([0, 1], n, p=[0.73, 0.27])
})
X = df.drop('churn', axis=1)
y = df['churn']
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges']
ordinal_cols = ['contract']
nominal_cols = ['payment_method', 'internet_service', 'tech_support']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_cols),
('ord', OrdinalEncoder(categories=[['Month-to-month', 'One year', 'Two year']]), ordinal_cols),
('nom', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_cols),
])
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier(random_state=42))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print(f"Churn Prediction Accuracy: {pipeline.score(X_test, y_test):.4f}")
6.2 E-commerce Price Prediction
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
# Product pricing model
# Features: area_sqft, rooms, age, location_score
X = np.array([
[500, 1, 10, 7],
[1000, 2, 5, 8],
[1500, 3, 2, 9],
[800, 2, 15, 6],
])
y = np.array([25000, 60000, 95000, 40000])
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
pipeline.fit(X, y)
new_property = np.array([[1200, 2, 8, 7.5]])
predicted_price = pipeline.predict(new_property)
print(f"Predicted Price: ₹{predicted_price[0]:,.0f}")
6.3 NLP Text Classification Preprocessing
from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import issparse
texts = [
"best product ever",
"terrible quality waste of money",
"average product nothing special",
"loved it will buy again",
"poor customer service",
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative']
# Label encoding
le = LabelEncoder()
y = le.fit_transform(labels) # [1, 0, 2, 1, 0]
# TF-IDF + L2 normalization
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts) # Sparse matrix
# L2 normalize rows
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')
print("Classes:", le.classes_)
print("Feature shape:", X_normalized.shape)
6.4 Time Series Feature Engineering
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
import pandas as pd
import numpy as np
# Sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365)
sales = np.random.exponential(scale=1000, size=365) + \
500 * np.sin(np.linspace(0, 4*np.pi, 365)) # Seasonal pattern
df = pd.DataFrame({'date': dates, 'sales': sales})
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
# Sales ko bins mein divide karo (Low/Medium/High)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df['sales_bucket'] = discretizer.fit_transform(df[['sales']])
# Continuous features scale karo
scaler = StandardScaler()
df['sales_scaled'] = scaler.fit_transform(df[['sales']])
print(df[['date', 'sales', 'sales_scaled', 'sales_bucket']].head(10))
7. Practical Examples
Example 1: Titanic Dataset — Complete Preprocessing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Titanic data simulate karte hain
titanic = pd.DataFrame({
'Pclass': [1, 2, 3, 1, 3, 2, 1, 3],
'Sex': ['male', 'female', 'male', 'female', 'male', 'female', 'male', 'female'],
'Age': [22, 38, np.nan, 35, 28, np.nan, 54, 27],
'SibSp': [1, 1, 0, 1, 0, 0, 0, 0],
'Fare': [7.25, 71.28, 7.92, 53.1, 8.05, 13.0, 51.86, 7.92],
'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', None],
'Survived': [0, 1, 1, 1, 0, 0, 0, 1]
})
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']
# Numeric features: Impute + Scale
numeric_features = ['Age', 'Fare', 'SibSp']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Missing values
('scaler', StandardScaler()) # Scale
])
# Categorical features: Impute + Encode
categorical_features = ['Sex', 'Embarked']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')), # Missing values
('encoder', OneHotEncoder(handle_unknown='ignore')) # Encode
])
# Ordinal feature (Pclass already numeric but ordinal)
ordinal_features = ['Pclass']
# Already numeric, bas scaler lagado
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('ord', StandardScaler(), ordinal_features),
])
# Full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
pipeline.fit(X_train, y_train)
# Cross validation score
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='accuracy')
print(f"Train Score: {pipeline.score(X_train, y_train):.4f}")
print(f"Test Score: {pipeline.score(X_test, y_test):.4f}")
print(f"CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
Line-by-line explanation:
Line | Kya hota hai |
|---|---|
| Missing Age values ko median se fill karo |
| Numeric features ko z-score se scale karo |
| Missing Embarked ko mode se fill karo |
| Sex, Embarked ko dummy variables mein convert karo |
| Different columns pe different transformers apply karo |
| Sab steps ko sequence mein joodo |
| Proper train-validation split ke saath evaluate karo |
Example 2: Salary Prediction — Handling Skewed Data
import numpy as np
import pandas as pd
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
np.random.seed(42)
n = 500
df = pd.DataFrame({
'experience_years': np.random.randint(0, 20, n),
'education': np.random.choice(['High School', 'Bachelors', 'Masters', 'PhD'], n),
'department': np.random.choice(['IT', 'Finance', 'HR', 'Marketing'], n),
# Salary is log-normally distributed (highly skewed) - typical in real world
'salary': np.random.lognormal(mean=11, sigma=0.5, size=n)
})
print("Salary statistics before transform:")
print(f" Skewness: {df['salary'].skew():.3f}")
# Target variable bhi transform karo!
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
y = pt.fit_transform(df[['salary']]).ravel()
print(f" Skewness after transform: {pd.Series(y).skew():.3f}")
X = df.drop('salary', axis=1)
preprocessor = ColumnTransformer([
('scale', PowerTransformer(), ['experience_years']),
('ordinal', OrdinalEncoder(categories=[
[['High School', 'Bachelors', 'Masters', 'PhD']]
]), ['education']),
('onehot', OneHotEncoder(drop='first', sparse_output=False), ['department']),
])
pipeline = Pipeline([
('prep', preprocessor),
('model', LinearRegression())
])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred_transformed = pipeline.predict(X_test)
# Predictions ko original salary mein wapas laao
y_pred_original = pt.inverse_transform(y_pred_transformed.reshape(-1, 1))
y_test_original = pt.inverse_transform(y_test.reshape(-1, 1))
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test_original, y_pred_original)
print(f"\nMean Absolute Error: ₹{mae:,.0f}")
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")
Example 3: Multi-class Classification — Complete Workflow
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
# Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Label ko human-readable banao
lb = LabelBinarizer()
y_binary = lb.fit_transform(y) # [0,1,2] → [[1,0,0],[0,1,0],[0,0,1]]
print("Classes:", lb.classes_) # [0, 1, 2]
# Pipeline with hyperparameter tuning
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(probability=True, random_state=42))
])
param_grid = {
'scaler__with_mean': [True, False], # Scaler params
'svm__C': [0.1, 1, 10], # SVM params
'svm__kernel': ['rbf', 'linear'],
}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")
8. Edge Cases and Errors
8.1 Data Leakage — Sabse Common Mistake
# ❌ WRONG — Data leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Puri dataset pe fit kar liya
X_train, X_test = train_test_split(X_scaled, ...) # Test data leak ho gaya
# ✅ CORRECT
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Sirf train pe fit
X_test_scaled = scaler.transform(X_test) # Test pe sirf transform
8.2 Unseen Categories in OneHotEncoder
# ❌ Error: Unknown category at inference time
ohe = OneHotEncoder()
ohe.fit([['Mumbai'], ['Delhi']])
ohe.transform([['Chennai']]) # ValueError: Found unknown categories
# ✅ FIX: handle_unknown='ignore'
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit([['Mumbai'], ['Delhi']])
result = ohe.transform([['Chennai']]) # Silently becomes [0, 0]
# ✅ FIX: handle_unknown='infrequent_if_exist'
ohe = OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=2)
8.3 Negative Values with Box-Cox
# ❌ Error: Box-Cox needs strictly positive values
pt = PowerTransformer(method='box-cox')
data_with_zeros = np.array([[0], [1], [2], [-1]])
pt.fit_transform(data_with_zeros) # ValueError!
# ✅ FIX 1: Yeo-Johnson use karo
pt = PowerTransformer(method='yeo-johnson')
pt.fit_transform(data_with_zeros) # Works!
# ✅ FIX 2: Shift data before box-cox
from sklearn.preprocessing import FunctionTransformer
shift_transformer = FunctionTransformer(lambda x: x + abs(x.min()) + 1)
8.4 String Data in Numeric Transformer
# ❌ Error: Strings in numeric column
data = pd.DataFrame({'salary': ['50000', '60000', 'N/A', '70000']})
scaler = StandardScaler()
scaler.fit_transform(data) # ValueError: could not convert string to float
# ✅ FIX: Clean data first
data['salary'] = pd.to_numeric(data['salary'], errors='coerce') # 'N/A' → NaN
# Then use SimpleImputer + StandardScaler in pipeline
8.5 Shape Mismatch
# ❌ Common mistake: 1D array dena
data_1d = np.array([1, 2, 3, 4, 5])
scaler = StandardScaler()
scaler.fit_transform(data_1d) # ValueError: Expected 2D array
# ✅ FIX: Reshape karo
scaler.fit_transform(data_1d.reshape(-1, 1)) # (5,) → (5, 1)
# Ya pandas Series ke liye
scaler.fit_transform(pd.Series([1,2,3]).values.reshape(-1, 1))
8.6 Fitting on Wrong Data
# ❌ WRONG: Test set pe fit karna
scaler = StandardScaler()
scaler.fit(X_test) # Test ke statistics use ho rahe hain!
X_train_scaled = scaler.transform(X_train)
# ✅ CORRECT: Training set pe fit karo
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Same scaler!
8.7 Debugging Toolkit
# Scaler ke statistics check karo
scaler = StandardScaler()
scaler.fit(X_train)
print("Mean per feature:", scaler.mean_)
print("Std per feature:", scaler.scale_)
print("Variance:", scaler.var_)
print("N samples seen:", scaler.n_samples_seen_)
# Transform hone ke baad verify karo
X_scaled = scaler.transform(X_train)
print("Scaled mean (should be ~0):", X_scaled.mean(axis=0))
print("Scaled std (should be ~1):", X_scaled.std(axis=0))
# OneHotEncoder categories check karo
ohe = OneHotEncoder()
ohe.fit(X_cat)
print("Categories:", ohe.categories_)
print("Feature names:", ohe.get_feature_names_out())
9. Pro Developer Insights
9.1 Best Practices
# ✅ 1. HAMESHA Pipeline use karo
pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
# ✅ 2. Train/test ke baad fit karo
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
# ✅ 3. set_output se DataFrame maintain karo
preprocessor.set_output(transform='pandas')
# ✅ 4. get_params() se hyperparameters check karo
print(scaler.get_params())
# ✅ 5. Preprocessor ko save karo (model ke saath)
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')
loaded = joblib.load('model_pipeline.pkl')
# Test time pe same transformations apply honge automatically
9.2 Performance Tips
# Tip 1: Sparse matrix use karo jab possible
ohe = OneHotEncoder() # sparse_output=True by default — use it!
# Tip 2: n_jobs=-1 se parallel processing
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer()
# Tip 3: Partial fit for big data
for chunk in pd.read_csv('big_data.csv', chunksize=10000):
scaler.partial_fit(chunk[numeric_cols])
# Tip 4: Memory-efficient: float32 use karo
import numpy as np
X = X.astype(np.float32)
# Tip 5: DataFrame set_output API (sklearn 1.2+)
from sklearn import set_config
set_config(transform_output='pandas') # Global setting
9.3 Security Tips
# ❌ Kabhi bhi untrusted pickle files load mat karo
import pickle
model = pickle.load(open('untrusted_model.pkl', 'rb')) # Dangerous!
# ✅ joblib use karo + source verify karo
import joblib
model = joblib.load('trusted_model.pkl')
# Production mein: Preprocessing artifacts bhi save karo
artifacts = {
'preprocessor': preprocessor,
'model': trained_model,
'feature_names': list(X.columns),
'version': '1.0.0'
}
joblib.dump(artifacts, 'production_artifacts.pkl')
9.4 Clean Code Patterns
# Pattern 1: Constants alag rakhna
NUMERIC_FEATURES = ['age', 'salary', 'tenure']
CATEGORICAL_FEATURES = ['department', 'city']
ORDINAL_FEATURES = ['education']
ORDINAL_CATEGORIES = [['High School', 'Bachelors', 'Masters', 'PhD']]
# Pattern 2: Factory function
def build_preprocessor():
return ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), NUMERIC_FEATURES),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), CATEGORICAL_FEATURES),
])
# Pattern 3: Feature validation
def validate_features(df, required_features):
missing = set(required_features) - set(df.columns)
if missing:
raise ValueError(f"Missing features: {missing}")
validate_features(X_test, NUMERIC_FEATURES + CATEGORICAL_FEATURES)
10. Comparison
10.1 Scalers Comparison
Scaler | Formula | Range | Outlier Sensitive | Best For |
|---|---|---|---|---|
| (x-μ)/σ | (-∞, +∞) | Yes | Normal dist, SVM, LR |
| (x-min)/(max-min) | [0,1] | Very Yes | Neural nets, KNN |
| x/ | max | [-1,1] | |
| (x-median)/IQR | Variable | No | Outlier-heavy data |
| x/‖x‖ | [-1,1] | No | Text, cosine sim |
| Percentile-based | [0,1] or N(0,1) | Very No | Any distribution |
| Yeo-Johnson/Box-Cox | Variable | No | Skewed data |
10.2 Encoders Comparison
Encoder | Input | Output | Order? | Use Case |
|---|---|---|---|---|
| 1D | 1D integers | No (arbitrary) | Target variable |
| 2D | 2D integers | Yes (custom) | Ordered categories |
| 2D | 2D binary sparse | No | Nominal categories |
| 1D | 2D binary | No | Multi-class targets |
| List of lists | 2D binary | No | Multi-label classification |
| 2D | 2D float | No | High-cardinality categories |
10.3 sklearn vs Alternatives
Feature |
|
|
|
|
|---|---|---|---|---|
Pipeline integration | ✅ Native | ❌ Manual | ✅ | ✅ |
Fit/transform pattern | ✅ | ❌ | ✅ | ✅ |
Sparse matrix support | ✅ | ❌ | ❌ | ❌ |
Target encoding | ✅ (v1.3) | ❌ | ✅ | ✅ |
Mean encoding | ❌ | Manual | ✅ | ✅ |
Ease of use | Medium | Easy | Medium | Easy |
Production-ready | ✅ | ⚠️ | ✅ | ✅ |
11. Data Science Perspective
11.1 EDA-Driven Preprocessing Decision
import pandas as pd
import numpy as np
def preprocessing_recommendations(df):
"""Data dekhke automatically preprocessing suggest karo."""
print("=" * 60)
print("PREPROCESSING RECOMMENDATIONS")
print("=" * 60)
for col in df.columns:
dtype = df[col].dtype
nunique = df[col].nunique()
missing_pct = df[col].isna().mean() * 100
print(f"\n📊 Column: {col}")
print(f" Type: {dtype}, Unique: {nunique}, Missing: {missing_pct:.1f}%")
if dtype in ['object', 'category']:
if nunique == 2:
print(" ✅ Recommended: LabelEncoder (binary)")
elif nunique <= 10:
print(" ✅ Recommended: OneHotEncoder (nominal)")
elif nunique > 10:
print(" ✅ Recommended: TargetEncoder (high cardinality)")
elif dtype in ['int64', 'float64']:
skewness = df[col].skew()
if abs(skewness) > 1:
print(f" 📈 High skewness: {skewness:.2f}")
print(" ✅ Recommended: PowerTransformer or QuantileTransformer")
else:
print(" ✅ Recommended: StandardScaler")
if missing_pct > 0:
print(f" ⚠️ Missing values: Use SimpleImputer")
# Use karo
preprocessing_recommendations(df)
11.2 Feature Importance After Preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
# Pipeline train karo
pipeline.fit(X_train, y_train)
# Feature names extract karo
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
# Feature importance
rf = pipeline.named_steps['model']
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(10))
11.3 Preprocessing in ML Workflow
1. Data Collection
↓
2. EDA (Exploratory Data Analysis)
- Distribution check → Scaler choice
- Missing values check → Imputer strategy
- Cardinality check → Encoder choice
- Correlation analysis
↓
3. Feature Engineering
- PolynomialFeatures
- KBinsDiscretizer
- FunctionTransformer
↓
4. Preprocessing Pipeline Build
- ColumnTransformer
- Pipeline
↓
5. Train/Test Split (PHIR fit karo!)
↓
6. Model Training + Hyperparameter Tuning
↓
7. Evaluation
↓
8. Save Pipeline (preprocessor + model)
↓
9. Production Deployment
11.4 Distribution Analysis for Scaler Selection
import numpy as np
from scipy import stats
def select_scaler(data):
"""Distribution dekhke best scaler recommend karo."""
# Normality test
_, p_value = stats.normaltest(data)
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)
# Outlier detection
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
outliers = np.sum((data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr))
outlier_pct = outliers / len(data) * 100
print(f"Skewness: {skewness:.3f}")
print(f"Kurtosis: {kurtosis:.3f}")
print(f"Outlier %: {outlier_pct:.1f}%")
print(f"Normality p-value: {p_value:.4f}")
if outlier_pct > 5:
print("→ Recommendation: RobustScaler")
elif abs(skewness) > 1:
print("→ Recommendation: PowerTransformer or QuantileTransformer")
elif p_value > 0.05: # Normally distributed
print("→ Recommendation: StandardScaler")
else:
print("→ Recommendation: QuantileTransformer (safe choice)")
# Test
salary_data = np.random.lognormal(10, 1, 1000)
select_scaler(salary_data)
12. Interview Questions
🟢 Basic Level
Q1. StandardScaler aur MinMaxScaler mein kya difference hai?
StandardScaler mean=0, std=1 banata hai (Z-score). Range fixed nahi hoti. Outliers se affect hota hai. Normal distribution ke liye best.
MinMaxScaler values ko [0,1] range mein laata hai. Fixed range milti hai. Outliers se bahut affect hota hai. Neural networks ke liye best.
Q2. Kyun fit_transform() sirf training data pe use karna chahiye?
Agar test data pe bhi
fit()karo, to test set ki statistics (mean, std, min, max) training mein leak ho jaati hain. Yahi Data Leakage hai. Model unrealistically achha score karta hai training time pe lekin production mein fail karta hai.
Q3. LabelEncoder aur OneHotEncoder mein kab kya use karein?
LabelEncoder: Target variable (
y) ke liye ya tree-based models ke features ke liyeOneHotEncoder: Linear/distance-based models ke nominal categorical features ke liye
LabelEncoder features mein ordinal relationship assume karta hai jo wrong prediction cause karta hai
Q4. Normalizer kya karta hai aur kab use karte hain?
Normalizer rows ko normalize karta hai (columns nahi). Har sample ki magnitude 1 ho jaati hai. Text classification, cosine similarity calculation mein use karte hain.
🟡 Intermediate Level
Q5. RobustScaler outliers handle kaise karta hai?
RobustScaler
medianaurIQR (Q3-Q1)use karta hai. Formula:(x - median) / IQR. Outliers median aur IQR ko affect nahi karte (jaise mean aur std ko karte hain). Isliye outlier-heavy data pe behtar kaam karta hai.
Q6. ColumnTransformer ka kya kaam hai? remainder parameter kya karta hai?
ColumnTransformerdifferent columns pe different transformers apply karta hai ek saath.remainder='passthrough'baaki columns ko as-is rakhta hai.remainder='drop'unhe remove karta hai. Default'drop'hai.
Q7. Sparse matrix kya hoti hai aur MaxAbsScaler kyun prefer karte hain sparse data ke liye?
Sparse matrix mostly zeros hoti hai (TF-IDF, OneHotEncoded data). MaxAbsScaler zero values ko disturb nahi karta (zeros zero rehte hain). StandardScaler mean subtract karta hai jisse sparse matrix dense ho jaati hai — memory waste!
Q8. PolynomialFeatures degree=2, n=10 features se kitne features banate hain?
Formula: C(n+d, d) = C(12, 2) = 66 features (bias ke saath). Isko curse of dimensionality bolte hain. High degree pe features exponentially badhte hain.
🔴 Advanced Level
Q9. Custom Transformer banana hai jo Pipeline ke saath kaam kare — kaise banate hain?
BaseEstimatoraurTransformerMixininherit karo.fit()method mein statistics seekho aurselfreturn karo.transform()mein actual transformation karo.get_feature_names_out()bhi implement karo for Pipeline compatibility.
Q10. TargetEncoder kya hai aur Label/OneHot se better kab hai?
TargetEncoder categorical values ko target variable ke conditional mean se replace karta hai. High-cardinality columns (500+ unique values) mein OneHotEncoder infeasible ho jaata hai. TargetEncoder mein K-fold cross-validation internally hoti hai target leakage avoid karne ke liye.
Q11. partial_fit() kab zaroorat padti hai aur kaun se preprocessors support karte hain?
Jab data memory mein fit na ho (Big Data), tab batch-by-batch
partial_fit()se incrementally fit karte hain. StandardScaler, MinMaxScaler, MaxAbsScaler support karte hain. PowerTransformer aur QuantileTransformer support nahi karte.
Q12. set_output API kya hai aur iska faida kya hai?
sklearn 1.2+ mein
set_output(transform='pandas')se transformers DataFrame return karte hain numpy array ki jagah. Column names preserve hote hain. Debugging easy hoti hai. Feature names automatically track hoti hain.
🎯 Scenario-Based
Q13. Production mein naye category (unseen label) aayi to kya hoga? Handle kaise karein?
OneHotEncoder(handle_unknown='ignore')se unseen categories zero vector ban jaati hain.handle_unknown='infrequent_if_exist'se rare + unseen categories ek "infrequent" bin mein group hoti hain. Pipeline mein yeh handling automatically hoti hai.
Q14. Ek column mein salary hai (0 se 10 crore tak) aur outliers bhi hain. Kaunsa scaler use karoge aur kyun?
RobustScaler best choice hai kyunki:
Wide range hai — StandardScaler/MinMaxScaler outliers se skew honge
Outliers hain — MinMaxScaler outliers se range compress ho jaati hai
RobustScaler median/IQR use karta hai jo outlier-resistant hai
Q15. Train test split ke baad validation set banani hai aur preprocessor apply karna hai. Correct sequence kya hai?
# Correct sequence:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Fit only on train!
preprocessor.fit(X_train)
X_train_t = preprocessor.transform(X_train)
X_val_t = preprocessor.transform(X_val) # Same scaler!
X_test_t = preprocessor.transform(X_test) # Same scaler!
13. Conclusion
Key Learnings Recap
✅ StandardScaler → Normal data, SVM, LR, Neural Nets
✅ MinMaxScaler → Image data, KNN, Neural Nets
✅ RobustScaler → Outlier-heavy data (financial, medical)
✅ MaxAbsScaler → Sparse data, NLP/TF-IDF
✅ QuantileTransformer → Any distribution, most robust
✅ PowerTransformer → Skewed data, linear model assumptions
✅ LabelEncoder → Target variable, tree models
✅ OrdinalEncoder → Ordered categories (size, education)
✅ OneHotEncoder → Nominal categories, linear/distance models
✅ TargetEncoder → High-cardinality categories
✅ PolynomialFeatures → Non-linear feature engineering
✅ KBinsDiscretizer → Continuous → categorical
✅ FunctionTransformer → Custom any function
When to Use What — Decision Tree
Data hai → Kya type?
NUMERIC → Outliers hain?
Yes → RobustScaler
No → Skewed?
Yes → PowerTransformer / QuantileTransformer
No → StandardScaler (normal) / MinMaxScaler (neural nets)
CATEGORICAL → Order hai?
Yes → OrdinalEncoder
No → Cardinality?
High (>50) → TargetEncoder
Low (<50) → OneHotEncoder
TARGET VARIABLE → LabelEncoder (multi-class) / as-is (regression)
Final Advice
Bhai, sklearn.preprocessing sirf numbers scale karna nahi hai — yeh ek data engineering toolkit hai jo:
Data quality improve karta hai
Model performance dramatically badhata hai
Training speed fast karta hai
Generalization improve karta hai
Production mein hamesha Pipeline use karo — isse preprocessing aur model ek unit ban jaate hain, deployment easy hoti hai, aur data leakage nahi hota.
Ek achha ML Engineer wo hai jo model se zyaada data samajhta hai — aur data samajhne ke liye preprocessing mastery zaroori hai. 🚀
📌 Pro Tip: "Garbage in, garbage out" — Machine Learning mein agar preprocessing sahi nahi hai, to duniya ka best model bhi fail karega. Preprocessing mein time lagao, results khud achhe aayenge.
Happy Coding! 🐍
Agar yeh blog helpful laga to share karo aur comments mein batao ki kaunsa concept most useful laga!