1. Introduction
Yeh sklearn.preprocessing hai kya?
Bhai, sochte ho ki Machine Learning sirf model fit karna hai? Galat! Asli kaam data ko model ke liye ready karna hai — aur yehi kaam karta hai sklearn.preprocessing.
sklearn.preprocessing ek Python module hai jo scikit-learn library ka part hai. Iska kaam hai raw, messy, unscaled data ko ML models ke liye clean, transformed, aur normalized banana.
Kyun zaroori hai?
Real data kabhi bhi perfect nahi hota:
Salary column mein values
50,000se1,00,00,000tak hainCity column mein "Mumbai", "Delhi", "Bangalore" jaise strings hain
Kuch columns mein
NaNvalues hainKuch features skewed hain
Agar yeh data seedha ML model ko dedo, to model:
Bias ho jaata hai high-value columns ki taraf (jaise salary)
Fail karta hai string categories ko process karne mein
Slow converge hota hai gradient descent mein
sklearn.preprocessing in sab problems ko solve karta hai.
2. Basic Concepts
2.1 Preprocessing Kya Hota Hai?
Preprocessing matlab data ko usable form mein convert karna — jaise:
Raw Data → Clean Data → Transformed Data → ML Model
2.2 Core Problem Areas
1. Feature Scaling → Numbers ka range equalize karna
2. Encoding → Strings ko numbers mein convert karna
3. Normalization → Rows ko unit norm pe laana
4. Transformation → Distribution shape change karna
5. Binarization → Continuous values ko 0/1 mein convert karna
6. Imputation → Missing values fill karna (SimpleImputer)
7. Polynomial Features → New features generate karna
2.3 Basic Import
# Pura module import karo
from sklearn import preprocessing
# Ya specific classes import karo (recommended)
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
MaxAbsScaler,
RobustScaler,
Normalizer,
Binarizer,
LabelEncoder,
OrdinalEncoder,
OneHotEncoder,
LabelBinarizer,
MultiLabelBinarizer,
PolynomialFeatures,
FunctionTransformer,
PowerTransformer,
QuantileTransformer,
KBinsDiscretizer,
)
2.4 Fit-Transform Pattern (Most Important!)
Har sklearn preprocessor same pattern follow karta hai:
# Step 1: Object banao
scaler = StandardScaler()
# Step 2: Training data pe fit karo (statistics seekhta hai)
scaler.fit(X_train)
# Step 3: Transform karo
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Same scaler use karo!
# Ya fit + transform ek saath (sirf training data pe)
X_train_scaled = scaler.fit_transform(X_train)
⚠️ Golden Rule:
fit()sirfX_trainpe karo.X_testpe sirftransform()karo. Test data ki information training mein leak nahi honi chahiye — isko Data Leakage kehte hain.
3. All Functions and Features
3.1 Feature Scaling (Numerical Data)
🔹 StandardScaler — Z-Score Normalization
Formula: z = (x - mean) / std_dev
Result: Mean = 0, Std Dev = 1
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[1000, 2], [2000, 4], [3000, 6], [4000, 8]])
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
print("Original:\n", data)
print("\nScaled:\n", scaled)
print("\nMean:", scaler.mean_)
print("Std:", scaler.scale_)
Original:
[[1000 2]
[2000 4]
[3000 6]
[4000 8]]
Scaled:
[[-1.34164079 -1.34164079]
[-0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079]]
Mean: [2500. 5.]
Std: [1118.03398875 2.23606798]
Kab use karein:
Linear Regression, Logistic Regression, SVM, Neural Networks
Data Gaussian distribution follow karta ho
Outliers nahi hone chahiye (sensitive to outliers)
🔹 MinMaxScaler — Range Scaling
Formula: x_scaled = (x - x_min) / (x_max - x_min)
Result: Values [0, 1] range mein (ya custom range)
from sklearn.preprocessing import MinMaxScaler
data = np.array([[100], [200], [300], [400], [500]])
scaler = MinMaxScaler(feature_range=(0, 1)) # Default
scaled = scaler.fit_transform(data)
print(scaled)
# [[0. ], [0.25], [0.5 ], [0.75], [1. ]]
# Custom range: -1 to 1
scaler2 = MinMaxScaler(feature_range=(-1, 1))
scaled2 = scaler2.fit_transform(data)
print(scaled2)
# [[-1. ], [-0.5], [0. ], [0.5 ], [1. ]]
Kab use karein:
Neural Networks (especially image pixel values 0-255 → 0-1)
KNN, K-Means
Jab exact [0,1] range chahiye
Outliers nahi hone chahiye
🔹 MaxAbsScaler — Absolute Maximum Scaling
Formula: x_scaled = x / |x_max|
Result: Values [-1, 1] range mein (sign preserve hota hai)
from sklearn.preprocessing import MaxAbsScaler
data = np.array([[-10], [-5], [0], [5], [10]])
scaler = MaxAbsScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5], [0. ], [0.5], [1. ]]
Kab use karein:
Sparse data (mostly zeros) — zeros disturb nahi hote
Text/NLP TF-IDF matrices
Data pehle se centered ho (zero mean)
🔹 RobustScaler — Outlier-Resistant Scaling
Formula: x_scaled = (x - median) / IQR
IQR = Q3 - Q1 (Interquartile Range)
from sklearn.preprocessing import RobustScaler
# Outliers ke saath data
data = np.array([[1], [2], [3], [4], [100]]) # 100 outlier hai
scaler = RobustScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5 ], [0. ], [0.5 ], [48.5 ]]
print("Median:", scaler.center_) # [3.]
print("IQR:", scaler.scale_) # [2.]
Kab use karein:
Data mein outliers hain
Financial data (salary, transactions)
Medical data
Real-world messy data
🔹 Normalizer — Row-wise Normalization
Ye rows ko normalize karta hai (columns nahi!) — ek sample ki saari values ek unit norm pe laata hai.
from sklearn.preprocessing import Normalizer
data = np.array([[3, 4], [1, 2], [6, 8]])
# L2 norm (default): sqrt(sum of squares) = 1
norm_l2 = Normalizer(norm='l2')
print(norm_l2.fit_transform(data))
# [[0.6, 0.8], [0.447, 0.894], [0.6, 0.8]]
# L1 norm: sum of absolute values = 1
norm_l1 = Normalizer(norm='l1')
print(norm_l1.fit_transform(data))
# [[0.429, 0.571], [0.333, 0.667], [0.429, 0.571]]
# Max norm: max value = 1
norm_max = Normalizer(norm='max')
print(norm_max.fit_transform(data))
# [[0.75, 1.0], [0.5, 1.0], [0.75, 1.0]]
Kab use karein:
Text classification (TF-IDF vectors)
Cosine similarity compute karna ho
Each sample independent ho
3.2 Encoding (Categorical Data)
🔹 LabelEncoder — Integer Encoding
Ek column ki categorical values ko integers mein convert karta hai.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# String labels
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi']
encoded = le.fit_transform(cities)
print(encoded) # [2 1 0 2 1]
print(le.classes_) # ['Bangalore' 'Delhi' 'Mumbai']
# Inverse transform
print(le.inverse_transform([2, 1, 0])) # ['Mumbai' 'Delhi' 'Bangalore']
Kab use karein:
Target variable (y) encode karne ke liye ✅
Tree-based models (Random Forest, XGBoost) ke features ke liye ✅
Linear models ke features mein mat use karo ❌ (ordinal assumption)
🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)
Multiple columns ko ek saath encode karta hai, order preserve karta hai.
from sklearn.preprocessing import OrdinalEncoder
data = np.array([
['Low', 'Small'],
['Medium', 'Medium'],
['High', 'Large'],
['Low', 'Large']
])
encoder = OrdinalEncoder(
categories=[
['Low', 'Medium', 'High'], # Column 1 order
['Small', 'Medium', 'Large'] # Column 2 order
]
)
encoded = encoder.fit_transform(data)
print(encoded)
# [[0. 0.]
# [1. 1.]
# [2. 2.]
# [0. 2.]]
Kab use karein:
Education: 'School' < 'Bachelors' < 'Masters' < 'PhD'
Size: 'Small' < 'Medium' < 'Large'
Rating: 'Poor' < 'Average' < 'Good' < 'Excellent'
🔹 OneHotEncoder — Dummy Variable Encoding
Categorical values ko binary columns mein convert karta hai (dummy variables).
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
data = np.array([['Mumbai'], ['Delhi'], ['Bangalore'], ['Mumbai']])
ohe = OneHotEncoder(
sparse_output=False, # numpy array return karo
drop='first', # Dummy variable trap avoid karo
handle_unknown='ignore' # New categories ko ignore karo
)
encoded = ohe.fit_transform(data)
print(encoded)
# [[0. 1.] <- Mumbai
# [1. 0.] <- Delhi
# [0. 0.] <- Bangalore (dropped - first)
# [0. 1.]] <- Mumbai
print(ohe.categories_)
print(ohe.get_feature_names_out(['city']))
# ['city_Delhi' 'city_Mumbai']
Key Parameters:
Parameter | Default | Description |
|---|---|---|
|
| Sparse matrix return karo |
|
| 'first' ya 'if_binary' |
|
| 'ignore' ya 'infrequent_if_exist' |
|
| Rare categories handle karo |
|
| Categories cap karo |
Kab use karein:
Linear models ke liye categorical features ✅
Neural networks ✅
Jab categories mein koi order na ho ✅
🔹 LabelBinarizer — Binary Matrix for Single Column
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai']
encoded = lb.fit_transform(cities)
print(encoded)
# [[0 0 1] <- Mumbai
# [0 1 0] <- Delhi
# [1 0 0] <- Bangalore
# [0 0 1]] <- Mumbai
print(lb.classes_) # ['Bangalore' 'Delhi' 'Mumbai']
🔹 MultiLabelBinarizer — Multiple Labels per Sample
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# Ek user multiple genres pasand karta hai
genres = [
['Action', 'Drama'],
['Comedy'],
['Action', 'Comedy', 'Romance'],
['Drama', 'Romance']
]
encoded = mlb.fit_transform(genres)
print(mlb.classes_) # ['Action' 'Comedy' 'Drama' 'Romance']
print(encoded)
# [[1 0 1 0]
# [0 1 0 0]
# [1 1 0 1]
# [0 0 1 1]]
3.3 Binarization
🔹 Binarizer — Threshold-based 0/1
from sklearn.preprocessing import Binarizer
data = np.array([[0.1, 0.5, 0.9], [0.2, 0.6, 0.4]])
# Default threshold = 0.0
binarizer = Binarizer(threshold=0.5)
result = binarizer.fit_transform(data)
print(result)
# [[0 0 1]
# [0 1 0]]
Kab use karein:
Probability scores ko binary predictions mein convert karna
Image thresholding
Feature engineering
3.4 Distribution Transformation
🔹 PowerTransformer — Gaussian Distribution Banana
from sklearn.preprocessing import PowerTransformer
data = np.array([[1], [2], [3], [4], [100], [1000]]) # Skewed data
# Yeo-Johnson: positive aur negative dono values handle karta hai
pt_yj = PowerTransformer(method='yeo-johnson')
transformed = pt_yj.fit_transform(data)
print("Yeo-Johnson:\n", transformed)
# Box-Cox: sirf positive values
pt_bc = PowerTransformer(method='box-cox')
transformed_bc = pt_bc.fit_transform(data)
print("Box-Cox lambda:", pt_bc.lambdas_)
Kab use karein:
Highly skewed numerical data (income, population)
Linear regression assume karta hai normality
Before t-tests, ANOVA
🔹 QuantileTransformer — Percentile-based Transform
from sklearn.preprocessing import QuantileTransformer
data = np.array([[1], [2], [3], [100], [1000], [50000]])
# Uniform distribution (0 to 1)
qt_uniform = QuantileTransformer(
n_quantiles=100,
output_distribution='uniform',
random_state=42
)
# Normal distribution
qt_normal = QuantileTransformer(
n_quantiles=100,
output_distribution='normal',
random_state=42
)
print("Uniform:\n", qt_uniform.fit_transform(data))
print("Normal:\n", qt_normal.fit_transform(data))
Advantage: Outliers ke against most robust hai — extreme values median ke paas aajati hain.
🔹 FunctionTransformer — Custom Transformation
from sklearn.preprocessing import FunctionTransformer
# Log transformation
log_transformer = FunctionTransformer(
func=np.log1p, # log(x + 1)
inverse_func=np.expm1 # inverse: e^x - 1
)
data = np.array([[0], [1], [10], [100], [1000]])
transformed = log_transformer.fit_transform(data)
print(transformed)
# [[0. ]
# [0.693]
# [2.398]
# [4.615]
# [6.909]]
# Koi bhi custom function use kar sakte ho
sqrt_transformer = FunctionTransformer(np.sqrt)
3.5 Discretization
🔹 KBinsDiscretizer — Continuous to Categorical
from sklearn.preprocessing import KBinsDiscretizer
data = np.array([[0], [10], [20], [30], [40], [50], [100]])
# Equal-width bins
kbd_width = KBinsDiscretizer(
n_bins=3,
encode='ordinal', # 'onehot', 'onehot-dense', 'ordinal'
strategy='uniform' # 'uniform', 'quantile', 'kmeans'
)
print("Uniform:\n", kbd_width.fit_transform(data))
# Equal-frequency bins
kbd_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
print("Quantile:\n", kbd_freq.fit_transform(data))
print("Bin edges:", kbd_width.bin_edges_)
3.6 Polynomial Features
🔹 PolynomialFeatures — Non-linear Feature Engineering
from sklearn.preprocessing import PolynomialFeatures
data = np.array([[2, 3]]) # 2 features: a=2, b=3
# Degree 2: 1, a, b, a^2, ab, b^2
poly = PolynomialFeatures(degree=2, include_bias=True)
result = poly.fit_transform(data)
print(result)
# [[ 1. 2. 3. 4. 6. 9.]]
print(poly.get_feature_names_out(['a', 'b']))
# ['1' 'a' 'b' 'a^2' 'a b' 'b^2']
# Interaction only (no powers)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
result2 = poly_interact.fit_transform(data)
print(result2)
# [[2. 3. 6.]] -> a, b, a*b
4. Intermediate Usage
4.1 Pipeline ke Saath Integration
Real projects mein preprocessors Pipeline ke andar use karte hain:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
import pandas as pd
# Sample dataset
df = pd.DataFrame({
'age': [25, 35, 45, 28, 55],
'salary': [30000, 60000, 90000, 40000, 120000],
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi'],
'gender': ['M', 'F', 'M', 'F', 'M'],
'churn': [0, 0, 1, 0, 1]
})
X = df.drop('churn', axis=1)
y = df['churn']
# Numeric aur categorical columns alag karo
numeric_features = ['age', 'salary']
categorical_features = ['city', 'gender']
# Column-specific preprocessors
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# ColumnTransformer: different columns pe different transformers
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Final pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Train/test split ke saath
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))
4.2 ColumnTransformer — Mixed Data Types Handle Karna
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Complex real-world scenario
preprocessor = ColumnTransformer(
transformers=[
('scale_num', StandardScaler(), ['age', 'income', 'credit_score']),
('encode_cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'occupation']),
('encode_ord', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['risk_level']),
],
remainder='passthrough' # Baaki columns as-is rakhna
)
# Feature names extract karna
X_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
print(feature_names)
4.3 Inverse Transform — Original Data Wapas Laana
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = np.array([[1000, 2], [2000, 4], [3000, 6]])
scaled = scaler.fit_transform(data)
print("Scaled:\n", scaled)
# Wapas original values mein convert karo
original = scaler.inverse_transform(scaled)
print("Recovered:\n", original)
# Same as original data
4.4 Partial Fit — Incremental Learning (Big Data)
Jab data ek saath memory mein nahi aata:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Batch by batch fit karo
for batch in data_batches: # Generator ya list of chunks
scaler.partial_fit(batch)
# Ab transform karo
X_scaled = scaler.transform(X_test)
StandardScaler, MinMaxScaler, MaxAbsScaler partial_fit support karte hain.
4.5 Cross-Validation ke Saath Correct Usage
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# ✅ CORRECT: Pipeline ke andar scaler rakhna
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipe, X, y, cv=5)
# Automatically: train fold pe fit, val fold pe transform
# ❌ WRONG: Pehle scale karna phir CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Data leakage!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
5. Advanced Concepts
5.1 Custom Transformer Banana
BaseEstimator aur TransformerMixin use karke custom preprocessor banate hain:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class LogTransformer(BaseEstimator, TransformerMixin):
"""Custom log(1+x) transformer with clipping."""
def __init__(self, clip_min=0.0):
self.clip_min = clip_min
def fit(self, X, y=None):
# Koi statistics seekhna ho to yahaan
self.n_features_in_ = X.shape[1] if hasattr(X, 'shape') else 1
return self # Hamesha self return karo
def transform(self, X, y=None):
X = np.array(X)
X_clipped = np.clip(X, self.clip_min, None) # Negative values clip
return np.log1p(X_clipped)
def get_feature_names_out(self, input_features=None):
if input_features is None:
return [f'log_feature_{i}' for i in range(self.n_features_in_)]
return [f'log_{name}' for name in input_features]
# Use in pipeline
pipe = Pipeline([
('log', LogTransformer(clip_min=0)),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
5.2 TargetEncoder (sklearn 1.3+)
High-cardinality categorical data ke liye:
from sklearn.preprocessing import TargetEncoder
# City column mein 500+ unique values hain
te = TargetEncoder(
target_type='continuous', # ya 'binary', 'multiclass'
smooth='auto', # Regularization
cv=5 # Cross-validation folds
)
X_train_enc = te.fit_transform(X_train[['city']], y_train)
X_test_enc = te.transform(X_test[['city']])
5.3 Internal Working of StandardScaler
# StandardScaler internally kya karta hai:
class MyStandardScaler:
def fit(self, X):
self.mean_ = X.mean(axis=0) # Column-wise mean
self.var_ = X.var(axis=0) # Column-wise variance
self.scale_ = np.sqrt(self.var_) # Standard deviation
return self
def transform(self, X):
return (X - self.mean_) / self.scale_
def inverse_transform(self, X_scaled):
return X_scaled * self.scale_ + self.mean_
5.4 PolynomialFeatures — Combinatorial Explosion
# Degree 2, 3 features se kitne features bante hain?
from sklearn.preprocessing import PolynomialFeatures
import math
def count_poly_features(n_features, degree, include_bias=True):
from math import comb
total = sum(comb(n_features + d, d) for d in range(1, degree + 1))
return total + (1 if include_bias else 0)
# n=10 features, degree=2 → 66 features!
# n=10 features, degree=3 → 286 features!
# n=100 features, degree=2 → 5151 features! (curse of dimensionality)
poly = PolynomialFeatures(degree=2)
print(f"100 features → {poly.fit(np.zeros((1, 100))).n_output_features_} features")
5.5 Sparse Matrix Optimization
from scipy.sparse import issparse
from sklearn.preprocessing import MaxAbsScaler
# OneHotEncoder returns sparse matrix by default
ohe = OneHotEncoder() # sparse_output=True by default
X_sparse = ohe.fit_transform(categories_data)
print("Is sparse:", issparse(X_sparse)) # True
print("Memory saved:", f"{X_sparse.nnz / X_sparse.size:.2%}")
# MaxAbsScaler sparse matrices ke saath kaam karta hai
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)
5.6 set_output API (sklearn 1.2+)
from sklearn.preprocessing import StandardScaler
import pandas as pd
X = pd.DataFrame({'age': [25, 35, 45], 'salary': [30000, 60000, 90000]})
# DataFrame output chahiye (numpy array nahi)
scaler = StandardScaler()
scaler.set_output(transform='pandas') # or 'default' for numpy
result = scaler.fit_transform(X)
print(type(result)) # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # ['age', 'salary']
6. Real-World Use Cases
6.1 Customer Churn Prediction Pipeline
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
# Telecom churn dataset simulation
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n),
'monthly_charges': np.random.uniform(20, 120, n),
'total_charges': np.random.uniform(100, 8000, n),
'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n),
'payment_method': np.random.choice(['Credit card', 'Bank transfer', 'Mailed check'], n),
'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n),
'tech_support': np.random.choice(['Yes', 'No'], n),
'churn': np.random.choice([0, 1], n, p=[0.73, 0.27])
})
X = df.drop('churn', axis=1)
y = df['churn']
numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges']
ordinal_cols = ['contract']
nominal_cols = ['payment_method', 'internet_service', 'tech_support']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_cols),
('ord', OrdinalEncoder(categories=[['Month-to-month', 'One year', 'Two year']]), ordinal_cols),
('nom', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_cols),
])
pipeline = Pipeline([
('prep', preprocessor),
('model', RandomForestClassifier(random_state=42))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print(f"Churn Prediction Accuracy: {pipeline.score(X_test, y_test):.4f}")
6.2 E-commerce Price Prediction
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
# Product pricing model
# Features: area_sqft, rooms, age, location_score
X = np.array([
[500, 1, 10, 7],
[1000, 2, 5, 8],
[1500, 3, 2, 9],
[800, 2, 15, 6],
])
y = np.array([25000, 60000, 95000, 40000])
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
pipeline.fit(X, y)
new_property = np.array([[1200, 2, 8, 7.5]])
predicted_price = pipeline.predict(new_property)
print(f"Predicted Price: ₹{predicted_price[0]:,.0f}")
6.3 NLP Text Classification Preprocessing
from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import issparse
texts = [
"best product ever",
"terrible quality waste of money",
"average product nothing special",
"loved it will buy again",
"poor customer service",
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative']
# Label encoding
le = LabelEncoder()
y = le.fit_transform(labels) # [1, 0, 2, 1, 0]
# TF-IDF + L2 normalization
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts) # Sparse matrix
# L2 normalize rows
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')
print("Classes:", le.classes_)
print("Feature shape:", X_normalized.shape)
6.4 Time Series Feature Engineering
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
import pandas as pd
import numpy as np
# Sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365)
sales = np.random.exponential(scale=1000, size=365) + \
500 * np.sin(np.linspace(0, 4*np.pi, 365)) # Seasonal pattern
df = pd.DataFrame({'date': dates, 'sales': sales})
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
# Sales ko bins mein divide karo (Low/Medium/High)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df['sales_bucket'] = discretizer.fit_transform(df[['sales']])
# Continuous features scale karo
scaler = StandardScaler()
df['sales_scaled'] = scaler.fit_transform(df[['sales']])
print(df[['date', 'sales', 'sales_scaled', 'sales_bucket']].head(10))
Final Advice
Bhai, sklearn.preprocessing sirf numbers scale karna nahi hai — yeh ek data engineering toolkit hai jo:
Data quality improve karta hai
Model performance dramatically badhata hai
Training speed fast karta hai
Generalization improve karta hai
Production mein hamesha Pipeline use karo — isse preprocessing aur model ek unit ban jaate hain, deployment easy hoti hai, aur data leakage nahi hota.
Ek achha ML Engineer wo hai jo model se zyaada data samajhta hai — aur data samajhne ke liye preprocessing mastery zaroori hai.
📌 Pro Tip: "Garbage in, garbage out" — Machine Learning mein agar preprocessing sahi nahi hai, to duniya ka best model bhi fail karega. Preprocessing mein time lagao, results khud achhe aayenge.
Happy Coding! 🐍
Agar yeh blog helpful laga to share karo aur comments mein batao ki kaunsa concept most useful laga!