Back to all posts
Data Science

sklearn.preprocessing Complete Guide: Data Scaling, Encoding & Transformation in Python

Master sklearn.preprocessing from scratch to advanced level. Learn StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, and 30+ transformers with real-...

1. Introduction

Yeh sklearn.preprocessing hai kya?

Bhai, sochte ho ki Machine Learning sirf model fit karna hai? Galat! Asli kaam data ko model ke liye ready karna hai — aur yehi kaam karta hai sklearn.preprocessing.

sklearn.preprocessing ek Python module hai jo scikit-learn library ka part hai. Iska kaam hai raw, messy, unscaled data ko ML models ke liye clean, transformed, aur normalized banana.

Kyun zaroori hai?

Real data kabhi bhi perfect nahi hota:

  • Salary column mein values 50,000 se 1,00,00,000 tak hain

  • City column mein "Mumbai", "Delhi", "Bangalore" jaise strings hain

  • Kuch columns mein NaN values hain

  • Kuch features skewed hain

Agar yeh data seedha ML model ko dedo, to model:

  • Bias ho jaata hai high-value columns ki taraf (jaise salary)

  • Fail karta hai string categories ko process karne mein

  • Slow converge hota hai gradient descent mein

sklearn.preprocessing in sab problems ko solve karta hai.


2. Basic Concepts

2.1 Preprocessing Kya Hota Hai?

Preprocessing matlab data ko usable form mein convert karna — jaise:

SQL
Raw Data → Clean Data → Transformed Data → ML Model

2.2 Core Problem Areas

SQL
1. Feature Scaling      → Numbers ka range equalize karna
2. Encoding             → Strings ko numbers mein convert karna  
3. Normalization        → Rows ko unit norm pe laana
4. Transformation       → Distribution shape change karna
5. Binarization         → Continuous values ko 0/1 mein convert karna
6. Imputation           → Missing values fill karna (SimpleImputer)
7. Polynomial Features  → New features generate karna

2.3 Basic Import

Python
# Pura module import karo
from sklearn import preprocessing

# Ya specific classes import karo (recommended)
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    MaxAbsScaler,
    RobustScaler,
    Normalizer,
    Binarizer,
    LabelEncoder,
    OrdinalEncoder,
    OneHotEncoder,
    LabelBinarizer,
    MultiLabelBinarizer,
    PolynomialFeatures,
    FunctionTransformer,
    PowerTransformer,
    QuantileTransformer,
    KBinsDiscretizer,
)

2.4 Fit-Transform Pattern (Most Important!)

Har sklearn preprocessor same pattern follow karta hai:

Python
# Step 1: Object banao
scaler = StandardScaler()

# Step 2: Training data pe fit karo (statistics seekhta hai)
scaler.fit(X_train)

# Step 3: Transform karo
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)  # Same scaler use karo!

# Ya fit + transform ek saath (sirf training data pe)
X_train_scaled = scaler.fit_transform(X_train)

⚠️ Golden Rule: fit() sirf X_train pe karo. X_test pe sirf transform() karo. Test data ki information training mein leak nahi honi chahiye — isko Data Leakage kehte hain.


3. All Functions and Features

3.1 Feature Scaling (Numerical Data)

🔹 StandardScaler — Z-Score Normalization

Formula: z = (x - mean) / std_dev

Result: Mean = 0, Std Dev = 1

Python
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1000, 2], [2000, 4], [3000, 6], [4000, 8]])

scaler = StandardScaler()
scaled = scaler.fit_transform(data)

print("Original:\n", data)
print("\nScaled:\n", scaled)
print("\nMean:", scaler.mean_)
print("Std:", scaler.scale_)
SQL
Original:
 [[1000    2]
  [2000    4]
  [3000    6]
  [4000    8]]

Scaled:
 [[-1.34164079 -1.34164079]
  [-0.4472136  -0.4472136 ]
  [ 0.4472136   0.4472136 ]
  [ 1.34164079  1.34164079]]

Mean: [2500.    5.]
Std:  [1118.03398875    2.23606798]

Kab use karein:

  • Linear Regression, Logistic Regression, SVM, Neural Networks

  • Data Gaussian distribution follow karta ho

  • Outliers nahi hone chahiye (sensitive to outliers)


🔹 MinMaxScaler — Range Scaling

Formula: x_scaled = (x - x_min) / (x_max - x_min)

Result: Values [0, 1] range mein (ya custom range)

Python
from sklearn.preprocessing import MinMaxScaler

data = np.array([[100], [200], [300], [400], [500]])

scaler = MinMaxScaler(feature_range=(0, 1))  # Default
scaled = scaler.fit_transform(data)
print(scaled)
# [[0.  ], [0.25], [0.5 ], [0.75], [1.  ]]

# Custom range: -1 to 1
scaler2 = MinMaxScaler(feature_range=(-1, 1))
scaled2 = scaler2.fit_transform(data)
print(scaled2)
# [[-1. ], [-0.5], [0.  ], [0.5 ], [1.  ]]

Kab use karein:

  • Neural Networks (especially image pixel values 0-255 → 0-1)

  • KNN, K-Means

  • Jab exact [0,1] range chahiye

  • Outliers nahi hone chahiye


🔹 MaxAbsScaler — Absolute Maximum Scaling

Formula: x_scaled = x / |x_max|

Result: Values [-1, 1] range mein (sign preserve hota hai)

Python
from sklearn.preprocessing import MaxAbsScaler

data = np.array([[-10], [-5], [0], [5], [10]])

scaler = MaxAbsScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1. ], [-0.5], [0. ], [0.5], [1. ]]

Kab use karein:

  • Sparse data (mostly zeros) — zeros disturb nahi hote

  • Text/NLP TF-IDF matrices

  • Data pehle se centered ho (zero mean)


🔹 RobustScaler — Outlier-Resistant Scaling

Formula: x_scaled = (x - median) / IQR

IQR = Q3 - Q1 (Interquartile Range)

Python
from sklearn.preprocessing import RobustScaler

# Outliers ke saath data
data = np.array([[1], [2], [3], [4], [100]])  # 100 outlier hai

scaler = RobustScaler()
scaled = scaler.fit_transform(data)
print(scaled)
# [[-1.  ], [-0.5 ], [0.   ], [0.5  ], [48.5 ]]

print("Median:", scaler.center_)   # [3.]
print("IQR:", scaler.scale_)       # [2.]

Kab use karein:

  • Data mein outliers hain

  • Financial data (salary, transactions)

  • Medical data

  • Real-world messy data


🔹 Normalizer — Row-wise Normalization

Ye rows ko normalize karta hai (columns nahi!) — ek sample ki saari values ek unit norm pe laata hai.

Python
from sklearn.preprocessing import Normalizer

data = np.array([[3, 4], [1, 2], [6, 8]])

# L2 norm (default): sqrt(sum of squares) = 1
norm_l2 = Normalizer(norm='l2')
print(norm_l2.fit_transform(data))
# [[0.6, 0.8], [0.447, 0.894], [0.6, 0.8]]

# L1 norm: sum of absolute values = 1
norm_l1 = Normalizer(norm='l1')
print(norm_l1.fit_transform(data))
# [[0.429, 0.571], [0.333, 0.667], [0.429, 0.571]]

# Max norm: max value = 1
norm_max = Normalizer(norm='max')
print(norm_max.fit_transform(data))
# [[0.75, 1.0], [0.5, 1.0], [0.75, 1.0]]

Kab use karein:

  • Text classification (TF-IDF vectors)

  • Cosine similarity compute karna ho

  • Each sample independent ho


3.2 Encoding (Categorical Data)

🔹 LabelEncoder — Integer Encoding

Ek column ki categorical values ko integers mein convert karta hai.

Python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# String labels
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi']
encoded = le.fit_transform(cities)
print(encoded)       # [2 1 0 2 1]
print(le.classes_)   # ['Bangalore' 'Delhi' 'Mumbai']

# Inverse transform
print(le.inverse_transform([2, 1, 0]))  # ['Mumbai' 'Delhi' 'Bangalore']

Kab use karein:

  • Target variable (y) encode karne ke liye ✅

  • Tree-based models (Random Forest, XGBoost) ke features ke liye ✅

  • Linear models ke features mein mat use karo ❌ (ordinal assumption)


🔹 OrdinalEncoder — Ordered Encoding (Multiple Columns)

Multiple columns ko ek saath encode karta hai, order preserve karta hai.

Python
from sklearn.preprocessing import OrdinalEncoder

data = np.array([
    ['Low', 'Small'],
    ['Medium', 'Medium'],
    ['High', 'Large'],
    ['Low', 'Large']
])

encoder = OrdinalEncoder(
    categories=[
        ['Low', 'Medium', 'High'],    # Column 1 order
        ['Small', 'Medium', 'Large']   # Column 2 order
    ]
)

encoded = encoder.fit_transform(data)
print(encoded)
# [[0. 0.]
#  [1. 1.]
#  [2. 2.]
#  [0. 2.]]

Kab use karein:

  • Education: 'School' < 'Bachelors' < 'Masters' < 'PhD'

  • Size: 'Small' < 'Medium' < 'Large'

  • Rating: 'Poor' < 'Average' < 'Good' < 'Excellent'


🔹 OneHotEncoder — Dummy Variable Encoding

Categorical values ko binary columns mein convert karta hai (dummy variables).

Python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = np.array([['Mumbai'], ['Delhi'], ['Bangalore'], ['Mumbai']])

ohe = OneHotEncoder(
    sparse_output=False,    # numpy array return karo
    drop='first',           # Dummy variable trap avoid karo
    handle_unknown='ignore' # New categories ko ignore karo
)

encoded = ohe.fit_transform(data)
print(encoded)
# [[0. 1.]    <- Mumbai
#  [1. 0.]    <- Delhi  
#  [0. 0.]    <- Bangalore (dropped - first)
#  [0. 1.]]   <- Mumbai

print(ohe.categories_)
print(ohe.get_feature_names_out(['city']))
# ['city_Delhi' 'city_Mumbai']

Key Parameters:

Parameter

Default

Description

sparse_output

True

Sparse matrix return karo

drop

None

'first' ya 'if_binary'

handle_unknown

'error'

'ignore' ya 'infrequent_if_exist'

min_frequency

None

Rare categories handle karo

max_categories

None

Categories cap karo

Kab use karein:

  • Linear models ke liye categorical features ✅

  • Neural networks ✅

  • Jab categories mein koi order na ho ✅


🔹 LabelBinarizer — Binary Matrix for Single Column

Python
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai']

encoded = lb.fit_transform(cities)
print(encoded)
# [[0 0 1]   <- Mumbai
#  [0 1 0]   <- Delhi
#  [1 0 0]   <- Bangalore
#  [0 0 1]]  <- Mumbai

print(lb.classes_)  # ['Bangalore' 'Delhi' 'Mumbai']

🔹 MultiLabelBinarizer — Multiple Labels per Sample

Python
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

# Ek user multiple genres pasand karta hai
genres = [
    ['Action', 'Drama'],
    ['Comedy'],
    ['Action', 'Comedy', 'Romance'],
    ['Drama', 'Romance']
]

encoded = mlb.fit_transform(genres)
print(mlb.classes_)  # ['Action' 'Comedy' 'Drama' 'Romance']
print(encoded)
# [[1 0 1 0]
#  [0 1 0 0]
#  [1 1 0 1]
#  [0 0 1 1]]

3.3 Binarization

🔹 Binarizer — Threshold-based 0/1

Python
from sklearn.preprocessing import Binarizer

data = np.array([[0.1, 0.5, 0.9], [0.2, 0.6, 0.4]])

# Default threshold = 0.0
binarizer = Binarizer(threshold=0.5)
result = binarizer.fit_transform(data)
print(result)
# [[0 0 1]
#  [0 1 0]]

Kab use karein:

  • Probability scores ko binary predictions mein convert karna

  • Image thresholding

  • Feature engineering


3.4 Distribution Transformation

🔹 PowerTransformer — Gaussian Distribution Banana

Python
from sklearn.preprocessing import PowerTransformer

data = np.array([[1], [2], [3], [4], [100], [1000]])  # Skewed data

# Yeo-Johnson: positive aur negative dono values handle karta hai
pt_yj = PowerTransformer(method='yeo-johnson')
transformed = pt_yj.fit_transform(data)
print("Yeo-Johnson:\n", transformed)

# Box-Cox: sirf positive values
pt_bc = PowerTransformer(method='box-cox')
transformed_bc = pt_bc.fit_transform(data)
print("Box-Cox lambda:", pt_bc.lambdas_)

Kab use karein:

  • Highly skewed numerical data (income, population)

  • Linear regression assume karta hai normality

  • Before t-tests, ANOVA


🔹 QuantileTransformer — Percentile-based Transform

Python
from sklearn.preprocessing import QuantileTransformer

data = np.array([[1], [2], [3], [100], [1000], [50000]])

# Uniform distribution (0 to 1)
qt_uniform = QuantileTransformer(
    n_quantiles=100,
    output_distribution='uniform',
    random_state=42
)

# Normal distribution
qt_normal = QuantileTransformer(
    n_quantiles=100,
    output_distribution='normal',
    random_state=42
)

print("Uniform:\n", qt_uniform.fit_transform(data))
print("Normal:\n", qt_normal.fit_transform(data))

Advantage: Outliers ke against most robust hai — extreme values median ke paas aajati hain.


🔹 FunctionTransformer — Custom Transformation

Python
from sklearn.preprocessing import FunctionTransformer

# Log transformation
log_transformer = FunctionTransformer(
    func=np.log1p,           # log(x + 1)
    inverse_func=np.expm1    # inverse: e^x - 1
)

data = np.array([[0], [1], [10], [100], [1000]])
transformed = log_transformer.fit_transform(data)
print(transformed)
# [[0.   ]
#  [0.693]
#  [2.398]
#  [4.615]
#  [6.909]]

# Koi bhi custom function use kar sakte ho
sqrt_transformer = FunctionTransformer(np.sqrt)

3.5 Discretization

🔹 KBinsDiscretizer — Continuous to Categorical

Python
from sklearn.preprocessing import KBinsDiscretizer

data = np.array([[0], [10], [20], [30], [40], [50], [100]])

# Equal-width bins
kbd_width = KBinsDiscretizer(
    n_bins=3,
    encode='ordinal',     # 'onehot', 'onehot-dense', 'ordinal'
    strategy='uniform'    # 'uniform', 'quantile', 'kmeans'
)

print("Uniform:\n", kbd_width.fit_transform(data))

# Equal-frequency bins
kbd_freq = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
print("Quantile:\n", kbd_freq.fit_transform(data))

print("Bin edges:", kbd_width.bin_edges_)

3.6 Polynomial Features

🔹 PolynomialFeatures — Non-linear Feature Engineering

Python
from sklearn.preprocessing import PolynomialFeatures

data = np.array([[2, 3]])  # 2 features: a=2, b=3

# Degree 2: 1, a, b, a^2, ab, b^2
poly = PolynomialFeatures(degree=2, include_bias=True)
result = poly.fit_transform(data)
print(result)
# [[ 1.  2.  3.  4.  6.  9.]]

print(poly.get_feature_names_out(['a', 'b']))
# ['1' 'a' 'b' 'a^2' 'a b' 'b^2']

# Interaction only (no powers)
poly_interact = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
result2 = poly_interact.fit_transform(data)
print(result2)
# [[2. 3. 6.]]  -> a, b, a*b

4. Intermediate Usage

4.1 Pipeline ke Saath Integration

Real projects mein preprocessors Pipeline ke andar use karte hain:

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'age':    [25, 35, 45, 28, 55],
    'salary': [30000, 60000, 90000, 40000, 120000],
    'city':   ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai', 'Delhi'],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'churn':  [0, 0, 1, 0, 1]
})

X = df.drop('churn', axis=1)
y = df['churn']

# Numeric aur categorical columns alag karo
numeric_features = ['age', 'salary']
categorical_features = ['city', 'gender']

# Column-specific preprocessors
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# ColumnTransformer: different columns pe different transformers
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Final pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Train/test split ke saath
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print("Accuracy:", pipeline.score(X_test, y_test))

4.2 ColumnTransformer — Mixed Data Types Handle Karna

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

# Complex real-world scenario
preprocessor = ColumnTransformer(
    transformers=[
        ('scale_num', StandardScaler(), ['age', 'income', 'credit_score']),
        ('encode_cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'occupation']),
        ('encode_ord', OrdinalEncoder(categories=[['Low', 'Medium', 'High']]), ['risk_level']),
    ],
    remainder='passthrough'  # Baaki columns as-is rakhna
)

# Feature names extract karna
X_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
print(feature_names)

4.3 Inverse Transform — Original Data Wapas Laana

Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data = np.array([[1000, 2], [2000, 4], [3000, 6]])

scaled = scaler.fit_transform(data)
print("Scaled:\n", scaled)

# Wapas original values mein convert karo
original = scaler.inverse_transform(scaled)
print("Recovered:\n", original)
# Same as original data

4.4 Partial Fit — Incremental Learning (Big Data)

Jab data ek saath memory mein nahi aata:

Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Batch by batch fit karo
for batch in data_batches:  # Generator ya list of chunks
    scaler.partial_fit(batch)

# Ab transform karo
X_scaled = scaler.transform(X_test)

StandardScaler, MinMaxScaler, MaxAbsScaler partial_fit support karte hain.


4.5 Cross-Validation ke Saath Correct Usage

Python
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# ✅ CORRECT: Pipeline ke andar scaler rakhna
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

scores = cross_val_score(pipe, X, y, cv=5)
# Automatically: train fold pe fit, val fold pe transform


# ❌ WRONG: Pehle scale karna phir CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Data leakage!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)

5. Advanced Concepts

5.1 Custom Transformer Banana

BaseEstimator aur TransformerMixin use karke custom preprocessor banate hain:

Python
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LogTransformer(BaseEstimator, TransformerMixin):
    """Custom log(1+x) transformer with clipping."""
    
    def __init__(self, clip_min=0.0):
        self.clip_min = clip_min
    
    def fit(self, X, y=None):
        # Koi statistics seekhna ho to yahaan
        self.n_features_in_ = X.shape[1] if hasattr(X, 'shape') else 1
        return self  # Hamesha self return karo
    
    def transform(self, X, y=None):
        X = np.array(X)
        X_clipped = np.clip(X, self.clip_min, None)  # Negative values clip
        return np.log1p(X_clipped)
    
    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return [f'log_feature_{i}' for i in range(self.n_features_in_)]
        return [f'log_{name}' for name in input_features]

# Use in pipeline
pipe = Pipeline([
    ('log', LogTransformer(clip_min=0)),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

5.2 TargetEncoder (sklearn 1.3+)

High-cardinality categorical data ke liye:

Python
from sklearn.preprocessing import TargetEncoder

# City column mein 500+ unique values hain
te = TargetEncoder(
    target_type='continuous',  # ya 'binary', 'multiclass'
    smooth='auto',              # Regularization
    cv=5                        # Cross-validation folds
)

X_train_enc = te.fit_transform(X_train[['city']], y_train)
X_test_enc = te.transform(X_test[['city']])

5.3 Internal Working of StandardScaler

Python
# StandardScaler internally kya karta hai:
class MyStandardScaler:
    def fit(self, X):
        self.mean_ = X.mean(axis=0)     # Column-wise mean
        self.var_  = X.var(axis=0)       # Column-wise variance
        self.scale_ = np.sqrt(self.var_) # Standard deviation
        return self
    
    def transform(self, X):
        return (X - self.mean_) / self.scale_
    
    def inverse_transform(self, X_scaled):
        return X_scaled * self.scale_ + self.mean_

5.4 PolynomialFeatures — Combinatorial Explosion

Python
# Degree 2, 3 features se kitne features bante hain?
from sklearn.preprocessing import PolynomialFeatures
import math

def count_poly_features(n_features, degree, include_bias=True):
    from math import comb
    total = sum(comb(n_features + d, d) for d in range(1, degree + 1))
    return total + (1 if include_bias else 0)

# n=10 features, degree=2 → 66 features!
# n=10 features, degree=3 → 286 features!
# n=100 features, degree=2 → 5151 features! (curse of dimensionality)

poly = PolynomialFeatures(degree=2)
print(f"100 features → {poly.fit(np.zeros((1, 100))).n_output_features_} features")

5.5 Sparse Matrix Optimization

Python
from scipy.sparse import issparse
from sklearn.preprocessing import MaxAbsScaler

# OneHotEncoder returns sparse matrix by default
ohe = OneHotEncoder()  # sparse_output=True by default
X_sparse = ohe.fit_transform(categories_data)

print("Is sparse:", issparse(X_sparse))  # True
print("Memory saved:", f"{X_sparse.nnz / X_sparse.size:.2%}")

# MaxAbsScaler sparse matrices ke saath kaam karta hai
mas = MaxAbsScaler()
X_sparse_scaled = mas.fit_transform(X_sparse)

5.6 set_output API (sklearn 1.2+)

Python
from sklearn.preprocessing import StandardScaler
import pandas as pd

X = pd.DataFrame({'age': [25, 35, 45], 'salary': [30000, 60000, 90000]})

# DataFrame output chahiye (numpy array nahi)
scaler = StandardScaler()
scaler.set_output(transform='pandas')  # or 'default' for numpy

result = scaler.fit_transform(X)
print(type(result))   # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # ['age', 'salary']

6. Real-World Use Cases

6.1 Customer Churn Prediction Pipeline

Python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Telecom churn dataset simulation
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'tenure_months':  np.random.randint(1, 72, n),
    'monthly_charges': np.random.uniform(20, 120, n),
    'total_charges':   np.random.uniform(100, 8000, n),
    'contract':        np.random.choice(['Month-to-month', 'One year', 'Two year'], n),
    'payment_method':  np.random.choice(['Credit card', 'Bank transfer', 'Mailed check'], n),
    'internet_service':np.random.choice(['DSL', 'Fiber optic', 'No'], n),
    'tech_support':    np.random.choice(['Yes', 'No'], n),
    'churn':           np.random.choice([0, 1], n, p=[0.73, 0.27])
})

X = df.drop('churn', axis=1)
y = df['churn']

numeric_cols = ['tenure_months', 'monthly_charges', 'total_charges']
ordinal_cols = ['contract']
nominal_cols = ['payment_method', 'internet_service', 'tech_support']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('ord', OrdinalEncoder(categories=[['Month-to-month', 'One year', 'Two year']]), ordinal_cols),
    ('nom', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_cols),
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print(f"Churn Prediction Accuracy: {pipeline.score(X_test, y_test):.4f}")

6.2 E-commerce Price Prediction

Python
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge

# Product pricing model
# Features: area_sqft, rooms, age, location_score
X = np.array([
    [500, 1, 10, 7],
    [1000, 2, 5, 8],
    [1500, 3, 2, 9],
    [800, 2, 15, 6],
])
y = np.array([25000, 60000, 95000, 40000])

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

pipeline.fit(X, y)
new_property = np.array([[1200, 2, 8, 7.5]])
predicted_price = pipeline.predict(new_property)
print(f"Predicted Price: ₹{predicted_price[0]:,.0f}")

6.3 NLP Text Classification Preprocessing

Python
from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import issparse

texts = [
    "best product ever",
    "terrible quality waste of money",
    "average product nothing special",
    "loved it will buy again",
    "poor customer service",
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative']

# Label encoding
le = LabelEncoder()
y = le.fit_transform(labels)  # [1, 0, 2, 1, 0]

# TF-IDF + L2 normalization
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)  # Sparse matrix

# L2 normalize rows
from sklearn.preprocessing import normalize
X_normalized = normalize(X, norm='l2')

print("Classes:", le.classes_)
print("Feature shape:", X_normalized.shape)

6.4 Time Series Feature Engineering

Python
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
import pandas as pd
import numpy as np

# Sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365)
sales = np.random.exponential(scale=1000, size=365) + \
        500 * np.sin(np.linspace(0, 4*np.pi, 365))  # Seasonal pattern

df = pd.DataFrame({'date': dates, 'sales': sales})
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter

# Sales ko bins mein divide karo (Low/Medium/High)
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df['sales_bucket'] = discretizer.fit_transform(df[['sales']])

# Continuous features scale karo
scaler = StandardScaler()
df['sales_scaled'] = scaler.fit_transform(df[['sales']])

print(df[['date', 'sales', 'sales_scaled', 'sales_bucket']].head(10))

Final Advice

Bhai, sklearn.preprocessing sirf numbers scale karna nahi hai — yeh ek data engineering toolkit hai jo:

  1. Data quality improve karta hai

  2. Model performance dramatically badhata hai

  3. Training speed fast karta hai

  4. Generalization improve karta hai

Production mein hamesha Pipeline use karo — isse preprocessing aur model ek unit ban jaate hain, deployment easy hoti hai, aur data leakage nahi hota.

Ek achha ML Engineer wo hai jo model se zyaada data samajhta hai — aur data samajhne ke liye preprocessing mastery zaroori hai.


📌 Pro Tip: "Garbage in, garbage out" — Machine Learning mein agar preprocessing sahi nahi hai, to duniya ka best model bhi fail karega. Preprocessing mein time lagao, results khud achhe aayenge.


Happy Coding! 🐍

Agar yeh blog helpful laga to share karo aur comments mein batao ki kaunsa concept most useful laga!

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.