Introduction — Problem Kya Hai?
Socho tumhare paas California ke hazaron ghar ka data hai — unki location, age, rooms, income of people — aur tumhara kaam hai yeh predict karna ki ek naye ghar ki median price kya hogi?
Yeh ek classic Regression problem hai. Classification mein hum categories predict karte hain (spam/not spam), lekin regression mein hum ek continuous number predict karte hain — jaise ghar ki keemat.
🎯 Goal
California Housing dataset ka use karke ek Random Forest Regressor model train karna jo diye gaye features ke basis par median_house_value predict kare.
Is blog mein hum do approaches cover karenge — pehle manual preprocessing, phir usi kaam ko Sklearn Pipeline ke saath professionally karna. End mein clearly samjhayenge ki dono mein kya difference hai aur kaunsa better hai.
Dataset ka Overview
Hum California Housing dataset use kar rahe hain. Yeh dataset real-world California census data par based hai. Isme 20,640 rows aur 10 columns hain.
File:- housing.csv
Columns ki List:
num_columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
cat_columns = ['ocean_proximity']
Categorical Feature (Text)
ocean_proximity
Values: NEAR BAY, INLAND, <1H OCEAN, NEAR OCEAN
🎯 Target Variable (Jo Predict Karna Hai)
median_house_value
⚠️
Important: total_bedrooms column mein 207 null values hain. Inhe handle karna zaroori hai warna model error dega!
Part 1 — Manual Preprocessing + Model Training
Pehle hum step-by-step manually data prepare karenge. Yeh approach beginners ke liye samajhna aasaan hai kyunki har step clearly alag hai.
1
Data Load Karo
Pandas se CSV file load karo aur pehli 2 rows dekho ki data kaisa hai.
Python
import pandas as pd
data = pd.read_csv('housing.csv')
data.head(2)
2
Categorical Column Encode Karo — OneHotEncoding
Machine Learning models numbers samajhte hain, text nahi. Isliye ocean_proximity (text column) ko numbers mein convert karna padta hai — yeh kaam OneHotEncoder karta hai.
💡 OneHotEncoding Kya Hoti Hai?
Agar ocean_proximity mein 4 categories hain (NEAR BAY, INLAND, etc.), toh OneHotEncoding 4 naye columns banata hai — har category ke liye ek. Agar ghar us category ka hai toh 1, warna 0.
Example: "NEAR BAY" → [1, 0, 0, 0] | "INLAND" → [0, 1, 0, 0]
Python
# Categorical columns define karo
cat_columns = ['ocean_proximity']
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_cat = encoder.fit_transform(data[cat_columns])
# Numpy array ko DataFrame mein convert karo
encoded_cat_df = pd.DataFrame(
encoded_cat.toarray(),
columns=encoder.get_feature_names_out(cat_columns)
)
# Original column drop karo, encoded columns add karo
data = pd.concat([data.drop(columns=cat_columns), encoded_cat_df], axis=1)
3
Null Values Handle Karo
total_bedrooms mein missing values hain. Hum unhe median se fill karte hain — yeh ek safe aur common approach hai.
💡 Median vs Mean — Kaunsa Better?
Mean (average) outliers se affect hota hai. Agar kisi block mein 10,000 bedrooms hain (outlier), toh mean bahut badh jayega.
Median middle value hota hai — outliers se affect nahi hota. Housing data mein outliers common hain, isliye median preferred hai.
Python
# Null values ko median se fill karo
data['total_bedrooms'] = data['total_bedrooms'].fillna(
data['total_bedrooms'].median()
)
4
Numerical Features Scale Karo — StandardScaler
Kuch features bahut bade numbers mein hain (population: 10,000+) aur kuch chote mein (median_income: 1–15). Yeh imbalance model ko confuse kar sakta hai. StandardScaler sab values ko same scale par le aata hai.
💡 StandardScaler Kya Karta Hai?
Har feature ka mean 0 aur standard deviation 1 kar deta hai.
Formula: z = (x - mean) / std_dev
Iska faida: distance-based algorithms (aur kuch tree-based bhi) better kaam karte hain jab sab features same scale par hoon.
Python
from sklearn.preprocessing import StandardScaler
num_columns = ['longitude', 'latitude', 'housing_median_age',
'total_rooms', 'total_bedrooms', 'population',
'households', 'median_income']
scaler = StandardScaler()
data[num_columns] = scaler.fit_transform(data[num_columns])
5
Train-Test Split Karo
Data ko 80% training aur 20% testing mein baanto. Model training data se seekhega aur testing data par evaluate hoga.
Python
from sklearn.model_selection import train_test_split
X = data.drop(columns=['median_house_value']) # Features
y = data['median_house_value'] # Target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
6
Random Forest Model Train Karo
Random Forest ek powerful algorithm hai jo kai decision trees banata hai aur unka average nikalta hai. Yeh overfitting se bachata hai aur accuracy achchi deta hai.
🌲 Random Forest — Simple Explanation
Ek akela decision tree kabhi kabhi galat decision leta hai — jaise ek judge.
Random Forest mein 100 trees (n_estimators=100) hain — 100 judges. Har tree thoda alag data aur features dekhta hai. Final prediction sabka average hota hai — democratic voting jaisa!
Real world analogy: Ek doctor ki jagah 100 doctors ki second opinion lena.
Python
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
📈Model Evaluation — MAE, MSE, R² Score
Model train hone ke baad sabse important question hai — "Yeh kitna accurate hai?" Regression ke liye hum teen metrics use karte hain:
MAE
Mean Absolute Error
Average absolute difference predicted aur actual values ke beech. Lower = better. Easy to understand!
MSE
Mean Squared Error
Errors ko square karke average. Bade errors ko zyada punish karta hai. Lower = better.
R² Score
0 to 1
Model ne variance ka kitna hissa explain kiya. 1 = perfect, 0 = model useless. Higher = better.
⚠️ Common Mistake — Accuracy Use Karna
Bahut beginners regression problem mein accuracy() use karne ki koshish karte hain — yeh bilkul galat hai!
Accuracy classification ke liye hai (spam/not spam). Regression mein hum exact number predict karte hain — toh MAE, MSE, R² use karo.
Python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'MAE: {mae:.2f}')
print(f'MSE: {mse:.2f}')
print(f'R² Score: {r2:.4f}')
🧠 R² Score Interpret Kaise Karo?
R² = 0.80 ka matlab: model ne target variable ka 80% variance explain kiya.
R² = 0.50 — average model. R² = 0.85+ — good model for housing data.
R² = 1.0 — perfect (suspicious — overfitting check karo!)
R² negative — model se bhi bekar simple average hota.
🔮New Data Point par Prediction
Ek naye ghar ka data diya gaya hai jiske baare mein model ne pehle kabhi nahi dekha — ab hum predict karenge ki uski price kya hogi.
Lekin problem yeh hai — new data raw hai. Hume wahi saari preprocessing manually apply karni padegi jo training mein ki thi.
Python
# New data point define karo
new_data = {
'longitude': -122.23, 'latitude': 37.88,
'housing_median_age': 41,
'total_rooms': 880, 'total_bedrooms': 129,
'population': 322, 'households': 126,
'median_income': 8.3252,
'ocean_proximity': 'NEAR BAY'
}
new_data_df = pd.DataFrame([new_data])
# Step 1: Categorical encode karo (same encoder use karo!)
encoded_new_cat = encoder.transform(new_data_df[cat_columns])
encoded_new_cat_df = pd.DataFrame(
encoded_new_cat.toarray(),
columns=encoder.get_feature_names_out(cat_columns)
)
# Step 2: Drop original, concat encoded
new_data_df = pd.concat(
[new_data_df.drop(columns=cat_columns), encoded_new_cat_df], axis=1
)
# Step 3: Scale (same scaler use karo!)
new_data_df[num_columns] = scaler.transform(new_data_df[num_columns])
# Step 4: Predict!
predicted_value = model.predict(new_data_df)
print(f'Predicted Price: ₹{predicted_value[0]:,.0f}')
🚨Manual Approach ka Bada Problem
⚠️
Yahan ek serious problem hai!
Dekho — new data predict karne ke liye humne manually 4 alag steps kiye:encoder.transform → concat → scaler.transform → model.predict
Agar tum production mein yeh code deploy karo aur koi ek step bhool jao — model galat results dega bina koi error bataye. Yeh debugging nightmare hai!
😰 Manual Approach ki Problems
Human error: Koi step miss ho sakta hai (especially scaling ya encoding)
Order matter karta hai: Agar scale pehle karo aur encode baad mein — wrong results
Objects separately store karne padte hain: encoder, scaler, model — teeno alag save/load karne padte hain
Team mein share karna mushkil: Dusra developer sab steps nahi jaanta
Repetitive code: Training aur prediction dono jagah same steps likhne padte hain
Solution? → Sklearn Pipeline! 🚀
🔧Part 2 — Sklearn Pipeline (Professional Way)
Pipeline ek aisa container hai jo saari preprocessing steps aur model ko ek single object mein pack kar deta hai. Ek baar fit karo, ek baar predict karo — baaki sab automatically hota hai.
🔢
Imputer
Null → Median
→
📐
Scaler
StandardScaler
→
🔤
Encoder
OneHotEncoder
→
🌲
Model
RandomForest
→
💰
Output
Price Predicted!
Step-by-Step Pipeline Code:
Python — Full Pipeline
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
data = pd.read_csv('housing.csv')
cat_columns = ['ocean_proximity']
num_columns = ['longitude', 'latitude', 'housing_median_age',
'total_rooms', 'total_bedrooms', 'population',
'households', 'median_income']
# Numerical pipeline: null fill → scale
num_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline: encode
cat_transformer = Pipeline(steps=[
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Dono transformers combine karo
preprocessor = ColumnTransformer(transformers=[
('num', num_transformer, num_columns),
('cat', cat_transformer, cat_columns)
])
# Final pipeline: preprocessor + model
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', RandomForestRegressor(n_estimators=100, random_state=42))
])
# Train
X = data.drop(columns=['median_house_value'])
y = data['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
Evaluation — Same Metrics:
Python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = pipeline.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
New Data Predict Karna — Kitna Aasaan! ✨
Pipeline ke saath new data prediction sirf 3 lines ka kaam hai:
Python
new_data = {
'longitude': -122.23, 'latitude': 37.88,
'housing_median_age': 41,
'total_rooms': 880, 'total_bedrooms': 129,
'population': 322, 'households': 126,
'median_income': 8.3252,
'ocean_proximity': 'NEAR BAY'
}
new_data_df = pd.DataFrame([new_data])
# BAS! Pipeline sab kuch handle kar lega automatically ✨
predicted_value = pipeline.predict(new_data_df)
print(f'Predicted Price: {predicted_value[0]:,.0f}')
✅ Pipeline ka Jadoo
Raw data diya → pipeline ne khud null fill kiya → encoding ki → scaling ki → model se predict karaya.
No manual steps. No human error. Clean code.
⚖️Manual vs Pipeline — Side-by-Side Comparison
Feature | Manual Approach | Pipeline Approach |
|---|---|---|
Code Length | Zyada code, repetitive | Clean, minimal code |
New Data Prediction | 4+ manual steps, error-prone | 1 line: pipeline.predict() |
Data Leakage Risk | High (fit on all data possible) | Low (CV ke saath safe) |
Model Save/Load | 3 objects save karne padte hain | Ek pipeline object — done! |
Team Collaboration | Steps document karne padte hain | Pipeline self-documenting hai |
Cross Validation | Complex, manually handle karo | cross_val_score directly works |
Production Ready | Risky, careful handling chahiye | Production mein deploy karna easy |
Beginners ke liye | Samajhna easy hai | Thoda setup chahiye, lekin worth it |
🛠️Pro Tips & Common Mistakes
✅ Best Practices
Hamesha Pipeline use karo — production ya real project mein manual approach avoid karo.
handle_unknown='ignore' OneHotEncoder mein lagao — naya category aane par error nahi ayega.
random_state=42 set karo — reproducible results ke liye (har baar same output).
Pipeline ko joblib se save karo:
joblib.dump(pipeline, 'model.pkl')— ek file mein sab kuch.R² score ke saath MAE bhi report karo — R² alone poori picture nahi deta.
❌ Common Mistakes
Test data par fit_transform use karna — yeh data leakage hai! Test par sirf
transform()use karo.Scaler ka fit training se alag karna — naya scaler fit karoge toh mean/std alag honge.
Regression mein accuracy use karna — yeh bilkul wrong hai, MAE/MSE/R² use karo.
n_estimators bahut kam rakhna — 10-20 trees se underfitting ho sakti hai. 100-500 try karo.
Null values ignore karna — model fit hone ke baad bhi error aayega prediction mein.
💾 Pipeline Save/Load — Production Tip
import joblib
# Save karo
joblib.dump(pipeline, 'housing_pipeline.pkl')
# Kisi bhi machine par load karo
loaded_pipeline = joblib.load('housing_pipeline.pkl')
result = loaded_pipeline.predict(new_data_df)
📝Summary — Kya Sikha Humne?
✓
Data Preprocessing
OneHotEncoding (text → numbers), Null filling (median), StandardScaler (scale karna) — yeh teen cheezein kisi bhi ML project ki neev hain.
✓
Random Forest Regressor
Multiple trees ka ensemble — powerful, accurate, aur overfitting-resistant. Housing price prediction ke liye excellent choice.
✓
Regression Metrics
MAE (average error), MSE (bade errors ko punish karta hai), R² (overall fit) — teeno milke model ki poori picture dete hain.
✓
Sklearn Pipeline
Manual approach ka professional upgrade. Clean code, error-free prediction, easy to save/deploy. Real projects mein hamesha Pipeline use karo.
🎯 Final Takeaway
Sikhne ke liye manual approach best hai — har step clearly samajh aati hai.
Real projects, production aur team work ke liye Pipeline best hai — safer, cleaner, aur professional.
🏷️ Tags
machine-learningpythonsklearnrandom-forestregressioncalifornia-housingpipelineonehotencodingstandardscalerdata-preprocessingpandasmae-mse-r2hinglishdata-science-hindiml-beginnershousing-price-predictioncolumntransformersimpleimputerfeature-engineeringtrain-test-split
Written in Hinglish for Indian Data Science learners 🇮🇳 · Python · Scikit-Learn · Pandas