Back to all posts
Data Science

Scikit-learn Complete Guide: Beginner To Advanced

Scikit-learn ka complete guide — installation se deployment tak. Classification, regression, clustering, pipelines, hyperparameter tuning sab kuch Hindi-Engl...

Introduction

Scikit-learn ek Python library hai jo Machine Learning ke liye use hoti hai. Ye library tumhe ready-made algorithms deti hai jisse tum easily ML models bana sakte ho.

What is it

  • Python ML library

  • Built on NumPy, Pandas, SciPy

  • Simple aur consistent API

Why important

  • Fast prototyping

  • Easy to use

  • Industry me widely used

Real-world usage

  • Fraud detection

  • Customer churn prediction

  • Sales forecasting

  • Recommendation systems


Basic Concepts

Machine Learning Types

  • Supervised Learning → labeled data (Regression, Classification)

  • Unsupervised Learning → no labels (Clustering)


Core Components

Dataset

Python
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

Train-Test Split

Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Model Training

Python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

Prediction

Python
y_pred = model.predict(X_test)

Evaluation

Python
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))
SQL
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

All Functions and Features

Preprocessing

StandardScaler

StandardScaler data ko normalize karta hai taaki sab features ek hi scale par aa jayein.
Isme mean 0 aur standard deviation 1 ho jata hai, jisse model better aur fair learning karta hai.
Ye mainly distance-based aur regression models me use hota hai.

Before vs After Scaling

Before Scaling (Original Data)

Python
Age     Salary
18      20000
25      30000
30      50000
35      80000
40      100000

👉 Problem:

  • Salary ka scale bahut bada hai

  • Age ka scale chhota hai


After StandardScaler

Python
Age     Salary
-1.41   -1.21
-0.70   -0.82
 0.00    0.00
 0.70    0.82
 1.41    1.21
Python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Use: Feature scaling ke liye


MinMaxScaler

MinMaxScaler data ko ek fixed range (usually 0–1) me convert karta hai.
Ye minimum value ko 0 aur maximum value ko 1 bana deta hai, baaki values proportion me scale hoti hain.
Isse sab features same range me aa jate hain, jo models ko better learning me help karta hai.

Python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

SQL
# Create graphs comparing original, StandardScaler, and MinMaxScaler

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt

# Original data
data = pd.DataFrame({
    'Age': [18, 25, 30, 35, 40],
    'Salary': [20000, 30000, 50000, 80000, 100000]
})

# StandardScaler
std_scaler = StandardScaler()
std_scaled = std_scaler.fit_transform(data)
std_df = pd.DataFrame(std_scaled, columns=data.columns)

# MinMaxScaler
mm_scaler = MinMaxScaler()
mm_scaled = mm_scaler.fit_transform(data)
mm_df = pd.DataFrame(mm_scaled, columns=data.columns)

# Plot Original Data
# plt.scatter(data['Age'], data['Salary'])
# plt.xlabel('Age')
# plt.ylabel('Salary')
# plt.title('Age vs Salary')
# plt.show()

# # Plot Standard Scaled
# plt.scatter(std_df['Age'], std_df['Salary'])
# plt.xlabel('Age')
# plt.ylabel('Salary')
# plt.title('Age vs Salary')
# plt.show()


# # Plot MinMax Scaled
# plt.scatter(mm_df['Age'], mm_df['Salary'])
# plt.xlabel('Age')
# plt.ylabel('Salary')
# plt.title('Age vs Salary')
# plt.show()


# sab grapgh ek sath
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(data['Age'], data['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Original Data')
plt.subplot(1, 3, 2)
plt.scatter(std_df['Age'], std_df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Standard Scaled')
plt.subplot(1, 3, 3)
plt.scatter(mm_df['Age'], mm_df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('MinMax Scaled')
plt.tight_layout()
plt.show()
{931D2DA4 BADE 4797 B15A 1977501F1D89}

LabelEncoder

Python
y = ['Low', 'Medium', 'High', 'Medium', 'Low']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(y_encoded) # Output: [1 2 0 2 1]

OneHotEncoder

OneHotEncoder categorical values ko binary (0/1) columns me convert karta hai. Har category ke liye alag column banata hai, isliye model ko clear signal milta hai aur koi order create nahi hota.

Example (Before Encoding)

Python
import numpy as np
X = [['Red'], ['Blue'], ['Green'], ['Red']]
X = np.array(X)
type(X)
X
# output
# <class 'numpy.ndarray'>
# array([['Red'],
#        ['Blue'],
#        ['Green'],
#        ['Red']], dtype='<U5')

Model text directly samajh nahi sakta, isliye encoding karna zaroori hai.

Python
X = [['Red'], ['Blue'], ['Green'], ['Red']]
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')

X_encoded = encoder.fit_transform(X)

print(X_encoded.toarray())
#df = pd.DataFrame(X_encoded.toarray(), columns=encoder.get_feature_names_out())
#df

After Encoding

Python
	x0_Blue	x0_Green  x0_Red
0	0.0	     0.0	1.0
1	1.0	     0.0	0.0
2	0.0	     1.0	0.0
3	0.0	     0.0	1.0

Yaha har color ek alag column ban gaya aur values 0/1 me aa gayi.

handle_unknown='ignore' ka matlab hai agar test data me koi new category aaye (jaise 'Yellow'), to error nahi aayega, bas us row me sab values 0 ho jayengi.

OneHotEncoder ka use tab karte hain jab data me koi natural order na ho (jaise color, city, category).

Ek line me samajh lo: OneHotEncoder categorical data ko multiple binary columns me convert karta hai bina kisi order ke, jisse model better learn karta hai.


SimpleImputer

SimpleImputer ka use missing values (NaN) ko fill karne ke liye hota hai.
Yaha strategy='mean' ka matlab hai har column ke missing values ko us column ke average (mean) se replace karna.

Example samjho:

Python
import numpy as np

X = [
    [10],
    [20],
    [np.nan],
    [30]
]

Yaha ek value missing hai (NaN)

Code:

Python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_filled = imputer.fit_transform(X)

print(X_filled)

Output:

Python
[[10.]
 [20.]
 [20.]
 [30.]]

Samajh kya aaya:

  • Mean = (10 + 20 + 30) / 3 = 20

  • Missing value ko 20 se replace kar diya

Important baatein:

  • fit() mean calculate karta hai

  • transform() missing values fill karta hai

  • fit_transform() dono ek saath karta hai

Kab use kare:

  • Jab data me missing values ho

  • Numeric columns ke liye (mean, median use hota hai)

Ek line me samajh lo:
SimpleImputer missing values ko kisi rule (jaise mean) se fill karta hai taaki model data properly use kar sake.


Models

1. Regression (Continuous Output)

👉 Jab output number hota hai (price, salary, marks)

Example:

  • House price predict karna

  • Salary estimate karna

Code:

Python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

👉 Ye input aur output ke beech straight line relation find karta hai


2. Classification (Categories)

👉 Jab output category hota hai (Yes/No, 0/1)

Example:

  • Spam vs Not Spam

  • Disease hai ya nahi


Decision Tree Classifier

👉 Tree structure me decision leta hai (if-else logic)

Code:

Python
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

👉 Easy to understand, but overfitting ho sakta hai


Random Forest Classifier

👉 Multiple decision trees ka group (ensemble)

Code:

Python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

👉 Zyada accurate aur stable
👉 Overfitting kam hota hai


3. Clustering (Unsupervised Learning)

👉 Jab data me labels nahi hote

Example:

  • Customers ko group karna

  • Similar users identify karna


KMeans

👉 Data ko K groups (clusters) me divide karta hai

Code:

Python
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(X)
labels = model.predict(X)

👉 Similar data ek cluster me aa jata hai


Final Summary (1 line each)

👉 LinearRegression → number predict karta hai
👉 DecisionTree → rules se classify karta hai
👉 RandomForest → multiple trees se better prediction
👉 KMeans → similar data ko group karta hai


Pipeline

Pipeline ko ek real example se samajhte hain — step by step, jaise tum project me karoge

Maan lo tumhare paas data hai jisme Age aur Salary se predict karna hai ki user product buy karega ya nahi.

Example data:

Python
import pandas as pd

data = pd.DataFrame({
    'Age': [18, 25, 30, 35, 40],
    'Salary': [20000, 30000, 50000, 80000, 100000],
    'Buy': [0, 0, 1, 1, 1]
})

X = data[['Age', 'Salary']]
y = data['Buy']

Ab normally tum kya karte:

  1. Pehle scaling

  2. Fir model training

But Pipeline ye sab automatically karega.

Pipeline banate hain:

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

Ab training aur prediction:

Python
pipeline.fit(X, y)
y_pred = pipeline.predict(X)

print(y_pred)

Samajh kya hua:

  • Step 1: StandardScaler ne Age aur Salary ko scale kiya

  • Step 2: LogisticRegression ne model train kiya

  • Step 3: Predict karte time bhi same scaling automatically apply hui

Tumhe manually scaling likhne ki zarurat nahi padi.

Agar bina pipeline karte to ye sab likhna padta:

Python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression()
model.fit(X_scaled, y)

y_pred = model.predict(X_scaled)

Difference dekho:

  • Pipeline → clean, safe, production ready

  • Manual → repetitive, error-prone

Real samajh:

Pipeline ek fixed machine hai jisme data daalo aur output mil jata hai, beech ke saare steps automatically handle ho jate hain.


Intermediate Usage

Combining Steps

Python
pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])

Feature Selection

SelectKBest data me se sabse useful K features choose karta hai taaki model better aur fast kaam kare

Python
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=5)
X_new = selector.fit_transform(X, y)

Advanced Concepts

Hyperparameter Tuning

GridSearchCV multiple hyperparameter combinations try karke best model automatically select karta hai

Python
from sklearn.model_selection import GridSearchCV

params = {'n_estimators': [50,100], 'max_depth': [3,5,10]}

grid = GridSearchCV(RandomForestClassifier(), params)
grid.fit(X_train, y_train)

Cross Validation

Cross Validation model ko multiple splits par test karke uski real aur reliable performance batata hai

Python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)

Interview Questions

Basic

  • Scikit-learn kya hai

  • fit vs transform

Intermediate

  • Pipeline kya hota hai

  • Cross-validation ka use

Advanced

  • GridSearch kaise kaam karta hai

  • Overfitting kaise avoid karte ho

Scenario

Dataset me missing values hai:

  • Imputer use

  • Pipeline build


Conclusion

Key Learnings

  • Scikit-learn ek complete ML toolkit hai

  • Pipeline most important concept hai

  • Evaluation critical hai

When to Use

  • Structured data

  • Fast ML models

  • Production pipelines

Final Advice

  • Practice karo

  • Real datasets use karo

  • End-to-end ML projects banao

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.