Introduction
Scikit-learn ek Python library hai jo Machine Learning ke liye use hoti hai. Ye library tumhe ready-made algorithms deti hai jisse tum easily ML models bana sakte ho.
What is it
Python ML library
Built on NumPy, Pandas, SciPy
Simple aur consistent API
Why important
Fast prototyping
Easy to use
Industry me widely used
Real-world usage
Fraud detection
Customer churn prediction
Sales forecasting
Recommendation systems
Basic Concepts
Machine Learning Types
Supervised Learning → labeled data (Regression, Classification)
Unsupervised Learning → no labels (Clustering)
Core Components
Dataset
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Model Training
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Prediction
y_pred = model.predict(X_test)
Evaluation
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")
All Functions and Features
Preprocessing
StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Use: Feature scaling ke liye
MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X)
SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
Models
Regression
from sklearn.linear_model import LinearRegression
Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Clustering
from sklearn.cluster import KMeans
Evaluation Metrics
Metric | Use |
|---|---|
accuracy_score | Classification |
precision_score | Imbalanced data |
recall_score | Important predictions |
f1_score | Balance |
mean_squared_error | Regression |
r2_score | Model performance |
Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
Use:
Clean workflow
Avoid data leakage
Production ready
Intermediate Usage
Combining Steps
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
Feature Selection
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
X_new = selector.fit_transform(X, y)
Edge Cases
Missing values
Outliers
Imbalanced data
High dimensional data
Advanced Concepts
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
params = {'n_estimators': [50,100], 'max_depth': [3,5,10]}
grid = GridSearchCV(RandomForestClassifier(), params)
grid.fit(X_train, y_train)
Cross Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Internal Working
fit() → model training
predict() → output
transform() → data change
Design Pattern
Estimator pattern
Pipeline pattern
Real-World Use Cases
API
Flask + sklearn model
JSON input → prediction
Data Pipeline
SQL Server → ETL → ML → output
Production
Batch prediction jobs
Scheduled retraining
Practical Example
Problem: House Price Prediction
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
Line-by-line Explanation
Dataset load kiya
Features aur target define kiya
Train-test split kiya
Model train kiya
Prediction generate kiya
Real-world mapping:
Real estate companies price prediction ke liye use karti hain
Edge Cases and Errors
Unknown Categories
OneHotEncoder(handle_unknown='ignore')
Overfitting
Complex model
Fix: regularization, cross-validation
Data Leakage
Training data test me aa jana
Fix: pipeline use karo
Pro Developer Insights
Best Practices
Pipeline always use karo
Train-test split pehle karo
Feature engineering important hai
Performance Tips
Vectorization use karo
Numpy arrays fast hote hain
Clean Code
Functions use karo
Modular approach rakho
Security
Pickle files trusted source se hi load karo
Comparison
Tool | Use Case |
|---|---|
Scikit-learn | Traditional ML |
TensorFlow | Deep Learning |
PyTorch | Research |
XGBoost | High performance |
Data Science Perspective
ML Pipeline
Data collection
Cleaning
Feature engineering
Model training
Deployment
Usage
Prediction
Classification
Clustering
Interview Questions
Basic
Scikit-learn kya hai
fit vs transform
Intermediate
Pipeline kya hota hai
Cross-validation ka use
Advanced
GridSearch kaise kaam karta hai
Overfitting kaise avoid karte ho
Scenario
Dataset me missing values hai:
Imputer use
Pipeline build
Conclusion
Key Learnings
Scikit-learn ek complete ML toolkit hai
Pipeline most important concept hai
Evaluation critical hai
When to Use
Structured data
Fast ML models
Production pipelines
Final Advice
Practice karo
Real datasets use karo
End-to-end ML projects banao