Series Overview:
Part 1: Fundamentals — Series, DataFrames, I/O, Basic Operations ✅
Part 2 (This Post): Data Manipulation — Indexing, Filtering, GroupBy, Merge, Reshape
Part 3: Advanced Pandas — Performance, Time Series, ML Pipelines, Interview Questions
1. Indexing and Selection — The Full Picture
Pandas has two primary indexers for selecting data:
.loc— label-based (use index labels and column names).iloc— integer position-based (use 0, 1, 2... positions)
This is one of the most confused topics in Pandas. Let's master it completely.
Setting Up Our Example DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"dept": ["HR", "IT", "Finance", "IT", "HR"],
"age": [25, 30, 35, 28, 32],
"salary": [50000, 65000, 75000, 60000, 55000],
"score": [88.5, 92.0, 78.3, 95.1, 84.7]
}, index=["E001", "E002", "E003", "E004", "E005"])
print(df)
Output:
name dept age salary score
E001 Alice HR 25 50000 88.5
E002 Bob IT 30 65000 92.0
E003 Charlie Finance 35 75000 78.3
E004 Diana IT 28 60000 95.1
E005 Eve HR 32 55000 84.7
.loc — Label-Based Indexing
# ── Single row by label ───────────────────────────────────────────────
df.loc["E002"] # returns Series (one row)
df.loc[["E002"]] # returns DataFrame (one-row DataFrame)
# ── Multiple rows by label ────────────────────────────────────────────
df.loc[["E001", "E003", "E005"]]
# ── Row range (INCLUSIVE on both ends!) ──────────────────────────────
df.loc["E001":"E003"] # E001, E002, E003 — includes E003!
# ── Row + Column ──────────────────────────────────────────────────────
df.loc["E002", "salary"] # single value → scalar
df.loc["E002", ["name", "salary"]] # row + multiple cols → Series
df.loc[["E002", "E004"], ["name", "salary"]] # multi rows + cols → DataFrame
# ── Column range ──────────────────────────────────────────────────────
df.loc[:, "age":"salary"] # all rows, columns from age to salary (inclusive)
df.loc[:, ["name", "score"]] # specific columns
# ── Boolean indexing with .loc ────────────────────────────────────────
df.loc[df["age"] > 30]
df.loc[df["dept"] == "IT", "salary"] # IT employees' salaries only
df.loc[(df["dept"] == "IT") & (df["age"] < 32), :] # complex condition
# ── Setting values with .loc ──────────────────────────────────────────
df.loc["E001", "salary"] = 52000 # update single cell
df.loc[df["dept"] == "HR", "salary"] = df.loc[df["dept"] == "HR", "salary"] * 1.05
.iloc — Integer Position-Based Indexing
# ── Single row by position ────────────────────────────────────────────
df.iloc[0] # first row
df.iloc[-1] # last row
df.iloc[[0]] # first row as DataFrame
# ── Multiple rows ─────────────────────────────────────────────────────
df.iloc[[0, 2, 4]] # rows at positions 0, 2, 4
df.iloc[1:4] # rows 1, 2, 3 (EXCLUSIVE end, like Python slicing)
# ── Row + Column ──────────────────────────────────────────────────────
df.iloc[0, 2] # row 0, column 2 → scalar
df.iloc[0, [0, 3]] # row 0, columns 0 and 3
df.iloc[:, 2] # all rows, column at position 2
# ── Slicing both axes ─────────────────────────────────────────────────
df.iloc[0:3, 1:4] # rows 0-2, columns 1-3
df.iloc[-3:, -2:] # last 3 rows, last 2 columns
# ── Setting values with .iloc ─────────────────────────────────────────
df.iloc[0, 3] = 52000 # set row 0, column 3 to 52000
.loc vs .iloc — Side-by-Side
|
| |
|---|---|---|
Based on | Labels (index/column names) | Integer positions (0, 1, 2...) |
End in slice | Inclusive | Exclusive (like Python) |
Works with | String/datetime/custom index | Always works |
Use when | You know the labels | You know the positions |
# ⚠️ Key difference: slicing endpoint!
df.loc["E001":"E003"] # includes E003
df.iloc[0:3] # excludes row at position 3 (only 0,1,2)
Direct Column and Row Access (Quick Reference)
# Column access
df["salary"] # Series
df[["name", "salary"]] # DataFrame
# Row access (only works with default integer index!)
df[0:3] # first 3 rows (slicing only — no scalar!)
# ⚠️ df[0] is NOT valid for row access unless index is integer
# Always prefer df.iloc[0] or df.loc["E001"] for explicit row access
at and iat — Fast Scalar Access
# When you need a SINGLE VALUE — use at/iat (faster than loc/iloc for scalars)
df.at["E002", "salary"] # label-based scalar → 65000
df.iat[1, 3] # position-based scalar → 65000
# Setting
df.at["E002", "salary"] = 67000
2. 🔎 Filtering and Boolean Indexing
Basic Boolean Filtering
# Single condition
df[df["age"] > 30]
df[df["dept"] == "IT"]
df[df["salary"] >= 60000]
df[df["name"].str.startswith("A")]
# Multiple conditions — use & (and), | (or), ~ (not)
# ⚠️ MUST use parentheses around each condition!
df[(df["dept"] == "IT") & (df["age"] > 25)] # both true
df[(df["dept"] == "IT") | (df["dept"] == "HR")] # either true
df[~(df["dept"] == "Finance")] # NOT Finance
df[(df["salary"] > 55000) & (df["score"] > 85)] # salary AND score
# Common mistake — Python 'and'/'or' won't work!
# ❌ df[df["age"] > 25 and df["salary"] > 60000] → ValueError
# ✅ df[(df["age"] > 25) & (df["salary"] > 60000)]
.isin() — Match Against a List
# Filter rows where value is in a list
df[df["dept"].isin(["IT", "HR"])]
# Negate with ~
df[~df["dept"].isin(["Finance"])]
# Works on any type
df[df["age"].isin([25, 30, 35])]
.between() — Range Filtering
# Inclusive on both ends by default
df[df["salary"].between(55000, 70000)]
df[df["score"].between(80, 95)]
# Exclusive bounds
df[df["age"].between(25, 32, inclusive="left")] # 25 <= age < 32
df[df["age"].between(25, 32, inclusive="neither")] # 25 < age < 32
.query() — SQL-like Filtering (Readable!)
# query() evaluates a string expression — very readable
df.query("age > 30")
df.query("dept == 'IT'")
df.query("salary >= 60000 and score > 85")
df.query("dept in ['IT', 'HR'] and age < 32")
df.query("dept != 'Finance'")
# Reference external variables with @
min_salary = 60000
max_age = 32
df.query("salary >= @min_salary and age <= @max_age")
# Column names with spaces — use backticks
df.query("`annual salary` > 60000")
String Filtering with .str
s = df["name"]
# Contains (case-insensitive with na=False)
df[df["name"].str.contains("li", case=False, na=False)]
# Starts/ends with
df[df["name"].str.startswith("A")]
df[df["name"].str.endswith("e")]
# Regex matching
df[df["name"].str.match(r"^[A-C]")] # names starting with A, B, or C
df[df["name"].str.fullmatch(r"[A-Za-z]{3,6}")] # 3-6 letter names
# Length
df[df["name"].str.len() > 5]
3. 🧹 Handling Missing Data — Complete Guide
Detecting Missing Values
df = pd.DataFrame({
"name": ["Alice", None, "Charlie", "Diana", "Eve"],
"age": [25, 30, np.nan, 28, 32],
"salary": [50000, np.nan, 75000, np.nan, 55000],
"dept": ["HR", "IT", "Finance", None, "HR"]
})
# ── Detection ─────────────────────────────────────────────────────────
df.isnull() # DataFrame of True/False
df.isna() # alias for isnull()
df.notnull() # opposite
df.isnull().sum() # count NaN per column
df.isnull().sum() / len(df) * 100 # percentage missing per column
df.isnull().any() # True if ANY NaN in column
df.isnull().any(axis=1) # True if ANY NaN in row
# Total missing
print(f"Total missing: {df.isnull().sum().sum()}")
print(f"Rows with any NaN: {df.isnull().any(axis=1).sum()}")
Dropping Missing Values
# ── Drop rows ─────────────────────────────────────────────────────────
df.dropna() # drop rows with ANY NaN
df.dropna(how="all") # drop rows where ALL values are NaN
df.dropna(subset=["salary"]) # drop rows where salary is NaN
df.dropna(subset=["salary", "age"]) # drop where salary OR age is NaN
df.dropna(thresh=3) # keep rows with at least 3 non-NaN values
# ── Drop columns ──────────────────────────────────────────────────────
df.dropna(axis=1) # drop columns with ANY NaN
df.dropna(axis=1, thresh=3) # drop columns with fewer than 3 non-NaN
Filling Missing Values
# ── Fill with a constant ──────────────────────────────────────────────
df.fillna(0) # fill all NaN with 0
df["salary"].fillna(0)
df.fillna({"salary": 0, "dept": "Unknown", "age": df["age"].median()})
# ── Forward fill (propagate last valid value forward) ─────────────────
df.ffill() # fill NaN with previous row's value
df["salary"].ffill()
# ── Backward fill ─────────────────────────────────────────────────────
df.bfill() # fill NaN with next row's value
# ── Fill with statistics ──────────────────────────────────────────────
df["salary"].fillna(df["salary"].mean())
df["salary"].fillna(df["salary"].median())
df["age"].fillna(df["age"].mode()[0]) # mode returns a Series, take first
# ── Fill within groups (advanced) ─────────────────────────────────────
# Fill NaN salary with the mean salary of that person's department
df["salary"] = df.groupby("dept")["salary"].transform(
lambda x: x.fillna(x.mean())
)
Interpolation (for Time-Series / Ordered Data)
s = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7])
s.interpolate() # linear interpolation (default)
s.interpolate(method="ffill") # forward fill
s.interpolate(method="polynomial", order=2) # polynomial
s.interpolate(limit=1) # fill at most 1 consecutive NaN
s.interpolate(limit_direction="both") # fill in both directions
Replacing Values
# Replace specific values
df.replace(np.nan, 0) # same as fillna(0)
df.replace("Unknown", np.nan) # treat "Unknown" as NaN
df["salary"].replace(0, np.nan) # replace 0 with NaN
# Replace multiple values at once
df.replace({"dept": {"HR": "Human Resources", "IT": "Information Technology"}})
# Replace with regex
df.replace(r"^\s+$", np.nan, regex=True) # whitespace-only strings → NaN
4. 📦 GroupBy — Split-Apply-Combine
GroupBy is one of the most powerful Pandas features. The concept:
Split — divide DataFrame into groups by one or more columns
Apply — apply a function to each group
Combine — combine results into a new DataFrame
df = pd.DataFrame({
"dept": ["IT", "IT", "HR", "HR", "Finance", "Finance", "IT"],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace"],
"salary": [65000, 72000, 50000, 55000, 75000, 80000, 68000],
"age": [28, 32, 25, 29, 35, 40, 27],
"score": [92, 88, 78, 85, 95, 82, 90]
})
Basic GroupBy
# Group by one column
grouped = df.groupby("dept")
# Inspect groups
grouped.groups # dict: {group_key: index_list}
grouped.ngroups # number of groups
grouped.size() # count per group
grouped.get_group("IT") # DataFrame for IT group only
# Basic aggregations
grouped["salary"].sum()
grouped["salary"].mean()
grouped["salary"].max()
grouped["salary"].min()
grouped["salary"].count()
grouped["salary"].std()
grouped["salary"].median()
.agg() — Multiple Aggregations at Once
# ── Apply same function to multiple columns ───────────────────────────
df.groupby("dept")[["salary", "score"]].mean()
# ── Apply different functions to different columns ─────────────────────
result = df.groupby("dept").agg(
total_salary=("salary", "sum"),
avg_salary=("salary", "mean"),
min_salary=("salary", "min"),
max_salary=("salary", "max"),
employee_count=("name", "count"),
avg_score=("score", "mean"),
oldest=("age", "max")
)
print(result)
total_salary avg_salary min_salary max_salary employee_count avg_score oldest
dept
Finance 155000 77500.0 75000 80000 2 88.5 40
HR 105000 52500.0 50000 55000 2 81.5 29
IT 205000 68333.3 65000 72000 3 90.0 32
# ── Apply multiple functions to one column ────────────────────────────
df.groupby("dept")["salary"].agg(["sum", "mean", "min", "max", "std"])
# ── Apply custom functions ─────────────────────────────────────────────
df.groupby("dept")["salary"].agg(
mean=np.mean,
range=lambda x: x.max() - x.min(),
top_earner=lambda x: x.idxmax()
)
.transform() — Apply Group Stats Back to Original Shape
# transform() returns a Series with the SAME index as the original df
# Perfect for adding group-level stats as new columns
# Add mean salary of dept for each employee
df["dept_avg_salary"] = df.groupby("dept")["salary"].transform("mean")
# Add salary rank within department
df["salary_rank_in_dept"] = df.groupby("dept")["salary"].rank(ascending=False)
# Normalize salary within department (0-1)
df["salary_normalized"] = df.groupby("dept")["salary"].transform(
lambda x: (x - x.min()) / (x.max() - x.min())
)
# Check if salary is above dept average
df["above_dept_avg"] = df["salary"] > df.groupby("dept")["salary"].transform("mean")
print(df[["name", "dept", "salary", "dept_avg_salary", "above_dept_avg"]])
.filter() — Keep/Remove Entire Groups Based on Condition
# Keep only groups where mean salary > 60000
high_pay_depts = df.groupby("dept").filter(lambda x: x["salary"].mean() > 60000)
# Keep only groups with more than 2 employees
big_depts = df.groupby("dept").filter(lambda x: len(x) > 2)
# Keep groups where ALL employees have score > 80
all_high_score = df.groupby("dept").filter(lambda x: (x["score"] > 80).all())
.apply() — Most Flexible GroupBy Operation
# Apply a custom function to each group (returns anything)
def dept_summary(group):
return pd.Series({
"employees": len(group),
"avg_salary": group["salary"].mean(),
"top_earner": group.loc[group["salary"].idxmax(), "name"],
"youngest": group.loc[group["age"].idxmin(), "name"]
})
summary = df.groupby("dept").apply(dept_summary)
print(summary)
Multi-Column GroupBy
# Group by multiple columns
df2 = pd.DataFrame({
"dept": ["IT", "IT", "IT", "HR", "HR"],
"level": ["Senior", "Junior", "Senior", "Junior", "Senior"],
"salary": [80000, 55000, 75000, 45000, 60000]
})
df2.groupby(["dept", "level"])["salary"].mean()
df2.groupby(["dept", "level"]).agg(avg=("salary", "mean"), count=("salary", "count"))
# Reset multi-level index after groupby
result = df2.groupby(["dept", "level"])["salary"].mean().reset_index()
5. 🔗 Merge and Join — Like SQL Joins
The Four Types of Joins
employees = pd.DataFrame({
"emp_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"dept_id": [10, 20, 10, 30, 20]
})
departments = pd.DataFrame({
"dept_id": [10, 20, 40],
"dept_name": ["IT", "HR", "Finance"]
})
# ── INNER JOIN — only matching rows from both ─────────────────────────
pd.merge(employees, departments, on="dept_id", how="inner")
# Result: Alice(IT), Bob(HR), Charlie(IT), Eve(HR) → Diana(dept=30) excluded
# ── LEFT JOIN — all rows from left, matching from right ───────────────
pd.merge(employees, departments, on="dept_id", how="left")
# Result: All 5 employees; Diana gets NaN for dept_name
# ── RIGHT JOIN — all rows from right, matching from left ──────────────
pd.merge(employees, departments, on="dept_id", how="right")
# Result: IT, HR, Finance rows; Finance has no employees → NaN
# ── OUTER JOIN — all rows from both ──────────────────────────────────
pd.merge(employees, departments, on="dept_id", how="outer")
# Result: All employees + Finance dept with NaN employees
# ── CROSS JOIN — every combination (Cartesian product) ────────────────
pd.merge(employees, departments, how="cross")
# Result: 5 × 3 = 15 rows
Advanced pd.merge() Parameters
# ── Different column names in each df ────────────────────────────────
pd.merge(employees, departments, left_on="dept_id", right_on="department_id")
# ── Merge on index ────────────────────────────────────────────────────
pd.merge(df1, df2, left_index=True, right_index=True)
pd.merge(df1, df2, left_on="id", right_index=True)
# ── Handle duplicate column names ────────────────────────────────────
# If both dfs have a "date" column:
pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))
# ── Validate join (prevents silent duplicates) ─────────────────────────
pd.merge(employees, departments, on="dept_id",
validate="many_to_one") # "one_to_one", "one_to_many", "many_to_many"
# ── Indicator column — which df the row came from ─────────────────────
result = pd.merge(employees, departments, on="dept_id", how="outer", indicator=True)
print(result["_merge"].value_counts())
# both 4
# left_only 1
# right_only 1
df.join() — Index-Based Joining
# join() is a shortcut for merge on index
df1 = pd.DataFrame({"salary": [50000, 60000]}, index=["Alice", "Bob"])
df2 = pd.DataFrame({"dept": ["HR", "IT"]}, index=["Alice", "Bob"])
df1.join(df2) # inner join on index
df1.join(df2, how="left")
df1.join(df2, how="outer")
pd.concat() — Stack DataFrames
df_jan = pd.DataFrame({"month": ["Jan"], "sales": [100000]})
df_feb = pd.DataFrame({"month": ["Feb"], "sales": [120000]})
df_mar = pd.DataFrame({"month": ["Mar"], "sales": [95000]})
# ── Stack vertically (rows) ───────────────────────────────────────────
result = pd.concat([df_jan, df_feb, df_mar], axis=0, ignore_index=True)
# ── Stack horizontally (columns) ──────────────────────────────────────
result = pd.concat([df1, df2], axis=1)
# ── Keep track of which df each row came from ─────────────────────────
result = pd.concat(
[df_jan, df_feb, df_mar],
keys=["Jan", "Feb", "Mar"], # creates MultiIndex
ignore_index=False
)
# ── Join behavior ─────────────────────────────────────────────────────
pd.concat([df1, df2], join="inner") # only columns common to both
pd.concat([df1, df2], join="outer") # all columns, NaN where missing (default)
Join Type Summary
Operation | Use When |
|---|---|
| Matching rows from BOTH tables |
| All left rows, fill NaN for unmatched right |
| All right rows, fill NaN for unmatched left |
| All rows from both, NaN where no match |
| Stacking rows of similar-schema DataFrames |
| Adding new columns side-by-side |
| Index-based joining shortcut |
6. 🔄 Reshaping Data — Pivot, Melt, Stack, Unstack
The Concept: Wide vs Long Format
Wide format — one row per entity, multiple columns for different measurements:
name Jan Feb Mar
Alice 100 200 150
Bob 80 120 110
Long format — one row per measurement:
name month value
Alice Jan 100
Alice Feb 200
Alice Mar 150
Bob Jan 80
...
Pandas can convert between both. Wide is good for display; long is good for analysis.
pivot() — Long to Wide (Simple)
long_df = pd.DataFrame({
"name": ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob"],
"month": ["Jan", "Feb", "Mar", "Jan", "Feb", "Mar"],
"sales": [100, 200, 150, 80, 120, 110]
})
# Pivot: name = index, month = columns, sales = values
wide_df = long_df.pivot(index="name", columns="month", values="sales")
print(wide_df)
month Feb Jan Mar
name
Alice 200 100 150
Bob 120 80 110
# Flatten column names after pivot
wide_df.columns.name = None # remove "month" label from column axis
wide_df = wide_df.reset_index() # move "name" back to a column
pivot_table() — GroupBy + Pivot (Powerful!)
sales = pd.DataFrame({
"region": ["North", "North", "South", "South", "North", "South"],
"product": ["Laptop", "Phone", "Laptop", "Phone", "Laptop", "Laptop"],
"quarter": ["Q1", "Q1", "Q1", "Q1", "Q2", "Q2"],
"sales": [100, 200, 150, 120, 180, 90]
})
# pivot_table handles DUPLICATES by aggregating
pt = pd.pivot_table(
sales,
values="sales",
index="region", # rows
columns="product", # columns
aggfunc="sum", # what to do with duplicates (sum, mean, count, etc.)
fill_value=0, # replace NaN with 0
margins=True, # add "All" row and column (totals)
margins_name="Total"
)
print(pt)
product Laptop Phone Total
region
North 280 200 480
South 240 120 360
Total 520 320 840
# Multiple aggregations
pd.pivot_table(
sales,
values="sales",
index="region",
columns="product",
aggfunc={"sales": ["sum", "mean", "count"]}
)
melt() — Wide to Long (Unpivot)
wide_df = pd.DataFrame({
"name": ["Alice", "Bob"],
"Jan": [100, 80],
"Feb": [200, 120],
"Mar": [150, 110]
})
# Melt: convert month columns into rows
long_df = pd.melt(
wide_df,
id_vars=["name"], # columns to keep as-is
value_vars=["Jan", "Feb", "Mar"], # columns to melt
var_name="month", # name for the new "variable" column
value_name="sales" # name for the new "value" column
)
print(long_df)
name month sales
0 Alice Jan 100
1 Bob Jan 80
2 Alice Feb 200
3 Bob Feb 120
4 Alice Mar 150
5 Bob Mar 110
stack() and unstack() — MultiIndex Reshaping
df = pd.DataFrame(
{"Math": [90, 85], "Science": [80, 95], "English": [88, 78]},
index=["Alice", "Bob"]
)
# stack() — column labels become inner row index
stacked = df.stack()
print(stacked)
# Alice Math 90
# Science 80
# English 88
# Bob Math 85
# Science 95
# English 78
# unstack() — inner row index becomes column labels (reverse of stack)
unstacked = stacked.unstack()
print(unstacked) # back to original
pd.crosstab() — Frequency Tables
df = pd.DataFrame({
"dept": ["IT", "IT", "HR", "HR", "IT", "Finance"],
"level": ["Senior", "Junior", "Senior", "Junior", "Senior", "Junior"],
"gender": ["M", "F", "M", "M", "F", "F"]
})
# Count of each dept × level combination
pd.crosstab(df["dept"], df["level"])
# Percentages
pd.crosstab(df["dept"], df["level"], normalize="index") # row percentages
# With margins
pd.crosstab(df["dept"], df["gender"], margins=True)
# Aggregation
pd.crosstab(df["dept"], df["gender"], values=df["level"].map({"Senior": 1, "Junior": 0}), aggfunc="sum")
7. 🔧 Apply, Map, and Vectorized Operations
apply() — Apply a Function Along an Axis
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"salary": [50000, 65000, 75000],
"score": [88, 92, 78]
})
# ── Apply to a Series (column) ────────────────────────────────────────
df["salary"].apply(lambda x: x * 1.1) # 10% raise
df["salary"].apply(lambda x: f"${x:,.0f}") # format as currency
df["name"].apply(len) # length of each name
# Custom function
def grade(score):
if score >= 90: return "A"
elif score >= 80: return "B"
elif score >= 70: return "C"
else: return "F"
df["grade"] = df["score"].apply(grade)
# ── Apply to rows (axis=1) ────────────────────────────────────────────
# Each row is passed as a Series
df["bonus"] = df.apply(lambda row: row["salary"] * 0.15 if row["score"] >= 90 else row["salary"] * 0.10, axis=1)
# Cleaner with a function
def calculate_bonus(row):
base = row["salary"]
if row["score"] >= 90:
return base * 0.20
elif row["score"] >= 80:
return base * 0.15
else:
return base * 0.10
df["bonus"] = df.apply(calculate_bonus, axis=1)
# ── Apply to entire DataFrame ─────────────────────────────────────────
df[["salary", "score"]].apply(lambda col: (col - col.mean()) / col.std()) # standardize
map() — Element-wise Transformation on Series
# ── With a dictionary ─────────────────────────────────────────────────
dept_codes = {"IT": 1, "HR": 2, "Finance": 3}
df["dept_code"] = df["dept"].map(dept_codes)
# ── With a function ───────────────────────────────────────────────────
df["salary"].map(lambda x: x * 1.1) # same as apply for Series
# ── With a Series ─────────────────────────────────────────────────────
# map from one column to another using a lookup Series
lookup = pd.Series({"IT": "Technology", "HR": "People Ops", "Finance": "Money"})
df["dept_full"] = df["dept"].map(lookup)
applymap() / map() on DataFrame — Element-wise
# pandas < 2.1: applymap()
# pandas >= 2.1: DataFrame.map() (applymap deprecated)
# Apply a function to EVERY SINGLE ELEMENT in the DataFrame
df_nums = pd.DataFrame({"a": [1.234, 2.567], "b": [3.891, 4.123]})
df_nums.map(lambda x: round(x, 1))
df_nums.map(lambda x: f"{x:.2f}") # format all numbers
⚡ Vectorized Operations — Always Prefer These Over apply()
# These are MUCH faster than apply/lambda
# apply() loops in Python — vectorized uses C code under the hood
# ❌ Slow
df["salary"].apply(lambda x: x * 1.1)
# ✅ Fast (vectorized)
df["salary"] * 1.1
# ❌ Slow
df["score"].apply(lambda x: x if x >= 80 else 0)
# ✅ Fast (numpy where)
np.where(df["score"] >= 80, df["score"], 0)
# ❌ Slow (multiple apply calls)
df["result"] = df.apply(lambda r: r["salary"] if r["score"] > 85 else 0, axis=1)
# ✅ Fast
df["result"] = np.where(df["score"] > 85, df["salary"], 0)
np.where() and np.select() — Fast Conditional Assignment
import numpy as np
# np.where: if condition, value_if_true, value_if_false
df["status"] = np.where(df["score"] >= 90, "High", "Normal")
# np.select: multiple conditions (like CASE WHEN in SQL)
conditions = [
df["score"] >= 90,
df["score"] >= 80,
df["score"] >= 70,
]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, default="F")
8. 📝 String Operations with .str
Pandas has a powerful .str accessor for string Series.
s = pd.Series([" Alice Smith ", "bob jones", "CHARLIE DOE", "Diana O'Brien", np.nan])
# ── Case operations ───────────────────────────────────────────────────
s.str.lower() # "alice smith"
s.str.upper() # "ALICE SMITH"
s.str.title() # "Alice Smith"
s.str.capitalize() # "Alice smith" (only first char)
s.str.swapcase() # swap case of each character
# ── Whitespace ────────────────────────────────────────────────────────
s.str.strip() # strip both sides
s.str.lstrip() # strip left
s.str.rstrip() # strip right
# ── Split and extract ─────────────────────────────────────────────────
names = pd.Series(["Alice Smith", "Bob Jones", "Charlie Doe"])
names.str.split() # split on whitespace → list
names.str.split(" ", expand=True) # split into separate columns
names.str.split(" ").str[0] # first word (first name)
names.str.split(" ").str[-1] # last word (last name)
# ── Contains and matching ─────────────────────────────────────────────
s.str.contains("Smith", na=False)
s.str.startswith("A")
s.str.endswith("n")
s.str.match(r"^\w+") # regex match from beginning
s.str.fullmatch(r"\w+ \w+") # full string regex match
# ── Replace ───────────────────────────────────────────────────────────
s.str.replace("Smith", "Johnson")
s.str.replace(r"\s+", " ", regex=True) # normalize multiple spaces
s.str.replace(r"[^a-zA-Z\s]", "", regex=True) # remove non-alpha chars
# ── Extract with regex ────────────────────────────────────────────────
emails = pd.Series(["alice@gmail.com", "bob@yahoo.com", "charlie@company.org"])
emails.str.extract(r"(\w+)@(\w+)\.(\w+)") # 3 capture groups → 3 columns
emails.str.extract(r"(?P<user>\w+)@(?P<domain>\w+)\.(?P<tld>\w+)") # named groups
# ── Length ────────────────────────────────────────────────────────────
s.str.len()
# ── Count occurrences ─────────────────────────────────────────────────
s.str.count("l") # count 'l' in each string
# ── Pad and format ────────────────────────────────────────────────────
pd.Series(["1", "22", "333"]).str.zfill(5) # "00001", "00022", "00333"
pd.Series(["hi"]).str.center(10, "*") # "****hi****"
pd.Series(["hi"]).str.ljust(10, ".") # "hi........"
pd.Series(["hi"]).str.rjust(10, ".") # "........hi"
# ── Join strings ──────────────────────────────────────────────────────
pd.Series([["Alice", "Bob"], ["Charlie", "Diana"]]).str.join(", ")
9. 🗓️ Datetime Operations
Creating and Parsing Dates
import pandas as pd
# ── Parse dates ───────────────────────────────────────────────────────
df["date"] = pd.to_datetime(df["date_str"])
df["date"] = pd.to_datetime(df["date_str"], format="%d/%m/%Y") # custom format
df["date"] = pd.to_datetime(df["date_str"], errors="coerce") # NaT on failure
df["date"] = pd.to_datetime(df["date_str"], infer_datetime_format=True) # auto-detect
# ── Create date ranges ────────────────────────────────────────────────
pd.date_range("2024-01-01", periods=12, freq="MS") # 12 month-starts
pd.date_range("2024-01-01", "2024-12-31", freq="D") # every day in 2024
pd.date_range("09:00", periods=8, freq="h") # hourly
# Frequency aliases: D=day, W=week, MS=month-start, ME=month-end,
# QS=quarter-start, YS=year-start, h=hour, min=minute, s=second
Datetime Accessor .dt
df["date"] = pd.to_datetime(["2024-01-15", "2024-06-22", "2024-11-30"])
# Extract components
df["date"].dt.year # 2024
df["date"].dt.month # 1, 6, 11
df["date"].dt.day # 15, 22, 30
df["date"].dt.hour
df["date"].dt.minute
df["date"].dt.second
df["date"].dt.weekday # 0=Monday, 6=Sunday
df["date"].dt.day_name() # "Monday", etc.
df["date"].dt.month_name() # "January", etc.
df["date"].dt.quarter # 1, 2, 4
df["date"].dt.dayofyear # day number within year (1-365)
df["date"].dt.weekofyear # ISO week number
# Is it...?
df["date"].dt.is_month_start
df["date"].dt.is_month_end
df["date"].dt.is_quarter_start
df["date"].dt.is_year_start
# Format
df["date"].dt.strftime("%B %d, %Y") # "January 15, 2024"
df["date"].dt.strftime("%Y-%m") # "2024-01"
# Rounding/truncating
df["date"].dt.floor("h") # round down to hour
df["date"].dt.ceil("D") # round up to day
df["date"].dt.round("min") # round to nearest minute
# Timedelta
df["days_since"] = (pd.Timestamp.now() - df["date"]).dt.days
10. 🛠️ Practical Examples
Example 1: HR Analytics Pipeline
import pandas as pd
import numpy as np
# ── Create sample data ────────────────────────────────────────────────
np.random.seed(42)
n = 200
df = pd.DataFrame({
"emp_id": [f"EMP{i:04d}" for i in range(1, n+1)],
"name": [f"Employee {i}" for i in range(1, n+1)],
"dept": np.random.choice(["IT", "HR", "Finance", "Marketing", "Operations"], n),
"level": np.random.choice(["Junior", "Mid", "Senior", "Lead"], n),
"salary": np.random.randint(35000, 120000, n),
"age": np.random.randint(22, 60, n),
"score": np.random.uniform(60, 100, n).round(1),
"hire_date": pd.date_range("2015-01-01", periods=n, freq="10D"),
"gender": np.random.choice(["M", "F", "Other"], n, p=[0.48, 0.48, 0.04]),
})
# Introduce some missing values
df.loc[np.random.choice(df.index, 10), "salary"] = np.nan
df.loc[np.random.choice(df.index, 5), "score"] = np.nan
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Missing: {df.isnull().sum().to_dict()}")
# ── Step 1: Data Cleaning ──────────────────────────────────────────────
# Fill missing salary with median salary for that dept+level combo
df["salary"] = df.groupby(["dept", "level"])["salary"].transform(
lambda x: x.fillna(x.median())
)
# Fill missing score with dept median
df["score"] = df.groupby("dept")["score"].transform(
lambda x: x.fillna(x.median())
)
print(f"After cleaning — missing: {df.isnull().sum().sum()}")
# ── Step 2: Feature Engineering ───────────────────────────────────────
today = pd.Timestamp("2024-01-01")
df["tenure_years"] = ((today - df["hire_date"]).dt.days / 365.25).round(1)
df["age_group"] = pd.cut(df["age"], bins=[0, 30, 40, 50, 100],
labels=["<30", "30-40", "40-50", "50+"])
df["experience_level"] = np.select(
[df["tenure_years"] < 2, df["tenure_years"] < 5, df["tenure_years"] < 10],
["New", "Growing", "Experienced"],
default="Veteran"
)
# ── Step 3: Analysis ───────────────────────────────────────────────────
print("\n" + "="*60)
print("🏢 DEPARTMENT SUMMARY")
print("="*60)
dept_summary = df.groupby("dept").agg(
headcount=("emp_id", "count"),
avg_salary=("salary", "mean"),
median_salary=("salary", "median"),
avg_score=("score", "mean"),
avg_tenure=("tenure_years", "mean"),
avg_age=("age", "mean")
).round(1).sort_values("avg_salary", ascending=False)
print(dept_summary.to_string())
print("\n" + "="*60)
print("💰 SALARY BY DEPT × LEVEL")
print("="*60)
salary_pivot = df.pivot_table(
values="salary",
index="dept",
columns="level",
aggfunc="mean",
fill_value=0
).round(0)
print(salary_pivot.to_string())
print("\n" + "="*60)
print("🚨 HIGH POTENTIAL EMPLOYEES (Score > 90, Salary < Dept Median)")
print("="*60)
# Add dept median salary for comparison
df["dept_median_salary"] = df.groupby("dept")["salary"].transform("median")
high_potential = df[
(df["score"] > 90) & (df["salary"] < df["dept_median_salary"])
][["emp_id", "name", "dept", "level", "salary", "score"]].sort_values("score", ascending=False)
print(f"Found {len(high_potential)} high-potential underpaid employees")
print(high_potential.head(10).to_string(index=False))
# ── Step 4: Export ─────────────────────────────────────────────────────
with pd.ExcelWriter("hr_analytics.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Raw Data", index=False)
dept_summary.to_excel(writer, sheet_name="Dept Summary")
salary_pivot.to_excel(writer, sheet_name="Salary by Level")
high_potential.to_excel(writer, sheet_name="High Potential", index=False)
print("\n✅ Report saved to hr_analytics.xlsx")
Example 2: Sales Merge and Analysis Pipeline
import pandas as pd
# ── Tables (like database relations) ──────────────────────────────────
orders = pd.DataFrame({
"order_id": [1001, 1002, 1003, 1004, 1005],
"customer_id": [201, 202, 201, 203, 204],
"product_id": [301, 302, 303, 301, 304],
"quantity": [2, 1, 3, 1, 2],
"order_date": pd.to_datetime(["2024-01-05", "2024-01-12", "2024-02-03",
"2024-02-18", "2024-03-01"])
})
customers = pd.DataFrame({
"customer_id": [201, 202, 203, 204, 205],
"name": ["Alice Corp", "Bob Ltd", "Charlie Inc", "Diana LLC", "Eve Co"],
"region": ["North", "South", "East", "West", "North"],
"tier": ["Gold", "Silver", "Gold", "Bronze", "Silver"]
})
products = pd.DataFrame({
"product_id": [301, 302, 303, 304, 305],
"product_name": ["Laptop", "Phone", "Tablet", "Watch", "Earbuds"],
"category": ["Computing", "Mobile", "Computing", "Wearable", "Audio"],
"unit_price": [999.99, 599.99, 449.99, 299.99, 149.99]
})
# ── Step 1: JOIN all three tables ─────────────────────────────────────
full = (
orders
.merge(customers, on="customer_id", how="left")
.merge(products, on="product_id", how="left")
)
# ── Step 2: Calculate revenue ─────────────────────────────────────────
full["revenue"] = full["quantity"] * full["unit_price"]
full["month"] = full["order_date"].dt.strftime("%Y-%m")
print("Merged Dataset:")
print(full[["order_id", "name", "product_name", "quantity", "unit_price", "revenue"]].to_string(index=False))
# ── Step 3: Analysis ───────────────────────────────────────────────────
print("\n📊 Revenue by Region:")
print(full.groupby("region")["revenue"].sum().sort_values(ascending=False).to_frame())
print("\n📦 Revenue by Category:")
print(full.groupby("category")["revenue"].sum().sort_values(ascending=False).to_frame())
print("\n🥇 Customer Tier Analysis:")
print(full.groupby("tier").agg(
orders=("order_id", "count"),
revenue=("revenue", "sum"),
avg_order=("revenue", "mean")
).round(2).to_string())
# ── Step 4: Pivot — monthly revenue by region ─────────────────────────
monthly_region = pd.pivot_table(
full, values="revenue", index="month",
columns="region", aggfunc="sum", fill_value=0
)
print("\n📅 Monthly Revenue by Region:")
print(monthly_region.to_string())
11. ⚠️ Edge Cases and Common Errors
Error 1: groupby().agg() — "DataError: No numeric types to aggregate"
# ❌ Trying to aggregate non-numeric column
df.groupby("dept").agg("sum") # fails if 'name' column (string) is included
# ✅ Select numeric columns first
df.groupby("dept")[["salary", "score"]].agg("sum")
# ✅ Or use numeric_only parameter
df.groupby("dept").sum(numeric_only=True)
Error 2: merge() — Duplicate Rows After Join
# This happens when the join key is not unique in one or both tables
# Solution: check before merging
print("Orders unique order_id?", orders["order_id"].is_unique)
print("Customers unique customer_id?", customers["customer_id"].is_unique)
# Use validate to enforce uniqueness
pd.merge(orders, customers, on="customer_id", validate="many_to_one")
# Raises MergeError if customer_id is not unique in customers
Error 3: pivot() — "Index contains duplicate entries"
# ❌ pivot() requires unique index+column combinations
# If a combination appears twice, use pivot_table() instead
df.pivot(index="name", columns="month", values="sales") # ValueError if duplicates
# ✅ Use pivot_table() with an aggfunc
df.pivot_table(index="name", columns="month", values="sales", aggfunc="sum")
Error 4: str.contains() — TypeError on NaN
# ❌ NaN in string column causes issues
df[df["name"].str.contains("Ali")] # may raise TypeError or return NaN rows
# ✅ Always use na=False
df[df["name"].str.contains("Ali", na=False)]
12. 💡 Pro Developer Insights
Use Method Chaining for Readable Pipelines
# ❌ Hard to read — creates many intermediate variables
df1 = df.drop_duplicates()
df2 = df1.dropna(subset=["salary"])
df3 = df2[df2["salary"] > 50000]
df4 = df3.assign(bonus=df3["salary"] * 0.10)
df5 = df4.groupby("dept")["bonus"].sum()
# ✅ Clean chained pipeline
result = (
df
.drop_duplicates()
.dropna(subset=["salary"])
.query("salary > 50000")
.assign(bonus=lambda x: x["salary"] * 0.10)
.groupby("dept")["bonus"].sum()
)
GroupBy Performance Tips
# ✅ Sort=False speeds up groupby (doesn't sort result by key)
df.groupby("dept", sort=False)["salary"].mean()
# ✅ Use as_index=False to get flat DataFrame (avoids reset_index())
df.groupby("dept", as_index=False)["salary"].mean()
# ✅ observed=True for categorical groupby (avoid empty groups)
df["dept"] = df["dept"].astype("category")
df.groupby("dept", observed=True)["salary"].sum()
Avoid Repeated Computation with transform()
# ❌ Computing group stats twice
df["dept_avg"] = df.groupby("dept")["salary"].mean().reindex(df["dept"]).values
df["above_avg"] = df["salary"] > df.groupby("dept")["salary"].mean().reindex(df["dept"]).values
# ✅ Use transform (computes once, broadcasts back)
dept_avg = df.groupby("dept")["salary"].transform("mean")
df["dept_avg"] = dept_avg
df["above_avg"] = df["salary"] > dept_avg
13. 🎤 Interview Questions — Part 2
Basic Level
Q1: What is the difference between merge() and join()?
merge()is more flexible — join on any column(s).join()is a shortcut that joins on the index by default. Internally,join()callsmerge().
Q2: How does groupby().transform() differ from groupby().agg()?
agg()reduces each group to one row.transform()returns values with the same shape as input — every original row gets its group's computed value. Usetransform()to add group stats as new columns.
Q3: What is pd.pivot_table() and when would you use it over pivot()?
pivot()requires unique index/column combinations.pivot_table()handles duplicates by aggregating (like GroupBy + pivot). Usepivot_table()whenever data has duplicate combinations.
Intermediate Level
Q4: How do you find rows where values from two DataFrames DON'T match (anti-join)?
# Left anti-join (rows in df1 NOT in df2)
merged = pd.merge(df1, df2, on="key", how="left", indicator=True)
anti_join = merged[merged["_merge"] == "left_only"].drop("_merge", axis=1)
Q5: How would you efficiently apply a function that depends on multiple columns?
# Option 1: apply with axis=1 (slow but simple)
df.apply(lambda row: f(row["col1"], row["col2"]), axis=1)
# Option 2: np.vectorize (faster)
np.vectorize(f)(df["col1"], df["col2"])
# Option 3: vectorized numpy operations (fastest)
np.where(df["col1"] > df["col2"], df["col1"] * 2, df["col2"] * 3)
Advanced Level
Q6: How does Pandas handle categorical data in GroupBy, and why does it matter?
When a column is
categorydtype and you groupby it, Pandas will show all categories in the result — even those with no rows — unless you passobserved=True. In Pandas 2.2+,observed=Falsetriggers a FutureWarning. Always useobserved=Truefor predictable results.
Q7: What happens when you concat two DataFrames with different columns?
By default,
concat(join="outer")keeps all columns from both and fills missing with NaN. Withjoin="inner", only common columns are kept.
14. ✅ Conclusion — Part 2
In Part 2, you mastered:
.locand.iloc— label vs position indexing; the key differencesBoolean filtering — single, multi-condition,
.isin(),.between(),.query()Missing data —
dropna(),fillna(),interpolate(), group-based imputationGroupBy — split-apply-combine,
agg(),transform(),filter(),apply()Merge and Join — inner/left/right/outer/cross joins,
pd.concat()Reshaping —
pivot(),pivot_table(),melt(),stack(),unstack()String operations — the full
.straccessorDatetime operations —
.dtaccessor, date ranges, time arithmeticApply, map, vectorize — and why to prefer vectorized ops
🔜 What's Coming in Part 3?
In Part 3: Advanced Pandas, we'll cover:
Performance Optimization — profiling, memory reduction, faster alternatives to apply
MultiIndex — hierarchical indexing in depth
Time Series — resampling, rolling windows, shift, lag features
Pipe and Method Chaining — building production-grade pipelines
Pandas + Machine Learning — feature engineering, scikit-learn integration
Real-World Data Pipeline — end-to-end ETL with Pandas
Pandas 2.x New Features — Copy-on-Write, new string dtype, Arrow backend
Complete Interview Questions — 30+ questions with answers
📌 Save this. The GroupBy + Merge sections alone will save you hours every week.
🔗 Next: Part 3 — Advanced Pandas: Performance, Time Series & ML Pipelines