Back to all posts
Data Science

Pandas Data Manipulation: The Complete Guide (Part 2 — Indexing, GroupBy, Merge & Reshape)

Master Pandas data manipulation — loc/iloc, boolean filtering, GroupBy, merge/join, pivot tables, melt, string ops, and apply functions with real examples.

Series Overview:

  • Part 1: Fundamentals — Series, DataFrames, I/O, Basic Operations ✅

  • Part 2 (This Post): Data Manipulation — Indexing, Filtering, GroupBy, Merge, Reshape

  • Part 3: Advanced Pandas — Performance, Time Series, ML Pipelines, Interview Questions


1. Indexing and Selection — The Full Picture

Pandas has two primary indexers for selecting data:

  • .loc — label-based (use index labels and column names)

  • .iloc — integer position-based (use 0, 1, 2... positions)

This is one of the most confused topics in Pandas. Let's master it completely.

Setting Up Our Example DataFrame

Python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "dept":   ["HR", "IT", "Finance", "IT", "HR"],
    "age":    [25, 30, 35, 28, 32],
    "salary": [50000, 65000, 75000, 60000, 55000],
    "score":  [88.5, 92.0, 78.3, 95.1, 84.7]
}, index=["E001", "E002", "E003", "E004", "E005"])

print(df)

Output:

SQL
         name    dept  age  salary  score
E001    Alice      HR   25   50000   88.5
E002      Bob      IT   30   65000   92.0
E003  Charlie  Finance   35   75000   78.3
E004    Diana      IT   28   60000   95.1
E005      Eve      HR   32   55000   84.7

.loc — Label-Based Indexing

Python
# ── Single row by label ───────────────────────────────────────────────
df.loc["E002"]             # returns Series (one row)
df.loc[["E002"]]           # returns DataFrame (one-row DataFrame)

# ── Multiple rows by label ────────────────────────────────────────────
df.loc[["E001", "E003", "E005"]]

# ── Row range (INCLUSIVE on both ends!) ──────────────────────────────
df.loc["E001":"E003"]      # E001, E002, E003 — includes E003!

# ── Row + Column ──────────────────────────────────────────────────────
df.loc["E002", "salary"]           # single value → scalar
df.loc["E002", ["name", "salary"]] # row + multiple cols → Series
df.loc[["E002", "E004"], ["name", "salary"]]  # multi rows + cols → DataFrame

# ── Column range ──────────────────────────────────────────────────────
df.loc[:, "age":"salary"]          # all rows, columns from age to salary (inclusive)
df.loc[:, ["name", "score"]]       # specific columns

# ── Boolean indexing with .loc ────────────────────────────────────────
df.loc[df["age"] > 30]
df.loc[df["dept"] == "IT", "salary"]  # IT employees' salaries only
df.loc[(df["dept"] == "IT") & (df["age"] < 32), :]  # complex condition

# ── Setting values with .loc ──────────────────────────────────────────
df.loc["E001", "salary"] = 52000   # update single cell
df.loc[df["dept"] == "HR", "salary"] = df.loc[df["dept"] == "HR", "salary"] * 1.05

.iloc — Integer Position-Based Indexing

Python
# ── Single row by position ────────────────────────────────────────────
df.iloc[0]          # first row
df.iloc[-1]         # last row
df.iloc[[0]]        # first row as DataFrame

# ── Multiple rows ─────────────────────────────────────────────────────
df.iloc[[0, 2, 4]]             # rows at positions 0, 2, 4
df.iloc[1:4]                   # rows 1, 2, 3 (EXCLUSIVE end, like Python slicing)

# ── Row + Column ──────────────────────────────────────────────────────
df.iloc[0, 2]                  # row 0, column 2 → scalar
df.iloc[0, [0, 3]]             # row 0, columns 0 and 3
df.iloc[:, 2]                  # all rows, column at position 2

# ── Slicing both axes ─────────────────────────────────────────────────
df.iloc[0:3, 1:4]              # rows 0-2, columns 1-3
df.iloc[-3:, -2:]              # last 3 rows, last 2 columns

# ── Setting values with .iloc ─────────────────────────────────────────
df.iloc[0, 3] = 52000          # set row 0, column 3 to 52000

.loc vs .iloc — Side-by-Side

.loc

.iloc

Based on

Labels (index/column names)

Integer positions (0, 1, 2...)

End in slice

Inclusive

Exclusive (like Python)

Works with

String/datetime/custom index

Always works

Use when

You know the labels

You know the positions

Python
# ⚠️ Key difference: slicing endpoint!
df.loc["E001":"E003"]    # includes E003
df.iloc[0:3]             # excludes row at position 3 (only 0,1,2)

Direct Column and Row Access (Quick Reference)

Python
# Column access
df["salary"]                  # Series
df[["name", "salary"]]        # DataFrame

# Row access (only works with default integer index!)
df[0:3]                       # first 3 rows (slicing only — no scalar!)

# ⚠️ df[0] is NOT valid for row access unless index is integer
# Always prefer df.iloc[0] or df.loc["E001"] for explicit row access

at and iat — Fast Scalar Access

Python
# When you need a SINGLE VALUE — use at/iat (faster than loc/iloc for scalars)
df.at["E002", "salary"]       # label-based scalar → 65000
df.iat[1, 3]                  # position-based scalar → 65000

# Setting
df.at["E002", "salary"] = 67000

2. 🔎 Filtering and Boolean Indexing

Basic Boolean Filtering

Python
# Single condition
df[df["age"] > 30]
df[df["dept"] == "IT"]
df[df["salary"] >= 60000]
df[df["name"].str.startswith("A")]

# Multiple conditions — use & (and), | (or), ~ (not)
# ⚠️ MUST use parentheses around each condition!
df[(df["dept"] == "IT") & (df["age"] > 25)]          # both true
df[(df["dept"] == "IT") | (df["dept"] == "HR")]       # either true
df[~(df["dept"] == "Finance")]                         # NOT Finance
df[(df["salary"] > 55000) & (df["score"] > 85)]       # salary AND score

# Common mistake — Python 'and'/'or' won't work!
# ❌ df[df["age"] > 25 and df["salary"] > 60000]  → ValueError
# ✅ df[(df["age"] > 25) & (df["salary"] > 60000)]

.isin() — Match Against a List

Python
# Filter rows where value is in a list
df[df["dept"].isin(["IT", "HR"])]

# Negate with ~
df[~df["dept"].isin(["Finance"])]

# Works on any type
df[df["age"].isin([25, 30, 35])]

.between() — Range Filtering

Python
# Inclusive on both ends by default
df[df["salary"].between(55000, 70000)]
df[df["score"].between(80, 95)]

# Exclusive bounds
df[df["age"].between(25, 32, inclusive="left")]   # 25 <= age < 32
df[df["age"].between(25, 32, inclusive="neither")] # 25 < age < 32

.query() — SQL-like Filtering (Readable!)

Python
# query() evaluates a string expression — very readable
df.query("age > 30")
df.query("dept == 'IT'")
df.query("salary >= 60000 and score > 85")
df.query("dept in ['IT', 'HR'] and age < 32")
df.query("dept != 'Finance'")

# Reference external variables with @
min_salary = 60000
max_age = 32
df.query("salary >= @min_salary and age <= @max_age")

# Column names with spaces — use backticks
df.query("`annual salary` > 60000")

String Filtering with .str

Python
s = df["name"]

# Contains (case-insensitive with na=False)
df[df["name"].str.contains("li", case=False, na=False)]

# Starts/ends with
df[df["name"].str.startswith("A")]
df[df["name"].str.endswith("e")]

# Regex matching
df[df["name"].str.match(r"^[A-C]")]   # names starting with A, B, or C
df[df["name"].str.fullmatch(r"[A-Za-z]{3,6}")]  # 3-6 letter names

# Length
df[df["name"].str.len() > 5]

3. 🧹 Handling Missing Data — Complete Guide

Detecting Missing Values

Python
df = pd.DataFrame({
    "name":   ["Alice", None, "Charlie", "Diana", "Eve"],
    "age":    [25, 30, np.nan, 28, 32],
    "salary": [50000, np.nan, 75000, np.nan, 55000],
    "dept":   ["HR", "IT", "Finance", None, "HR"]
})

# ── Detection ─────────────────────────────────────────────────────────
df.isnull()              # DataFrame of True/False
df.isna()                # alias for isnull()
df.notnull()             # opposite
df.isnull().sum()        # count NaN per column
df.isnull().sum() / len(df) * 100  # percentage missing per column
df.isnull().any()        # True if ANY NaN in column
df.isnull().any(axis=1)  # True if ANY NaN in row

# Total missing
print(f"Total missing: {df.isnull().sum().sum()}")
print(f"Rows with any NaN: {df.isnull().any(axis=1).sum()}")

Dropping Missing Values

Python
# ── Drop rows ─────────────────────────────────────────────────────────
df.dropna()                           # drop rows with ANY NaN
df.dropna(how="all")                  # drop rows where ALL values are NaN
df.dropna(subset=["salary"])          # drop rows where salary is NaN
df.dropna(subset=["salary", "age"])   # drop where salary OR age is NaN
df.dropna(thresh=3)                   # keep rows with at least 3 non-NaN values

# ── Drop columns ──────────────────────────────────────────────────────
df.dropna(axis=1)                     # drop columns with ANY NaN
df.dropna(axis=1, thresh=3)           # drop columns with fewer than 3 non-NaN

Filling Missing Values

Python
# ── Fill with a constant ──────────────────────────────────────────────
df.fillna(0)                          # fill all NaN with 0
df["salary"].fillna(0)
df.fillna({"salary": 0, "dept": "Unknown", "age": df["age"].median()})

# ── Forward fill (propagate last valid value forward) ─────────────────
df.ffill()                            # fill NaN with previous row's value
df["salary"].ffill()

# ── Backward fill ─────────────────────────────────────────────────────
df.bfill()                            # fill NaN with next row's value

# ── Fill with statistics ──────────────────────────────────────────────
df["salary"].fillna(df["salary"].mean())
df["salary"].fillna(df["salary"].median())
df["age"].fillna(df["age"].mode()[0])  # mode returns a Series, take first

# ── Fill within groups (advanced) ─────────────────────────────────────
# Fill NaN salary with the mean salary of that person's department
df["salary"] = df.groupby("dept")["salary"].transform(
    lambda x: x.fillna(x.mean())
)

Interpolation (for Time-Series / Ordered Data)

Python
s = pd.Series([1, np.nan, np.nan, 4, 5, np.nan, 7])

s.interpolate()                        # linear interpolation (default)
s.interpolate(method="ffill")          # forward fill
s.interpolate(method="polynomial", order=2)  # polynomial
s.interpolate(limit=1)                 # fill at most 1 consecutive NaN
s.interpolate(limit_direction="both")  # fill in both directions

Replacing Values

Python
# Replace specific values
df.replace(np.nan, 0)                    # same as fillna(0)
df.replace("Unknown", np.nan)            # treat "Unknown" as NaN
df["salary"].replace(0, np.nan)          # replace 0 with NaN

# Replace multiple values at once
df.replace({"dept": {"HR": "Human Resources", "IT": "Information Technology"}})

# Replace with regex
df.replace(r"^\s+$", np.nan, regex=True)  # whitespace-only strings → NaN

4. 📦 GroupBy — Split-Apply-Combine

GroupBy is one of the most powerful Pandas features. The concept:

  1. Split — divide DataFrame into groups by one or more columns

  2. Apply — apply a function to each group

  3. Combine — combine results into a new DataFrame

Python
df = pd.DataFrame({
    "dept":   ["IT", "IT", "HR", "HR", "Finance", "Finance", "IT"],
    "name":   ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace"],
    "salary": [65000, 72000, 50000, 55000, 75000, 80000, 68000],
    "age":    [28, 32, 25, 29, 35, 40, 27],
    "score":  [92, 88, 78, 85, 95, 82, 90]
})

Basic GroupBy

Python
# Group by one column
grouped = df.groupby("dept")

# Inspect groups
grouped.groups          # dict: {group_key: index_list}
grouped.ngroups         # number of groups
grouped.size()          # count per group
grouped.get_group("IT") # DataFrame for IT group only

# Basic aggregations
grouped["salary"].sum()
grouped["salary"].mean()
grouped["salary"].max()
grouped["salary"].min()
grouped["salary"].count()
grouped["salary"].std()
grouped["salary"].median()

.agg() — Multiple Aggregations at Once

Python
# ── Apply same function to multiple columns ───────────────────────────
df.groupby("dept")[["salary", "score"]].mean()

# ── Apply different functions to different columns ─────────────────────
result = df.groupby("dept").agg(
    total_salary=("salary", "sum"),
    avg_salary=("salary", "mean"),
    min_salary=("salary", "min"),
    max_salary=("salary", "max"),
    employee_count=("name", "count"),
    avg_score=("score", "mean"),
    oldest=("age", "max")
)
print(result)
SQL
          total_salary  avg_salary  min_salary  max_salary  employee_count  avg_score  oldest
dept                                                                                         
Finance        155000    77500.0       75000       80000               2       88.5      40
HR             105000    52500.0       50000       55000               2       81.5      29
IT             205000    68333.3       65000       72000               3       90.0      32
Python
# ── Apply multiple functions to one column ────────────────────────────
df.groupby("dept")["salary"].agg(["sum", "mean", "min", "max", "std"])

# ── Apply custom functions ─────────────────────────────────────────────
df.groupby("dept")["salary"].agg(
    mean=np.mean,
    range=lambda x: x.max() - x.min(),
    top_earner=lambda x: x.idxmax()
)

.transform() — Apply Group Stats Back to Original Shape

Python
# transform() returns a Series with the SAME index as the original df
# Perfect for adding group-level stats as new columns

# Add mean salary of dept for each employee
df["dept_avg_salary"] = df.groupby("dept")["salary"].transform("mean")

# Add salary rank within department
df["salary_rank_in_dept"] = df.groupby("dept")["salary"].rank(ascending=False)

# Normalize salary within department (0-1)
df["salary_normalized"] = df.groupby("dept")["salary"].transform(
    lambda x: (x - x.min()) / (x.max() - x.min())
)

# Check if salary is above dept average
df["above_dept_avg"] = df["salary"] > df.groupby("dept")["salary"].transform("mean")

print(df[["name", "dept", "salary", "dept_avg_salary", "above_dept_avg"]])

.filter() — Keep/Remove Entire Groups Based on Condition

Python
# Keep only groups where mean salary > 60000
high_pay_depts = df.groupby("dept").filter(lambda x: x["salary"].mean() > 60000)

# Keep only groups with more than 2 employees
big_depts = df.groupby("dept").filter(lambda x: len(x) > 2)

# Keep groups where ALL employees have score > 80
all_high_score = df.groupby("dept").filter(lambda x: (x["score"] > 80).all())

.apply() — Most Flexible GroupBy Operation

Python
# Apply a custom function to each group (returns anything)
def dept_summary(group):
    return pd.Series({
        "employees": len(group),
        "avg_salary": group["salary"].mean(),
        "top_earner": group.loc[group["salary"].idxmax(), "name"],
        "youngest": group.loc[group["age"].idxmin(), "name"]
    })

summary = df.groupby("dept").apply(dept_summary)
print(summary)

Multi-Column GroupBy

Python
# Group by multiple columns
df2 = pd.DataFrame({
    "dept":   ["IT", "IT", "IT", "HR", "HR"],
    "level":  ["Senior", "Junior", "Senior", "Junior", "Senior"],
    "salary": [80000, 55000, 75000, 45000, 60000]
})

df2.groupby(["dept", "level"])["salary"].mean()
df2.groupby(["dept", "level"]).agg(avg=("salary", "mean"), count=("salary", "count"))

# Reset multi-level index after groupby
result = df2.groupby(["dept", "level"])["salary"].mean().reset_index()

5. 🔗 Merge and Join — Like SQL Joins

The Four Types of Joins

Python
employees = pd.DataFrame({
    "emp_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "dept_id": [10, 20, 10, 30, 20]
})

departments = pd.DataFrame({
    "dept_id": [10, 20, 40],
    "dept_name": ["IT", "HR", "Finance"]
})
Python
# ── INNER JOIN — only matching rows from both ─────────────────────────
pd.merge(employees, departments, on="dept_id", how="inner")
# Result: Alice(IT), Bob(HR), Charlie(IT), Eve(HR) → Diana(dept=30) excluded

# ── LEFT JOIN — all rows from left, matching from right ───────────────
pd.merge(employees, departments, on="dept_id", how="left")
# Result: All 5 employees; Diana gets NaN for dept_name

# ── RIGHT JOIN — all rows from right, matching from left ──────────────
pd.merge(employees, departments, on="dept_id", how="right")
# Result: IT, HR, Finance rows; Finance has no employees → NaN

# ── OUTER JOIN — all rows from both ──────────────────────────────────
pd.merge(employees, departments, on="dept_id", how="outer")
# Result: All employees + Finance dept with NaN employees

# ── CROSS JOIN — every combination (Cartesian product) ────────────────
pd.merge(employees, departments, how="cross")
# Result: 5 × 3 = 15 rows

Advanced pd.merge() Parameters

Python
# ── Different column names in each df ────────────────────────────────
pd.merge(employees, departments, left_on="dept_id", right_on="department_id")

# ── Merge on index ────────────────────────────────────────────────────
pd.merge(df1, df2, left_index=True, right_index=True)
pd.merge(df1, df2, left_on="id", right_index=True)

# ── Handle duplicate column names ────────────────────────────────────
# If both dfs have a "date" column:
pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))

# ── Validate join (prevents silent duplicates) ─────────────────────────
pd.merge(employees, departments, on="dept_id",
         validate="many_to_one")  # "one_to_one", "one_to_many", "many_to_many"

# ── Indicator column — which df the row came from ─────────────────────
result = pd.merge(employees, departments, on="dept_id", how="outer", indicator=True)
print(result["_merge"].value_counts())
# both          4
# left_only     1
# right_only    1

df.join() — Index-Based Joining

Python
# join() is a shortcut for merge on index
df1 = pd.DataFrame({"salary": [50000, 60000]}, index=["Alice", "Bob"])
df2 = pd.DataFrame({"dept": ["HR", "IT"]}, index=["Alice", "Bob"])

df1.join(df2)                   # inner join on index
df1.join(df2, how="left")
df1.join(df2, how="outer")

pd.concat() — Stack DataFrames

Python
df_jan = pd.DataFrame({"month": ["Jan"], "sales": [100000]})
df_feb = pd.DataFrame({"month": ["Feb"], "sales": [120000]})
df_mar = pd.DataFrame({"month": ["Mar"], "sales": [95000]})

# ── Stack vertically (rows) ───────────────────────────────────────────
result = pd.concat([df_jan, df_feb, df_mar], axis=0, ignore_index=True)

# ── Stack horizontally (columns) ──────────────────────────────────────
result = pd.concat([df1, df2], axis=1)

# ── Keep track of which df each row came from ─────────────────────────
result = pd.concat(
    [df_jan, df_feb, df_mar],
    keys=["Jan", "Feb", "Mar"],  # creates MultiIndex
    ignore_index=False
)

# ── Join behavior ─────────────────────────────────────────────────────
pd.concat([df1, df2], join="inner")   # only columns common to both
pd.concat([df1, df2], join="outer")   # all columns, NaN where missing (default)

Join Type Summary

Operation

Use When

pd.merge(how="inner")

Matching rows from BOTH tables

pd.merge(how="left")

All left rows, fill NaN for unmatched right

pd.merge(how="right")

All right rows, fill NaN for unmatched left

pd.merge(how="outer")

All rows from both, NaN where no match

pd.concat(axis=0)

Stacking rows of similar-schema DataFrames

pd.concat(axis=1)

Adding new columns side-by-side

.join()

Index-based joining shortcut


6. 🔄 Reshaping Data — Pivot, Melt, Stack, Unstack

The Concept: Wide vs Long Format

Wide format — one row per entity, multiple columns for different measurements:

SQL
name     Jan    Feb    Mar
Alice    100    200    150
Bob       80    120    110

Long format — one row per measurement:

SQL
name   month   value
Alice  Jan     100
Alice  Feb     200
Alice  Mar     150
Bob    Jan      80
...

Pandas can convert between both. Wide is good for display; long is good for analysis.

pivot() — Long to Wide (Simple)

Python
long_df = pd.DataFrame({
    "name":  ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob"],
    "month": ["Jan", "Feb", "Mar", "Jan", "Feb", "Mar"],
    "sales": [100, 200, 150, 80, 120, 110]
})

# Pivot: name = index, month = columns, sales = values
wide_df = long_df.pivot(index="name", columns="month", values="sales")
print(wide_df)
SQL
month   Feb  Jan  Mar
name                 
Alice   200  100  150
Bob     120   80  110
Python
# Flatten column names after pivot
wide_df.columns.name = None           # remove "month" label from column axis
wide_df = wide_df.reset_index()       # move "name" back to a column

pivot_table() — GroupBy + Pivot (Powerful!)

Python
sales = pd.DataFrame({
    "region": ["North", "North", "South", "South", "North", "South"],
    "product": ["Laptop", "Phone", "Laptop", "Phone", "Laptop", "Laptop"],
    "quarter": ["Q1", "Q1", "Q1", "Q1", "Q2", "Q2"],
    "sales":  [100, 200, 150, 120, 180, 90]
})

# pivot_table handles DUPLICATES by aggregating
pt = pd.pivot_table(
    sales,
    values="sales",
    index="region",          # rows
    columns="product",       # columns
    aggfunc="sum",           # what to do with duplicates (sum, mean, count, etc.)
    fill_value=0,            # replace NaN with 0
    margins=True,            # add "All" row and column (totals)
    margins_name="Total"
)
print(pt)
SQL
product  Laptop  Phone  Total
region                        
North       280    200    480
South       240    120    360
Total       520    320    840
Python
# Multiple aggregations
pd.pivot_table(
    sales,
    values="sales",
    index="region",
    columns="product",
    aggfunc={"sales": ["sum", "mean", "count"]}
)

melt() — Wide to Long (Unpivot)

Python
wide_df = pd.DataFrame({
    "name":  ["Alice", "Bob"],
    "Jan":   [100, 80],
    "Feb":   [200, 120],
    "Mar":   [150, 110]
})

# Melt: convert month columns into rows
long_df = pd.melt(
    wide_df,
    id_vars=["name"],           # columns to keep as-is
    value_vars=["Jan", "Feb", "Mar"],  # columns to melt
    var_name="month",           # name for the new "variable" column
    value_name="sales"          # name for the new "value" column
)
print(long_df)
SQL
    name month  sales
0  Alice   Jan    100
1    Bob   Jan     80
2  Alice   Feb    200
3    Bob   Feb    120
4  Alice   Mar    150
5    Bob   Mar    110

stack() and unstack() — MultiIndex Reshaping

Python
df = pd.DataFrame(
    {"Math": [90, 85], "Science": [80, 95], "English": [88, 78]},
    index=["Alice", "Bob"]
)

# stack() — column labels become inner row index
stacked = df.stack()
print(stacked)
# Alice  Math       90
#        Science    80
#        English    88
# Bob    Math       85
#        Science    95
#        English    78

# unstack() — inner row index becomes column labels (reverse of stack)
unstacked = stacked.unstack()
print(unstacked)  # back to original

pd.crosstab() — Frequency Tables

Python
df = pd.DataFrame({
    "dept": ["IT", "IT", "HR", "HR", "IT", "Finance"],
    "level": ["Senior", "Junior", "Senior", "Junior", "Senior", "Junior"],
    "gender": ["M", "F", "M", "M", "F", "F"]
})

# Count of each dept × level combination
pd.crosstab(df["dept"], df["level"])

# Percentages
pd.crosstab(df["dept"], df["level"], normalize="index")  # row percentages

# With margins
pd.crosstab(df["dept"], df["gender"], margins=True)

# Aggregation
pd.crosstab(df["dept"], df["gender"], values=df["level"].map({"Senior": 1, "Junior": 0}), aggfunc="sum")

7. 🔧 Apply, Map, and Vectorized Operations

apply() — Apply a Function Along an Axis

Python
df = pd.DataFrame({
    "name":   ["Alice", "Bob", "Charlie"],
    "salary": [50000, 65000, 75000],
    "score":  [88, 92, 78]
})

# ── Apply to a Series (column) ────────────────────────────────────────
df["salary"].apply(lambda x: x * 1.1)         # 10% raise
df["salary"].apply(lambda x: f"${x:,.0f}")    # format as currency
df["name"].apply(len)                           # length of each name

# Custom function
def grade(score):
    if score >= 90: return "A"
    elif score >= 80: return "B"
    elif score >= 70: return "C"
    else: return "F"

df["grade"] = df["score"].apply(grade)

# ── Apply to rows (axis=1) ────────────────────────────────────────────
# Each row is passed as a Series
df["bonus"] = df.apply(lambda row: row["salary"] * 0.15 if row["score"] >= 90 else row["salary"] * 0.10, axis=1)

# Cleaner with a function
def calculate_bonus(row):
    base = row["salary"]
    if row["score"] >= 90:
        return base * 0.20
    elif row["score"] >= 80:
        return base * 0.15
    else:
        return base * 0.10

df["bonus"] = df.apply(calculate_bonus, axis=1)

# ── Apply to entire DataFrame ─────────────────────────────────────────
df[["salary", "score"]].apply(lambda col: (col - col.mean()) / col.std())  # standardize

map() — Element-wise Transformation on Series

Python
# ── With a dictionary ─────────────────────────────────────────────────
dept_codes = {"IT": 1, "HR": 2, "Finance": 3}
df["dept_code"] = df["dept"].map(dept_codes)

# ── With a function ───────────────────────────────────────────────────
df["salary"].map(lambda x: x * 1.1)  # same as apply for Series

# ── With a Series ─────────────────────────────────────────────────────
# map from one column to another using a lookup Series
lookup = pd.Series({"IT": "Technology", "HR": "People Ops", "Finance": "Money"})
df["dept_full"] = df["dept"].map(lookup)

applymap() / map() on DataFrame — Element-wise

Python
# pandas < 2.1: applymap()
# pandas >= 2.1: DataFrame.map() (applymap deprecated)

# Apply a function to EVERY SINGLE ELEMENT in the DataFrame
df_nums = pd.DataFrame({"a": [1.234, 2.567], "b": [3.891, 4.123]})
df_nums.map(lambda x: round(x, 1))
df_nums.map(lambda x: f"{x:.2f}")  # format all numbers

⚡ Vectorized Operations — Always Prefer These Over apply()

Python
# These are MUCH faster than apply/lambda
# apply() loops in Python — vectorized uses C code under the hood

# ❌ Slow
df["salary"].apply(lambda x: x * 1.1)

# ✅ Fast (vectorized)
df["salary"] * 1.1

# ❌ Slow
df["score"].apply(lambda x: x if x >= 80 else 0)

# ✅ Fast (numpy where)
np.where(df["score"] >= 80, df["score"], 0)

# ❌ Slow (multiple apply calls)
df["result"] = df.apply(lambda r: r["salary"] if r["score"] > 85 else 0, axis=1)

# ✅ Fast
df["result"] = np.where(df["score"] > 85, df["salary"], 0)

np.where() and np.select() — Fast Conditional Assignment

Python
import numpy as np

# np.where: if condition, value_if_true, value_if_false
df["status"] = np.where(df["score"] >= 90, "High", "Normal")

# np.select: multiple conditions (like CASE WHEN in SQL)
conditions = [
    df["score"] >= 90,
    df["score"] >= 80,
    df["score"] >= 70,
]
choices = ["A", "B", "C"]
df["grade"] = np.select(conditions, choices, default="F")

8. 📝 String Operations with .str

Pandas has a powerful .str accessor for string Series.

Python
s = pd.Series(["  Alice Smith  ", "bob jones", "CHARLIE DOE", "Diana O'Brien", np.nan])

# ── Case operations ───────────────────────────────────────────────────
s.str.lower()            # "alice smith"
s.str.upper()            # "ALICE SMITH"
s.str.title()            # "Alice Smith"
s.str.capitalize()       # "Alice smith" (only first char)
s.str.swapcase()         # swap case of each character

# ── Whitespace ────────────────────────────────────────────────────────
s.str.strip()            # strip both sides
s.str.lstrip()           # strip left
s.str.rstrip()           # strip right

# ── Split and extract ─────────────────────────────────────────────────
names = pd.Series(["Alice Smith", "Bob Jones", "Charlie Doe"])
names.str.split()                  # split on whitespace → list
names.str.split(" ", expand=True)  # split into separate columns
names.str.split(" ").str[0]        # first word (first name)
names.str.split(" ").str[-1]       # last word (last name)

# ── Contains and matching ─────────────────────────────────────────────
s.str.contains("Smith", na=False)
s.str.startswith("A")
s.str.endswith("n")
s.str.match(r"^\w+")   # regex match from beginning
s.str.fullmatch(r"\w+ \w+")  # full string regex match

# ── Replace ───────────────────────────────────────────────────────────
s.str.replace("Smith", "Johnson")
s.str.replace(r"\s+", " ", regex=True)  # normalize multiple spaces
s.str.replace(r"[^a-zA-Z\s]", "", regex=True)  # remove non-alpha chars

# ── Extract with regex ────────────────────────────────────────────────
emails = pd.Series(["alice@gmail.com", "bob@yahoo.com", "charlie@company.org"])
emails.str.extract(r"(\w+)@(\w+)\.(\w+)")  # 3 capture groups → 3 columns
emails.str.extract(r"(?P<user>\w+)@(?P<domain>\w+)\.(?P<tld>\w+)")  # named groups

# ── Length ────────────────────────────────────────────────────────────
s.str.len()

# ── Count occurrences ─────────────────────────────────────────────────
s.str.count("l")  # count 'l' in each string

# ── Pad and format ────────────────────────────────────────────────────
pd.Series(["1", "22", "333"]).str.zfill(5)    # "00001", "00022", "00333"
pd.Series(["hi"]).str.center(10, "*")         # "****hi****"
pd.Series(["hi"]).str.ljust(10, ".")          # "hi........"
pd.Series(["hi"]).str.rjust(10, ".")          # "........hi"

# ── Join strings ──────────────────────────────────────────────────────
pd.Series([["Alice", "Bob"], ["Charlie", "Diana"]]).str.join(", ")

9. 🗓️ Datetime Operations

Creating and Parsing Dates

Python
import pandas as pd

# ── Parse dates ───────────────────────────────────────────────────────
df["date"] = pd.to_datetime(df["date_str"])
df["date"] = pd.to_datetime(df["date_str"], format="%d/%m/%Y")  # custom format
df["date"] = pd.to_datetime(df["date_str"], errors="coerce")    # NaT on failure
df["date"] = pd.to_datetime(df["date_str"], infer_datetime_format=True)  # auto-detect

# ── Create date ranges ────────────────────────────────────────────────
pd.date_range("2024-01-01", periods=12, freq="MS")   # 12 month-starts
pd.date_range("2024-01-01", "2024-12-31", freq="D")  # every day in 2024
pd.date_range("09:00", periods=8, freq="h")           # hourly

# Frequency aliases: D=day, W=week, MS=month-start, ME=month-end,
# QS=quarter-start, YS=year-start, h=hour, min=minute, s=second

Datetime Accessor .dt

Python
df["date"] = pd.to_datetime(["2024-01-15", "2024-06-22", "2024-11-30"])

# Extract components
df["date"].dt.year          # 2024
df["date"].dt.month         # 1, 6, 11
df["date"].dt.day           # 15, 22, 30
df["date"].dt.hour
df["date"].dt.minute
df["date"].dt.second
df["date"].dt.weekday       # 0=Monday, 6=Sunday
df["date"].dt.day_name()    # "Monday", etc.
df["date"].dt.month_name()  # "January", etc.
df["date"].dt.quarter       # 1, 2, 4
df["date"].dt.dayofyear     # day number within year (1-365)
df["date"].dt.weekofyear    # ISO week number

# Is it...?
df["date"].dt.is_month_start
df["date"].dt.is_month_end
df["date"].dt.is_quarter_start
df["date"].dt.is_year_start

# Format
df["date"].dt.strftime("%B %d, %Y")  # "January 15, 2024"
df["date"].dt.strftime("%Y-%m")       # "2024-01"

# Rounding/truncating
df["date"].dt.floor("h")       # round down to hour
df["date"].dt.ceil("D")        # round up to day
df["date"].dt.round("min")     # round to nearest minute

# Timedelta
df["days_since"] = (pd.Timestamp.now() - df["date"]).dt.days

10. 🛠️ Practical Examples

Example 1: HR Analytics Pipeline

Python
import pandas as pd
import numpy as np

# ── Create sample data ────────────────────────────────────────────────
np.random.seed(42)
n = 200

df = pd.DataFrame({
    "emp_id": [f"EMP{i:04d}" for i in range(1, n+1)],
    "name": [f"Employee {i}" for i in range(1, n+1)],
    "dept": np.random.choice(["IT", "HR", "Finance", "Marketing", "Operations"], n),
    "level": np.random.choice(["Junior", "Mid", "Senior", "Lead"], n),
    "salary": np.random.randint(35000, 120000, n),
    "age": np.random.randint(22, 60, n),
    "score": np.random.uniform(60, 100, n).round(1),
    "hire_date": pd.date_range("2015-01-01", periods=n, freq="10D"),
    "gender": np.random.choice(["M", "F", "Other"], n, p=[0.48, 0.48, 0.04]),
})

# Introduce some missing values
df.loc[np.random.choice(df.index, 10), "salary"] = np.nan
df.loc[np.random.choice(df.index, 5), "score"] = np.nan

print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Missing: {df.isnull().sum().to_dict()}")

# ── Step 1: Data Cleaning ──────────────────────────────────────────────
# Fill missing salary with median salary for that dept+level combo
df["salary"] = df.groupby(["dept", "level"])["salary"].transform(
    lambda x: x.fillna(x.median())
)

# Fill missing score with dept median
df["score"] = df.groupby("dept")["score"].transform(
    lambda x: x.fillna(x.median())
)

print(f"After cleaning — missing: {df.isnull().sum().sum()}")

# ── Step 2: Feature Engineering ───────────────────────────────────────
today = pd.Timestamp("2024-01-01")
df["tenure_years"] = ((today - df["hire_date"]).dt.days / 365.25).round(1)
df["age_group"] = pd.cut(df["age"], bins=[0, 30, 40, 50, 100],
                          labels=["<30", "30-40", "40-50", "50+"])
df["experience_level"] = np.select(
    [df["tenure_years"] < 2, df["tenure_years"] < 5, df["tenure_years"] < 10],
    ["New", "Growing", "Experienced"],
    default="Veteran"
)

# ── Step 3: Analysis ───────────────────────────────────────────────────
print("\n" + "="*60)
print("🏢 DEPARTMENT SUMMARY")
print("="*60)

dept_summary = df.groupby("dept").agg(
    headcount=("emp_id", "count"),
    avg_salary=("salary", "mean"),
    median_salary=("salary", "median"),
    avg_score=("score", "mean"),
    avg_tenure=("tenure_years", "mean"),
    avg_age=("age", "mean")
).round(1).sort_values("avg_salary", ascending=False)

print(dept_summary.to_string())

print("\n" + "="*60)
print("💰 SALARY BY DEPT × LEVEL")
print("="*60)

salary_pivot = df.pivot_table(
    values="salary",
    index="dept",
    columns="level",
    aggfunc="mean",
    fill_value=0
).round(0)

print(salary_pivot.to_string())

print("\n" + "="*60)
print("🚨 HIGH POTENTIAL EMPLOYEES (Score > 90, Salary < Dept Median)")
print("="*60)

# Add dept median salary for comparison
df["dept_median_salary"] = df.groupby("dept")["salary"].transform("median")
high_potential = df[
    (df["score"] > 90) & (df["salary"] < df["dept_median_salary"])
][["emp_id", "name", "dept", "level", "salary", "score"]].sort_values("score", ascending=False)

print(f"Found {len(high_potential)} high-potential underpaid employees")
print(high_potential.head(10).to_string(index=False))

# ── Step 4: Export ─────────────────────────────────────────────────────
with pd.ExcelWriter("hr_analytics.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Raw Data", index=False)
    dept_summary.to_excel(writer, sheet_name="Dept Summary")
    salary_pivot.to_excel(writer, sheet_name="Salary by Level")
    high_potential.to_excel(writer, sheet_name="High Potential", index=False)

print("\n✅ Report saved to hr_analytics.xlsx")

Example 2: Sales Merge and Analysis Pipeline

Python
import pandas as pd

# ── Tables (like database relations) ──────────────────────────────────
orders = pd.DataFrame({
    "order_id": [1001, 1002, 1003, 1004, 1005],
    "customer_id": [201, 202, 201, 203, 204],
    "product_id": [301, 302, 303, 301, 304],
    "quantity": [2, 1, 3, 1, 2],
    "order_date": pd.to_datetime(["2024-01-05", "2024-01-12", "2024-02-03",
                                    "2024-02-18", "2024-03-01"])
})

customers = pd.DataFrame({
    "customer_id": [201, 202, 203, 204, 205],
    "name": ["Alice Corp", "Bob Ltd", "Charlie Inc", "Diana LLC", "Eve Co"],
    "region": ["North", "South", "East", "West", "North"],
    "tier": ["Gold", "Silver", "Gold", "Bronze", "Silver"]
})

products = pd.DataFrame({
    "product_id": [301, 302, 303, 304, 305],
    "product_name": ["Laptop", "Phone", "Tablet", "Watch", "Earbuds"],
    "category": ["Computing", "Mobile", "Computing", "Wearable", "Audio"],
    "unit_price": [999.99, 599.99, 449.99, 299.99, 149.99]
})

# ── Step 1: JOIN all three tables ─────────────────────────────────────
full = (
    orders
    .merge(customers, on="customer_id", how="left")
    .merge(products, on="product_id", how="left")
)

# ── Step 2: Calculate revenue ─────────────────────────────────────────
full["revenue"] = full["quantity"] * full["unit_price"]
full["month"] = full["order_date"].dt.strftime("%Y-%m")

print("Merged Dataset:")
print(full[["order_id", "name", "product_name", "quantity", "unit_price", "revenue"]].to_string(index=False))

# ── Step 3: Analysis ───────────────────────────────────────────────────
print("\n📊 Revenue by Region:")
print(full.groupby("region")["revenue"].sum().sort_values(ascending=False).to_frame())

print("\n📦 Revenue by Category:")
print(full.groupby("category")["revenue"].sum().sort_values(ascending=False).to_frame())

print("\n🥇 Customer Tier Analysis:")
print(full.groupby("tier").agg(
    orders=("order_id", "count"),
    revenue=("revenue", "sum"),
    avg_order=("revenue", "mean")
).round(2).to_string())

# ── Step 4: Pivot — monthly revenue by region ─────────────────────────
monthly_region = pd.pivot_table(
    full, values="revenue", index="month",
    columns="region", aggfunc="sum", fill_value=0
)
print("\n📅 Monthly Revenue by Region:")
print(monthly_region.to_string())

11. ⚠️ Edge Cases and Common Errors

Error 1: groupby().agg() — "DataError: No numeric types to aggregate"

Python
# ❌ Trying to aggregate non-numeric column
df.groupby("dept").agg("sum")  # fails if 'name' column (string) is included

# ✅ Select numeric columns first
df.groupby("dept")[["salary", "score"]].agg("sum")

# ✅ Or use numeric_only parameter
df.groupby("dept").sum(numeric_only=True)

Error 2: merge() — Duplicate Rows After Join

Python
# This happens when the join key is not unique in one or both tables
# Solution: check before merging
print("Orders unique order_id?", orders["order_id"].is_unique)
print("Customers unique customer_id?", customers["customer_id"].is_unique)

# Use validate to enforce uniqueness
pd.merge(orders, customers, on="customer_id", validate="many_to_one")
# Raises MergeError if customer_id is not unique in customers

Error 3: pivot() — "Index contains duplicate entries"

Python
# ❌ pivot() requires unique index+column combinations
# If a combination appears twice, use pivot_table() instead
df.pivot(index="name", columns="month", values="sales")  # ValueError if duplicates

# ✅ Use pivot_table() with an aggfunc
df.pivot_table(index="name", columns="month", values="sales", aggfunc="sum")

Error 4: str.contains() — TypeError on NaN

Python
# ❌ NaN in string column causes issues
df[df["name"].str.contains("Ali")]   # may raise TypeError or return NaN rows

# ✅ Always use na=False
df[df["name"].str.contains("Ali", na=False)]

12. 💡 Pro Developer Insights

Use Method Chaining for Readable Pipelines

Python
# ❌ Hard to read — creates many intermediate variables
df1 = df.drop_duplicates()
df2 = df1.dropna(subset=["salary"])
df3 = df2[df2["salary"] > 50000]
df4 = df3.assign(bonus=df3["salary"] * 0.10)
df5 = df4.groupby("dept")["bonus"].sum()

# ✅ Clean chained pipeline
result = (
    df
    .drop_duplicates()
    .dropna(subset=["salary"])
    .query("salary > 50000")
    .assign(bonus=lambda x: x["salary"] * 0.10)
    .groupby("dept")["bonus"].sum()
)

GroupBy Performance Tips

Python
# ✅ Sort=False speeds up groupby (doesn't sort result by key)
df.groupby("dept", sort=False)["salary"].mean()

# ✅ Use as_index=False to get flat DataFrame (avoids reset_index())
df.groupby("dept", as_index=False)["salary"].mean()

# ✅ observed=True for categorical groupby (avoid empty groups)
df["dept"] = df["dept"].astype("category")
df.groupby("dept", observed=True)["salary"].sum()

Avoid Repeated Computation with transform()

Python
# ❌ Computing group stats twice
df["dept_avg"] = df.groupby("dept")["salary"].mean().reindex(df["dept"]).values
df["above_avg"] = df["salary"] > df.groupby("dept")["salary"].mean().reindex(df["dept"]).values

# ✅ Use transform (computes once, broadcasts back)
dept_avg = df.groupby("dept")["salary"].transform("mean")
df["dept_avg"] = dept_avg
df["above_avg"] = df["salary"] > dept_avg

13. 🎤 Interview Questions — Part 2

Basic Level

Q1: What is the difference between merge() and join()?

merge() is more flexible — join on any column(s). join() is a shortcut that joins on the index by default. Internally, join() calls merge().

Q2: How does groupby().transform() differ from groupby().agg()?

agg() reduces each group to one row. transform() returns values with the same shape as input — every original row gets its group's computed value. Use transform() to add group stats as new columns.

Q3: What is pd.pivot_table() and when would you use it over pivot()?

pivot() requires unique index/column combinations. pivot_table() handles duplicates by aggregating (like GroupBy + pivot). Use pivot_table() whenever data has duplicate combinations.

Intermediate Level

Q4: How do you find rows where values from two DataFrames DON'T match (anti-join)?

Python
# Left anti-join (rows in df1 NOT in df2)
merged = pd.merge(df1, df2, on="key", how="left", indicator=True)
anti_join = merged[merged["_merge"] == "left_only"].drop("_merge", axis=1)

Q5: How would you efficiently apply a function that depends on multiple columns?

Python
# Option 1: apply with axis=1 (slow but simple)
df.apply(lambda row: f(row["col1"], row["col2"]), axis=1)

# Option 2: np.vectorize (faster)
np.vectorize(f)(df["col1"], df["col2"])

# Option 3: vectorized numpy operations (fastest)
np.where(df["col1"] > df["col2"], df["col1"] * 2, df["col2"] * 3)

Advanced Level

Q6: How does Pandas handle categorical data in GroupBy, and why does it matter?

When a column is category dtype and you groupby it, Pandas will show all categories in the result — even those with no rows — unless you pass observed=True. In Pandas 2.2+, observed=False triggers a FutureWarning. Always use observed=True for predictable results.

Q7: What happens when you concat two DataFrames with different columns?

By default, concat(join="outer") keeps all columns from both and fills missing with NaN. With join="inner", only common columns are kept.


14. ✅ Conclusion — Part 2

In Part 2, you mastered:

  • .loc and .iloc — label vs position indexing; the key differences

  • Boolean filtering — single, multi-condition, .isin(), .between(), .query()

  • Missing datadropna(), fillna(), interpolate(), group-based imputation

  • GroupBy — split-apply-combine, agg(), transform(), filter(), apply()

  • Merge and Join — inner/left/right/outer/cross joins, pd.concat()

  • Reshapingpivot(), pivot_table(), melt(), stack(), unstack()

  • String operations — the full .str accessor

  • Datetime operations.dt accessor, date ranges, time arithmetic

  • Apply, map, vectorize — and why to prefer vectorized ops


🔜 What's Coming in Part 3?

In Part 3: Advanced Pandas, we'll cover:

  • Performance Optimization — profiling, memory reduction, faster alternatives to apply

  • MultiIndex — hierarchical indexing in depth

  • Time Series — resampling, rolling windows, shift, lag features

  • Pipe and Method Chaining — building production-grade pipelines

  • Pandas + Machine Learning — feature engineering, scikit-learn integration

  • Real-World Data Pipeline — end-to-end ETL with Pandas

  • Pandas 2.x New Features — Copy-on-Write, new string dtype, Arrow backend

  • Complete Interview Questions — 30+ questions with answers


📌 Save this. The GroupBy + Merge sections alone will save you hours every week.

🔗 Next: Part 3 — Advanced Pandas: Performance, Time Series & ML Pipelines

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.