Data Science Pandas

Pandas for Python Developers: The Complete Guide (Part 1 — Fundamentals)

Meta Description: Master Pandas from scratch. Learn Series, DataFrames, I/O operations, and essential data manipulation with real-world examples. The only gu...

Apr 15, 2026 22 min read

Series Overview:
Part 1 (This Post): Fundamentals — Series, DataFrames, I/O, Basic Operations
Part 2: Data Manipulation — Indexing, Filtering, GroupBy, Merge/Join, Reshaping
Part 3: Advanced Pandas — Performance, Time Series, Real-World Pipelines, ML & Interview Questions

1. Introduction

What is Pandas?

Pandas is the most popular data manipulation and analysis library in Python. The name comes from "Panel Data" — a term used in econometrics for multidimensional structured datasets.

At its core, Pandas gives you two powerful data structures:

Series — a one-dimensional labeled array
DataFrame — a two-dimensional labeled table (like an Excel sheet or SQL table)

Think of Pandas as Excel + SQL + Python combined into one blazing-fast library.

Why is Pandas Important?

🚀 Speed — Built on top of NumPy, extremely fast for tabular data
🧹 Data Cleaning — Handle missing data, duplicates, wrong types effortlessly
🔗 Integration — Works with CSV, Excel, JSON, SQL, Parquet, APIs, and more
📊 Analysis — Grouping, aggregation, pivoting, filtering — all in a few lines
🤖 ML-Ready — Direct input to scikit-learn, TensorFlow, and PyTorch pipelines

Real-World Usage

Industry	Use Case
Finance	Stock price analysis, portfolio management
Healthcare	Patient data cleaning, EHR analysis
E-commerce	Sales reports, customer segmentation
Data Engineering	ETL pipelines, data validation
Machine Learning	Feature engineering, dataset preparation

Installing Pandas

pip install pandas

# With optional dependencies (recommended)
pip install pandas openpyxl xlrd sqlalchemy pyarrow

import pandas as pd
import numpy as np

print(pd.__version__)  # e.g., 2.2.0

2. 🧱 Basic Concepts

2.1 The Pandas Data Model

Pandas is built on top of NumPy arrays. Every column in a DataFrame is essentially a NumPy array under the hood — which is why Pandas is so fast.

DataFrame
│
├── Column A (Series) → NumPy Array
├── Column B (Series) → NumPy Array
└── Column C (Series) → NumPy Array

2.2 The Index — The Heart of Pandas

Every Pandas object has an Index — labels for rows (and columns). This is what makes Pandas unique compared to plain arrays.

import pandas as pd

# Default integer index (0, 1, 2...)
s = pd.Series([10, 20, 30])
print(s.index)   # RangeIndex(start=0, stop=3, step=1)

# Custom string index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"])    # 20

3. 📊 The Series — Deep Dive

What is a Series?

A Series is a 1D array with labels. Think of it as a single column in a spreadsheet.

# Structure: pd.Series(data, index, dtype, name)
s = pd.Series(
    data=[100, 200, 300, 400],
    index=["Jan", "Feb", "Mar", "Apr"],
    name="Monthly Revenue"
)
print(s)

Output:

Jan    100
Feb    200
Mar    300
Apr    400
Name: Monthly Revenue, dtype: int64

Creating a Series — All Methods

import pandas as pd
import numpy as np

# 1. From a Python list
s1 = pd.Series([1, 2, 3, 4, 5])

# 2. From a NumPy array
s2 = pd.Series(np.array([10.5, 20.1, 30.8]))

# 3. From a Python dictionary (keys become index)
s3 = pd.Series({"Alice": 95, "Bob": 87, "Charlie": 92})

# 4. From a scalar (broadcasts to all index positions)
s4 = pd.Series(42, index=["a", "b", "c", "d"])
print(s4)
# a    42
# b    42
# c    42
# d    42

# 5. With explicit dtype
s5 = pd.Series([1, 2, 3], dtype="float64")

# 6. With name
s6 = pd.Series([100, 200], name="sales")

data = pd.Series([1, 2, 3, 4, 5],index=['a', 'b', 'c', 'd', 'e'])
data #a    1
     #b    2
     #c    3
     #d    4
     #e    5
     #dtype: int64
data.index #Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
data.values #array([1, 2, 3, 4, 5])
data.to_dict() #{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
data.to_numpy() #array([1, 2, 3, 4, 5])
data.to_list() #[1, 2, 3, 4, 5]

Essential Series Attributes

s = pd.Series([10, 20, 30, 40, 50], index=["a", "b", "c", "d", "e"])

s.values        # numpy array: [10 20 30 40 50]
s.index         # Index(['a', 'b', 'c', 'd', 'e'])
s.dtype         # dtype('int64')
s.name          # None (or set name)
s.shape         # (5,)
s.size          # 5
s.ndim          # 1
s.nbytes        # 40 (memory in bytes)
s.is_unique     # True (all values distinct?)
s.hasnans       # False

Essential Series Methods

s = pd.Series([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

# --- Statistics ---
s.sum()          # 39
s.mean()         # 3.9
s.median()       # 3.5
s.std()          # 2.43
s.var()          # 5.88
s.min()          # 1
s.max()          # 9
s.count()        # 10 (non-NaN count)
s.describe()     # summary stats

# --- Sorting ---
s.sort_values()            # sort by values
s.sort_values(ascending=False)
s.sort_index()             # sort by index

# --- Value Counts ---
s.value_counts()           # frequency of each value
s.unique()                 # array of unique values
s.nunique()                # count of unique values

# --- Math Operations ---
s + 10          # add 10 to all
s * 2           # multiply all by 2
s ** 2          # square all
np.sqrt(s)      # square root (NumPy works directly)

# --- Type Conversion ---
s.astype(float)
s.astype(str)

# --- Missing Data ---
s.isnull()      # boolean mask of NaN
s.notnull()
s.dropna()      # remove NaN
s.fillna(0)     # fill NaN with 0

# --- Apply custom function ---
s.apply(lambda x: x * 2 if x > 3 else x)

# --- String methods (if string Series) ---
names = pd.Series(["Alice", "Bob", "Charlie"])
names.str.lower()
names.str.upper()
names.str.len()
names.str.contains("li")
names.str.replace("a", "@")

Series Alignment (Super Important!)

When you do math between two Series, Pandas aligns by index — not by position.

s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
s2 = pd.Series([10, 20, 30], index=["b", "c", "d"])

result = s1 + s2
print(result)
# a     NaN   (a not in s2)
# b    12.0   (2 + 10)
# c    23.0   (3 + 20)
# d     NaN   (d not in s1)

This is called index alignment and it's one of Pandas' most powerful features.

4. 📋 The DataFrame — Deep Dive

What is a DataFrame?

A DataFrame is a 2D labeled data structure — like a spreadsheet or SQL table. It has rows and columns, each with labels (index and column names).

             name   age  salary
0           Alice    25   50000
1             Bob    30   60000
2         Charlie    35   75000

Creating a DataFrame — All Methods

import pandas as pd
import numpy as np

# ── Method 1: From a dictionary of lists ──────────────────────────────
df1 = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000]
})

# ── Method 2: From a list of dictionaries (most common in APIs) ────────
df2 = pd.DataFrame([
    {"name": "Alice", "age": 25, "salary": 50000},
    {"name": "Bob", "age": 30, "salary": 60000},
    {"name": "Charlie", "age": 35, "salary": 75000}
])

# ── Method 3: From a NumPy array ──────────────────────────────────────
df3 = pd.DataFrame(
    np.random.randint(0, 100, size=(4, 3)),
    columns=["A", "B", "C"]
)

# ── Method 4: From a list of lists ────────────────────────────────────
df4 = pd.DataFrame(
    [["Alice", 25], ["Bob", 30]],
    columns=["name", "age"]
)

# ── Method 5: From a Series ───────────────────────────────────────────
s = pd.Series([1, 2, 3], name="values")
df5 = pd.DataFrame(s)

# ── Method 6: Empty DataFrame with schema ─────────────────────────────
df6 = pd.DataFrame(columns=["id", "name", "score"])

# ── Method 7: With custom index ───────────────────────────────────────
df7 = pd.DataFrame(
    {"score": [85, 90, 78]},
    index=["Alice", "Bob", "Charlie"]
)

Essential DataFrame Attributes

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "salary": [50000, 60000, 75000, 55000],
    "dept": ["HR", "IT", "Finance", "IT"]
})

df.shape         # (4, 4) → rows, columns
df.ndim          # 2
df.size          # 16 (total elements)
df.dtypes        # dtype of each column
df.columns       # Index(['name', 'age', 'salary', 'dept'])
df.index         # RangeIndex(start=0, stop=4, step=1)
df.values        # NumPy array of all data
df.axes          # [row index, column index]
df.memory_usage()# memory per column in bytes

First Look at Your Data

df.head()        # first 5 rows (default)
df.head(10)      # first 10 rows
df.tail()        # last 5 rows
df.tail(3)       # last 3 rows
df.sample(3)     # 3 random rows
df.sample(frac=0.1)  # 10% random sample

df.info()        # column names, dtypes, non-null counts, memory
df.describe()    # statistical summary (numeric cols)
df.describe(include="all")    # include object/categorical cols
df.describe(include="object") # string/categorical only

`df.info()` Output Explained

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   salary  4 non-null      int64 
 3   dept    4 non-null      object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes

This tells you:

Total rows and columns
Which columns have missing values
Data types (important for performance and operations)
Memory usage

5. 📂 Reading and Writing Data (I/O Operations)

This is one of the most important practical skills. You'll spend 30% of your Pandas time just reading and writing files.

5.1 CSV Files

# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_csv("data.csv")

# Common parameters:
df = pd.read_csv(
    "data.csv",
    sep=",",             # delimiter (use sep="\t" for TSV)
    header=0,            # row to use as column names (0 = first row)
    index_col="id",      # set column as index
    usecols=["name", "age", "salary"],  # read only specific columns
    nrows=1000,          # read only first 1000 rows
    skiprows=2,          # skip first 2 rows
    dtype={"age": int, "salary": float},  # specify dtypes
    na_values=["N/A", "NULL", "none", ""],  # treat as NaN
    parse_dates=["date_col"],  # auto-parse date columns
    encoding="utf-8",    # encoding (try "latin-1" if utf-8 fails)
    low_memory=False,    # avoid mixed-type warnings
    chunksize=10000,     # read in chunks (for large files)
)

# ── Writing ────────────────────────────────────────────────────────────
df.to_csv("output.csv")

df.to_csv(
    "output.csv",
    index=False,         # don't write row numbers
    sep=",",
    encoding="utf-8",
    columns=["name", "salary"],  # write only specific columns
    header=True,         # write column names
    mode="a",            # append instead of overwrite
    na_rep="NULL",       # represent NaN as "NULL"
    float_format="%.2f", # format floats to 2 decimal places
    date_format="%Y-%m-%d",
)

Chunked Reading (for Big Files)

# When a CSV is too large to fit in memory, read in chunks
chunk_list = []
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
    # Process each chunk
    filtered = chunk[chunk["age"] > 30]
    chunk_list.append(filtered)

df = pd.concat(chunk_list, ignore_index=True)
print(f"Final shape: {df.shape}")

5.2 Excel Files

# ── Reading ────────────────────────────────────────────────────────────
# pip install openpyxl (for .xlsx)  xlrd (for .xls)
df = pd.read_excel("data.xlsx")

df = pd.read_excel(
    "data.xlsx",
    sheet_name="Sheet1",     # specify sheet name or index (0-based)
    header=0,
    usecols="A:D",            # Excel column range
    skiprows=2,
    dtype={"id": str},
    na_values=["", "N/A"],
)

# Read all sheets at once
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)
# Returns dict: {"Sheet1": df1, "Sheet2": df2, ...}

# ── Writing ────────────────────────────────────────────────────────────
df.to_excel("output.xlsx", index=False, sheet_name="Results")

# Multiple sheets in one file
with pd.ExcelWriter("multi_sheet.xlsx", engine="openpyxl") as writer:
    df1.to_excel(writer, sheet_name="Sales", index=False)
    df2.to_excel(writer, sheet_name="HR", index=False)
    df3.to_excel(writer, sheet_name="Finance", index=False)

5.3 JSON Files

# ── Reading ────────────────────────────────────────────────────────────
# records format: [{"col": val}, {"col": val}, ...]
df = pd.read_json("data.json")

df = pd.read_json(
    "data.json",
    orient="records",    # "records", "split", "index", "columns", "values"
    lines=True,          # for JSON Lines format (one JSON per line)
    dtype={"id": str},
    encoding="utf-8",
)

# Reading from a URL (API response)
import json, urllib.request
url = "https://jsonplaceholder.typicode.com/users"
with urllib.request.urlopen(url) as response:
    data = json.loads(response.read())
df = pd.DataFrame(data)

# ── Writing ────────────────────────────────────────────────────────────
df.to_json("output.json", orient="records", indent=2)
df.to_json("output.json", orient="records", lines=True)  # JSON Lines

5.4 SQL Databases

# pip install sqlalchemy
from sqlalchemy import create_engine

# ── Connect ────────────────────────────────────────────────────────────
# SQLite
engine = create_engine("sqlite:///mydb.db")

# PostgreSQL
engine = create_engine("postgresql+psycopg2://user:password@host:5432/dbname")

# MySQL
engine = create_engine("mysql+pymysql://user:password@host:3306/dbname")

# SQL Server
engine = create_engine("mssql+pyodbc://user:password@host/dbname?driver=ODBC+Driver+17+for+SQL+Server")

# ── Reading ────────────────────────────────────────────────────────────
# Read an entire table
df = pd.read_sql_table("employees", con=engine)

# Read with SQL query
df = pd.read_sql_query(
    "SELECT name, age, salary FROM employees WHERE dept = 'IT'",
    con=engine
)

# Universal (works for both table name and SQL string)
df = pd.read_sql("SELECT * FROM employees LIMIT 100", con=engine)

# With parameters (safe — avoids SQL injection)
df = pd.read_sql(
    "SELECT * FROM employees WHERE dept = :dept AND age > :age",
    con=engine,
    params={"dept": "IT", "age": 30}
)

# ── Writing ────────────────────────────────────────────────────────────
df.to_sql(
    "employees_backup",
    con=engine,
    if_exists="replace",    # "fail", "replace", "append"
    index=False,
    chunksize=1000,         # write in chunks
    method="multi",         # use multi-row INSERT (faster)
)

5.5 Parquet Files (Modern Big Data Format)

# pip install pyarrow  (or fastparquet)

# ── Writing ────────────────────────────────────────────────────────────
df.to_parquet("data.parquet", index=False, engine="pyarrow", compression="snappy")

# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_parquet("data.parquet", engine="pyarrow")

# Read specific columns only (much faster for wide tables)
df = pd.read_parquet("data.parquet", columns=["name", "salary"])

💡 Parquet vs CSV: Parquet is 5-10x smaller and 10-100x faster to read. Always use Parquet for large datasets in production.

5.6 Other Formats

# ── Clipboard ─────────────────────────────────────────────────────────
df = pd.read_clipboard()  # paste from Excel/browser table
df.to_clipboard(index=False)

# ── HTML (scraping tables) ─────────────────────────────────────────────
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP")
df = tables[0]  # first table on the page

# ── Pickle (Python objects, preserves dtypes perfectly) ───────────────
df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")

# ── HDF5 (for very large datasets) ────────────────────────────────────
df.to_hdf("data.h5", key="employees", mode="w")
df = pd.read_hdf("data.h5", key="employees")

I/O Format Comparison

Format	Speed	File Size	Human Readable	Best For
CSV	Slow	Large	✅ Yes	Small files, sharing
JSON	Medium	Large	✅ Yes	APIs, configs
Parquet	Fast	Small	❌ No	Big data, production
Excel	Slow	Medium	✅ Yes	Business reports
Pickle	Fast	Medium	❌ No	Python-only caching
HDF5	Very Fast	Variable	❌ No	Scientific data
SQL	Varies	Varies	Via query	Transactional data

6. 🔧 Essential DataFrame Operations

6.1 Selecting Columns

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000],
    "dept": ["HR", "IT", "Finance"]
})

# ── Single column → returns Series ────────────────────────────────────
df["name"]         # bracket notation (always works)
df.name            # attribute notation (AVOID — breaks if col name has spaces/conflicts)

# ── Multiple columns → returns DataFrame ──────────────────────────────
df[["name", "salary"]]
df[["name", "age", "dept"]]

# ── All columns except one ────────────────────────────────────────────
df.drop(columns=["salary"])

# ── Select by dtype ───────────────────────────────────────────────────
df.select_dtypes(include="number")    # numeric columns only
df.select_dtypes(include="object")    # string columns only
df.select_dtypes(exclude="object")    # exclude strings

6.2 Adding and Removing Columns

# ── Adding new columns ────────────────────────────────────────────────
df["annual_bonus"] = df["salary"] * 0.10
df["senior"] = df["age"] >= 30   # boolean column
df["full_info"] = df["name"] + " (" + df["dept"] + ")"

# Using assign() — returns a new df, doesn't modify original
df_new = df.assign(
    tax=df["salary"] * 0.30,
    net_salary=df["salary"] * 0.70,
    age_group=lambda x: pd.cut(x["age"], bins=[0, 30, 50, 100], labels=["Young", "Mid", "Senior"])
)

# Insert at specific position
df.insert(1, "employee_id", range(1, len(df) + 1))

# ── Removing columns ──────────────────────────────────────────────────
df.drop("annual_bonus", axis=1, inplace=True)              # single column
df.drop(["salary", "dept"], axis=1, inplace=True)          # multiple columns
df.drop(columns=["salary", "dept"])                        # cleaner syntax

# ── Rename columns ────────────────────────────────────────────────────
df.rename(columns={"name": "full_name", "age": "years"})

# Make all column names lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(" ", "_")

6.3 Adding and Removing Rows

# ── Adding rows ───────────────────────────────────────────────────────
new_row = pd.DataFrame([{"name": "Diana", "age": 28, "salary": 55000, "dept": "Marketing"}])
df = pd.concat([df, new_row], ignore_index=True)

# ── Removing rows ─────────────────────────────────────────────────────
df.drop(0, axis=0)                 # drop row with index 0
df.drop([0, 2, 4])                 # drop multiple rows by index label
df.drop(df.index[df["age"] < 25]) # drop rows where condition is True
df[df["age"] >= 25]               # filter (preferred over dropping)

# ── Reset index after drops ───────────────────────────────────────────
df = df.drop(0).reset_index(drop=True)

6.4 Data Types and Conversion

df = pd.DataFrame({
    "id": ["1", "2", "3"],
    "salary": ["50000", "60000", "75000"],
    "hire_date": ["2020-01-15", "2019-03-22", "2021-07-01"],
    "active": ["True", "False", "True"]
})

# Check dtypes
print(df.dtypes)

# ── Convert types ─────────────────────────────────────────────────────
df["id"] = df["id"].astype(int)
df["salary"] = df["salary"].astype(float)
df["hire_date"] = pd.to_datetime(df["hire_date"])
df["active"] = df["active"].map({"True": True, "False": False})

# Safe numeric conversion (won't crash on errors)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")  # NaN on failure
df["salary"] = pd.to_numeric(df["salary"], errors="ignore")  # keep original on failure

# Convert to categorical (saves memory for low-cardinality string cols)
df["dept"] = df["dept"].astype("category")

# Full conversion with convert_dtypes() — auto-detects best types
df = df.convert_dtypes()

6.5 Viewing, Sorting, Reindexing

# ── Sorting ───────────────────────────────────────────────────────────
df.sort_values("salary")                           # ascending
df.sort_values("salary", ascending=False)          # descending
df.sort_values(["dept", "salary"], ascending=[True, False])  # multi-col sort
df.sort_index()                                    # sort by index

# ── Reindexing ────────────────────────────────────────────────────────
df.reset_index(drop=True)             # reset to 0,1,2...
df.set_index("name")                  # use 'name' column as index
df.set_index("name", drop=True)       # remove 'name' from columns

# ── Transposing ───────────────────────────────────────────────────────
df.T   # swap rows and columns

7. 🔍 Understanding dtypes in Depth

One of the most common sources of bugs and slowness in Pandas is wrong data types.

All Pandas dtypes

dtype	Description	Example
`int64`	64-bit integer	1, 42, -7
`int32`	32-bit integer (less memory)	—
`float64`	64-bit float	3.14, 2.718
`float32`	32-bit float (less memory)	—
`object`	Python string (or mixed)	"hello"
`string`	Pandas string type (new)	"hello"
`bool`	Boolean	True/False
`datetime64[ns]`	Timestamps	2024-01-01
`timedelta64[ns]`	Time differences	3 days
`category`	Categorical (low cardinality)	"M"/"F"
`complex128`	Complex numbers	3+4j

# Memory optimization example
df = pd.DataFrame({
    "user_id": np.random.randint(1, 100000, 1_000_000),
    "score": np.random.uniform(0, 100, 1_000_000),
    "grade": np.random.choice(["A", "B", "C", "D"], 1_000_000)
})

print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Optimize types
df["user_id"] = df["user_id"].astype("int32")       # int64 → int32 = 50% less
df["score"] = df["score"].astype("float32")          # float64 → float32 = 50% less
df["grade"] = df["grade"].astype("category")         # object → category = huge savings

print(f"After:  {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Before: ~22 MB  →  After: ~5 MB  (75% reduction!)

8. 🛠️ Practical Examples

Example 1: Analyzing Employee Data (Beginner)

import pandas as pd

# Step 1: Load data
df = pd.read_csv("employees.csv")

# Step 2: First look
print("Shape:", df.shape)          # How many rows and columns?
print("\nColumn types:")
print(df.dtypes)                   # What are the data types?
print("\nSummary:")
print(df.describe())               # Statistical summary
print("\nMissing values:")
print(df.isnull().sum())           # Check for missing data

# Step 3: Data cleaning
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df["hire_date"] = pd.to_datetime(df["hire_date"], errors="coerce")
df["dept"] = df["dept"].str.strip().str.title()

# Step 4: Basic analysis
print("\nEmployee count by department:")
print(df["dept"].value_counts())

print("\nAverage salary by department:")
print(df.groupby("dept")["salary"].mean().sort_values(ascending=False))

# Step 5: Save cleaned data
df.to_csv("employees_clean.csv", index=False)
print("\n✅ Cleaned data saved!")

Line-by-line explanation:

pd.read_csv() — loads the CSV into a DataFrame
df.shape — tells us rows × columns so we know the dataset size
df.dtypes — shows data types; crucial to check before analysis
df.describe() — quick stats: count, mean, std, min, max, quartiles
df.isnull().sum() — counts missing values per column
str.strip().str.lower() — cleans column names (remove spaces, lowercase)
pd.to_numeric(..., errors="coerce") — safely converts to number; puts NaN if conversion fails
pd.to_datetime(...) — converts string dates to proper datetime
value_counts() — counts how many employees per department
groupby().mean() — average salary per department

Example 2: Sales Dashboard Data Prep (Intermediate)

import pandas as pd
import numpy as np

# Create realistic sample data
np.random.seed(42)
n = 500

df = pd.DataFrame({
    "order_id": range(1001, 1001 + n),
    "customer_id": np.random.randint(100, 200, n),
    "product": np.random.choice(["Laptop", "Phone", "Tablet", "Watch", "Earbuds"], n),
    "category": np.random.choice(["Electronics", "Accessories"], n),
    "quantity": np.random.randint(1, 10, n),
    "unit_price": np.random.uniform(50, 2000, n).round(2),
    "discount": np.random.choice([0, 0.05, 0.10, 0.20], n),
    "order_date": pd.date_range("2023-01-01", periods=n, freq="12H")
})

# ── Step 1: Calculate derived columns ─────────────────────────────────
df["gross_revenue"] = df["quantity"] * df["unit_price"]
df["discount_amount"] = df["gross_revenue"] * df["discount"]
df["net_revenue"] = df["gross_revenue"] - df["discount_amount"]

# ── Step 2: Extract date parts for time-series analysis ───────────────
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B")  # "January", etc.
df["quarter"] = df["order_date"].dt.quarter
df["day_of_week"] = df["order_date"].dt.day_name()

# ── Step 3: Quick KPIs ────────────────────────────────────────────────
print("=" * 50)
print("📊 SALES DASHBOARD SUMMARY")
print("=" * 50)
print(f"Total Orders:        {len(df):,}")
print(f"Total Gross Revenue: ${df['gross_revenue'].sum():,.2f}")
print(f"Total Discounts:     ${df['discount_amount'].sum():,.2f}")
print(f"Total Net Revenue:   ${df['net_revenue'].sum():,.2f}")
print(f"Average Order Value: ${df['net_revenue'].mean():,.2f}")
print(f"Unique Customers:    {df['customer_id'].nunique():,}")

# ── Step 4: Top products ──────────────────────────────────────────────
top_products = (
    df.groupby("product")
    .agg(
        orders=("order_id", "count"),
        total_revenue=("net_revenue", "sum"),
        avg_price=("unit_price", "mean")
    )
    .sort_values("total_revenue", ascending=False)
    .round(2)
)
print("\n🏆 Top Products:")
print(top_products.to_string())

# ── Step 5: Monthly revenue trend ────────────────────────────────────
monthly = (
    df.groupby(["year", "month"])["net_revenue"]
    .sum()
    .reset_index()
    .rename(columns={"net_revenue": "monthly_revenue"})
)
print("\n📅 Monthly Revenue:")
print(monthly.head(6).to_string(index=False))

# ── Step 6: Save to Excel with multiple sheets ────────────────────────
with pd.ExcelWriter("sales_dashboard.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Raw Data", index=False)
    top_products.to_excel(writer, sheet_name="Product Summary")
    monthly.to_excel(writer, sheet_name="Monthly Trend", index=False)

print("\n✅ Dashboard exported to sales_dashboard.xlsx")

9. ⚠️ Edge Cases and Common Errors

Error 1: SettingWithCopyWarning

# ❌ Wrong — modifying a copy, not the original
subset = df[df["age"] > 30]
subset["salary"] = subset["salary"] * 1.1  # ⚠️ SettingWithCopyWarning!

# ✅ Correct — use .copy() explicitly
subset = df[df["age"] > 30].copy()
subset["salary"] = subset["salary"] * 1.1  # No warning

# ✅ Or use .loc on the original
df.loc[df["age"] > 30, "salary"] = df.loc[df["age"] > 30, "salary"] * 1.1

Error 2: Mixed Types in `object` Columns

# ❌ This column has mixed types — causes silent bugs
s = pd.Series([1, "two", 3, None, 5.0])
print(s.dtype)   # object — pandas treats mixed as Python objects
print(s.sum())   # ERROR or wrong result

# ✅ Always clean before operating
s = pd.to_numeric(s, errors="coerce")  # converts "two" → NaN
print(s.sum())   # 9.0 (skips NaN)

Error 3: Index Misalignment

df1 = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 2])
df2 = pd.DataFrame({"b": [10, 20, 30]}, index=[1, 2, 3])

# ❌ Arithmetic aligns on index — might not be what you want
result = df1["a"] + df2["b"]
print(result)
# 0    NaN  (index 0 not in df2)
# 1    12.0
# 2    23.0
# 3    NaN  (index 3 not in df1)

# ✅ Reset index first if you want positional alignment
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
result = df1["a"] + df2["b"]

Error 4: Chained Assignment

# ❌ Chained indexing — unpredictable behavior
df["dept"]["IT"] = "Information Technology"

# ✅ Always use .loc or .iloc
df.loc[df["dept"] == "IT", "dept"] = "Information Technology"

Error 5: `read_csv` with Wrong Encoding

# UnicodeDecodeError: 'utf-8' codec can't decode byte...
df = pd.read_csv("data.csv")  # might crash

# ✅ Try different encodings
df = pd.read_csv("data.csv", encoding="latin-1")   # common for European data
df = pd.read_csv("data.csv", encoding="utf-8-sig")  # for Excel-generated CSVs
df = pd.read_csv("data.csv", encoding="cp1252")     # Windows default

Error 6: Memory Error on Large Files

# ❌ Crashes on 10GB file
df = pd.read_csv("huge_10gb_file.csv")

# ✅ Read in chunks
results = []
for chunk in pd.read_csv("huge_10gb_file.csv", chunksize=50000):
    # Do your processing per chunk
    summary = chunk.groupby("category")["revenue"].sum()
    results.append(summary)

final = pd.concat(results).groupby(level=0).sum()

10. 💡 Pro Developer Insights

Best Practices

Never use .iterrows() — it's 100-1000x slower than vectorized operations
Use .copy() when slicing — avoids SettingWithCopyWarning
Set index on the column you filter/join most — speeds up lookups
Use categorical dtype for low-cardinality string columns — saves 80%+ memory
Prefer .loc over chained indexing — more explicit and safe
Always check df.info() and df.isnull().sum() first — understand your data before touching it
Use pd.read_parquet instead of CSV for production data pipelines
Don't use inplace=True — it doesn't save memory and is being deprecated

# ❌ Anti-pattern: iterrows (very slow)
for index, row in df.iterrows():
    df.at[index, "tax"] = row["salary"] * 0.30

# ✅ Vectorized (fast)
df["tax"] = df["salary"] * 0.30

# ❌ Anti-pattern: building DataFrames with loops
rows = []
for i in range(10000):
    rows.append({"x": i, "y": i**2})
df = pd.DataFrame(rows)  # slow

# ✅ Build from arrays
x = np.arange(10000)
df = pd.DataFrame({"x": x, "y": x**2})  # fast

Performance Tips

# Tip 1: Use query() for readable filtering
df.query("age > 30 and dept == 'IT'")  # instead of boolean indexing

# Tip 2: Use eval() for complex expressions
df.eval("net = salary - salary * 0.30", inplace=True)

# Tip 3: Downcasting numerics
df["age"] = pd.to_numeric(df["age"], downcast="integer")   # int64 → int8/16/32
df["score"] = pd.to_numeric(df["score"], downcast="float") # float64 → float32

# Tip 4: pipe() for readable chained operations
def clean_names(df):
    df.columns = df.columns.str.lower().str.replace(" ", "_")
    return df

def drop_duplicates_keep_last(df):
    return df.drop_duplicates(keep="last")

result = (df
    .pipe(clean_names)
    .pipe(drop_duplicates_keep_last)
    .dropna(subset=["salary"])
    .reset_index(drop=True)
)

11. 📊 Comparison: Pandas vs Alternatives

Feature	Pandas	Polars	Dask	Spark	NumPy
Primary Use	Tabular data	Tabular data (fast)	Big data	Distributed big data	Numeric arrays
Speed	Good	Excellent	Good (parallel)	Excellent (cluster)	Very fast
Memory	RAM-limited	Efficient	Larger than RAM	Cluster	Efficient
Learning Curve	Moderate	Low-Medium	Moderate	High	Low
Ecosystem	Huge	Growing	Good	Huge	Huge
Best For	<10M rows	<100M rows	10M-100B rows	Petabyte scale	Math operations
SQL Support	Via sqlalchemy	Native SQL	Partial	Full	No

When to use Pandas:

✅ Datasets that fit in RAM (< 10-20 GB)
✅ Exploratory data analysis
✅ Data cleaning and transformation pipelines
✅ Integration with scikit-learn, seaborn, statsmodels
✅ Business reporting and dashboards

12. 🎤 Interview Questions — Part 1

Basic Level

Q1: What is the difference between a Series and a DataFrame?

A Series is a 1D labeled array (like a single column). A DataFrame is a 2D labeled table made of multiple Series sharing the same index.

Q2: How do you check for missing values in a DataFrame?

df.isnull().sum()          # count of NaN per column
df.isnull().any()          # True/False per column
df.isnull().values.any()   # True if any NaN anywhere

Q3: What's the difference between df.loc and df.iloc?

.loc uses label-based indexing (index names). .iloc uses integer position-based indexing (0, 1, 2...). We'll cover this in depth in Part 2.

Q4: How do you select multiple columns from a DataFrame?

df[["col1", "col2", "col3"]]  # double brackets → DataFrame

Q5: How do you read only specific columns from a CSV?

df = pd.read_csv("file.csv", usecols=["name", "salary", "dept"])

Intermediate Level

Q6: What is SettingWithCopyWarning and how do you fix it?

It happens when you try to modify a slice of a DataFrame. Fix: Use .copy() when creating the slice, or use .loc to modify the original.

Q7: What is the difference between read_csv with dtype parameter vs calling astype() after?

Using dtype in read_csv is faster and more memory efficient because it sets the type during reading (no conversion needed). astype() after reading first loads as one type then converts.

Q8: How do you efficiently read a CSV file that doesn't fit in memory?

for chunk in pd.read_csv("huge.csv", chunksize=50000):
    process(chunk)

Advanced Level

Q9: How does Pandas handle index alignment in arithmetic operations?

Pandas automatically aligns on the index before performing arithmetic. If indices don't match, the result is NaN for mismatched positions. This prevents silent bugs from position-based operations.

Q10: What's the memory layout difference between object dtype and string dtype in Pandas 2.x?

object dtype stores Python string objects in a pointer-based array — high memory and slow. string dtype (new in Pandas 1.0+) uses Apache Arrow memory layout or Pandas nullable arrays — more efficient and consistent.

13. ✅ Conclusion — Part 1

Here's what you learned in Part 1:

Pandas fundamentals — why it exists and what problems it solves
Series — 1D labeled arrays, all creation methods, key methods and alignment
DataFrames — 2D labeled tables, all creation methods, essential attributes
I/O Operations — reading and writing CSV, Excel, JSON, SQL, Parquet, and more
Essential operations — selecting, adding, removing columns/rows, type conversion
dtypes — memory optimization through smart type selection
Real-world examples — employee analysis, sales dashboard
Common errors — SettingWithCopyWarning, index misalignment, encoding issues
Best practices — avoid iterrows, use vectorized operations, use .loc

🔜 What's Coming in Part 2?

In Part 2: Data Manipulation, we'll cover:

Indexing and Slicing — .loc, .iloc, Boolean indexing, MultiIndex
Filtering — query(), complex conditions, isin(), between()
Handling Missing Data — dropna(), fillna(), interpolate()
GroupBy — split-apply-combine, agg(), transform(), filter()
Merge and Join — inner, outer, left, right joins like SQL
Reshaping — pivot(), pivot_table(), melt(), stack(), unstack()
String Operations — str.split(), str.extract(), regex in Pandas
Apply, Map, and Lambda — when and how to use each

💾 Save this blog. Share it with your team. You'll refer back to the I/O and dtype sections constantly in your day-to-day work.
📌 Next: Pandas Data Manipulation: The Complete Guide (Part 2 — Indexing, GroupBy, Merge & Reshape)