Series Overview:
Part 1 (This Post): Fundamentals — Series, DataFrames, I/O, Basic Operations
Part 2: Data Manipulation — Indexing, Filtering, GroupBy, Merge/Join, Reshaping
Part 3: Advanced Pandas — Performance, Time Series, Real-World Pipelines, ML & Interview Questions
1. Introduction
What is Pandas?
Pandas is the most popular data manipulation and analysis library in Python. The name comes from "Panel Data" — a term used in econometrics for multidimensional structured datasets.
At its core, Pandas gives you two powerful data structures:
Series — a one-dimensional labeled array
DataFrame — a two-dimensional labeled table (like an Excel sheet or SQL table)
Think of Pandas as Excel + SQL + Python combined into one blazing-fast library.
Why is Pandas Important?
🚀 Speed — Built on top of NumPy, extremely fast for tabular data
🧹 Data Cleaning — Handle missing data, duplicates, wrong types effortlessly
🔗 Integration — Works with CSV, Excel, JSON, SQL, Parquet, APIs, and more
📊 Analysis — Grouping, aggregation, pivoting, filtering — all in a few lines
🤖 ML-Ready — Direct input to scikit-learn, TensorFlow, and PyTorch pipelines
Real-World Usage
Industry | Use Case |
|---|---|
Finance | Stock price analysis, portfolio management |
Healthcare | Patient data cleaning, EHR analysis |
E-commerce | Sales reports, customer segmentation |
Data Engineering | ETL pipelines, data validation |
Machine Learning | Feature engineering, dataset preparation |
Installing Pandas
pip install pandas
# With optional dependencies (recommended)
pip install pandas openpyxl xlrd sqlalchemy pyarrow
import pandas as pd
import numpy as np
print(pd.__version__) # e.g., 2.2.0
2. 🧱 Basic Concepts
2.1 The Pandas Data Model
Pandas is built on top of NumPy arrays. Every column in a DataFrame is essentially a NumPy array under the hood — which is why Pandas is so fast.
DataFrame
│
├── Column A (Series) → NumPy Array
├── Column B (Series) → NumPy Array
└── Column C (Series) → NumPy Array
2.2 The Index — The Heart of Pandas
Every Pandas object has an Index — labels for rows (and columns). This is what makes Pandas unique compared to plain arrays.
import pandas as pd
# Default integer index (0, 1, 2...)
s = pd.Series([10, 20, 30])
print(s.index) # RangeIndex(start=0, stop=3, step=1)
# Custom string index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"]) # 20
3. 📊 The Series — Deep Dive
What is a Series?
A Series is a 1D array with labels. Think of it as a single column in a spreadsheet.
# Structure: pd.Series(data, index, dtype, name)
s = pd.Series(
data=[100, 200, 300, 400],
index=["Jan", "Feb", "Mar", "Apr"],
name="Monthly Revenue"
)
print(s)
Output:
Jan 100
Feb 200
Mar 300
Apr 400
Name: Monthly Revenue, dtype: int64
Creating a Series — All Methods
import pandas as pd
import numpy as np
# 1. From a Python list
s1 = pd.Series([1, 2, 3, 4, 5])
# 2. From a NumPy array
s2 = pd.Series(np.array([10.5, 20.1, 30.8]))
# 3. From a Python dictionary (keys become index)
s3 = pd.Series({"Alice": 95, "Bob": 87, "Charlie": 92})
# 4. From a scalar (broadcasts to all index positions)
s4 = pd.Series(42, index=["a", "b", "c", "d"])
print(s4)
# a 42
# b 42
# c 42
# d 42
# 5. With explicit dtype
s5 = pd.Series([1, 2, 3], dtype="float64")
# 6. With name
s6 = pd.Series([100, 200], name="sales")
data = pd.Series([1, 2, 3, 4, 5],index=['a', 'b', 'c', 'd', 'e'])
data #a 1
#b 2
#c 3
#d 4
#e 5
#dtype: int64
data.index #Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
data.values #array([1, 2, 3, 4, 5])
data.to_dict() #{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
data.to_numpy() #array([1, 2, 3, 4, 5])
data.to_list() #[1, 2, 3, 4, 5]
Essential Series Attributes
s = pd.Series([10, 20, 30, 40, 50], index=["a", "b", "c", "d", "e"])
s.values # numpy array: [10 20 30 40 50]
s.index # Index(['a', 'b', 'c', 'd', 'e'])
s.dtype # dtype('int64')
s.name # None (or set name)
s.shape # (5,)
s.size # 5
s.ndim # 1
s.nbytes # 40 (memory in bytes)
s.is_unique # True (all values distinct?)
s.hasnans # False
Essential Series Methods
s = pd.Series([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
# --- Statistics ---
s.sum() # 39
s.mean() # 3.9
s.median() # 3.5
s.std() # 2.43
s.var() # 5.88
s.min() # 1
s.max() # 9
s.count() # 10 (non-NaN count)
s.describe() # summary stats
# --- Sorting ---
s.sort_values() # sort by values
s.sort_values(ascending=False)
s.sort_index() # sort by index
# --- Value Counts ---
s.value_counts() # frequency of each value
s.unique() # array of unique values
s.nunique() # count of unique values
# --- Math Operations ---
s + 10 # add 10 to all
s * 2 # multiply all by 2
s ** 2 # square all
np.sqrt(s) # square root (NumPy works directly)
# --- Type Conversion ---
s.astype(float)
s.astype(str)
# --- Missing Data ---
s.isnull() # boolean mask of NaN
s.notnull()
s.dropna() # remove NaN
s.fillna(0) # fill NaN with 0
# --- Apply custom function ---
s.apply(lambda x: x * 2 if x > 3 else x)
# --- String methods (if string Series) ---
names = pd.Series(["Alice", "Bob", "Charlie"])
names.str.lower()
names.str.upper()
names.str.len()
names.str.contains("li")
names.str.replace("a", "@")
Series Alignment (Super Important!)
When you do math between two Series, Pandas aligns by index — not by position.
s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
s2 = pd.Series([10, 20, 30], index=["b", "c", "d"])
result = s1 + s2
print(result)
# a NaN (a not in s2)
# b 12.0 (2 + 10)
# c 23.0 (3 + 20)
# d NaN (d not in s1)
This is called index alignment and it's one of Pandas' most powerful features.
4. 📋 The DataFrame — Deep Dive
What is a DataFrame?
A DataFrame is a 2D labeled data structure — like a spreadsheet or SQL table. It has rows and columns, each with labels (index and column names).
name age salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 75000
Creating a DataFrame — All Methods
import pandas as pd
import numpy as np
# ── Method 1: From a dictionary of lists ──────────────────────────────
df1 = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 75000]
})
# ── Method 2: From a list of dictionaries (most common in APIs) ────────
df2 = pd.DataFrame([
{"name": "Alice", "age": 25, "salary": 50000},
{"name": "Bob", "age": 30, "salary": 60000},
{"name": "Charlie", "age": 35, "salary": 75000}
])
# ── Method 3: From a NumPy array ──────────────────────────────────────
df3 = pd.DataFrame(
np.random.randint(0, 100, size=(4, 3)),
columns=["A", "B", "C"]
)
# ── Method 4: From a list of lists ────────────────────────────────────
df4 = pd.DataFrame(
[["Alice", 25], ["Bob", 30]],
columns=["name", "age"]
)
# ── Method 5: From a Series ───────────────────────────────────────────
s = pd.Series([1, 2, 3], name="values")
df5 = pd.DataFrame(s)
# ── Method 6: Empty DataFrame with schema ─────────────────────────────
df6 = pd.DataFrame(columns=["id", "name", "score"])
# ── Method 7: With custom index ───────────────────────────────────────
df7 = pd.DataFrame(
{"score": [85, 90, 78]},
index=["Alice", "Bob", "Charlie"]
)
Essential DataFrame Attributes
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana"],
"age": [25, 30, 35, 28],
"salary": [50000, 60000, 75000, 55000],
"dept": ["HR", "IT", "Finance", "IT"]
})
df.shape # (4, 4) → rows, columns
df.ndim # 2
df.size # 16 (total elements)
df.dtypes # dtype of each column
df.columns # Index(['name', 'age', 'salary', 'dept'])
df.index # RangeIndex(start=0, stop=4, step=1)
df.values # NumPy array of all data
df.axes # [row index, column index]
df.memory_usage()# memory per column in bytes
First Look at Your Data
df.head() # first 5 rows (default)
df.head(10) # first 10 rows
df.tail() # last 5 rows
df.tail(3) # last 3 rows
df.sample(3) # 3 random rows
df.sample(frac=0.1) # 10% random sample
df.info() # column names, dtypes, non-null counts, memory
df.describe() # statistical summary (numeric cols)
df.describe(include="all") # include object/categorical cols
df.describe(include="object") # string/categorical only
df.info() Output Explained
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 age 4 non-null int64
2 salary 4 non-null int64
3 dept 4 non-null object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
This tells you:
Total rows and columns
Which columns have missing values
Data types (important for performance and operations)
Memory usage
5. 📂 Reading and Writing Data (I/O Operations)
This is one of the most important practical skills. You'll spend 30% of your Pandas time just reading and writing files.
5.1 CSV Files
# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_csv("data.csv")
# Common parameters:
df = pd.read_csv(
"data.csv",
sep=",", # delimiter (use sep="\t" for TSV)
header=0, # row to use as column names (0 = first row)
index_col="id", # set column as index
usecols=["name", "age", "salary"], # read only specific columns
nrows=1000, # read only first 1000 rows
skiprows=2, # skip first 2 rows
dtype={"age": int, "salary": float}, # specify dtypes
na_values=["N/A", "NULL", "none", ""], # treat as NaN
parse_dates=["date_col"], # auto-parse date columns
encoding="utf-8", # encoding (try "latin-1" if utf-8 fails)
low_memory=False, # avoid mixed-type warnings
chunksize=10000, # read in chunks (for large files)
)
# ── Writing ────────────────────────────────────────────────────────────
df.to_csv("output.csv")
df.to_csv(
"output.csv",
index=False, # don't write row numbers
sep=",",
encoding="utf-8",
columns=["name", "salary"], # write only specific columns
header=True, # write column names
mode="a", # append instead of overwrite
na_rep="NULL", # represent NaN as "NULL"
float_format="%.2f", # format floats to 2 decimal places
date_format="%Y-%m-%d",
)
Chunked Reading (for Big Files)
# When a CSV is too large to fit in memory, read in chunks
chunk_list = []
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
# Process each chunk
filtered = chunk[chunk["age"] > 30]
chunk_list.append(filtered)
df = pd.concat(chunk_list, ignore_index=True)
print(f"Final shape: {df.shape}")
5.2 Excel Files
# ── Reading ────────────────────────────────────────────────────────────
# pip install openpyxl (for .xlsx) xlrd (for .xls)
df = pd.read_excel("data.xlsx")
df = pd.read_excel(
"data.xlsx",
sheet_name="Sheet1", # specify sheet name or index (0-based)
header=0,
usecols="A:D", # Excel column range
skiprows=2,
dtype={"id": str},
na_values=["", "N/A"],
)
# Read all sheets at once
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)
# Returns dict: {"Sheet1": df1, "Sheet2": df2, ...}
# ── Writing ────────────────────────────────────────────────────────────
df.to_excel("output.xlsx", index=False, sheet_name="Results")
# Multiple sheets in one file
with pd.ExcelWriter("multi_sheet.xlsx", engine="openpyxl") as writer:
df1.to_excel(writer, sheet_name="Sales", index=False)
df2.to_excel(writer, sheet_name="HR", index=False)
df3.to_excel(writer, sheet_name="Finance", index=False)
5.3 JSON Files
# ── Reading ────────────────────────────────────────────────────────────
# records format: [{"col": val}, {"col": val}, ...]
df = pd.read_json("data.json")
df = pd.read_json(
"data.json",
orient="records", # "records", "split", "index", "columns", "values"
lines=True, # for JSON Lines format (one JSON per line)
dtype={"id": str},
encoding="utf-8",
)
# Reading from a URL (API response)
import json, urllib.request
url = "https://jsonplaceholder.typicode.com/users"
with urllib.request.urlopen(url) as response:
data = json.loads(response.read())
df = pd.DataFrame(data)
# ── Writing ────────────────────────────────────────────────────────────
df.to_json("output.json", orient="records", indent=2)
df.to_json("output.json", orient="records", lines=True) # JSON Lines
5.4 SQL Databases
# pip install sqlalchemy
from sqlalchemy import create_engine
# ── Connect ────────────────────────────────────────────────────────────
# SQLite
engine = create_engine("sqlite:///mydb.db")
# PostgreSQL
engine = create_engine("postgresql+psycopg2://user:password@host:5432/dbname")
# MySQL
engine = create_engine("mysql+pymysql://user:password@host:3306/dbname")
# SQL Server
engine = create_engine("mssql+pyodbc://user:password@host/dbname?driver=ODBC+Driver+17+for+SQL+Server")
# ── Reading ────────────────────────────────────────────────────────────
# Read an entire table
df = pd.read_sql_table("employees", con=engine)
# Read with SQL query
df = pd.read_sql_query(
"SELECT name, age, salary FROM employees WHERE dept = 'IT'",
con=engine
)
# Universal (works for both table name and SQL string)
df = pd.read_sql("SELECT * FROM employees LIMIT 100", con=engine)
# With parameters (safe — avoids SQL injection)
df = pd.read_sql(
"SELECT * FROM employees WHERE dept = :dept AND age > :age",
con=engine,
params={"dept": "IT", "age": 30}
)
# ── Writing ────────────────────────────────────────────────────────────
df.to_sql(
"employees_backup",
con=engine,
if_exists="replace", # "fail", "replace", "append"
index=False,
chunksize=1000, # write in chunks
method="multi", # use multi-row INSERT (faster)
)
5.5 Parquet Files (Modern Big Data Format)
# pip install pyarrow (or fastparquet)
# ── Writing ────────────────────────────────────────────────────────────
df.to_parquet("data.parquet", index=False, engine="pyarrow", compression="snappy")
# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_parquet("data.parquet", engine="pyarrow")
# Read specific columns only (much faster for wide tables)
df = pd.read_parquet("data.parquet", columns=["name", "salary"])
💡 Parquet vs CSV: Parquet is 5-10x smaller and 10-100x faster to read. Always use Parquet for large datasets in production.
5.6 Other Formats
# ── Clipboard ─────────────────────────────────────────────────────────
df = pd.read_clipboard() # paste from Excel/browser table
df.to_clipboard(index=False)
# ── HTML (scraping tables) ─────────────────────────────────────────────
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP")
df = tables[0] # first table on the page
# ── Pickle (Python objects, preserves dtypes perfectly) ───────────────
df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")
# ── HDF5 (for very large datasets) ────────────────────────────────────
df.to_hdf("data.h5", key="employees", mode="w")
df = pd.read_hdf("data.h5", key="employees")
I/O Format Comparison
Format | Speed | File Size | Human Readable | Best For |
|---|---|---|---|---|
CSV | Slow | Large | ✅ Yes | Small files, sharing |
JSON | Medium | Large | ✅ Yes | APIs, configs |
Parquet | Fast | Small | ❌ No | Big data, production |
Excel | Slow | Medium | ✅ Yes | Business reports |
Pickle | Fast | Medium | ❌ No | Python-only caching |
HDF5 | Very Fast | Variable | ❌ No | Scientific data |
SQL | Varies | Varies | Via query | Transactional data |
6. 🔧 Essential DataFrame Operations
6.1 Selecting Columns
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 75000],
"dept": ["HR", "IT", "Finance"]
})
# ── Single column → returns Series ────────────────────────────────────
df["name"] # bracket notation (always works)
df.name # attribute notation (AVOID — breaks if col name has spaces/conflicts)
# ── Multiple columns → returns DataFrame ──────────────────────────────
df[["name", "salary"]]
df[["name", "age", "dept"]]
# ── All columns except one ────────────────────────────────────────────
df.drop(columns=["salary"])
# ── Select by dtype ───────────────────────────────────────────────────
df.select_dtypes(include="number") # numeric columns only
df.select_dtypes(include="object") # string columns only
df.select_dtypes(exclude="object") # exclude strings
6.2 Adding and Removing Columns
# ── Adding new columns ────────────────────────────────────────────────
df["annual_bonus"] = df["salary"] * 0.10
df["senior"] = df["age"] >= 30 # boolean column
df["full_info"] = df["name"] + " (" + df["dept"] + ")"
# Using assign() — returns a new df, doesn't modify original
df_new = df.assign(
tax=df["salary"] * 0.30,
net_salary=df["salary"] * 0.70,
age_group=lambda x: pd.cut(x["age"], bins=[0, 30, 50, 100], labels=["Young", "Mid", "Senior"])
)
# Insert at specific position
df.insert(1, "employee_id", range(1, len(df) + 1))
# ── Removing columns ──────────────────────────────────────────────────
df.drop("annual_bonus", axis=1, inplace=True) # single column
df.drop(["salary", "dept"], axis=1, inplace=True) # multiple columns
df.drop(columns=["salary", "dept"]) # cleaner syntax
# ── Rename columns ────────────────────────────────────────────────────
df.rename(columns={"name": "full_name", "age": "years"})
# Make all column names lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(" ", "_")
6.3 Adding and Removing Rows
# ── Adding rows ───────────────────────────────────────────────────────
new_row = pd.DataFrame([{"name": "Diana", "age": 28, "salary": 55000, "dept": "Marketing"}])
df = pd.concat([df, new_row], ignore_index=True)
# ── Removing rows ─────────────────────────────────────────────────────
df.drop(0, axis=0) # drop row with index 0
df.drop([0, 2, 4]) # drop multiple rows by index label
df.drop(df.index[df["age"] < 25]) # drop rows where condition is True
df[df["age"] >= 25] # filter (preferred over dropping)
# ── Reset index after drops ───────────────────────────────────────────
df = df.drop(0).reset_index(drop=True)
6.4 Data Types and Conversion
df = pd.DataFrame({
"id": ["1", "2", "3"],
"salary": ["50000", "60000", "75000"],
"hire_date": ["2020-01-15", "2019-03-22", "2021-07-01"],
"active": ["True", "False", "True"]
})
# Check dtypes
print(df.dtypes)
# ── Convert types ─────────────────────────────────────────────────────
df["id"] = df["id"].astype(int)
df["salary"] = df["salary"].astype(float)
df["hire_date"] = pd.to_datetime(df["hire_date"])
df["active"] = df["active"].map({"True": True, "False": False})
# Safe numeric conversion (won't crash on errors)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce") # NaN on failure
df["salary"] = pd.to_numeric(df["salary"], errors="ignore") # keep original on failure
# Convert to categorical (saves memory for low-cardinality string cols)
df["dept"] = df["dept"].astype("category")
# Full conversion with convert_dtypes() — auto-detects best types
df = df.convert_dtypes()
6.5 Viewing, Sorting, Reindexing
# ── Sorting ───────────────────────────────────────────────────────────
df.sort_values("salary") # ascending
df.sort_values("salary", ascending=False) # descending
df.sort_values(["dept", "salary"], ascending=[True, False]) # multi-col sort
df.sort_index() # sort by index
# ── Reindexing ────────────────────────────────────────────────────────
df.reset_index(drop=True) # reset to 0,1,2...
df.set_index("name") # use 'name' column as index
df.set_index("name", drop=True) # remove 'name' from columns
# ── Transposing ───────────────────────────────────────────────────────
df.T # swap rows and columns
7. 🔍 Understanding dtypes in Depth
One of the most common sources of bugs and slowness in Pandas is wrong data types.
All Pandas dtypes
dtype | Description | Example |
|---|---|---|
| 64-bit integer | 1, 42, -7 |
| 32-bit integer (less memory) | — |
| 64-bit float | 3.14, 2.718 |
| 32-bit float (less memory) | — |
| Python string (or mixed) | "hello" |
| Pandas string type (new) | "hello" |
| Boolean | True/False |
| Timestamps | 2024-01-01 |
| Time differences | 3 days |
| Categorical (low cardinality) | "M"/"F" |
| Complex numbers | 3+4j |
# Memory optimization example
df = pd.DataFrame({
"user_id": np.random.randint(1, 100000, 1_000_000),
"score": np.random.uniform(0, 100, 1_000_000),
"grade": np.random.choice(["A", "B", "C", "D"], 1_000_000)
})
print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Optimize types
df["user_id"] = df["user_id"].astype("int32") # int64 → int32 = 50% less
df["score"] = df["score"].astype("float32") # float64 → float32 = 50% less
df["grade"] = df["grade"].astype("category") # object → category = huge savings
print(f"After: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Before: ~22 MB → After: ~5 MB (75% reduction!)
8. 🛠️ Practical Examples
Example 1: Analyzing Employee Data (Beginner)
import pandas as pd
# Step 1: Load data
df = pd.read_csv("employees.csv")
# Step 2: First look
print("Shape:", df.shape) # How many rows and columns?
print("\nColumn types:")
print(df.dtypes) # What are the data types?
print("\nSummary:")
print(df.describe()) # Statistical summary
print("\nMissing values:")
print(df.isnull().sum()) # Check for missing data
# Step 3: Data cleaning
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df["hire_date"] = pd.to_datetime(df["hire_date"], errors="coerce")
df["dept"] = df["dept"].str.strip().str.title()
# Step 4: Basic analysis
print("\nEmployee count by department:")
print(df["dept"].value_counts())
print("\nAverage salary by department:")
print(df.groupby("dept")["salary"].mean().sort_values(ascending=False))
# Step 5: Save cleaned data
df.to_csv("employees_clean.csv", index=False)
print("\n✅ Cleaned data saved!")
Line-by-line explanation:
pd.read_csv()— loads the CSV into a DataFramedf.shape— tells us rows × columns so we know the dataset sizedf.dtypes— shows data types; crucial to check before analysisdf.describe()— quick stats: count, mean, std, min, max, quartilesdf.isnull().sum()— counts missing values per columnstr.strip().str.lower()— cleans column names (remove spaces, lowercase)pd.to_numeric(..., errors="coerce")— safely converts to number; puts NaN if conversion failspd.to_datetime(...)— converts string dates to proper datetimevalue_counts()— counts how many employees per departmentgroupby().mean()— average salary per department
Example 2: Sales Dashboard Data Prep (Intermediate)
import pandas as pd
import numpy as np
# Create realistic sample data
np.random.seed(42)
n = 500
df = pd.DataFrame({
"order_id": range(1001, 1001 + n),
"customer_id": np.random.randint(100, 200, n),
"product": np.random.choice(["Laptop", "Phone", "Tablet", "Watch", "Earbuds"], n),
"category": np.random.choice(["Electronics", "Accessories"], n),
"quantity": np.random.randint(1, 10, n),
"unit_price": np.random.uniform(50, 2000, n).round(2),
"discount": np.random.choice([0, 0.05, 0.10, 0.20], n),
"order_date": pd.date_range("2023-01-01", periods=n, freq="12H")
})
# ── Step 1: Calculate derived columns ─────────────────────────────────
df["gross_revenue"] = df["quantity"] * df["unit_price"]
df["discount_amount"] = df["gross_revenue"] * df["discount"]
df["net_revenue"] = df["gross_revenue"] - df["discount_amount"]
# ── Step 2: Extract date parts for time-series analysis ───────────────
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B") # "January", etc.
df["quarter"] = df["order_date"].dt.quarter
df["day_of_week"] = df["order_date"].dt.day_name()
# ── Step 3: Quick KPIs ────────────────────────────────────────────────
print("=" * 50)
print("📊 SALES DASHBOARD SUMMARY")
print("=" * 50)
print(f"Total Orders: {len(df):,}")
print(f"Total Gross Revenue: ${df['gross_revenue'].sum():,.2f}")
print(f"Total Discounts: ${df['discount_amount'].sum():,.2f}")
print(f"Total Net Revenue: ${df['net_revenue'].sum():,.2f}")
print(f"Average Order Value: ${df['net_revenue'].mean():,.2f}")
print(f"Unique Customers: {df['customer_id'].nunique():,}")
# ── Step 4: Top products ──────────────────────────────────────────────
top_products = (
df.groupby("product")
.agg(
orders=("order_id", "count"),
total_revenue=("net_revenue", "sum"),
avg_price=("unit_price", "mean")
)
.sort_values("total_revenue", ascending=False)
.round(2)
)
print("\n🏆 Top Products:")
print(top_products.to_string())
# ── Step 5: Monthly revenue trend ────────────────────────────────────
monthly = (
df.groupby(["year", "month"])["net_revenue"]
.sum()
.reset_index()
.rename(columns={"net_revenue": "monthly_revenue"})
)
print("\n📅 Monthly Revenue:")
print(monthly.head(6).to_string(index=False))
# ── Step 6: Save to Excel with multiple sheets ────────────────────────
with pd.ExcelWriter("sales_dashboard.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Raw Data", index=False)
top_products.to_excel(writer, sheet_name="Product Summary")
monthly.to_excel(writer, sheet_name="Monthly Trend", index=False)
print("\n✅ Dashboard exported to sales_dashboard.xlsx")
9. ⚠️ Edge Cases and Common Errors
Error 1: SettingWithCopyWarning
# ❌ Wrong — modifying a copy, not the original
subset = df[df["age"] > 30]
subset["salary"] = subset["salary"] * 1.1 # ⚠️ SettingWithCopyWarning!
# ✅ Correct — use .copy() explicitly
subset = df[df["age"] > 30].copy()
subset["salary"] = subset["salary"] * 1.1 # No warning
# ✅ Or use .loc on the original
df.loc[df["age"] > 30, "salary"] = df.loc[df["age"] > 30, "salary"] * 1.1
Error 2: Mixed Types in object Columns
# ❌ This column has mixed types — causes silent bugs
s = pd.Series([1, "two", 3, None, 5.0])
print(s.dtype) # object — pandas treats mixed as Python objects
print(s.sum()) # ERROR or wrong result
# ✅ Always clean before operating
s = pd.to_numeric(s, errors="coerce") # converts "two" → NaN
print(s.sum()) # 9.0 (skips NaN)
Error 3: Index Misalignment
df1 = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 2])
df2 = pd.DataFrame({"b": [10, 20, 30]}, index=[1, 2, 3])
# ❌ Arithmetic aligns on index — might not be what you want
result = df1["a"] + df2["b"]
print(result)
# 0 NaN (index 0 not in df2)
# 1 12.0
# 2 23.0
# 3 NaN (index 3 not in df1)
# ✅ Reset index first if you want positional alignment
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
result = df1["a"] + df2["b"]
Error 4: Chained Assignment
# ❌ Chained indexing — unpredictable behavior
df["dept"]["IT"] = "Information Technology"
# ✅ Always use .loc or .iloc
df.loc[df["dept"] == "IT", "dept"] = "Information Technology"
Error 5: read_csv with Wrong Encoding
# UnicodeDecodeError: 'utf-8' codec can't decode byte...
df = pd.read_csv("data.csv") # might crash
# ✅ Try different encodings
df = pd.read_csv("data.csv", encoding="latin-1") # common for European data
df = pd.read_csv("data.csv", encoding="utf-8-sig") # for Excel-generated CSVs
df = pd.read_csv("data.csv", encoding="cp1252") # Windows default
Error 6: Memory Error on Large Files
# ❌ Crashes on 10GB file
df = pd.read_csv("huge_10gb_file.csv")
# ✅ Read in chunks
results = []
for chunk in pd.read_csv("huge_10gb_file.csv", chunksize=50000):
# Do your processing per chunk
summary = chunk.groupby("category")["revenue"].sum()
results.append(summary)
final = pd.concat(results).groupby(level=0).sum()
10. 💡 Pro Developer Insights
Best Practices
Never use
.iterrows()— it's 100-1000x slower than vectorized operationsUse
.copy()when slicing — avoids SettingWithCopyWarningSet index on the column you filter/join most — speeds up lookups
Use categorical dtype for low-cardinality string columns — saves 80%+ memory
Prefer
.locover chained indexing — more explicit and safeAlways check
df.info()anddf.isnull().sum()first — understand your data before touching itUse
pd.read_parquetinstead of CSV for production data pipelinesDon't use
inplace=True— it doesn't save memory and is being deprecated
# ❌ Anti-pattern: iterrows (very slow)
for index, row in df.iterrows():
df.at[index, "tax"] = row["salary"] * 0.30
# ✅ Vectorized (fast)
df["tax"] = df["salary"] * 0.30
# ❌ Anti-pattern: building DataFrames with loops
rows = []
for i in range(10000):
rows.append({"x": i, "y": i**2})
df = pd.DataFrame(rows) # slow
# ✅ Build from arrays
x = np.arange(10000)
df = pd.DataFrame({"x": x, "y": x**2}) # fast
Performance Tips
# Tip 1: Use query() for readable filtering
df.query("age > 30 and dept == 'IT'") # instead of boolean indexing
# Tip 2: Use eval() for complex expressions
df.eval("net = salary - salary * 0.30", inplace=True)
# Tip 3: Downcasting numerics
df["age"] = pd.to_numeric(df["age"], downcast="integer") # int64 → int8/16/32
df["score"] = pd.to_numeric(df["score"], downcast="float") # float64 → float32
# Tip 4: pipe() for readable chained operations
def clean_names(df):
df.columns = df.columns.str.lower().str.replace(" ", "_")
return df
def drop_duplicates_keep_last(df):
return df.drop_duplicates(keep="last")
result = (df
.pipe(clean_names)
.pipe(drop_duplicates_keep_last)
.dropna(subset=["salary"])
.reset_index(drop=True)
)
11. 📊 Comparison: Pandas vs Alternatives
Feature | Pandas | Polars | Dask | Spark | NumPy |
|---|---|---|---|---|---|
Primary Use | Tabular data | Tabular data (fast) | Big data | Distributed big data | Numeric arrays |
Speed | Good | Excellent | Good (parallel) | Excellent (cluster) | Very fast |
Memory | RAM-limited | Efficient | Larger than RAM | Cluster | Efficient |
Learning Curve | Moderate | Low-Medium | Moderate | High | Low |
Ecosystem | Huge | Growing | Good | Huge | Huge |
Best For | <10M rows | <100M rows | 10M-100B rows | Petabyte scale | Math operations |
SQL Support | Via sqlalchemy | Native SQL | Partial | Full | No |
When to use Pandas:
✅ Datasets that fit in RAM (< 10-20 GB)
✅ Exploratory data analysis
✅ Data cleaning and transformation pipelines
✅ Integration with scikit-learn, seaborn, statsmodels
✅ Business reporting and dashboards
12. 🎤 Interview Questions — Part 1
Basic Level
Q1: What is the difference between a Series and a DataFrame?
A Series is a 1D labeled array (like a single column). A DataFrame is a 2D labeled table made of multiple Series sharing the same index.
Q2: How do you check for missing values in a DataFrame?
df.isnull().sum() # count of NaN per column
df.isnull().any() # True/False per column
df.isnull().values.any() # True if any NaN anywhere
Q3: What's the difference between df.loc and df.iloc?
.locuses label-based indexing (index names)..ilocuses integer position-based indexing (0, 1, 2...). We'll cover this in depth in Part 2.
Q4: How do you select multiple columns from a DataFrame?
df[["col1", "col2", "col3"]] # double brackets → DataFrame
Q5: How do you read only specific columns from a CSV?
df = pd.read_csv("file.csv", usecols=["name", "salary", "dept"])
Intermediate Level
Q6: What is SettingWithCopyWarning and how do you fix it?
It happens when you try to modify a slice of a DataFrame. Fix: Use
.copy()when creating the slice, or use.locto modify the original.
Q7: What is the difference between read_csv with dtype parameter vs calling astype() after?
Using
dtypeinread_csvis faster and more memory efficient because it sets the type during reading (no conversion needed).astype()after reading first loads as one type then converts.
Q8: How do you efficiently read a CSV file that doesn't fit in memory?
for chunk in pd.read_csv("huge.csv", chunksize=50000):
process(chunk)
Advanced Level
Q9: How does Pandas handle index alignment in arithmetic operations?
Pandas automatically aligns on the index before performing arithmetic. If indices don't match, the result is NaN for mismatched positions. This prevents silent bugs from position-based operations.
Q10: What's the memory layout difference between object dtype and string dtype in Pandas 2.x?
objectdtype stores Python string objects in a pointer-based array — high memory and slow.stringdtype (new in Pandas 1.0+) uses Apache Arrow memory layout or Pandas nullable arrays — more efficient and consistent.
13. ✅ Conclusion — Part 1
Here's what you learned in Part 1:
Pandas fundamentals — why it exists and what problems it solves
Series — 1D labeled arrays, all creation methods, key methods and alignment
DataFrames — 2D labeled tables, all creation methods, essential attributes
I/O Operations — reading and writing CSV, Excel, JSON, SQL, Parquet, and more
Essential operations — selecting, adding, removing columns/rows, type conversion
dtypes — memory optimization through smart type selection
Real-world examples — employee analysis, sales dashboard
Common errors — SettingWithCopyWarning, index misalignment, encoding issues
Best practices — avoid iterrows, use vectorized operations, use
.loc
🔜 What's Coming in Part 2?
In Part 2: Data Manipulation, we'll cover:
Indexing and Slicing —
.loc,.iloc, Boolean indexing, MultiIndexFiltering — query(), complex conditions,
isin(),between()Handling Missing Data —
dropna(),fillna(),interpolate()GroupBy — split-apply-combine,
agg(),transform(),filter()Merge and Join — inner, outer, left, right joins like SQL
Reshaping —
pivot(),pivot_table(),melt(),stack(),unstack()String Operations —
str.split(),str.extract(), regex in PandasApply, Map, and Lambda — when and how to use each
💾 Save this blog. Share it with your team. You'll refer back to the I/O and dtype sections constantly in your day-to-day work.
📌 Next: Pandas Data Manipulation: The Complete Guide (Part 2 — Indexing, GroupBy, Merge & Reshape)