Back to all posts
Data Science

Pandas for Python Developers: The Complete Guide (Part 1 — Fundamentals)

Meta Description: Master Pandas from scratch. Learn Series, DataFrames, I/O operations, and essential data manipulation with real-world examples. The only gu...

Series Overview:

  • Part 1 (This Post): Fundamentals — Series, DataFrames, I/O, Basic Operations

  • Part 2: Data Manipulation — Indexing, Filtering, GroupBy, Merge/Join, Reshaping

  • Part 3: Advanced Pandas — Performance, Time Series, Real-World Pipelines, ML & Interview Questions


1. Introduction

What is Pandas?

Pandas is the most popular data manipulation and analysis library in Python. The name comes from "Panel Data" — a term used in econometrics for multidimensional structured datasets.

At its core, Pandas gives you two powerful data structures:

  • Series — a one-dimensional labeled array

  • DataFrame — a two-dimensional labeled table (like an Excel sheet or SQL table)

Think of Pandas as Excel + SQL + Python combined into one blazing-fast library.

Why is Pandas Important?

  • 🚀 Speed — Built on top of NumPy, extremely fast for tabular data

  • 🧹 Data Cleaning — Handle missing data, duplicates, wrong types effortlessly

  • 🔗 Integration — Works with CSV, Excel, JSON, SQL, Parquet, APIs, and more

  • 📊 Analysis — Grouping, aggregation, pivoting, filtering — all in a few lines

  • 🤖 ML-Ready — Direct input to scikit-learn, TensorFlow, and PyTorch pipelines

Real-World Usage

Industry

Use Case

Finance

Stock price analysis, portfolio management

Healthcare

Patient data cleaning, EHR analysis

E-commerce

Sales reports, customer segmentation

Data Engineering

ETL pipelines, data validation

Machine Learning

Feature engineering, dataset preparation

Installing Pandas

Bash
pip install pandas

# With optional dependencies (recommended)
pip install pandas openpyxl xlrd sqlalchemy pyarrow
Python
import pandas as pd
import numpy as np

print(pd.__version__)  # e.g., 2.2.0

2. 🧱 Basic Concepts

2.1 The Pandas Data Model

Pandas is built on top of NumPy arrays. Every column in a DataFrame is essentially a NumPy array under the hood — which is why Pandas is so fast.

SQL
DataFrame
│
├── Column A (Series) → NumPy Array
├── Column B (Series) → NumPy Array
└── Column C (Series) → NumPy Array

2.2 The Index — The Heart of Pandas

Every Pandas object has an Index — labels for rows (and columns). This is what makes Pandas unique compared to plain arrays.

Python
import pandas as pd

# Default integer index (0, 1, 2...)
s = pd.Series([10, 20, 30])
print(s.index)   # RangeIndex(start=0, stop=3, step=1)

# Custom string index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"])    # 20

3. 📊 The Series — Deep Dive

What is a Series?

A Series is a 1D array with labels. Think of it as a single column in a spreadsheet.

Python
# Structure: pd.Series(data, index, dtype, name)
s = pd.Series(
    data=[100, 200, 300, 400],
    index=["Jan", "Feb", "Mar", "Apr"],
    name="Monthly Revenue"
)
print(s)

Output:

SQL
Jan    100
Feb    200
Mar    300
Apr    400
Name: Monthly Revenue, dtype: int64

Creating a Series — All Methods

Python
import pandas as pd
import numpy as np

# 1. From a Python list
s1 = pd.Series([1, 2, 3, 4, 5])

# 2. From a NumPy array
s2 = pd.Series(np.array([10.5, 20.1, 30.8]))

# 3. From a Python dictionary (keys become index)
s3 = pd.Series({"Alice": 95, "Bob": 87, "Charlie": 92})

# 4. From a scalar (broadcasts to all index positions)
s4 = pd.Series(42, index=["a", "b", "c", "d"])
print(s4)
# a    42
# b    42
# c    42
# d    42

# 5. With explicit dtype
s5 = pd.Series([1, 2, 3], dtype="float64")

# 6. With name
s6 = pd.Series([100, 200], name="sales")

data = pd.Series([1, 2, 3, 4, 5],index=['a', 'b', 'c', 'd', 'e'])
data #a    1
     #b    2
     #c    3
     #d    4
     #e    5
     #dtype: int64
data.index #Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
data.values #array([1, 2, 3, 4, 5])
data.to_dict() #{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
data.to_numpy() #array([1, 2, 3, 4, 5])
data.to_list() #[1, 2, 3, 4, 5]

Essential Series Attributes

Python
s = pd.Series([10, 20, 30, 40, 50], index=["a", "b", "c", "d", "e"])

s.values        # numpy array: [10 20 30 40 50]
s.index         # Index(['a', 'b', 'c', 'd', 'e'])
s.dtype         # dtype('int64')
s.name          # None (or set name)
s.shape         # (5,)
s.size          # 5
s.ndim          # 1
s.nbytes        # 40 (memory in bytes)
s.is_unique     # True (all values distinct?)
s.hasnans       # False

Essential Series Methods

Python
s = pd.Series([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

# --- Statistics ---
s.sum()          # 39
s.mean()         # 3.9
s.median()       # 3.5
s.std()          # 2.43
s.var()          # 5.88
s.min()          # 1
s.max()          # 9
s.count()        # 10 (non-NaN count)
s.describe()     # summary stats

# --- Sorting ---
s.sort_values()            # sort by values
s.sort_values(ascending=False)
s.sort_index()             # sort by index

# --- Value Counts ---
s.value_counts()           # frequency of each value
s.unique()                 # array of unique values
s.nunique()                # count of unique values

# --- Math Operations ---
s + 10          # add 10 to all
s * 2           # multiply all by 2
s ** 2          # square all
np.sqrt(s)      # square root (NumPy works directly)

# --- Type Conversion ---
s.astype(float)
s.astype(str)

# --- Missing Data ---
s.isnull()      # boolean mask of NaN
s.notnull()
s.dropna()      # remove NaN
s.fillna(0)     # fill NaN with 0

# --- Apply custom function ---
s.apply(lambda x: x * 2 if x > 3 else x)

# --- String methods (if string Series) ---
names = pd.Series(["Alice", "Bob", "Charlie"])
names.str.lower()
names.str.upper()
names.str.len()
names.str.contains("li")
names.str.replace("a", "@")

Series Alignment (Super Important!)

When you do math between two Series, Pandas aligns by index — not by position.

Python
s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
s2 = pd.Series([10, 20, 30], index=["b", "c", "d"])

result = s1 + s2
print(result)
# a     NaN   (a not in s2)
# b    12.0   (2 + 10)
# c    23.0   (3 + 20)
# d     NaN   (d not in s1)

This is called index alignment and it's one of Pandas' most powerful features.


4. 📋 The DataFrame — Deep Dive

What is a DataFrame?

A DataFrame is a 2D labeled data structure — like a spreadsheet or SQL table. It has rows and columns, each with labels (index and column names).

SQL
             name   age  salary
0           Alice    25   50000
1             Bob    30   60000
2         Charlie    35   75000

Creating a DataFrame — All Methods

Python
import pandas as pd
import numpy as np

# ── Method 1: From a dictionary of lists ──────────────────────────────
df1 = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000]
})

# ── Method 2: From a list of dictionaries (most common in APIs) ────────
df2 = pd.DataFrame([
    {"name": "Alice", "age": 25, "salary": 50000},
    {"name": "Bob", "age": 30, "salary": 60000},
    {"name": "Charlie", "age": 35, "salary": 75000}
])

# ── Method 3: From a NumPy array ──────────────────────────────────────
df3 = pd.DataFrame(
    np.random.randint(0, 100, size=(4, 3)),
    columns=["A", "B", "C"]
)

# ── Method 4: From a list of lists ────────────────────────────────────
df4 = pd.DataFrame(
    [["Alice", 25], ["Bob", 30]],
    columns=["name", "age"]
)

# ── Method 5: From a Series ───────────────────────────────────────────
s = pd.Series([1, 2, 3], name="values")
df5 = pd.DataFrame(s)

# ── Method 6: Empty DataFrame with schema ─────────────────────────────
df6 = pd.DataFrame(columns=["id", "name", "score"])

# ── Method 7: With custom index ───────────────────────────────────────
df7 = pd.DataFrame(
    {"score": [85, 90, 78]},
    index=["Alice", "Bob", "Charlie"]
)

Essential DataFrame Attributes

Python
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "salary": [50000, 60000, 75000, 55000],
    "dept": ["HR", "IT", "Finance", "IT"]
})

df.shape         # (4, 4) → rows, columns
df.ndim          # 2
df.size          # 16 (total elements)
df.dtypes        # dtype of each column
df.columns       # Index(['name', 'age', 'salary', 'dept'])
df.index         # RangeIndex(start=0, stop=4, step=1)
df.values        # NumPy array of all data
df.axes          # [row index, column index]
df.memory_usage()# memory per column in bytes

First Look at Your Data

Python
df.head()        # first 5 rows (default)
df.head(10)      # first 10 rows
df.tail()        # last 5 rows
df.tail(3)       # last 3 rows
df.sample(3)     # 3 random rows
df.sample(frac=0.1)  # 10% random sample

df.info()        # column names, dtypes, non-null counts, memory
df.describe()    # statistical summary (numeric cols)
df.describe(include="all")    # include object/categorical cols
df.describe(include="object") # string/categorical only

df.info() Output Explained

Python
df.info()
SQL
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   salary  4 non-null      int64 
 3   dept    4 non-null      object
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes

This tells you:

  • Total rows and columns

  • Which columns have missing values

  • Data types (important for performance and operations)

  • Memory usage


5. 📂 Reading and Writing Data (I/O Operations)

This is one of the most important practical skills. You'll spend 30% of your Pandas time just reading and writing files.

5.1 CSV Files

Python
# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_csv("data.csv")

# Common parameters:
df = pd.read_csv(
    "data.csv",
    sep=",",             # delimiter (use sep="\t" for TSV)
    header=0,            # row to use as column names (0 = first row)
    index_col="id",      # set column as index
    usecols=["name", "age", "salary"],  # read only specific columns
    nrows=1000,          # read only first 1000 rows
    skiprows=2,          # skip first 2 rows
    dtype={"age": int, "salary": float},  # specify dtypes
    na_values=["N/A", "NULL", "none", ""],  # treat as NaN
    parse_dates=["date_col"],  # auto-parse date columns
    encoding="utf-8",    # encoding (try "latin-1" if utf-8 fails)
    low_memory=False,    # avoid mixed-type warnings
    chunksize=10000,     # read in chunks (for large files)
)

# ── Writing ────────────────────────────────────────────────────────────
df.to_csv("output.csv")

df.to_csv(
    "output.csv",
    index=False,         # don't write row numbers
    sep=",",
    encoding="utf-8",
    columns=["name", "salary"],  # write only specific columns
    header=True,         # write column names
    mode="a",            # append instead of overwrite
    na_rep="NULL",       # represent NaN as "NULL"
    float_format="%.2f", # format floats to 2 decimal places
    date_format="%Y-%m-%d",
)

Chunked Reading (for Big Files)

Python
# When a CSV is too large to fit in memory, read in chunks
chunk_list = []
for chunk in pd.read_csv("huge_file.csv", chunksize=10000):
    # Process each chunk
    filtered = chunk[chunk["age"] > 30]
    chunk_list.append(filtered)

df = pd.concat(chunk_list, ignore_index=True)
print(f"Final shape: {df.shape}")

5.2 Excel Files

Python
# ── Reading ────────────────────────────────────────────────────────────
# pip install openpyxl (for .xlsx)  xlrd (for .xls)
df = pd.read_excel("data.xlsx")

df = pd.read_excel(
    "data.xlsx",
    sheet_name="Sheet1",     # specify sheet name or index (0-based)
    header=0,
    usecols="A:D",            # Excel column range
    skiprows=2,
    dtype={"id": str},
    na_values=["", "N/A"],
)

# Read all sheets at once
all_sheets = pd.read_excel("data.xlsx", sheet_name=None)
# Returns dict: {"Sheet1": df1, "Sheet2": df2, ...}

# ── Writing ────────────────────────────────────────────────────────────
df.to_excel("output.xlsx", index=False, sheet_name="Results")

# Multiple sheets in one file
with pd.ExcelWriter("multi_sheet.xlsx", engine="openpyxl") as writer:
    df1.to_excel(writer, sheet_name="Sales", index=False)
    df2.to_excel(writer, sheet_name="HR", index=False)
    df3.to_excel(writer, sheet_name="Finance", index=False)

5.3 JSON Files

Python
# ── Reading ────────────────────────────────────────────────────────────
# records format: [{"col": val}, {"col": val}, ...]
df = pd.read_json("data.json")

df = pd.read_json(
    "data.json",
    orient="records",    # "records", "split", "index", "columns", "values"
    lines=True,          # for JSON Lines format (one JSON per line)
    dtype={"id": str},
    encoding="utf-8",
)

# Reading from a URL (API response)
import json, urllib.request
url = "https://jsonplaceholder.typicode.com/users"
with urllib.request.urlopen(url) as response:
    data = json.loads(response.read())
df = pd.DataFrame(data)

# ── Writing ────────────────────────────────────────────────────────────
df.to_json("output.json", orient="records", indent=2)
df.to_json("output.json", orient="records", lines=True)  # JSON Lines

5.4 SQL Databases

Python
# pip install sqlalchemy
from sqlalchemy import create_engine

# ── Connect ────────────────────────────────────────────────────────────
# SQLite
engine = create_engine("sqlite:///mydb.db")

# PostgreSQL
engine = create_engine("postgresql+psycopg2://user:password@host:5432/dbname")

# MySQL
engine = create_engine("mysql+pymysql://user:password@host:3306/dbname")

# SQL Server
engine = create_engine("mssql+pyodbc://user:password@host/dbname?driver=ODBC+Driver+17+for+SQL+Server")

# ── Reading ────────────────────────────────────────────────────────────
# Read an entire table
df = pd.read_sql_table("employees", con=engine)

# Read with SQL query
df = pd.read_sql_query(
    "SELECT name, age, salary FROM employees WHERE dept = 'IT'",
    con=engine
)

# Universal (works for both table name and SQL string)
df = pd.read_sql("SELECT * FROM employees LIMIT 100", con=engine)

# With parameters (safe — avoids SQL injection)
df = pd.read_sql(
    "SELECT * FROM employees WHERE dept = :dept AND age > :age",
    con=engine,
    params={"dept": "IT", "age": 30}
)

# ── Writing ────────────────────────────────────────────────────────────
df.to_sql(
    "employees_backup",
    con=engine,
    if_exists="replace",    # "fail", "replace", "append"
    index=False,
    chunksize=1000,         # write in chunks
    method="multi",         # use multi-row INSERT (faster)
)

5.5 Parquet Files (Modern Big Data Format)

Python
# pip install pyarrow  (or fastparquet)

# ── Writing ────────────────────────────────────────────────────────────
df.to_parquet("data.parquet", index=False, engine="pyarrow", compression="snappy")

# ── Reading ────────────────────────────────────────────────────────────
df = pd.read_parquet("data.parquet", engine="pyarrow")

# Read specific columns only (much faster for wide tables)
df = pd.read_parquet("data.parquet", columns=["name", "salary"])

💡 Parquet vs CSV: Parquet is 5-10x smaller and 10-100x faster to read. Always use Parquet for large datasets in production.

5.6 Other Formats

Python
# ── Clipboard ─────────────────────────────────────────────────────────
df = pd.read_clipboard()  # paste from Excel/browser table
df.to_clipboard(index=False)

# ── HTML (scraping tables) ─────────────────────────────────────────────
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP")
df = tables[0]  # first table on the page

# ── Pickle (Python objects, preserves dtypes perfectly) ───────────────
df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")

# ── HDF5 (for very large datasets) ────────────────────────────────────
df.to_hdf("data.h5", key="employees", mode="w")
df = pd.read_hdf("data.h5", key="employees")

I/O Format Comparison

Format

Speed

File Size

Human Readable

Best For

CSV

Slow

Large

✅ Yes

Small files, sharing

JSON

Medium

Large

✅ Yes

APIs, configs

Parquet

Fast

Small

❌ No

Big data, production

Excel

Slow

Medium

✅ Yes

Business reports

Pickle

Fast

Medium

❌ No

Python-only caching

HDF5

Very Fast

Variable

❌ No

Scientific data

SQL

Varies

Varies

Via query

Transactional data


6. 🔧 Essential DataFrame Operations

6.1 Selecting Columns

Python
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000],
    "dept": ["HR", "IT", "Finance"]
})

# ── Single column → returns Series ────────────────────────────────────
df["name"]         # bracket notation (always works)
df.name            # attribute notation (AVOID — breaks if col name has spaces/conflicts)

# ── Multiple columns → returns DataFrame ──────────────────────────────
df[["name", "salary"]]
df[["name", "age", "dept"]]

# ── All columns except one ────────────────────────────────────────────
df.drop(columns=["salary"])

# ── Select by dtype ───────────────────────────────────────────────────
df.select_dtypes(include="number")    # numeric columns only
df.select_dtypes(include="object")    # string columns only
df.select_dtypes(exclude="object")    # exclude strings

6.2 Adding and Removing Columns

Python
# ── Adding new columns ────────────────────────────────────────────────
df["annual_bonus"] = df["salary"] * 0.10
df["senior"] = df["age"] >= 30   # boolean column
df["full_info"] = df["name"] + " (" + df["dept"] + ")"

# Using assign() — returns a new df, doesn't modify original
df_new = df.assign(
    tax=df["salary"] * 0.30,
    net_salary=df["salary"] * 0.70,
    age_group=lambda x: pd.cut(x["age"], bins=[0, 30, 50, 100], labels=["Young", "Mid", "Senior"])
)

# Insert at specific position
df.insert(1, "employee_id", range(1, len(df) + 1))

# ── Removing columns ──────────────────────────────────────────────────
df.drop("annual_bonus", axis=1, inplace=True)              # single column
df.drop(["salary", "dept"], axis=1, inplace=True)          # multiple columns
df.drop(columns=["salary", "dept"])                        # cleaner syntax

# ── Rename columns ────────────────────────────────────────────────────
df.rename(columns={"name": "full_name", "age": "years"})

# Make all column names lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(" ", "_")

6.3 Adding and Removing Rows

Python
# ── Adding rows ───────────────────────────────────────────────────────
new_row = pd.DataFrame([{"name": "Diana", "age": 28, "salary": 55000, "dept": "Marketing"}])
df = pd.concat([df, new_row], ignore_index=True)

# ── Removing rows ─────────────────────────────────────────────────────
df.drop(0, axis=0)                 # drop row with index 0
df.drop([0, 2, 4])                 # drop multiple rows by index label
df.drop(df.index[df["age"] < 25]) # drop rows where condition is True
df[df["age"] >= 25]               # filter (preferred over dropping)

# ── Reset index after drops ───────────────────────────────────────────
df = df.drop(0).reset_index(drop=True)

6.4 Data Types and Conversion

Python
df = pd.DataFrame({
    "id": ["1", "2", "3"],
    "salary": ["50000", "60000", "75000"],
    "hire_date": ["2020-01-15", "2019-03-22", "2021-07-01"],
    "active": ["True", "False", "True"]
})

# Check dtypes
print(df.dtypes)

# ── Convert types ─────────────────────────────────────────────────────
df["id"] = df["id"].astype(int)
df["salary"] = df["salary"].astype(float)
df["hire_date"] = pd.to_datetime(df["hire_date"])
df["active"] = df["active"].map({"True": True, "False": False})

# Safe numeric conversion (won't crash on errors)
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")  # NaN on failure
df["salary"] = pd.to_numeric(df["salary"], errors="ignore")  # keep original on failure

# Convert to categorical (saves memory for low-cardinality string cols)
df["dept"] = df["dept"].astype("category")

# Full conversion with convert_dtypes() — auto-detects best types
df = df.convert_dtypes()

6.5 Viewing, Sorting, Reindexing

Python
# ── Sorting ───────────────────────────────────────────────────────────
df.sort_values("salary")                           # ascending
df.sort_values("salary", ascending=False)          # descending
df.sort_values(["dept", "salary"], ascending=[True, False])  # multi-col sort
df.sort_index()                                    # sort by index

# ── Reindexing ────────────────────────────────────────────────────────
df.reset_index(drop=True)             # reset to 0,1,2...
df.set_index("name")                  # use 'name' column as index
df.set_index("name", drop=True)       # remove 'name' from columns

# ── Transposing ───────────────────────────────────────────────────────
df.T   # swap rows and columns

7. 🔍 Understanding dtypes in Depth

One of the most common sources of bugs and slowness in Pandas is wrong data types.

All Pandas dtypes

dtype

Description

Example

int64

64-bit integer

1, 42, -7

int32

32-bit integer (less memory)

float64

64-bit float

3.14, 2.718

float32

32-bit float (less memory)

object

Python string (or mixed)

"hello"

string

Pandas string type (new)

"hello"

bool

Boolean

True/False

datetime64[ns]

Timestamps

2024-01-01

timedelta64[ns]

Time differences

3 days

category

Categorical (low cardinality)

"M"/"F"

complex128

Complex numbers

3+4j

Python
# Memory optimization example
df = pd.DataFrame({
    "user_id": np.random.randint(1, 100000, 1_000_000),
    "score": np.random.uniform(0, 100, 1_000_000),
    "grade": np.random.choice(["A", "B", "C", "D"], 1_000_000)
})

print(f"Before: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Optimize types
df["user_id"] = df["user_id"].astype("int32")       # int64 → int32 = 50% less
df["score"] = df["score"].astype("float32")          # float64 → float32 = 50% less
df["grade"] = df["grade"].astype("category")         # object → category = huge savings

print(f"After:  {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Before: ~22 MB  →  After: ~5 MB  (75% reduction!)

8. 🛠️ Practical Examples

Example 1: Analyzing Employee Data (Beginner)

Python
import pandas as pd

# Step 1: Load data
df = pd.read_csv("employees.csv")

# Step 2: First look
print("Shape:", df.shape)          # How many rows and columns?
print("\nColumn types:")
print(df.dtypes)                   # What are the data types?
print("\nSummary:")
print(df.describe())               # Statistical summary
print("\nMissing values:")
print(df.isnull().sum())           # Check for missing data

# Step 3: Data cleaning
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
df["hire_date"] = pd.to_datetime(df["hire_date"], errors="coerce")
df["dept"] = df["dept"].str.strip().str.title()

# Step 4: Basic analysis
print("\nEmployee count by department:")
print(df["dept"].value_counts())

print("\nAverage salary by department:")
print(df.groupby("dept")["salary"].mean().sort_values(ascending=False))

# Step 5: Save cleaned data
df.to_csv("employees_clean.csv", index=False)
print("\n✅ Cleaned data saved!")

Line-by-line explanation:

  • pd.read_csv() — loads the CSV into a DataFrame

  • df.shape — tells us rows × columns so we know the dataset size

  • df.dtypes — shows data types; crucial to check before analysis

  • df.describe() — quick stats: count, mean, std, min, max, quartiles

  • df.isnull().sum() — counts missing values per column

  • str.strip().str.lower() — cleans column names (remove spaces, lowercase)

  • pd.to_numeric(..., errors="coerce") — safely converts to number; puts NaN if conversion fails

  • pd.to_datetime(...) — converts string dates to proper datetime

  • value_counts() — counts how many employees per department

  • groupby().mean() — average salary per department


Example 2: Sales Dashboard Data Prep (Intermediate)

Python
import pandas as pd
import numpy as np

# Create realistic sample data
np.random.seed(42)
n = 500

df = pd.DataFrame({
    "order_id": range(1001, 1001 + n),
    "customer_id": np.random.randint(100, 200, n),
    "product": np.random.choice(["Laptop", "Phone", "Tablet", "Watch", "Earbuds"], n),
    "category": np.random.choice(["Electronics", "Accessories"], n),
    "quantity": np.random.randint(1, 10, n),
    "unit_price": np.random.uniform(50, 2000, n).round(2),
    "discount": np.random.choice([0, 0.05, 0.10, 0.20], n),
    "order_date": pd.date_range("2023-01-01", periods=n, freq="12H")
})

# ── Step 1: Calculate derived columns ─────────────────────────────────
df["gross_revenue"] = df["quantity"] * df["unit_price"]
df["discount_amount"] = df["gross_revenue"] * df["discount"]
df["net_revenue"] = df["gross_revenue"] - df["discount_amount"]

# ── Step 2: Extract date parts for time-series analysis ───────────────
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B")  # "January", etc.
df["quarter"] = df["order_date"].dt.quarter
df["day_of_week"] = df["order_date"].dt.day_name()

# ── Step 3: Quick KPIs ────────────────────────────────────────────────
print("=" * 50)
print("📊 SALES DASHBOARD SUMMARY")
print("=" * 50)
print(f"Total Orders:        {len(df):,}")
print(f"Total Gross Revenue: ${df['gross_revenue'].sum():,.2f}")
print(f"Total Discounts:     ${df['discount_amount'].sum():,.2f}")
print(f"Total Net Revenue:   ${df['net_revenue'].sum():,.2f}")
print(f"Average Order Value: ${df['net_revenue'].mean():,.2f}")
print(f"Unique Customers:    {df['customer_id'].nunique():,}")

# ── Step 4: Top products ──────────────────────────────────────────────
top_products = (
    df.groupby("product")
    .agg(
        orders=("order_id", "count"),
        total_revenue=("net_revenue", "sum"),
        avg_price=("unit_price", "mean")
    )
    .sort_values("total_revenue", ascending=False)
    .round(2)
)
print("\n🏆 Top Products:")
print(top_products.to_string())

# ── Step 5: Monthly revenue trend ────────────────────────────────────
monthly = (
    df.groupby(["year", "month"])["net_revenue"]
    .sum()
    .reset_index()
    .rename(columns={"net_revenue": "monthly_revenue"})
)
print("\n📅 Monthly Revenue:")
print(monthly.head(6).to_string(index=False))

# ── Step 6: Save to Excel with multiple sheets ────────────────────────
with pd.ExcelWriter("sales_dashboard.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Raw Data", index=False)
    top_products.to_excel(writer, sheet_name="Product Summary")
    monthly.to_excel(writer, sheet_name="Monthly Trend", index=False)

print("\n✅ Dashboard exported to sales_dashboard.xlsx")

9. ⚠️ Edge Cases and Common Errors

Error 1: SettingWithCopyWarning

Python
# ❌ Wrong — modifying a copy, not the original
subset = df[df["age"] > 30]
subset["salary"] = subset["salary"] * 1.1  # ⚠️ SettingWithCopyWarning!

# ✅ Correct — use .copy() explicitly
subset = df[df["age"] > 30].copy()
subset["salary"] = subset["salary"] * 1.1  # No warning

# ✅ Or use .loc on the original
df.loc[df["age"] > 30, "salary"] = df.loc[df["age"] > 30, "salary"] * 1.1

Error 2: Mixed Types in object Columns

Python
# ❌ This column has mixed types — causes silent bugs
s = pd.Series([1, "two", 3, None, 5.0])
print(s.dtype)   # object — pandas treats mixed as Python objects
print(s.sum())   # ERROR or wrong result

# ✅ Always clean before operating
s = pd.to_numeric(s, errors="coerce")  # converts "two" → NaN
print(s.sum())   # 9.0 (skips NaN)

Error 3: Index Misalignment

Python
df1 = pd.DataFrame({"a": [1, 2, 3]}, index=[0, 1, 2])
df2 = pd.DataFrame({"b": [10, 20, 30]}, index=[1, 2, 3])

# ❌ Arithmetic aligns on index — might not be what you want
result = df1["a"] + df2["b"]
print(result)
# 0    NaN  (index 0 not in df2)
# 1    12.0
# 2    23.0
# 3    NaN  (index 3 not in df1)

# ✅ Reset index first if you want positional alignment
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
result = df1["a"] + df2["b"]

Error 4: Chained Assignment

Python
# ❌ Chained indexing — unpredictable behavior
df["dept"]["IT"] = "Information Technology"

# ✅ Always use .loc or .iloc
df.loc[df["dept"] == "IT", "dept"] = "Information Technology"

Error 5: read_csv with Wrong Encoding

Python
# UnicodeDecodeError: 'utf-8' codec can't decode byte...
df = pd.read_csv("data.csv")  # might crash

# ✅ Try different encodings
df = pd.read_csv("data.csv", encoding="latin-1")   # common for European data
df = pd.read_csv("data.csv", encoding="utf-8-sig")  # for Excel-generated CSVs
df = pd.read_csv("data.csv", encoding="cp1252")     # Windows default

Error 6: Memory Error on Large Files

Python
# ❌ Crashes on 10GB file
df = pd.read_csv("huge_10gb_file.csv")

# ✅ Read in chunks
results = []
for chunk in pd.read_csv("huge_10gb_file.csv", chunksize=50000):
    # Do your processing per chunk
    summary = chunk.groupby("category")["revenue"].sum()
    results.append(summary)

final = pd.concat(results).groupby(level=0).sum()

10. 💡 Pro Developer Insights

Best Practices

  1. Never use .iterrows() — it's 100-1000x slower than vectorized operations

  2. Use .copy() when slicing — avoids SettingWithCopyWarning

  3. Set index on the column you filter/join most — speeds up lookups

  4. Use categorical dtype for low-cardinality string columns — saves 80%+ memory

  5. Prefer .loc over chained indexing — more explicit and safe

  6. Always check df.info() and df.isnull().sum() first — understand your data before touching it

  7. Use pd.read_parquet instead of CSV for production data pipelines

  8. Don't use inplace=True — it doesn't save memory and is being deprecated

Python
# ❌ Anti-pattern: iterrows (very slow)
for index, row in df.iterrows():
    df.at[index, "tax"] = row["salary"] * 0.30

# ✅ Vectorized (fast)
df["tax"] = df["salary"] * 0.30

# ❌ Anti-pattern: building DataFrames with loops
rows = []
for i in range(10000):
    rows.append({"x": i, "y": i**2})
df = pd.DataFrame(rows)  # slow

# ✅ Build from arrays
x = np.arange(10000)
df = pd.DataFrame({"x": x, "y": x**2})  # fast

Performance Tips

Python
# Tip 1: Use query() for readable filtering
df.query("age > 30 and dept == 'IT'")  # instead of boolean indexing

# Tip 2: Use eval() for complex expressions
df.eval("net = salary - salary * 0.30", inplace=True)

# Tip 3: Downcasting numerics
df["age"] = pd.to_numeric(df["age"], downcast="integer")   # int64 → int8/16/32
df["score"] = pd.to_numeric(df["score"], downcast="float") # float64 → float32

# Tip 4: pipe() for readable chained operations
def clean_names(df):
    df.columns = df.columns.str.lower().str.replace(" ", "_")
    return df

def drop_duplicates_keep_last(df):
    return df.drop_duplicates(keep="last")

result = (df
    .pipe(clean_names)
    .pipe(drop_duplicates_keep_last)
    .dropna(subset=["salary"])
    .reset_index(drop=True)
)

11. 📊 Comparison: Pandas vs Alternatives

Feature

Pandas

Polars

Dask

Spark

NumPy

Primary Use

Tabular data

Tabular data (fast)

Big data

Distributed big data

Numeric arrays

Speed

Good

Excellent

Good (parallel)

Excellent (cluster)

Very fast

Memory

RAM-limited

Efficient

Larger than RAM

Cluster

Efficient

Learning Curve

Moderate

Low-Medium

Moderate

High

Low

Ecosystem

Huge

Growing

Good

Huge

Huge

Best For

<10M rows

<100M rows

10M-100B rows

Petabyte scale

Math operations

SQL Support

Via sqlalchemy

Native SQL

Partial

Full

No

When to use Pandas:

  • ✅ Datasets that fit in RAM (< 10-20 GB)

  • ✅ Exploratory data analysis

  • ✅ Data cleaning and transformation pipelines

  • ✅ Integration with scikit-learn, seaborn, statsmodels

  • ✅ Business reporting and dashboards


12. 🎤 Interview Questions — Part 1

Basic Level

Q1: What is the difference between a Series and a DataFrame?

A Series is a 1D labeled array (like a single column). A DataFrame is a 2D labeled table made of multiple Series sharing the same index.

Q2: How do you check for missing values in a DataFrame?

Python
df.isnull().sum()          # count of NaN per column
df.isnull().any()          # True/False per column
df.isnull().values.any()   # True if any NaN anywhere

Q3: What's the difference between df.loc and df.iloc?

.loc uses label-based indexing (index names). .iloc uses integer position-based indexing (0, 1, 2...). We'll cover this in depth in Part 2.

Q4: How do you select multiple columns from a DataFrame?

Python
df[["col1", "col2", "col3"]]  # double brackets → DataFrame

Q5: How do you read only specific columns from a CSV?

Python
df = pd.read_csv("file.csv", usecols=["name", "salary", "dept"])

Intermediate Level

Q6: What is SettingWithCopyWarning and how do you fix it?

It happens when you try to modify a slice of a DataFrame. Fix: Use .copy() when creating the slice, or use .loc to modify the original.

Q7: What is the difference between read_csv with dtype parameter vs calling astype() after?

Using dtype in read_csv is faster and more memory efficient because it sets the type during reading (no conversion needed). astype() after reading first loads as one type then converts.

Q8: How do you efficiently read a CSV file that doesn't fit in memory?

Python
for chunk in pd.read_csv("huge.csv", chunksize=50000):
    process(chunk)

Advanced Level

Q9: How does Pandas handle index alignment in arithmetic operations?

Pandas automatically aligns on the index before performing arithmetic. If indices don't match, the result is NaN for mismatched positions. This prevents silent bugs from position-based operations.

Q10: What's the memory layout difference between object dtype and string dtype in Pandas 2.x?

object dtype stores Python string objects in a pointer-based array — high memory and slow. string dtype (new in Pandas 1.0+) uses Apache Arrow memory layout or Pandas nullable arrays — more efficient and consistent.


13. ✅ Conclusion — Part 1

Here's what you learned in Part 1:

  • Pandas fundamentals — why it exists and what problems it solves

  • Series — 1D labeled arrays, all creation methods, key methods and alignment

  • DataFrames — 2D labeled tables, all creation methods, essential attributes

  • I/O Operations — reading and writing CSV, Excel, JSON, SQL, Parquet, and more

  • Essential operations — selecting, adding, removing columns/rows, type conversion

  • dtypes — memory optimization through smart type selection

  • Real-world examples — employee analysis, sales dashboard

  • Common errors — SettingWithCopyWarning, index misalignment, encoding issues

  • Best practices — avoid iterrows, use vectorized operations, use .loc


🔜 What's Coming in Part 2?

In Part 2: Data Manipulation, we'll cover:

  • Indexing and Slicing.loc, .iloc, Boolean indexing, MultiIndex

  • Filtering — query(), complex conditions, isin(), between()

  • Handling Missing Datadropna(), fillna(), interpolate()

  • GroupBy — split-apply-combine, agg(), transform(), filter()

  • Merge and Join — inner, outer, left, right joins like SQL

  • Reshapingpivot(), pivot_table(), melt(), stack(), unstack()

  • String Operationsstr.split(), str.extract(), regex in Pandas

  • Apply, Map, and Lambda — when and how to use each


💾 Save this blog. Share it with your team. You'll refer back to the I/O and dtype sections constantly in your day-to-day work.

📌 Next: Pandas Data Manipulation: The Complete Guide (Part 2 — Indexing, GroupBy, Merge & Reshape)

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.