✅ What is Pivot and Unpivot?
- Pivot = Convert rows into columns
- Unpivot = Convert columns into rows
🌀 Sample DataFrame
Let’s start with a small DataFrame to work with:
from pyspark.sql.functions import *
data =
columns =
df = spark.createDataFrame(data, columns)
df.show()
🔍 Output:
| Name | Subject | Score |
|---|---|---|
| Himanshu | Math | 85 |
| Himanshu | Science | 90 |
| Amit | Math | 78 |
| Amit | Science | 83 |
| Neha | Math | 92 |
| Neha | Science | 88 |
🔄 Pivot in PySpark
Goal: Turn Subjects into columns and fill with Score.
pivotDF = df.groupBy("Name").pivot("Subject").agg(first("Score"))
pivotDF.show()
🧾 Output:
| Name | Math | Science |
|---|---|---|
| Himanshu | 85 | 90 |
| Amit | 78 | 83 |
| Neha | 92 | 88 |
✅ Explanation:
groupBy("Name"): Group by the student name.pivot("Subject"): Turn values in "Subject" into column headers.agg(first("Score")): Get the score (since there is only one score per subject).
🔁 Unpivot in PySpark
PySpark doesn’t have a built-in unpivot() function, but we can simulate it using selectExpr or stack.
Let’s unpivot the above pivotDF back to its original format.
unpivotDF = pivotDF.select(
"Name",
expr("stack(2, 'Math', Math, 'Science', Science) as (Subject, Score)")
)
unpivotDF.show()
Dynamic Way==>
# 🔹 Step 1: Dynamically get all columns to unpivot (except the id column)
id_column = "Name"
value_columns =
# 🔹 Step 2: Create stack expression
stack_expr = "stack({0}, {1}) as (Subject, Marks)".format(
len(value_columns),
', '.join()
)
# 🔹 Step 3: Use selectExpr with dynamic stack
unpivotDF = pivotDF.selectExpr(id_column, stack_expr)
# 🔹 Step 4: Show result
unpivotDF.show(truncate=False)
🧾 Output:
| Name | Subject | Score |
|---|---|---|
| Himanshu | Math | 85 |
| Himanshu | Science | 90 |
| Amit | Math | 78 |
| Amit | Science | 83 |
| Neha | Math | 92 |
| Neha | Science | 88 |
🔁 Summary
| Operation | What It Does | PySpark Code Example |
|---|---|---|
| Pivot | Rows ➡ Columns | df.groupBy().pivot() |
| Unpivot | Columns ➡ Rows (using stack) | df.select(expr("stack(...)")) |
🎯 Real-world use case:
Scenario: HR wants to see attendance status by day.
| Employee | Day | Status |
|---|---|---|
| Himanshu | Monday | Present |
| Himanshu | Tuesday | Absent |
Pivoted:
| Employee | Monday | Tuesday |
|---|---|---|
| Himanshu | Present | Absent |
And vice versa (Unpivoted).
# Sample pivoted data
data =
columns =
pivotDF = spark.createDataFrame(data, columns)
# 🔹 Step 1: Dynamically get all columns to unpivot (except the id column)
id_column = "Product"
value_columns =
# 🔹 Step 2: Create stack expression
stack_expr = "stack({0}, {1}) as (Country, Amount)".format(
len(value_columns),
', '.join()
)
# 🔹 Step 3: Use selectExpr with dynamic stack
unpivotDF = pivotDF.selectExpr(id_column, stack_expr)
# 🔹 Step 4: Show result
unpivotDF.show(truncate=False)
