Back to all posts

PySpark Pivot and Unpivot DataFrame

✅ What is Pivot and Unpivot? Pivot = Convert rows into columns Unpivot = Convert columns into rows 🌀 Sample DataFrame Let’s start with a small DataFrame t…

✅ What is Pivot and Unpivot?

  • Pivot = Convert rows into columns
  • Unpivot = Convert columns into rows

🌀 Sample DataFrame

Let’s start with a small DataFrame to work with:

JavaScript
from pyspark.sql.functions import *

data = 

columns = 

df = spark.createDataFrame(data, columns)
df.show()

🔍 Output:

NameSubjectScore
HimanshuMath85
HimanshuScience90
AmitMath78
AmitScience83
NehaMath92
NehaScience88

🔄 Pivot in PySpark

Goal: Turn Subjects into columns and fill with Score.

SQL
pivotDF = df.groupBy("Name").pivot("Subject").agg(first("Score"))
pivotDF.show()

🧾 Output:

NameMathScience
Himanshu8590
Amit7883
Neha9288

✅ Explanation:

  • groupBy("Name"): Group by the student name.
  • pivot("Subject"): Turn values in "Subject" into column headers.
  • agg(first("Score")): Get the score (since there is only one score per subject).

🔁 Unpivot in PySpark

PySpark doesn’t have a built-in unpivot() function, but we can simulate it using selectExpr or stack.

Let’s unpivot the above pivotDF back to its original format.

Bash
unpivotDF = pivotDF.select(
    "Name",
    expr("stack(2, 'Math', Math, 'Science', Science) as (Subject, Score)")
)
unpivotDF.show()

Dynamic Way==>

SQL
# 🔹 Step 1: Dynamically get all columns to unpivot (except the id column)
id_column = "Name"
value_columns = 

# 🔹 Step 2: Create stack expression
stack_expr = "stack({0}, {1}) as (Subject, Marks)".format(
    len(value_columns),
    ', '.join()
)

# 🔹 Step 3: Use selectExpr with dynamic stack
unpivotDF = pivotDF.selectExpr(id_column, stack_expr)

# 🔹 Step 4: Show result
unpivotDF.show(truncate=False)

🧾 Output:

NameSubjectScore
HimanshuMath85
HimanshuScience90
AmitMath78
AmitScience83
NehaMath92
NehaScience88

🔁 Summary

OperationWhat It DoesPySpark Code Example
PivotRows ➡ Columnsdf.groupBy().pivot()
UnpivotColumns ➡ Rows (using stack)df.select(expr("stack(...)"))

🎯 Real-world use case:

Scenario: HR wants to see attendance status by day.

EmployeeDayStatus
HimanshuMondayPresent
HimanshuTuesdayAbsent

Pivoted:

EmployeeMondayTuesday
HimanshuPresentAbsent

And vice versa (Unpivoted).


SQL
# Sample pivoted data
data = 
columns = 

pivotDF = spark.createDataFrame(data, columns)

# 🔹 Step 1: Dynamically get all columns to unpivot (except the id column)
id_column = "Product"
value_columns = 

# 🔹 Step 2: Create stack expression
stack_expr = "stack({0}, {1}) as (Country, Amount)".format(
    len(value_columns),
    ', '.join()
)

# 🔹 Step 3: Use selectExpr with dynamic stack
unpivotDF = pivotDF.selectExpr(id_column, stack_expr)

# 🔹 Step 4: Show result
unpivotDF.show(truncate=False)

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.