Back to all posts

Groupby in Pyspark

Function Description Example count() Counts the number of rows per group df.groupBy("col").count() mean() Returns the average value per group df.groupBy("c…

FunctionDescriptionExample
count()Counts the number of rows per groupdf.groupBy("col").count()
mean()Returns the average value per groupdf.groupBy("col").mean("value")
max()Returns the maximum value per groupdf.groupBy("col").max("value")
min()Returns the minimum value per groupdf.groupBy("col").min("value")
sum()Returns the total sum per groupdf.groupBy("col").sum("value")
avg()Returns the average value per groupdf.groupBy("col").avg("value")
agg()Performs multiple aggregations at oncedf.groupBy("col").agg(sum("val"), avg("val"))
pivot()Converts unique values in a column into new columnsdf.groupBy("col").pivot("col2").sum("val")
SQL
# Sample Data
data = 

# Creating DataFrame
columns = 
df = spark.createDataFrame(data, columns)

# Show original DataFrame
print("Original DataFrame:")
df.show()

# 1. Count the number of employees in each department
print("Count of employees per department:")
df.groupBy("Department").count().show()

# 2. Get the mean (average) salary per department
print("Mean (Average) salary per department:")
df.groupBy("Department").mean("Salary").show()

# 3. Get the maximum salary per department
print("Maximum salary per department:")
df.groupBy("Department").max("Salary").show()

# 4. Get the minimum salary per department
print("Minimum salary per department:")
df.groupBy("Department").min("Salary").show()

# 5. Get the total (sum) salary per department
print("Total salary per department:")
df.groupBy("Department").sum("Salary").show()

# 6. Get the average salary per department (same as mean)
print("Average salary per department:")
df.groupBy("Department").avg("Salary").show()

# 7. Perform multiple aggregations at once using agg()
print("Multiple aggregations per department:")
df.groupBy("Department").agg(
    count("Name").alias("Employee_Count"),
    sum("Salary").alias("Total_Salary"),
    avg("Salary").alias("Avg_Salary"),
    max("Salary").alias("Max_Salary"),
    min("Salary").alias("Min_Salary")
).show()

# 8. Pivot: Convert unique values in 'Name' column into new columns
print("Pivot DataFrame based on Employee Names:")
df.groupBy("Department").pivot("Name").sum("Salary").show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.