PySpark Built-in Functions

These functions are commonly used with groupBy() , agg() , or select() to compute things like sum, average, max, min, count, etc. PySpark functions come fr…

Mar 19, 2026 3 min read

These functions are commonly used with groupBy(), agg(), or select() to compute things like sum, average, max, min, count, etc. PySpark functions come from pyspark.sql.functions, which includes a wide variety of operations like aggregation, date/time, string, and more.

🔹 1. Aggregation Functions

These are used to perform calculations on a group of rows.

Function	Description	Example
`count()`	Count number of rows	`df.select(count("*"))`
`sum()`	Sum of column values	`df.select(sum("salary"))`
`avg()`	Average of column values	`df.select(avg("salary"))`
`max()`	Maximum value	`df.select(max("salary"))`
`min()`	Minimum value	`df.select(min("salary"))`
`mean()`	Alias for avg	`df.select(mean("salary"))`

from pyspark.sql.functions import count, sum, avg, max, min

df.select(count("*"), sum("salary"), avg("salary")).show()

🔹 2. String Functions

Manipulate string columns.

Function	Description	Example
`lower()`	Convert to lowercase	`df.select(lower("name"))`
`upper()`	Convert to uppercase	`df.select(upper("name"))`
`length()`	String length	`df.select(length("name"))`
`substr()`	Extract substring	`df.select(substr("name", 1, 3))`
`concat()`	Concatenate strings	`df.select(concat(col("fname"), col("lname")))`
`trim()`	Remove spaces	`df.select(trim("name"))`
`lpad()` / `rpad()`	Pad strings	`df.select(lpad("id", 5, "0"))`

from pyspark.sql.functions import lower, upper, length, concat, lit

df.select(lower("name"), upper("name"), length("name")).show()
df.select(concat(col("fname"), lit(" "), col("lname"))).show()

🔹 3. Date and Time Functions

Function	Description	Example
`current_date()`	Current date	`df.select(current_date())`
`current_timestamp()`	Current timestamp	`df.select(current_timestamp())`
`date_add()`	Add days to date	`df.select(date_add(col("start_date"), 10))`
`date_sub()`	Subtract days	`df.select(date_sub(col("start_date"), 5))`
`datediff()`	Difference in days	`df.select(datediff(col("end_date"), col("start_date")))`
`year()`, `month()`, `dayofmonth()`	Extract parts of date	`df.select(year("date"))`
`to_date()`	Convert string to date	`df.select(to_date("date_string"))`

from pyspark.sql.functions import current_date, datediff, year

df.select(current_date(), datediff(col("end_date"), col("start_date")), year("start_date")).show()

🔹 4. Null Handling Functions

Function	Description	Example
`isnull()`	Check for null	`df.filter(col("salary").isNull())`
`fillna()`	Replace null	`df.fillna(0)`
`na.drop()`	Drop null rows	`df.na.drop()`
`coalesce()`	First non-null value	`df.select(coalesce(col("col1"), col("col2")))`

df.fillna({'salary': 0}).show()
df.select(coalesce(col("bonus"), col("salary"))).show()

🔹 5. Conditional Functions

Function	Description	Example
`when()`	Like SQL CASE WHEN	`df.select(when(col("age") > 18, "Adult").otherwise("Child"))`
`lit()`	Add constant value	`df.select(lit("Hello"))`

from pyspark.sql.functions import when, lit

df.select(when(col("age") >= 18, "Adult").otherwise("Minor")).show()

🔹 6. Window Functions (Used with `Window` spec)

Function	Description	Example
`row_number()`	Row number in window	`row_number().over(Window.partitionBy("dept"))`
`rank()`	Rank within window	`rank().over(Window.partitionBy("dept"))`
`dense_rank()`	Dense rank	`dense_rank().over(Window.partitionBy("dept"))`

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_num", row_number().over(windowSpec)).show()

🔹 7. Collection Functions

Function	Description	Example
`array()`	Create array	`df.select(array("col1", "col2"))`
`explode()`	Explode array into rows	`df.select(explode("hobbies"))`
`size()`	Get array size	`df.select(size("hobbies"))`

from pyspark.sql.functions import array, explode, size

df.select(array("col1", "col2")).show()
df.select(explode("hobbies")).show()

🔹 8. JSON Functions

Function	Description	Example
`get_json_object()`	Extract JSON field	`get_json_object(col("json_col"), "$.field")`
`from_json()`	Parse JSON string to struct	`from_json(col("json_col"), schema)`
`to_json()`	Struct to JSON string	`to_json(struct("col1", "col2"))`

🔹 9. Others

Function	Description	Example
`col()`	Reference a column	`col("name")`
`expr()`	SQL expression	`df.select(expr("salary * 0.1"))`
`monotonically_increasing_id()`	Generate unique ID	`df.withColumn("id", monotonically_increasing_id())`

simpleData = 
schema = 
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Tags

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.

Browse All Posts