Back to all posts

PySpark Built-in Functions

These functions are commonly used with groupBy() , agg() , or select() to compute things like sum, average, max, min, count, etc. PySpark functions come fr…

These functions are commonly used with groupBy(), agg(), or select() to compute things like sum, average, max, min, count, etc. PySpark functions come from pyspark.sql.functions, which includes a wide variety of operations like aggregation, date/time, string, and more.


🔹 1. Aggregation Functions

These are used to perform calculations on a group of rows.

FunctionDescriptionExample
count()Count number of rowsdf.select(count("*"))
sum()Sum of column valuesdf.select(sum("salary"))
avg()Average of column valuesdf.select(avg("salary"))
max()Maximum valuedf.select(max("salary"))
min()Minimum valuedf.select(min("salary"))
mean()Alias for avgdf.select(mean("salary"))
Python
from pyspark.sql.functions import count, sum, avg, max, min

df.select(count("*"), sum("salary"), avg("salary")).show()

🔹 2. String Functions

Manipulate string columns.

FunctionDescriptionExample
lower()Convert to lowercasedf.select(lower("name"))
upper()Convert to uppercasedf.select(upper("name"))
length()String lengthdf.select(length("name"))
substr()Extract substringdf.select(substr("name", 1, 3))
concat()Concatenate stringsdf.select(concat(col("fname"), col("lname")))
trim()Remove spacesdf.select(trim("name"))
lpad() / rpad()Pad stringsdf.select(lpad("id", 5, "0"))
SQL
from pyspark.sql.functions import lower, upper, length, concat, lit

df.select(lower("name"), upper("name"), length("name")).show()
df.select(concat(col("fname"), lit(" "), col("lname"))).show()

🔹 3. Date and Time Functions

FunctionDescriptionExample
current_date()Current datedf.select(current_date())
current_timestamp()Current timestampdf.select(current_timestamp())
date_add()Add days to datedf.select(date_add(col("start_date"), 10))
date_sub()Subtract daysdf.select(date_sub(col("start_date"), 5))
datediff()Difference in daysdf.select(datediff(col("end_date"), col("start_date")))
year(), month(), dayofmonth()Extract parts of datedf.select(year("date"))
to_date()Convert string to datedf.select(to_date("date_string"))
SQL
from pyspark.sql.functions import current_date, datediff, year

df.select(current_date(), datediff(col("end_date"), col("start_date")), year("start_date")).show()

🔹 4. Null Handling Functions

FunctionDescriptionExample
isnull()Check for nulldf.filter(col("salary").isNull())
fillna()Replace nulldf.fillna(0)
na.drop()Drop null rowsdf.na.drop()
coalesce()First non-null valuedf.select(coalesce(col("col1"), col("col2")))
Bash
df.fillna({'salary': 0}).show()
df.select(coalesce(col("bonus"), col("salary"))).show()

🔹 5. Conditional Functions

FunctionDescriptionExample
when()Like SQL CASE WHENdf.select(when(col("age") > 18, "Adult").otherwise("Child"))
lit()Add constant valuedf.select(lit("Hello"))
SQL
from pyspark.sql.functions import when, lit

df.select(when(col("age") >= 18, "Adult").otherwise("Minor")).show()

🔹 6. Window Functions (Used with Window spec)

FunctionDescriptionExample
row_number()Row number in windowrow_number().over(Window.partitionBy("dept"))
rank()Rank within windowrank().over(Window.partitionBy("dept"))
dense_rank()Dense rankdense_rank().over(Window.partitionBy("dept"))
SQL
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_num", row_number().over(windowSpec)).show()

🔹 7. Collection Functions

FunctionDescriptionExample
array()Create arraydf.select(array("col1", "col2"))
explode()Explode array into rowsdf.select(explode("hobbies"))
size()Get array sizedf.select(size("hobbies"))
SQL
from pyspark.sql.functions import array, explode, size

df.select(array("col1", "col2")).show()
df.select(explode("hobbies")).show()

🔹 8. JSON Functions

FunctionDescriptionExample
get_json_object()Extract JSON fieldget_json_object(col("json_col"), "$.field")
from_json()Parse JSON string to structfrom_json(col("json_col"), schema)
to_json()Struct to JSON stringto_json(struct("col1", "col2"))

🔹 9. Others

FunctionDescriptionExample
col()Reference a columncol("name")
expr()SQL expressiondf.select(expr("salary * 0.1"))
monotonically_increasing_id()Generate unique IDdf.withColumn("id", monotonically_increasing_id())

Bash
simpleData = 
schema = 
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.