These functions are commonly used with groupBy(), agg(), or select() to compute things like sum, average, max, min, count, etc. PySpark functions come from pyspark.sql.functions, which includes a wide variety of operations like aggregation, date/time, string, and more.
🔹 1. Aggregation Functions
These are used to perform calculations on a group of rows.
| Function | Description | Example |
|---|---|---|
count() | Count number of rows | df.select(count("*")) |
sum() | Sum of column values | df.select(sum("salary")) |
avg() | Average of column values | df.select(avg("salary")) |
max() | Maximum value | df.select(max("salary")) |
min() | Minimum value | df.select(min("salary")) |
mean() | Alias for avg | df.select(mean("salary")) |
from pyspark.sql.functions import count, sum, avg, max, min
df.select(count("*"), sum("salary"), avg("salary")).show()
🔹 2. String Functions
Manipulate string columns.
| Function | Description | Example |
|---|---|---|
lower() | Convert to lowercase | df.select(lower("name")) |
upper() | Convert to uppercase | df.select(upper("name")) |
length() | String length | df.select(length("name")) |
substr() | Extract substring | df.select(substr("name", 1, 3)) |
concat() | Concatenate strings | df.select(concat(col("fname"), col("lname"))) |
trim() | Remove spaces | df.select(trim("name")) |
lpad() / rpad() | Pad strings | df.select(lpad("id", 5, "0")) |
from pyspark.sql.functions import lower, upper, length, concat, lit
df.select(lower("name"), upper("name"), length("name")).show()
df.select(concat(col("fname"), lit(" "), col("lname"))).show()
🔹 3. Date and Time Functions
| Function | Description | Example |
|---|---|---|
current_date() | Current date | df.select(current_date()) |
current_timestamp() | Current timestamp | df.select(current_timestamp()) |
date_add() | Add days to date | df.select(date_add(col("start_date"), 10)) |
date_sub() | Subtract days | df.select(date_sub(col("start_date"), 5)) |
datediff() | Difference in days | df.select(datediff(col("end_date"), col("start_date"))) |
year(), month(), dayofmonth() | Extract parts of date | df.select(year("date")) |
to_date() | Convert string to date | df.select(to_date("date_string")) |
from pyspark.sql.functions import current_date, datediff, year
df.select(current_date(), datediff(col("end_date"), col("start_date")), year("start_date")).show()
🔹 4. Null Handling Functions
| Function | Description | Example |
|---|---|---|
isnull() | Check for null | df.filter(col("salary").isNull()) |
fillna() | Replace null | df.fillna(0) |
na.drop() | Drop null rows | df.na.drop() |
coalesce() | First non-null value | df.select(coalesce(col("col1"), col("col2"))) |
df.fillna({'salary': 0}).show()
df.select(coalesce(col("bonus"), col("salary"))).show()
🔹 5. Conditional Functions
| Function | Description | Example |
|---|---|---|
when() | Like SQL CASE WHEN | df.select(when(col("age") > 18, "Adult").otherwise("Child")) |
lit() | Add constant value | df.select(lit("Hello")) |
from pyspark.sql.functions import when, lit
df.select(when(col("age") >= 18, "Adult").otherwise("Minor")).show()
🔹 6. Window Functions (Used with Window spec)
| Function | Description | Example |
|---|---|---|
row_number() | Row number in window | row_number().over(Window.partitionBy("dept")) |
rank() | Rank within window | rank().over(Window.partitionBy("dept")) |
dense_rank() | Dense rank | dense_rank().over(Window.partitionBy("dept")) |
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_num", row_number().over(windowSpec)).show()
🔹 7. Collection Functions
| Function | Description | Example |
|---|---|---|
array() | Create array | df.select(array("col1", "col2")) |
explode() | Explode array into rows | df.select(explode("hobbies")) |
size() | Get array size | df.select(size("hobbies")) |
from pyspark.sql.functions import array, explode, size
df.select(array("col1", "col2")).show()
df.select(explode("hobbies")).show()
🔹 8. JSON Functions
| Function | Description | Example |
|---|---|---|
get_json_object() | Extract JSON field | get_json_object(col("json_col"), "$.field") |
from_json() | Parse JSON string to struct | from_json(col("json_col"), schema) |
to_json() | Struct to JSON string | to_json(struct("col1", "col2")) |
🔹 9. Others
| Function | Description | Example |
|---|---|---|
col() | Reference a column | col("name") |
expr() | SQL expression | df.select(expr("salary * 0.1")) |
monotonically_increasing_id() | Generate unique ID | df.withColumn("id", monotonically_increasing_id()) |
simpleData =
schema =
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)




