📌 What is substring()?
The substring() function in PySpark is used to extract a portion of a string from a column in a DataFrame.
It is part of the PySpark SQL functions module and is typically used when you need to extract specific segments (like year, month, or day) from a date string or structured text column.
🧪 Syntax
substring(column, start_position, length)
Parameters:
column: The column name or a column expression (as a string orcol()object)start_position: The position (1-based index) to start the substring fromlength: How many characters to extract
✅ Note: Unlike Python indexing (which starts at 0), substring in PySpark starts from 1.
✅ Example Use Case
Suppose we have a date column in string format: "20230430" (which means 30th April 2023). We want to extract:
- Year: first 4 characters →
"2023" - Month: next 2 characters →
"04" - Day: last 2 characters →
"30"
PySpark Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import substring
spark = SparkSession.builder.appName("substring_example").getOrCreate()
data =
df = spark.createDataFrame(data, )
df_transformed = df.withColumn("year", substring("date", 1, 4))\
.withColumn("month", substring("date", 5, 2))\
.withColumn("day", substring("date", 7, 2))
df_transformed.show()
Output:
+--------+----+-----+----+
| date |year|month| day|
+--------+----+-----+----+
|20230430|2023| 04 | 30 |
|20221215|2022| 12 | 15 |
+--------+----+-----+----+
📊 Real-Time Scenario: Banking Logs
Imagine a bank stores transaction IDs like this:
TX20240125XYZ9876
Where:
"2024"is the year of transaction"0125"is the MMDD (month-day)"XYZ"is the transaction type"9876"is a transaction number
You can use substring() to extract each part for reporting or classification.
🔁 Common Use Cases
| Use Case | Example |
|---|---|
| Extracting year/month/day | substring("20230430", 1, 4) → "2023" |
| Parsing log codes | substring("TX20240125XYZ9876", 3, 4) → "2024" |
| Splitting concatenated values | Extract area code from "IN-DEL-01123" |
| Data cleaning | Remove unwanted suffixes/prefixes |
🆚 substring() vs substr()
Both are similar but differ slightly in how they're used:
substring()is a function:substring("column", start, length)substr()is a method on column:col("column").substr(start, length)
Both start counting from 1.
🧾 Summary
substring()is a powerful string function in PySpark used to extract parts of a string column.- It's useful in parsing date strings, IDs, codes, and other structured text.
- It is part of the
pyspark.sql.functionsmodule.