Back to all posts

substring() in PySpark

📌 What is substring() ? The substring() function in PySpark is used to extract a portion of a string from a column in a DataFrame. It is part of the PySpa…

📌 What is substring()?

The substring() function in PySpark is used to extract a portion of a string from a column in a DataFrame.

It is part of the PySpark SQL functions module and is typically used when you need to extract specific segments (like year, month, or day) from a date string or structured text column.


🧪 Syntax

SQL
substring(column, start_position, length)

Parameters:

  • column: The column name or a column expression (as a string or col() object)
  • start_position: The position (1-based index) to start the substring from
  • length: How many characters to extract

✅ Note: Unlike Python indexing (which starts at 0), substring in PySpark starts from 1.


✅ Example Use Case

Suppose we have a date column in string format: "20230430" (which means 30th April 2023). We want to extract:

  • Year: first 4 characters → "2023"
  • Month: next 2 characters → "04"
  • Day: last 2 characters → "30"

PySpark Code:

SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import substring

spark = SparkSession.builder.appName("substring_example").getOrCreate()

data = 
df = spark.createDataFrame(data, )

df_transformed = df.withColumn("year", substring("date", 1, 4))\
                   .withColumn("month", substring("date", 5, 2))\
                   .withColumn("day", substring("date", 7, 2))

df_transformed.show()

Output:

SQL
+--------+----+-----+----+
|   date |year|month| day|
+--------+----+-----+----+
|20230430|2023|  04 | 30 |
|20221215|2022|  12 | 15 |
+--------+----+-----+----+

📊 Real-Time Scenario: Banking Logs

Imagine a bank stores transaction IDs like this:

Plain Text
TX20240125XYZ9876

Where:

  • "2024" is the year of transaction
  • "0125" is the MMDD (month-day)
  • "XYZ" is the transaction type
  • "9876" is a transaction number

You can use substring() to extract each part for reporting or classification.


🔁 Common Use Cases

Use CaseExample
Extracting year/month/daysubstring("20230430", 1, 4) → "2023"
Parsing log codessubstring("TX20240125XYZ9876", 3, 4) → "2024"
Splitting concatenated valuesExtract area code from "IN-DEL-01123"
Data cleaningRemove unwanted suffixes/prefixes

🆚 substring() vs substr()

Both are similar but differ slightly in how they're used:

  • substring() is a function: substring("column", start, length)
  • substr() is a method on column: col("column").substr(start, length)

Both start counting from 1.


🧾 Summary

  • substring() is a powerful string function in PySpark used to extract parts of a string column.
  • It's useful in parsing date strings, IDs, codes, and other structured text.
  • It is part of the pyspark.sql.functions module.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.