Back to all posts

Applying Functions in PySpark

PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analys…

PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analysts to perform transformations efficiently.

Apply फ़ंक्शन का उपयोग PySpark में डेटा फ़्रेम (DataFrame) के कॉलम पर विभिन्न प्रकार के ट्रांसफॉर्मेशन (परिवर्तन) करने के लिए किया जाता है। यह हमें डेटा को बिल्ट-इन फंक्शन्स (जैसे upper(), lower(), trim(), आदि) या कस्टम यूजर-डिफाइंड फंक्शन्स (UDFs) के माध्यम से बदलने की सुविधा देता है।

Creating a Spark DataFrame

First, let's create a sample DataFrame to demonstrate function application:

Python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = 
data = 

df = spark.createDataFrame(data=data, schema=columns)
df.show(truncate=False)

Applying Functions in PySpark

1. Using withColumn()

The withColumn() function allows applying transformations to a specific column while adding it as a new column in the DataFrame.

SQL
from pyspark.sql.functions import upper

df.withColumn("Upper_Name", upper(df.Name)).show()

2. Using select()

Functions can also be applied within select(), returning a DataFrame with only the specified columns.

SQL
df.select("Seqno", "Name", upper(df.Name)).show()

3. Using SQL Expressions

SQL expressions can be leveraged using Spark SQL queries. To use this approach, we first register the DataFrame as a temporary SQL table.

SQL
df.createOrReplaceTempView("TAB")
spark.sql("SELECT Seqno, Name, UPPER(Name) FROM TAB").show()

Creating and Using User-Defined Functions (UDFs)

PySpark allows defining custom functions using UDFs (User-Defined Functions) when built-in functions are insufficient.

1. Defining a Custom Function

Python
def upperCase(str):
    return str.upper()

2. Converting Function to UDF

Python
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

upperCaseUDF = udf(lambda x: upperCase(x), StringType())

3. Applying UDF Using withColumn()

SQL
df.withColumn("Curated_Name", upperCaseUDF(col("Name"))).show(truncate=False)

4. Applying UDF Using select()

SQL
df.select(col("Seqno"), upperCaseUDF(col("Name")).alias("Upper_Name")).show(truncate=False)

5. Registering UDF for SQL Queries

SQL
spark.udf.register("upperCaseUDF", upperCaseUDF)
df.createOrReplaceTempView("TAB")
spark.sql("SELECT Seqno, Name, upperCaseUDF(Name) FROM TAB").show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.