PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analysts to perform transformations efficiently.
Apply फ़ंक्शन का उपयोग PySpark में डेटा फ़्रेम (DataFrame) के कॉलम पर विभिन्न प्रकार के ट्रांसफॉर्मेशन (परिवर्तन) करने के लिए किया जाता है। यह हमें डेटा को बिल्ट-इन फंक्शन्स (जैसे upper(), lower(), trim(), आदि) या कस्टम यूजर-डिफाइंड फंक्शन्स (UDFs) के माध्यम से बदलने की सुविधा देता है।
Creating a Spark DataFrame
First, let's create a sample DataFrame to demonstrate function application:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns =
data =
df = spark.createDataFrame(data=data, schema=columns)
df.show(truncate=False)
Applying Functions in PySpark
1. Using withColumn()
The withColumn() function allows applying transformations to a specific column while adding it as a new column in the DataFrame.
from pyspark.sql.functions import upper
df.withColumn("Upper_Name", upper(df.Name)).show()
2. Using select()
Functions can also be applied within select(), returning a DataFrame with only the specified columns.
df.select("Seqno", "Name", upper(df.Name)).show()
3. Using SQL Expressions
SQL expressions can be leveraged using Spark SQL queries. To use this approach, we first register the DataFrame as a temporary SQL table.
df.createOrReplaceTempView("TAB")
spark.sql("SELECT Seqno, Name, UPPER(Name) FROM TAB").show()
Creating and Using User-Defined Functions (UDFs)
PySpark allows defining custom functions using UDFs (User-Defined Functions) when built-in functions are insufficient.
1. Defining a Custom Function
def upperCase(str):
return str.upper()
2. Converting Function to UDF
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
upperCaseUDF = udf(lambda x: upperCase(x), StringType())
3. Applying UDF Using withColumn()
df.withColumn("Curated_Name", upperCaseUDF(col("Name"))).show(truncate=False)
4. Applying UDF Using select()
df.select(col("Seqno"), upperCaseUDF(col("Name")).alias("Upper_Name")).show(truncate=False)
5. Registering UDF for SQL Queries
spark.udf.register("upperCaseUDF", upperCaseUDF)
df.createOrReplaceTempView("TAB")
spark.sql("SELECT Seqno, Name, upperCaseUDF(Name) FROM TAB").show()