Here's the corrected explanation of union(), unionAll(), and unionByName() in PySpark along with appropriate examples.
1. union()
- The
union()method is used to merge two DataFrames with the same schema (same columns in the same order). - It retains duplicates (does NOT remove them).
- If the schemas do not match, it throws an error.
Example:
from pyspark.sql import SparkSession, Row
# Initialize Spark session
spark = SparkSession.builder.appName("UnionExample").getOrCreate()
# Create first DataFrame
df1 = spark.createDataFrame()
# Create second DataFrame
df2 = spark.createDataFrame()
# Perform union
df_union = df1.union(df2)
df_union.show()
Output:
+---+-------+
| id| name |
+---+-------+
| 1| Alice |
| 2| Bob |
| 2| Bob |
| 3|Charlie|
+---+-------+
- The duplicate row
(2, Bob)is not removed.
2. unionAll()
unionAll()was used in older versions of Spark (< 2.0.0).- In Spark 2.0 and later,
unionAll()is deprecated and replaced byunion(), which works the same way. - It retains duplicate records just like
union().
So, instead of using unionAll(), use union() in modern PySpark.
3. unionByName()
unionByName()is used to merge two DataFrames by matching column names instead of column order.- It allows merging DataFrames with different column orders or DataFrames with missing columns.
- If there are missing columns, they will be filled with
null. - To handle missing columns, you must set
allowMissingColumns=True.
Example:
# Create a DataFrame with different column order and an extra column
df3 = spark.createDataFrame()
df1.show()
df3.show()
# Perform unionByName with allowMissingColumns=True
df_unionByName = df1.unionByName(df3, allowMissingColumns=True)
df_unionByName.show()
Output:
+---+-------+----+
| id| name |age |
+---+-------+----+
| 1| Alice |null|
| 2| Bob |null|
| 4| David | 30|
| 5| Eve | 25|
+---+-------+----+
df1did not have anagecolumn, so the missing column is filled withnull.
Key Differences:
| Function | Retains Duplicates? | Matches by Column Name? | Handles Different Column Order? | Fills Missing Columns? |
|---|---|---|---|---|
union() | ✅ Yes | ❌ No | ❌ No | ❌ No |
unionAll() | ✅ Yes (Deprecated) | ❌ No | ❌ No | ❌ No |
unionByName() | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes (if allowMissingColumns=True) |
Which One to Use?
- Use
union()if the schemas match exactly and you just need to merge them. - Use
unionByName()if column orders are different or some columns are missing. - Avoid
unionAll()because it's deprecated (useunion()instead).