Back to all posts

union(), unionAll(), and unionByName() in PySpark

Here's the corrected explanation of union() , unionAll() , and unionByName() in PySpark along with appropriate examples. 1. union() The union() method is u…

Here's the corrected explanation of union(), unionAll(), and unionByName() in PySpark along with appropriate examples.


1. union()

  • The union() method is used to merge two DataFrames with the same schema (same columns in the same order).
  • It retains duplicates (does NOT remove them).
  • If the schemas do not match, it throws an error.

Example:

SQL
from pyspark.sql import SparkSession, Row

# Initialize Spark session
spark = SparkSession.builder.appName("UnionExample").getOrCreate()

# Create first DataFrame
df1 = spark.createDataFrame()

# Create second DataFrame
df2 = spark.createDataFrame()

# Perform union
df_union = df1.union(df2)

df_union.show()

Output:

SQL
+---+-------+
| id|  name |
+---+-------+
|  1| Alice |
|  2|   Bob |
|  2|   Bob |
|  3|Charlie|
+---+-------+
  • The duplicate row (2, Bob) is not removed.

2. unionAll()

  • unionAll() was used in older versions of Spark (< 2.0.0).
  • In Spark 2.0 and later, unionAll() is deprecated and replaced by union(), which works the same way.
  • It retains duplicate records just like union().

So, instead of using unionAll(), use union() in modern PySpark.


3. unionByName()

  • unionByName() is used to merge two DataFrames by matching column names instead of column order.
  • It allows merging DataFrames with different column orders or DataFrames with missing columns.
  • If there are missing columns, they will be filled with null.
  • To handle missing columns, you must set allowMissingColumns=True.

Example:

SQL
# Create a DataFrame with different column order and an extra column
df3 = spark.createDataFrame()

df1.show()
df3.show()

# Perform unionByName with allowMissingColumns=True
df_unionByName = df1.unionByName(df3, allowMissingColumns=True)

df_unionByName.show()

Output:

SQL
+---+-------+----+
| id|  name |age |
+---+-------+----+
|  1| Alice |null|
|  2|   Bob |null|
|  4| David |  30|
|  5|   Eve |  25|
+---+-------+----+
  • df1 did not have an age column, so the missing column is filled with null.

Key Differences:

FunctionRetains Duplicates?Matches by Column Name?Handles Different Column Order?Fills Missing Columns?
union()✅ Yes❌ No❌ No❌ No
unionAll()✅ Yes (Deprecated)❌ No❌ No❌ No
unionByName()✅ Yes✅ Yes✅ Yes✅ Yes (if allowMissingColumns=True)

Which One to Use?

  • Use union() if the schemas match exactly and you just need to merge them.
  • Use unionByName() if column orders are different or some columns are missing.
  • Avoid unionAll() because it's deprecated (use union() instead).

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.