They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis.
1. PySpark Join Types
Below are the different Join Types PySpark supports.
| Join String | Equivalent SQL Join |
| inner | INNER JOIN |
| outer, full, fullouter, full_outer | FULL OUTER JOIN |
| left, leftouter, left_outer | LEFT JOIN |
| right, rightouter, right_outer | RIGHT JOIN |
| cross | |
| anti, leftanti, left_anti | |
| semi, leftsemi, left_semi |
- Inner Join: Returns only the rows with matching keys in both DataFrames.
- Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame.
- Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame.
- Full Outer Join: Returns all rows from both DataFrames, including matching and non-matching rows.
- Left Semi Join: Returns all rows from the left DataFrame where there is a match in the right DataFrame.
- Left Anti Join: Returns all rows from the left DataFrame where there is no match in the right DataFrame.
2. Creating Sample DataFrames
Before performing joins, we need to create sample DataFrames to work with:
from pyspark.sql.functions import col
# Creating Employee DataFrame
data1 =
columns1 =
df1 = spark.createDataFrame(data1, columns1)
# Creating Department DataFrame
data2 =
columns2 =
df2 = spark.createDataFrame(data2, columns2)
df1.show()
df2.show()

3. Different Types of Joins in PySpark
3.1 Inner Join
An Inner Join returns only the matching rows from both DataFrames.
inner_join_df = df1.join(df2, on="dept", how="inner")
inner_join_df.show()
3.2 Left Join (Left Outer Join)
A Left Join returns all records from the left DataFrame and the matched records from the right DataFrame. Unmatched rows will have NULL values.
left_join_df = df1.join(df2, on="dept", how="left")
left_join_df.show()
3.3 Right Join (Right Outer Join)
A Right Join returns all records from the right DataFrame and the matched records from the left DataFrame.
right_join_df = df1.join(df2, on="dept", how="right")
right_join_df.show()
3.4 Full Outer Join
A Full Outer Join returns all records from both DataFrames. Unmatched records will have NULL values.
full_outer_join_df = df1.join(df2, on="dept", how="outer")
full_outer_join_df.show()
3.5 Cross Join
A Cross Join returns the Cartesian product of both DataFrames.
cross_join_df = df1.crossJoin(df2)
cross_join_df.show()
3.6 Left Semi Join
A Left Semi Join returns only the rows from the left DataFrame where there is a match in the right DataFrame.
semi_join_df = df1.join(df2, on="dept", how="left_semi")
semi_join_df.show()
3.7 Left Anti Join
A Left Anti Join returns only the rows from the left DataFrame that have no match in the right DataFrame.
anti_join_df = df1.join(df2, on="dept", how="left_anti")
anti_join_df.show()
PySpark Self Join
Joins are not complete without a self join, Though there is no self-join type available in PySpark.
# Self join
empDF.alias("emp1").join(empDF.alias("emp2"), \
col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
.select(col("emp1.emp_id"),col("emp1.name"), \
col("emp2.emp_id").alias("superior_emp_id"), \
col("emp2.name").alias("superior_emp_name")) \
.show(truncate=False)
Multiple Joins
from pyspark.sql.functions import col
# Creating Employee DataFrame
data1 =
columns1 =
df1 = spark.createDataFrame(data1, columns1)
# Creating Department DataFrame
data2 =
columns2 =
df2 = spark.createDataFrame(data2, columns2)
# Creating Salary DataFrame
data3 =
columns3 =
df3 = spark.createDataFrame(data3, columns3)
multi_join_df = df1.join(df2, on="dept", how="left") \
.join(df3, on="emp_id", how="left")
multi_join_df.show()