Joins in PySpark

They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis. 1. PySpark Join Types Below are …

Mar 19, 2026 3 min read

They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis.

1. PySpark Join Types

Below are the different Join Types PySpark supports.

Join String	Equivalent SQL Join
inner	INNER JOIN
outer, full, fullouter, full_outer	FULL OUTER JOIN
left, leftouter, left_outer	LEFT JOIN
right, rightouter, right_outer	RIGHT JOIN
cross
anti, leftanti, left_anti
semi, leftsemi, left_semi

Inner Join: Returns only the rows with matching keys in both DataFrames.
Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame.
Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame.
Full Outer Join: Returns all rows from both DataFrames, including matching and non-matching rows.
Left Semi Join: Returns all rows from the left DataFrame where there is a match in the right DataFrame.
Left Anti Join: Returns all rows from the left DataFrame where there is no match in the right DataFrame.

2. Creating Sample DataFrames

Before performing joins, we need to create sample DataFrames to work with:

from pyspark.sql.functions import col

# Creating Employee DataFrame
data1 = 
columns1 = 
df1 = spark.createDataFrame(data1, columns1)

# Creating Department DataFrame
data2 = 
columns2 = 
df2 = spark.createDataFrame(data2, columns2)

df1.show()
df2.show()

3. Different Types of Joins in PySpark

3.1 Inner Join

An Inner Join returns only the matching rows from both DataFrames.

inner_join_df = df1.join(df2, on="dept", how="inner")
inner_join_df.show()

3.2 Left Join (Left Outer Join)

A Left Join returns all records from the left DataFrame and the matched records from the right DataFrame. Unmatched rows will have NULL values.

left_join_df = df1.join(df2, on="dept", how="left")
left_join_df.show()

3.3 Right Join (Right Outer Join)

A Right Join returns all records from the right DataFrame and the matched records from the left DataFrame.

right_join_df = df1.join(df2, on="dept", how="right")
right_join_df.show()

3.4 Full Outer Join

A Full Outer Join returns all records from both DataFrames. Unmatched records will have NULL values.

full_outer_join_df = df1.join(df2, on="dept", how="outer")
full_outer_join_df.show()

3.5 Cross Join

A Cross Join returns the Cartesian product of both DataFrames.

cross_join_df = df1.crossJoin(df2)
cross_join_df.show()

3.6 Left Semi Join

A Left Semi Join returns only the rows from the left DataFrame where there is a match in the right DataFrame.

semi_join_df = df1.join(df2, on="dept", how="left_semi")
semi_join_df.show()

3.7 Left Anti Join

A Left Anti Join returns only the rows from the left DataFrame that have no match in the right DataFrame.

anti_join_df = df1.join(df2, on="dept", how="left_anti")
anti_join_df.show()

PySpark Self Join

Joins are not complete without a self join, Though there is no self-join type available in PySpark.

# Self join
empDF.alias("emp1").join(empDF.alias("emp2"), \
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col("emp1.name"), \
      col("emp2.emp_id").alias("superior_emp_id"), \
      col("emp2.name").alias("superior_emp_name")) \
   .show(truncate=False)

Multiple Joins

from pyspark.sql.functions import col

# Creating Employee DataFrame
data1 = 
columns1 = 
df1 = spark.createDataFrame(data1, columns1)

# Creating Department DataFrame
data2 = 
columns2 = 
df2 = spark.createDataFrame(data2, columns2)

# Creating Salary DataFrame
data3 = 
columns3 = 
df3 = spark.createDataFrame(data3, columns3)

multi_join_df = df1.join(df2, on="dept", how="left") \  
                   .join(df3, on="emp_id", how="left")

multi_join_df.show()