Back to all posts

Joins in PySpark

They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis. 1. PySpark Join Types Below are …

They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis.

1. PySpark Join Types

Below are the different Join Types PySpark supports.

Join StringEquivalent SQL Join
innerINNER JOIN
outer, full, fullouter, full_outerFULL OUTER JOIN
left, leftouter, left_outerLEFT JOIN
right, rightouter, right_outerRIGHT JOIN
cross
anti, leftanti, left_anti
semi, leftsemi, left_semi
  • Inner Join: Returns only the rows with matching keys in both DataFrames.
  • Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame.
  • Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame.
  • Full Outer Join: Returns all rows from both DataFrames, including matching and non-matching rows.
  • Left Semi Join: Returns all rows from the left DataFrame where there is a match in the right DataFrame.
  • Left Anti Join: Returns all rows from the left DataFrame where there is no match in the right DataFrame.

2. Creating Sample DataFrames

Before performing joins, we need to create sample DataFrames to work with:

Python
from pyspark.sql.functions import col

# Creating Employee DataFrame
data1 = 
columns1 = 
df1 = spark.createDataFrame(data1, columns1)

# Creating Department DataFrame
data2 = 
columns2 = 
df2 = spark.createDataFrame(data2, columns2)

df1.show()
df2.show()

3. Different Types of Joins in PySpark

3.1 Inner Join

An Inner Join returns only the matching rows from both DataFrames.

Bash
inner_join_df = df1.join(df2, on="dept", how="inner")
inner_join_df.show()

3.2 Left Join (Left Outer Join)

A Left Join returns all records from the left DataFrame and the matched records from the right DataFrame. Unmatched rows will have NULL values.

Bash
left_join_df = df1.join(df2, on="dept", how="left")
left_join_df.show()

3.3 Right Join (Right Outer Join)

A Right Join returns all records from the right DataFrame and the matched records from the left DataFrame.

Bash
right_join_df = df1.join(df2, on="dept", how="right")
right_join_df.show()

3.4 Full Outer Join

A Full Outer Join returns all records from both DataFrames. Unmatched records will have NULL values.

Bash
full_outer_join_df = df1.join(df2, on="dept", how="outer")
full_outer_join_df.show()

3.5 Cross Join

A Cross Join returns the Cartesian product of both DataFrames.

Plain Text
cross_join_df = df1.crossJoin(df2)
cross_join_df.show()

3.6 Left Semi Join

A Left Semi Join returns only the rows from the left DataFrame where there is a match in the right DataFrame.

Bash
semi_join_df = df1.join(df2, on="dept", how="left_semi")
semi_join_df.show()

3.7 Left Anti Join

A Left Anti Join returns only the rows from the left DataFrame that have no match in the right DataFrame.

Bash
anti_join_df = df1.join(df2, on="dept", how="left_anti")
anti_join_df.show()

PySpark Self Join

Joins are not complete without a self join, Though there is no self-join type available in PySpark.

Bash
# Self join
empDF.alias("emp1").join(empDF.alias("emp2"), \
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col("emp1.name"), \
      col("emp2.emp_id").alias("superior_emp_id"), \
      col("emp2.name").alias("superior_emp_name")) \
   .show(truncate=False)

Multiple Joins

Python
from pyspark.sql.functions import col

# Creating Employee DataFrame
data1 = 
columns1 = 
df1 = spark.createDataFrame(data1, columns1)

# Creating Department DataFrame
data2 = 
columns2 = 
df2 = spark.createDataFrame(data2, columns2)

# Creating Salary DataFrame
data3 = 
columns3 = 
df3 = spark.createDataFrame(data3, columns3)
Bash
multi_join_df = df1.join(df2, on="dept", how="left") \  
                   .join(df3, on="emp_id", how="left")

multi_join_df.show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.