Back to all posts

drop(), dropDuplicates(), and distinct() in PySpark

🔹 1. drop() – Removing Columns The drop() function is used to remove one or more columns from a DataFrame. 👉 Example: Removing a Single Column from pyspa…

🔹 1. drop() – Removing Columns

The drop() function is used to remove one or more columns from a DataFrame.

👉 Example: Removing a Single Column

SQL
from pyspark.sql import Row

# Sample Data
data = 

df = spark.createDataFrame(data)

# Drop the 'age' column
df_dropped = df.drop("age")
df_dropped.show()

Output:

SQL
+---+------+
| id|  name|
+---+------+
|  1| Alice|
|  2|   Bob|
|  3|Charlie|
+---+------+

👉 Example: Removing Multiple Columns

Bash
df_dropped_multi = df.drop("age", "name")
df_dropped_multi.show()

Output:

SQL
+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

🔹 2. dropDuplicates() – Removing Duplicate Rows

The dropDuplicates() function is used to remove duplicate rows, either from all columns or specific columns.

👉 Example: Remove Duplicates from All Columns

Plain Text
data = 
columns = 
df_dup = spark.createDataFrame(data, columns)

df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()

Output:

SQL
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
|  3|David| 40|
+---+-----+---+

👉 Example: Remove Duplicates from Specific Columns

Plain Text
df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()

Output:

SQL
+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
|  3|David| 40|
+---+-----+---+

If multiple rows have the same "name", only the first occurrence is kept.


🔹 3. distinct() – Removing Duplicate Rows Based on All Columns

The distinct() function removes all duplicate rows by checking all columns.

👉 Example:

Plain Text
df_distinct = df_dup.distinct()
df_distinct.show()

Output will be the same as dropDuplicates() without column arguments.


🔸 Key Differences Summary

FunctionPurposeWhat it affects
drop()Removes one or more columnsChanges column structure
dropDuplicates()Removes duplicate rowsCan work on specific columns
distinct()Removes duplicate rows (all columns)Checks all columns only

✅ FAQs

🔹 What’s the difference between distinct() and dropDuplicates()?

  • distinct() checks all columns.
  • dropDuplicates() lets you choose specific columns to check for duplicates.

🔹 Can we use distinct() on selected columns only?

No. To check for uniqueness based on selected columns, use:

Plain Text
df.dropDuplicates()

🔹 Does distinct() keep the original row order?

No. To keep or control the order, use:

Bash
df.distinct().orderBy("column_name")

🔹 How does distinct() handle NULLs?

It treats NULL values as equal — keeps only one row where all values (including NULLs) are the same.


🔹 Can we apply distinct() to only a few rows?

Not directly. But you can:

Python
filtered_df = df.filter("condition").distinct()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.