drop(), dropDuplicates(), and distinct() in PySpark

🔹 1. drop() – Removing Columns The drop() function is used to remove one or more columns from a DataFrame. 👉 Example: Removing a Single Column from pyspa…

Mar 19, 2026 2 min read

🔹 1. `drop()` – Removing Columns

The drop() function is used to remove one or more columns from a DataFrame.

👉 Example: Removing a Single Column

from pyspark.sql import Row

# Sample Data
data = 

df = spark.createDataFrame(data)

# Drop the 'age' column
df_dropped = df.drop("age")
df_dropped.show()

Output:

+---+------+
| id|  name|
+---+------+
|  1| Alice|
|  2|   Bob|
|  3|Charlie|
+---+------+

👉 Example: Removing Multiple Columns

df_dropped_multi = df.drop("age", "name")
df_dropped_multi.show()

Output:

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

🔹 2. `dropDuplicates()` – Removing Duplicate Rows

The dropDuplicates() function is used to remove duplicate rows, either from all columns or specific columns.

👉 Example: Remove Duplicates from All Columns

data = 
columns = 
df_dup = spark.createDataFrame(data, columns)

df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()

Output:

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
|  3|David| 40|
+---+-----+---+

👉 Example: Remove Duplicates from Specific Columns

df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()

Output:

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
|  3|David| 40|
+---+-----+---+

If multiple rows have the same "name", only the first occurrence is kept.

🔹 3. `distinct()` – Removing Duplicate Rows Based on All Columns

The distinct() function removes all duplicate rows by checking all columns.

👉 Example:

df_distinct = df_dup.distinct()
df_distinct.show()

Output will be the same as dropDuplicates() without column arguments.

🔸 Key Differences Summary

Function	Purpose	What it affects
`drop()`	Removes one or more columns	Changes column structure
`dropDuplicates()`	Removes duplicate rows	Can work on specific columns
`distinct()`	Removes duplicate rows (all columns)	Checks all columns only

✅ FAQs

🔹 What’s the difference between `distinct()` and `dropDuplicates()`?

distinct() checks all columns.
dropDuplicates() lets you choose specific columns to check for duplicates.

🔹 Can we use `distinct()` on selected columns only?

No. To check for uniqueness based on selected columns, use:

df.dropDuplicates()

🔹 Does `distinct()` keep the original row order?

No. To keep or control the order, use:

df.distinct().orderBy("column_name")

🔹 How does `distinct()` handle NULLs?

It treats NULL values as equal — keeps only one row where all values (including NULLs) are the same.

🔹 Can we apply `distinct()` to only a few rows?

Not directly. But you can:

filtered_df = df.filter("condition").distinct()

🔹 1. drop() – Removing Columns

👉 Example: Removing a Single Column

👉 Example: Removing Multiple Columns

🔹 2. dropDuplicates() – Removing Duplicate Rows

👉 Example: Remove Duplicates from All Columns

👉 Example: Remove Duplicates from Specific Columns

🔹 3. distinct() – Removing Duplicate Rows Based on All Columns

👉 Example:

🔸 Key Differences Summary

✅ FAQs

🔹 What’s the difference between distinct() and dropDuplicates()?

🔹 Can we use distinct() on selected columns only?

🔹 Does distinct() keep the original row order?

🔹 How does distinct() handle NULLs?

🔹 Can we apply distinct() to only a few rows?

Latest comments

🔹 1. `drop()` – Removing Columns

🔹 2. `dropDuplicates()` – Removing Duplicate Rows

🔹 3. `distinct()` – Removing Duplicate Rows Based on All Columns

🔹 What’s the difference between `distinct()` and `dropDuplicates()`?

🔹 Can we use `distinct()` on selected columns only?

🔹 Does `distinct()` keep the original row order?

🔹 How does `distinct()` handle NULLs?

🔹 Can we apply `distinct()` to only a few rows?