🔹 1. drop() – Removing Columns
The drop() function is used to remove one or more columns from a DataFrame.
👉 Example: Removing a Single Column
from pyspark.sql import Row
# Sample Data
data =
df = spark.createDataFrame(data)
# Drop the 'age' column
df_dropped = df.drop("age")
df_dropped.show()
Output:
+---+------+
| id| name|
+---+------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
+---+------+
👉 Example: Removing Multiple Columns
df_dropped_multi = df.drop("age", "name")
df_dropped_multi.show()
Output:
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
🔹 2. dropDuplicates() – Removing Duplicate Rows
The dropDuplicates() function is used to remove duplicate rows, either from all columns or specific columns.
👉 Example: Remove Duplicates from All Columns
data =
columns =
df_dup = spark.createDataFrame(data, columns)
df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()
Output:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 25|
| 2| Bob| 30|
| 3|David| 40|
+---+-----+---+
👉 Example: Remove Duplicates from Specific Columns
df_no_duplicates = df_dup.dropDuplicates()
df_no_duplicates.show()
Output:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 25|
| 2| Bob| 30|
| 3|David| 40|
+---+-----+---+
If multiple rows have the same
"name", only the first occurrence is kept.
🔹 3. distinct() – Removing Duplicate Rows Based on All Columns
The distinct() function removes all duplicate rows by checking all columns.
👉 Example:
df_distinct = df_dup.distinct()
df_distinct.show()
Output will be the same as dropDuplicates() without column arguments.
🔸 Key Differences Summary
| Function | Purpose | What it affects |
|---|---|---|
drop() | Removes one or more columns | Changes column structure |
dropDuplicates() | Removes duplicate rows | Can work on specific columns |
distinct() | Removes duplicate rows (all columns) | Checks all columns only |
✅ FAQs
🔹 What’s the difference between distinct() and dropDuplicates()?
distinct()checks all columns.dropDuplicates()lets you choose specific columns to check for duplicates.
🔹 Can we use distinct() on selected columns only?
No. To check for uniqueness based on selected columns, use:
df.dropDuplicates()
🔹 Does distinct() keep the original row order?
No. To keep or control the order, use:
df.distinct().orderBy("column_name")
🔹 How does distinct() handle NULLs?
It treats NULL values as equal — keeps only one row where all values (including NULLs) are the same.
🔹 Can we apply distinct() to only a few rows?
Not directly. But you can:
filtered_df = df.filter("condition").distinct()