Back to all posts

Working with NULL/None Values in PySpark

🔍 What's fillna() or fill() in PySpark? In PySpark, both fillna() and fill() are used to replace null or missing values in a DataFrame. Both fillna() and …

🔍 What's fillna() or fill() in PySpark?

In PySpark, both fillna() and fill() are used to replace null or missing values in a DataFrame.

Both fillna() and fill() work the same:

Plain Text
df.fillna(0) == df.na.fill(0)
Bash
# Sample Data
data = 

# Creating DataFrame
columns = 
df = spark.createDataFrame(data, columns)

# Show original DataFrame
print("Original DataFrame:")
df.show()

PySpark Drop Rows with NULL or None Values

PySpark drop() function can take 3 optional parameters that are used to remove Rows with NULL values on single, any, all, multiple DataFrame columns.

SQL
drop(how='any', thresh=None, subset=None)

All these parameters are optional.

  • how – This takes values ‘any’ or ‘all’. By using ‘any’, drop a row if it contains NULLs on any columns. By using ‘all’, drop a row only if all columns have NULL values. Default is ‘any’.
  • thresh – This takes int value, Drop rows that have less than thresh hold non-null values. Default is ‘None’.
  • subset – Use this to select the columns for NULL values. Default is ‘None.

Alternatively, you can also use DataFrame.dropna() function to drop rows with null values.

Bash
# Sample Data
data = 

# Creating DataFrame
columns = 
df = spark.createDataFrame(data, columns)

# Show original DataFrame
print("Original DataFrame:")
df.show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.