Back to all posts

Collect() in PySpark

PySpark collect() Function – The collect() function in PySpark is used to retrieve all the rows of a DataFrame (or RDD) from the distributed cluster back t…

PySpark collect() Function –

The collect() function in PySpark is used to retrieve all the rows of a DataFrame (or RDD) from the distributed cluster back to the driver program as a Python list.


🔍 Think of it like this:

  • Imagine your data is stored across multiple machines.
  • PySpark processes the data in parallel on each machine.
  • collect() pulls all that processed data from those machines into your main driver machine.

✅ When to Use collect():

  • When your dataset is small and fits in memory.
  • When you need to debug or inspect the actual data.
  • When converting the data into Python native objects for local operations.
  • When you need to pass the values to another non-distributed process (e.g., plotting with matplotlib).

❌ When NOT to Use:

  • ⚠️ If the dataset is too large, collect() can lead to:
    • Out of Memory (OOM) errors
    • Driver crashes
  • Use .show(), .limit(), or .take(n) for previewing data instead.

🧪 Example Dataset

SQL
dept = 
deptColumns = 

deptDF = spark.createDataFrame(data=dept, schema=deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

🔹 1. Collect all rows


🔹 2. Collect specific column(s)


🔹 3. Loop through collected data


🔹 4. Convert collected data to list of values


🔹 5. Find the max value from collected data (small datasets only)


🔹 6. Use in debugging / inspection

✅ Pro Tips:

  • ✅ Use .collect() only when you’re sure the dataset will fit in memory.
  • ✅ Always test with .limit(n).collect() before full .collect().

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.