Back to all posts

PartitionBy() in PySpark

partitionBy() एक function है जो DataFrame को disk par likhne (write) के time par use hota hai. ये function pyspark.sql.DataFrameWriter class ka part hai. �…


partitionBy() एक function है जो DataFrame को disk par likhne (write) के time par use hota hai. ये function pyspark.sql.DataFrameWriter class ka part hai.

👉 इसका use हम badi datasets ko chhoti-chhoti files mein divide karne ke लिए करते हैं — usually based on one or more columns.

✅ Use Case:

Agar aap Data Lake ya distributed file system mein data save कर रहे हो, to partitioning performance improve karta hai, especially jab query large data पर चलानी हो.

🧠 Important Point:

Jab aap partitionBy("state") use करते हो:

  • Data alag-alag folders में save होता है — जैसे: state=CA, state=NY, etc.
  • Data file ke andar state column nahi होता, ये value folder name se hi pata चल जाती है (space saving).

✅ Example:

Python
df.write.option("header", True) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("/FileStore/tables/state")

Output folders:

Bash
/FileStore/tables/state/state=CA/
/FileStore/tables/state/state=NY/

🔹 Multiple Columns Partition

Python
df.write.option("header", True) \
        .partitionBy("state", "City") \
        .mode("overwrite") \
        .csv("/FileStore/tables/state-city")

📁 Folder structure:

Bash
/FileStore/tables/state-city/state=CA/City=LosAngeles/
/FileStore/tables/state-city/state=CA/City=SanDiego/

🔹 Partitioning with Control (Data Skew Handling)

Kabhi-kabhi kuch partitions mein zyada records hote hain, aur kuch mein bahut kam — isse data skew kehte hain.

Aap maxRecordsPerFile option ka use karke control kar सकते हैं कि har file mein maximum kitne records hone chahiye:

Python
df.write.option("header", True) \
        .option("maxRecordsPerFile", 2) \
        .partitionBy("state") \
        .mode("overwrite") \
        .csv("/FileStore/tables/zipcodes-state")

🆚 partitionBy() vs groupBy()

FeaturepartitionBy()groupBy()
Use CaseData write करते समय disk par partitioningDataFrame के अंदर logical grouping
AffectsFile structure on diskAggregation logic inside DataFrame

🔹 repartition() in PySpark

repartition() ek transformation hai jo memory में partitions की संख्या को change करता है.

Syntax:

Bash
df.repartition(5)  # by number
df.repartition("state")  # by column
df.repartition(5, "state")  # by number & column

✅ Example:

Bash
simpleData = 

schema = 
df = spark.createDataFrame(data=simpleData, schema=schema)
df.show()

df.repartition(3).write.mode("overwrite").csv("/FileStore/tables/RepartionData1")

📌 Note: Repartition memory mein hota hai, aur partitionBy() disk ke structure ko affect karta hai.


🧠 Bonus Tip:

  • Partitioning SQL की indexing जैसा hi kaam karta hai.
  • Query performance aur data management dono mein help milti hai.

Bilkul Himanshu, yah raha same content bina question numbers ke — easy-to-read Q&A format mein, Hindi-English mix style:


❓ PySpark mein partitionBy() kya hota hai?

partitionBy() ek function hai jo DataFrame ko disk par likhte waqt use hota hai. Ye data ko kisi column ke basis par alag-alag folders mein divide karta hai.

Example:

Bash
df.write.partitionBy("state").csv("path/")

Output folders honge:
/path/state=NY/, /path/state=CA/ etc.


repartition() kya hota hai?

repartition() ek transformation hai jo DataFrame ko memory (RAM) mein multiple partitions mein divide karta hai.
Ye processing performance ko improve karta hai, especially jab aap join, groupBy ya heavy operation kar rahe ho.

Example:

Bash
df.repartition("state")

❓ Dono ko ek saath use kyu karein?

✅ Agar aapko write karte waqt balanced partitioning chahiye, to dono ka combination best hota hai.

Example:

Bash
df.repartition("state") \
  .write.partitionBy("state") \
  .csv("path/")
  • repartition(): Memory mein load balance karta hai
  • partitionBy(): Disk par organize karta hai

❓ Data skew kya hota hai? Aur kaise handle karein?

✅ Jab kuch partitions mein bahut zyada records ho aur kuch mein bahut kam — use data skew kehte hain.
Isse handle karne ke liye:

  • Memory level pe: repartition() use karo
  • File level pe: maxRecordsPerFile option use karo

Example:

Bash
df.write.option("maxRecordsPerFile", 1000) \
  .partitionBy("state") \
  .csv("path/")

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.