partitionBy() एक function है जो DataFrame को disk par likhne (write) के time par use hota hai. ये function pyspark.sql.DataFrameWriter class ka part hai.
👉 इसका use हम badi datasets ko chhoti-chhoti files mein divide karne ke लिए करते हैं — usually based on one or more columns.
✅ Use Case:
Agar aap Data Lake ya distributed file system mein data save कर रहे हो, to partitioning performance improve karta hai, especially jab query large data पर चलानी हो.
🧠 Important Point:
Jab aap partitionBy("state") use करते हो:
- Data alag-alag folders में save होता है — जैसे:
state=CA,state=NY, etc. - Data file ke andar
statecolumn nahi होता, ये value folder name se hi pata चल जाती है (space saving).
✅ Example:
df.write.option("header", True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/FileStore/tables/state")
Output folders:
/FileStore/tables/state/state=CA/
/FileStore/tables/state/state=NY/
🔹 Multiple Columns Partition
df.write.option("header", True) \
.partitionBy("state", "City") \
.mode("overwrite") \
.csv("/FileStore/tables/state-city")
📁 Folder structure:
/FileStore/tables/state-city/state=CA/City=LosAngeles/
/FileStore/tables/state-city/state=CA/City=SanDiego/
🔹 Partitioning with Control (Data Skew Handling)
Kabhi-kabhi kuch partitions mein zyada records hote hain, aur kuch mein bahut kam — isse data skew kehte hain.
Aap maxRecordsPerFile option ka use karke control kar सकते हैं कि har file mein maximum kitne records hone chahiye:
df.write.option("header", True) \
.option("maxRecordsPerFile", 2) \
.partitionBy("state") \
.mode("overwrite") \
.csv("/FileStore/tables/zipcodes-state")
🆚 partitionBy() vs groupBy()
| Feature | partitionBy() | groupBy() |
|---|---|---|
| Use Case | Data write करते समय disk par partitioning | DataFrame के अंदर logical grouping |
| Affects | File structure on disk | Aggregation logic inside DataFrame |
🔹 repartition() in PySpark
repartition() ek transformation hai jo memory में partitions की संख्या को change करता है.
Syntax:
df.repartition(5) # by number
df.repartition("state") # by column
df.repartition(5, "state") # by number & column
✅ Example:
simpleData =
schema =
df = spark.createDataFrame(data=simpleData, schema=schema)
df.show()
df.repartition(3).write.mode("overwrite").csv("/FileStore/tables/RepartionData1")
📌 Note: Repartition memory mein hota hai, aur partitionBy() disk ke structure ko affect karta hai.
🧠 Bonus Tip:
- Partitioning SQL की indexing जैसा hi kaam karta hai.
- Query performance aur data management dono mein help milti hai.
Bilkul Himanshu, yah raha same content bina question numbers ke — easy-to-read Q&A format mein, Hindi-English mix style:
❓ PySpark mein partitionBy() kya hota hai?
✅ partitionBy() ek function hai jo DataFrame ko disk par likhte waqt use hota hai. Ye data ko kisi column ke basis par alag-alag folders mein divide karta hai.
Example:
df.write.partitionBy("state").csv("path/")
Output folders honge:/path/state=NY/, /path/state=CA/ etc.
❓ repartition() kya hota hai?
✅ repartition() ek transformation hai jo DataFrame ko memory (RAM) mein multiple partitions mein divide karta hai.
Ye processing performance ko improve karta hai, especially jab aap join, groupBy ya heavy operation kar rahe ho.
Example:
df.repartition("state")
❓ Dono ko ek saath use kyu karein?
✅ Agar aapko write karte waqt balanced partitioning chahiye, to dono ka combination best hota hai.
Example:
df.repartition("state") \
.write.partitionBy("state") \
.csv("path/")
repartition(): Memory mein load balance karta haipartitionBy(): Disk par organize karta hai
❓ Data skew kya hota hai? Aur kaise handle karein?
✅ Jab kuch partitions mein bahut zyada records ho aur kuch mein bahut kam — use data skew kehte hain.
Isse handle karne ke liye:
- Memory level pe:
repartition()use karo - File level pe:
maxRecordsPerFileoption use karo
Example:
df.write.option("maxRecordsPerFile", 1000) \
.partitionBy("state") \
.csv("path/")