Great! Let’s break down PySpark's complex data types—StructType, ArrayType, and MapType—in a simple and clear way. We'll go over:
- What they are
- When to use or not use them
- Simple code examples
- A combined example showing all three in action
🔹 1. StructType
✅ What is it?
StructType lets you define nested columns (like a structure inside a structure). It’s useful when your data has subfields, like a person having a first, middle, and last name.
📌 When to use:
- Use when your data is hierarchical or you want nested columns.
- Avoid if the structure is very shallow or adds unnecessary complexity.
🧾 Example:
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType()),
StructField("age", StringType(), True)
])
data =
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 2. ArrayType
✅ What is it?
ArrayType is used when you want a list of values in a column (e.g., a person knows multiple languages).
📌 When to use:
- Use when a field has multiple values of the same type (like languages, hobbies).
- Avoid if the number of values is always one or if a separate row per value is better for analysis.
🧾 Example:
from pyspark.sql.types import ArrayType, StringType
schema = StructType()
data = )]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)
🔹 3. MapType
✅ What is it?
MapType is like a Python dict—key-value pairs in a column.
📌 When to use:
- Use when values are associated with keys, like {"hair": "black", "eye": "brown"}.
- Avoid if keys are fixed and can just be separate columns.
🧾 Example:
from pyspark.sql.types import MapType, StringType
schema = StructType()
data =
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 4. Combined Example: StructType + ArrayType + MapType
Let’s combine them all into one DataFrame:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType
schema = StructType()),
StructField("hobbies", ArrayType(StringType()), True),
StructField("attributes", MapType(StringType(), StringType()), True),
StructField("age", IntegerType(), True)
])
data = , {"hair": "black", "eye": "brown"}, 30),
(("Maria", "Jones"), , {"hair": "blonde", "eye": "blue"}, 28)
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 Summary Table: When to Use
| Data Type | Description | When to Use | Avoid When |
|---|---|---|---|
| StructType | Nested fields inside a column | When data has a sub-structure | If flat structure is enough |
| ArrayType | List of items | When you have multiple values | If only one value or can normalize by rows |
| MapType | Key-value pairs like a dict | When keys vary or are dynamic | If keys are fixed (use StructType instead) |
🔹 1. Accessing StructType Fields
If you have a column that's a StructType, you can access its subfields using dot notation or with the col() function.
✅ Example:
from pyspark.sql.functions import col
df.select(
col("person.firstname").alias("First Name"),
col("person.lastname").alias("Last Name"),
"age"
).show()
🔹 2. Accessing ArrayType Elements
You can access elements of an array by index or explode it into multiple rows.
✅ Example: Access by index
df.select(
"person.firstname",
col("hobbies").alias("First Hobby")
).show()
✅ Example: Explode array into rows
from pyspark.sql.functions import explode
df.select(
"person.firstname",
explode("hobbies").alias("Each Hobby")
).show()

🔹 3. Accessing MapType Values
You can access map values by key.
✅ Example:
df.select(
"person.firstname",
df.attributes.hair.alias("Hair Color"),
col("attributes").alias("Eye Color")
).show()

🔸 Bonus: Flatten All Columns in One Go
Here’s how you might pull all useful fields into a flat structure:

Yes! There are a few more key things you should know when working with StructType, ArrayType, and MapType in PySpark, especially as a data analyst or engineer.
Here’s a breakdown of advanced but very useful concepts that help you master these complex data types:
🔸 1. Nesting: You can combine them together!
PySpark allows you to nest these complex types inside each other:
✅ Struct inside Struct:
StructType()),
StructField("age", IntegerType())
])
✅ Struct inside Array:
ArrayType(StructType())
✅ Map inside Struct or vice versa:
StructType()
Create your own Schema
If you’re reading from a file or JSON column, define your custom schema using StructType:
schema = StructType())
])
Explode vs Inline
explode()is for arrays/maps: expands rows.inline()is for StructType inside arrays.
from pyspark.sql.functions import inline
data2 = ),
("Anna", )
]
schema2 = "name string, skills array<struct<language:string, proficiency:string>>"
df2 = spark.createDataFrame(data2, schema=schema2)
df2.show(truncate=False)


from pyspark.sql.types import StructType, StructField, StringType
schema = StructType()),
StructField("age", StringType(), True)
])
data =
df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)
from pyspark.sql.types import MapType, StringType, StructType, StructField
# Schema using MapType
schema = StructType()
# Data: each "name" is a map (i.e., dictionary)
data =
df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)
