Back to all posts

Complex Data(StructType, ArrayType, and MapType) Types in PySpark

Great! Let’s break down PySpark's complex data types— StructType , ArrayType , and MapType —in a simple and clear way. We'll go over: What they are When to…

Great! Let’s break down PySpark's complex data types—StructType, ArrayType, and MapType—in a simple and clear way. We'll go over:

  1. What they are
  2. When to use or not use them
  3. Simple code examples
  4. A combined example showing all three in action

🔹 1. StructType

✅ What is it?

StructType lets you define nested columns (like a structure inside a structure). It’s useful when your data has subfields, like a person having a first, middle, and last name.

📌 When to use:

  • Use when your data is hierarchical or you want nested columns.
  • Avoid if the structure is very shallow or adds unnecessary complexity.

🧾 Example:

Python
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType()),
    StructField("age", StringType(), True)
])

data = 
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 2. ArrayType

✅ What is it?

ArrayType is used when you want a list of values in a column (e.g., a person knows multiple languages).

📌 When to use:

  • Use when a field has multiple values of the same type (like languages, hobbies).
  • Avoid if the number of values is always one or if a separate row per value is better for analysis.

🧾 Example:

Python
from pyspark.sql.types import ArrayType, StringType

schema = StructType()

data = )]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 3. MapType

✅ What is it?

MapType is like a Python dict—key-value pairs in a column.

📌 When to use:

  • Use when values are associated with keys, like {"hair": "black", "eye": "brown"}.
  • Avoid if keys are fixed and can just be separate columns.

🧾 Example:

Python
from pyspark.sql.types import MapType, StringType

schema = StructType()

data = 
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 4. Combined Example: StructType + ArrayType + MapType

Let’s combine them all into one DataFrame:

Python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType

schema = StructType()),
    StructField("hobbies", ArrayType(StringType()), True),
    StructField("attributes", MapType(StringType(), StringType()), True),
    StructField("age", IntegerType(), True)
])

data = , {"hair": "black", "eye": "brown"}, 30),
    (("Maria", "Jones"), , {"hair": "blonde", "eye": "blue"}, 28)
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df.show(truncate=False)

🔹 Summary Table: When to Use

Data TypeDescriptionWhen to UseAvoid When
StructTypeNested fields inside a columnWhen data has a sub-structureIf flat structure is enough
ArrayTypeList of itemsWhen you have multiple valuesIf only one value or can normalize by rows
MapTypeKey-value pairs like a dictWhen keys vary or are dynamicIf keys are fixed (use StructType instead)

🔹 1. Accessing StructType Fields

If you have a column that's a StructType, you can access its subfields using dot notation or with the col() function.

✅ Example:

Bash
from pyspark.sql.functions import col

df.select(
    col("person.firstname").alias("First Name"),
    col("person.lastname").alias("Last Name"),
    "age"
).show()

🔹 2. Accessing ArrayType Elements

You can access elements of an array by index or explode it into multiple rows.

✅ Example: Access by index

Bash
df.select(
    "person.firstname",
    col("hobbies").alias("First Hobby")
).show()

✅ Example: Explode array into rows

JavaScript
from pyspark.sql.functions import explode

df.select(
    "person.firstname",
    explode("hobbies").alias("Each Hobby")
).show()

🔹 3. Accessing MapType Values

You can access map values by key.

✅ Example:

Bash
df.select(
    "person.firstname",
     df.attributes.hair.alias("Hair Color"),
     col("attributes").alias("Eye Color")
).show()

🔸 Bonus: Flatten All Columns in One Go

Here’s how you might pull all useful fields into a flat structure:

Yes! There are a few more key things you should know when working with StructType, ArrayType, and MapType in PySpark, especially as a data analyst or engineer.

Here’s a breakdown of advanced but very useful concepts that help you master these complex data types:


🔸 1. Nesting: You can combine them together!

PySpark allows you to nest these complex types inside each other:

✅ Struct inside Struct:

Bash
StructType()),
    StructField("age", IntegerType())
])

✅ Struct inside Array:

Plain Text
ArrayType(StructType())

✅ Map inside Struct or vice versa:

Plain Text
StructType()

Create your own Schema

If you’re reading from a file or JSON column, define your custom schema using StructType:

Plain Text
schema = StructType())
])

Explode vs Inline

  • explode() is for arrays/maps: expands rows.
  • inline() is for StructType inside arrays.
Python
from pyspark.sql.functions import inline

data2 = ),
    ("Anna",  )
]

schema2 = "name string, skills array<struct<language:string, proficiency:string>>"
df2 = spark.createDataFrame(data2, schema=schema2)

df2.show(truncate=False)
Python
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType()),
    StructField("age", StringType(), True)
])

data = 
df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)


from pyspark.sql.types import MapType, StringType, StructType, StructField

# Schema using MapType
schema = StructType()

# Data: each "name" is a map (i.e., dictionary)
data = 

df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.