Back to all posts

Understanding DataFrames in PySpark

DataFrames are an important data structure in PySpark. They help in handling structured and semi-structured data efficiently. DataFrames are like tables in…

DataFrames are an important data structure in PySpark. They help in handling structured and semi-structured data efficiently. DataFrames are like tables in databases or Pandas DataFrames, and they allow querying, filtering, and transforming data easily.

What is a DataFrame in PySpark?

A DataFrame in PySpark is a collection of data arranged in named columns. It is built on top of Resilient Distributed Datasets (RDDs) and allows SQL-like operations on big data.

Different Ways to Create DataFrames in PySpark

Below are some common ways to create a DataFrame in PySpark.

1. Creating DataFrame from a List of Tuples

Bash
# Sample data
data = 

# Define column names
columns = 

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

df.show()

Output:

SQL
+---+-------+---+
| ID|  Name |Age|
+---+-------+---+
|  1| Alice | 25|
|  2|   Bob | 30|
|  3|Charlie| 35|
+---+-------+---+

2. Creating DataFrame from a List of Row Objects

SQL
from pyspark.sql import Row

# Using Row
rows = 
df_row = spark.createDataFrame(rows)
df_row.show()

Output:

SQL
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
+---+-----+---+

3. Creating DataFrame from an RDD

Bash
# Creating RDD
rdd = spark.sparkContext.parallelize()
df_rdd = rdd.toDF()
df_rdd.show()

Output:

SQL
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
+---+-----+---+

4. Creating DataFrame from a Pandas DataFrame

Python
import pandas as pd

# Create Pandas DataFrame
pdf = pd.DataFrame({"ID": , "Name": , "Age": })

# Convert Pandas DataFrame to PySpark DataFrame
df_pandas = spark.createDataFrame(pdf)
df_pandas.show()

Output:

SQL
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
|  1|Alice| 25|
|  2|  Bob| 30|
+---+-----+---+

5. Creating DataFrame from a JSON File

Bash
# Read JSON file
df_json = spark.read.json("sample.json")
df_json.show()

6. Creating DataFrame from a CSV File

Python
# Read CSV file
df_csv = spark.read.csv("sample.csv", header=True, inferSchema=True)
df_csv.show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.