DataFrames are an important data structure in PySpark. They help in handling structured and semi-structured data efficiently. DataFrames are like tables in databases or Pandas DataFrames, and they allow querying, filtering, and transforming data easily.
What is a DataFrame in PySpark?
A DataFrame in PySpark is a collection of data arranged in named columns. It is built on top of Resilient Distributed Datasets (RDDs) and allows SQL-like operations on big data.
Different Ways to Create DataFrames in PySpark
Below are some common ways to create a DataFrame in PySpark.
1. Creating DataFrame from a List of Tuples
# Sample data
data =
# Define column names
columns =
# Create DataFrame
df = spark.createDataFrame(data, schema=columns)
df.show()
Output:
+---+-------+---+
| ID| Name |Age|
+---+-------+---+
| 1| Alice | 25|
| 2| Bob | 30|
| 3|Charlie| 35|
+---+-------+---+
2. Creating DataFrame from a List of Row Objects
from pyspark.sql import Row
# Using Row
rows =
df_row = spark.createDataFrame(rows)
df_row.show()
Output:
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
| 1|Alice| 25|
| 2| Bob| 30|
+---+-----+---+
3. Creating DataFrame from an RDD
# Creating RDD
rdd = spark.sparkContext.parallelize()
df_rdd = rdd.toDF()
df_rdd.show()
Output:
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
| 1|Alice| 25|
| 2| Bob| 30|
+---+-----+---+
4. Creating DataFrame from a Pandas DataFrame
import pandas as pd
# Create Pandas DataFrame
pdf = pd.DataFrame({"ID": , "Name": , "Age": })
# Convert Pandas DataFrame to PySpark DataFrame
df_pandas = spark.createDataFrame(pdf)
df_pandas.show()
Output:
+---+-----+---+
| ID| Name|Age|
+---+-----+---+
| 1|Alice| 25|
| 2| Bob| 30|
+---+-----+---+
5. Creating DataFrame from a JSON File
# Read JSON file
df_json = spark.read.json("sample.json")
df_json.show()
6. Creating DataFrame from a CSV File
# Read CSV file
df_csv = spark.read.csv("sample.csv", header=True, inferSchema=True)
df_csv.show()