Back to all posts

Spark session vs Spark context

In Apache Spark, SparkSession and SparkContext are both essential components, but they serve different purposes and have different scopes. Here's a detaile…

In Apache Spark, SparkSession and SparkContext are both essential components, but they serve different purposes and have different scopes. Here's a detailed comparison of SparkSession and SparkContext:

SparkContext

  1. Definition:
    • SparkContext is the entry point to Spark's core functionality. It allows the Spark application to access Spark's cluster and interact with it to create RDDs, accumulators, and broadcast variables.
  2. Creation:
    • Typically created directly in Spark 1.x or indirectly through SparkConf in Spark 2.x and later.
Java
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("Example App").setMaster("local")
sc = SparkContext(conf)

Responsibilities:

  • Manages the connection to the cluster manager.
  • Coordinates distributed data processing operations.
  • Provides the core API for working with RDDs.

Limitations:

  • Does not provide high-level APIs for working with structured data (like DataFrames and Datasets).
  • Lacks SQL functionalities, streaming, and other higher-level abstractions directly.

SparkSession

  1. Definition:
    • SparkSession is the unified entry point to Spark's higher-level APIs. Introduced in Spark 2.0, it combines functionalities of SQLContext and HiveContext along with the original SparkContext.
  2. Creation:
    • Typically created using a builder pattern. It automatically creates a SparkContext internally if one does not exist.
Java
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Example App") \
    .getOrCreate()

Responsibilities:

  • Provides a single entry point to all of Spark's functionality, including working with structured and semi-structured data.
  • Manages the underlying SparkContext and other contexts like SQLContext and HiveContext.
  • Facilitates reading from and writing to a variety of data sources.
  • Provides APIs for DataFrames and Datasets, enabling SQL queries and advanced analytics.

Advantages:

  • Simplifies the initialization process by merging multiple contexts into one.
  • Provides high-level APIs for DataFrame and Dataset operations.
  • Supports SQL queries, Hive tables, and other structured data operations.
  • Enables interaction with Spark's Catalyst optimizer for query optimization.

Example Comparison

Using SparkContext (Pre-Spark 2.0)

Bash
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Example App").setMaster("local")
sc = SparkContext(conf)
sqlContext = SQLContext(sc)

# Creating RDD
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
rdd = sc.parallelize(data)

# Converting RDD to DataFrame
df = sqlContext.createDataFrame(rdd, ["name", "value"])
df.show()

Using SparkSession (Spark 2.0 and Later)

Bash
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example App").getOrCreate()

# Creating DataFrame directly
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
df = spark.createDataFrame(data, ["name", "value"])
df.show()

Summary

  • SparkContext:
    • Essential for low-level operations and the core Spark functionalities.
    • Directly used to create RDDs and manage the cluster connection.
    • Requires SQLContext or HiveContext for structured data operations.
  • SparkSession:
    • A unified entry point for all Spark operations introduced in Spark 2.0.
    • Simplifies application development by merging the functionalities of SparkContext, SQLContext, and HiveContext.
    • Provides high-level APIs for DataFrame, Dataset operations, and SQL queries.
    • Recommended for most applications due to its ease of use and comprehensive feature set.

In modern Spark applications, SparkSession is preferred because it offers a consolidated and more convenient way to access all of Spark's features.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.