In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these concepts is crucial for optimizing and managing Spark workloads.
Application
Definition:
- A Spark application is a self-contained computation process that consists of a driver program and a set of executor processes on a cluster. It runs user code to perform a series of computations.
Components:
- Driver Program:
- The main program that coordinates all the workers and the overall computation.
- Contains the
SparkSessionorSparkContext. - Translates user code into jobs, stages, and tasks.
- Executors:
- Worker nodes responsible for executing the tasks assigned by the driver.
- Each application has its own executors, which are not shared between applications.
Example:
- A Spark application might be a standalone application written in Python, Scala, or Java that performs data processing using Spark APIs.
Job
Definition:
- A job is a parallel computation consisting of multiple tasks that get spawned in response to an action (such as
save,count,collect) on an RDD, DataFrame, or Dataset.
Characteristics:
- Triggered by actions.
- Each action in the Spark program creates a new job.
- The job is divided into multiple stages.
Example:
- If you run an action like
df.count(), it triggers a job to compute the count of the DataFrame rows.
Stage
Definition:
- A stage is a set of tasks that can be executed in parallel and corresponds to a combination of narrow transformations followed by a wide transformation (like
shuffle).
Characteristics:
- Each stage represents a step in the job execution plan.
- Spark breaks a job into stages based on the transformations applied to the data.
- Narrow transformations (e.g.,
map,filter) are grouped into a single stage. - Wide transformations (e.g.,
reduceByKey,groupByKey) cause a shuffle and trigger the start of a new stage.
Example:
- If a job involves both a
mapand areduceByKeyoperation, Spark will divide the job into two stages: one for themapoperation and one for thereduceByKeyoperation.
Task
Definition:
- A task is the smallest unit of work in Spark. It is a single operation (such as applying a function to a partition of an RDD) sent to an executor.
Characteristics:
- Each stage is divided into many tasks, where each task is executed on a partition of the data.
- Tasks within a stage are executed in parallel across the cluster.
Example:
- If you have an RDD with 10 partitions and a job that consists of a single stage, Spark will create 10 tasks, one for each partition.
Example Workflow
Here's an example to illustrate the hierarchy of these components:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Example App").getOrCreate()
# Load some data
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
df = spark.createDataFrame(data, ["name", "value"])
# Perform some transformations
df_filtered = df.filter(df.value > 1)
df_mapped = df_filtered.rdd.map(lambda x: (x[0], x[1] * 2))
# Trigger an action
result = df_mapped.collect()
print(result)
# Stop the SparkSession
spark.stop()
In this example:
- Application: The whole script is a Spark application.
- Job: The
collect()action triggers a job. - Stage: The job might be divided into stages based on transformations (
filterandmap). - Task: Each stage is split into tasks, one for each partition of the RDD resulting from the transformations.