Application,Job,Stage,Task in Spark

In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these …

Author

SQLDataDev Editorial Team

Mar 19, 2026 3 min read

In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these concepts is crucial for optimizing and managing Spark workloads.

Application

Definition:

A Spark application is a self-contained computation process that consists of a driver program and a set of executor processes on a cluster. It runs user code to perform a series of computations.

Components:

Driver Program:
- The main program that coordinates all the workers and the overall computation.
- Contains the SparkSession or SparkContext.
- Translates user code into jobs, stages, and tasks.
Executors:
- Worker nodes responsible for executing the tasks assigned by the driver.
- Each application has its own executors, which are not shared between applications.

Example:

A Spark application might be a standalone application written in Python, Scala, or Java that performs data processing using Spark APIs.

Job

Definition:

A job is a parallel computation consisting of multiple tasks that get spawned in response to an action (such as save, count, collect) on an RDD, DataFrame, or Dataset.

Characteristics:

Triggered by actions.
Each action in the Spark program creates a new job.
The job is divided into multiple stages.

Example:

If you run an action like df.count(), it triggers a job to compute the count of the DataFrame rows.

Stage

Definition:

A stage is a set of tasks that can be executed in parallel and corresponds to a combination of narrow transformations followed by a wide transformation (like shuffle).

Characteristics:

Each stage represents a step in the job execution plan.
Spark breaks a job into stages based on the transformations applied to the data.
Narrow transformations (e.g., map, filter) are grouped into a single stage.
Wide transformations (e.g., reduceByKey, groupByKey) cause a shuffle and trigger the start of a new stage.

Example:

If a job involves both a map and a reduceByKey operation, Spark will divide the job into two stages: one for the map operation and one for the reduceByKey operation.

Task

Definition:

A task is the smallest unit of work in Spark. It is a single operation (such as applying a function to a partition of an RDD) sent to an executor.

Characteristics:

Each stage is divided into many tasks, where each task is executed on a partition of the data.
Tasks within a stage are executed in parallel across the cluster.

Example:

If you have an RDD with 10 partitions and a job that consists of a single stage, Spark will create 10 tasks, one for each partition.

Example Workflow

Here's an example to illustrate the hierarchy of these components:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Example App").getOrCreate()

# Load some data
data = &#091;("Alice", 1), ("Bob", 2), ("Cathy", 3)]
df = spark.createDataFrame(data, &#091;"name", "value"])

# Perform some transformations
df_filtered = df.filter(df.value > 1)
df_mapped = df_filtered.rdd.map(lambda x: (x&#091;0], x&#091;1] * 2))

# Trigger an action
result = df_mapped.collect()
print(result)

# Stop the SparkSession
spark.stop()

In this example:

Application: The whole script is a Spark application.
Job: The collect() action triggers a job.
Stage: The job might be divided into stages based on transformations (filter and map).
Task: Each stage is split into tasks, one for each partition of the RDD resulting from the transformations.

Topics covered

PySpark

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.

Browse All Posts