Back to all posts

Spark Transformations, Actions and Lazy Evaluation and DAG.

Apache Spark RDD supports two types of Operations: Transformations Actions A Transformation is a function that produces new RDD from the existing RDDs but …

Apache Spark RDD supports two types of Operations:

  • Transformations
  • Actions

A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.

Transformations in Spark

Transformations are operations on RDDs that create a new RDD from the existing RDDs. They are lazy operations, meaning they are not executed immediately. Instead, they are recorded as lineage information for the RDD, and the actual execution occurs when an action is performed.

Types of Transformations

  1. Narrow Transformations:
    • These do not require data to be shuffled across partitions.
    • Examples: map(), filter(), flatMap(), union().
    • Characteristics: Data from a single partition is only needed to compute the data for a single partition in the output.
  2. Wide Transformations:
    • These involve shuffling data between partitions.
    • Examples: groupByKey(), aggregate(), aggregateByKey(), join(), repartition().
    • Characteristics: Data from multiple partitions is needed to compute the data for a single partition in the output.

What are Actions?

Actions are operations that trigger the execution of the transformations to produce a result. When an action is called, Spark examines the lineage graph, optimizes the execution plan, and performs the necessary computations.

  • Examples: collect(), show(), count(), first(), take(n).

Lazy Evaluation

Lazy evaluation is a powerful feature in Spark. It means that Spark will not execute the transformations immediately. Instead, it waits until an action is called. This allows Spark to optimize the execution plan by merging transformations and eliminating unnecessary computations.

Example: Transformations and Actions in Spark

Python
# Read the Parquet file
df = spark.read.parquet("employee")

# Narrow Transformations
df_sal = df.filter("Salary > 6000")  # Filter employees with salary > 6000
df_exp = df_sal.filter("Experience_Years > 4")  # Filter employees with experience > 4 years
df_age = df_exp.filter("Age < 55")  # Filter employees younger than 55
df_gen = df_age.filter("Gender == 'Male'")  # Filter male employees
df_mumbai = df.filter("Location == 'Mumbai'")  # Filter employees located in Mumbai

# Wide Transformations
df_ordered = df.orderBy("Salary")  # Order employees by salary
df_sal2 = df_ordered.filter("Salary > 6000")  # Filter employees with salary > 6000
df_exp2 = df_sal2.filter("Experience_Years > 3")  # Filter employees with experience > 3 years
df_age2 = df_exp2.filter("Age > 30")  # Filter employees older than 30
df_gen2 = df_age2.filter("Gender == 'Female'")  # Filter female employees
df_pune_ord = df_gen2.filter("Location == 'Pune'")  # Filter employees located in Pune
df_pune_grp = df_pune_ord.groupBy("Gender").sum("Salary")  # Group by gender and sum the salaries

# Actions
df_mumbai.show()  # Show employees located in Mumbai
df_pune_grp.show()  # Show grouped data for Pune employees
  1. Narrow Transformations:
    • These filters (filter()) do not require shuffling data between partitions.
    • Each transformation here results in a new RDD based on the previous one.
  2. Wide Transformations:
    • orderBy() and groupBy() require shuffling data across partitions, making these wide transformations.
    • These transformations are more resource-intensive due to the data movement involved.
  3. Actions:
    • show() is an action that triggers the execution of all preceding transformations.

Understanding the DAG

When an action is called, Spark creates a Directed Acyclic Graph (DAG) representing the sequence of transformations and actions. This DAG is used to optimize the execution plan.

Example DAG for Action 1

When df_mumbai.show() is called:

  • Stage 1:
    • Reads the Parquet file (df).
    • Applies the filter transformation (df_mumbai).

Example DAG for Action 2

When df_pune_grp.show() is called:

  • Stage 1:
    • Reads the Parquet file (df).
    • Applies transformations up to df_pune_ord.
  • Stage 2:
    • Applies the groupBy transformation (df_pune_grp).

Understanding the Job:

In Apache Spark, a job represents a complete computation initiated by an action. It encompasses all the transformations leading up to the action and is divided into stages and tasks for efficient execution.

Every call of action create a job and every job have own DAG.

Conclusion

Understanding transformations, actions, and lazy evaluation is crucial for efficient Spark programming.

Transformations define the computation,

actions trigger the execution, and

lazy evaluation allows Spark to optimize the process.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.