Back to all posts

What is Managed and External table in Spark

In Apache Spark, both Managed and External tables are used to store the data. However, there are significant differences in how Spark manages the data for …

In Apache Spark, both Managed and External tables are used to store the data. However, there are significant differences in how Spark manages the data for these tables:

Managed Tables

  1. Ownership and Management:
    • Spark manages both the metadata and the data for managed tables.
    • When you create a managed table, Spark takes care of storing the table data in a default location, typically within a predefined warehouse directory.
  2. Data Location:
    • By default, data for managed tables is stored in the warehouse directory specified by the spark.sql.warehouse.dir configuration.
  3. Lifecycle:
    • When a managed table is dropped, both the metadata and the data are deleted by Spark.
SQL
CREATE TABLE managed_table (
    id INT,
    name STRING
)
USING PARQUET;

External Tables

  1. Ownership and Management:
    • Spark only manages the metadata for external tables. The data is managed externally, i.e., it resides outside the default Spark warehouse directory.
  2. Data Location:
    • You specify the location of the data when creating an external table. This data can be in HDFS, S3, or any other storage system supported by Spark.
  3. Lifecycle:
    • When an external table is dropped, Spark only deletes the metadata; the actual data remains intact.
SQL
CREATE TABLE external_table (
    id INT,
    name STRING
)
USING PARQUET
LOCATION '/path/to/external/data';
AspectManaged TableExternal Table
Data StorageData is stored in the Spark warehouse directory.Data is stored at a user-specified location.
Data ManagementSpark manages both data and metadata.Spark manages only the metadata; the user is responsible for managing the data.
Data DeletionDropping the table deletes both data and metadata.Dropping the table deletes only the metadata; the data remains.
When to UseUse managed tables when you want Spark to handle the complete lifecycle of the table and data. Suitable for temporary data or data that is fully controlled by Spark.Use external tables when you have existing data stored in a specific location that you want to query using Spark. Ideal for integrating Spark with external data sources without moving the data.

Metadata Summary

Metadata is data that describes other data. In databases and data management, metadata includes information about:

  1. Schema Information:
    • Describes tables, columns, data types, constraints, and default values.
    • Includes information about views and virtual tables.
  2. Data Relationships:
    • Details how tables are related (e.g., one-to-many, many-to-many).
    • Information about indexes used for query performance.
  3. Permissions and Security:
    • Specifies user roles and access permissions.
    • Describes security policies, such as encryption and data masking.
  4. Data Origin and Lineage:
    • Provides information about data provenance and transformations.
    • Tracks data lineage for auditing and compliance.
  5. Storage Information:
    • For external tables, includes the file locations and storage formats.
    • Details about file formats like CSV, Parquet, ORC.
  6. Statistics and Performance Metrics:
    • Includes statistical information about the data, such as row counts and value distributions.
    • Provides performance metrics for query optimization.

Importance of Metadata

  • Data Management: Organizes and retrieves data efficiently.
  • Query Optimization: Helps in creating efficient query execution plans.
  • Data Governance: Ensures compliance with regulations and governance policies.
  • Data Integration: Facilitates integration across different data sources.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.