Results for "Apache Spark"
9 / 184 posts
Databricks, Apache Spark, Data Engineering and Science etc.
Azure Databricks is a platform on Microsoft Azure that helps with big data analysis and machine learning. It lets you work with large datasets easily and c…
What is Managed and External table in Spark
In Apache Spark, both Managed and External tables are used to store the data. However, there are significant differences in how Spark manages the data for …
Spark Transformations, Actions and Lazy Evaluation and DAG.
Apache Spark RDD supports two types of Operations: Transformations Actions A Transformation is a function that produces new RDD from the existing RDDs but …
What is catalog in Spark
In Apache Spark, the catalog refers to the internal management system that keeps track of all the metadata related to tables, databases, functions, and oth…
What is Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs) are a data structure of Apache Spark. They represent an immutable, distributed collection of objects that can be proc…
Spark session vs Spark context
In Apache Spark, SparkSession and SparkContext are both essential components, but they serve different purposes and have different scopes. Here's a detaile…
Application,Job,Stage,Task in Spark
In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these …
Applying Functions in PySpark
PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analys…
PySpark Built-in Functions
These functions are commonly used with groupBy() , agg() , or select() to compute things like sum, average, max, min, count, etc. PySpark functions come fr…