Results for "spark"
43 / 184 posts
Window Functions in PySpark
Window functions in PySpark allow you to perform operations across a set of rows that are somehow related to the current row. They are useful for tasks lik…
Databricks, Apache Spark, Data Engineering and Science etc.
Azure Databricks is a platform on Microsoft Azure that helps with big data analysis and machine learning. It lets you work with large datasets easily and c…
Markdown Cheat Sheet
Basic Syntax These are the elements outlined in John Gruber’s original design document. All Markdown applications support these elements. Element Markdown …
Azure Databricks command
Here are some common commands used in Databricks: %fs : Allows you to interact with the filesystem. For example, %fs ls lists the files in the current dire…
What is Data Ingestion and DataFrame API
Data ingestion : Data ingestion refers to the process of collecting, importing, and importing data from various sources into a system or storage environmen…
How to Read and Write file into DataFrame by using Pyspark
# dataframe reader API.... spark.read.format("") \ .option("key":"value") \ .schema(schemavariable) \ .load() # dataframe write API...... spark.write.mode(…
Databricks widgets
Input widgets allow you to add parameters to your notebooks and dashboards. You can add a widget from the Databricks UI or using the widget API. If y…
PaySpark Data manipulation
Select Table into dataframe: df = spark.read.table(tableName="samples.tpch.customer").limit(5) df = spark.table(tableName="samples.tpch.customer").limit(5)…
How to Read and Write CSV file into DataFrame by using Pyspark
PySpark Read CSV File into DataFrame: reading CSV files from disk using PySpark offers a versatile and efficient approach to data ingestion and processing.…
Join in PySpark
PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames. # Syntax join(self, other, on=None, how=None) …
How to use Window Functions in PySpark
Absolutely! Let’s break it down and explain each PySpark window function with examples using your code and dataset. I’ll categorize the functions into thre…
What is Data tables(Fact Tables) vs Lookup tables(Dimension Tables)
Data Tables(Fact Tables) Purpose : Store detailed, raw data. Structure : Multiple columns (attributes) and rows (records). Example : A table with order det…
Spark SQL useful command
Spark SQL provides a variety of commands for managing databases, tables, and performing SQL operations. CREATE DATABASE IF NOT EXISTS demo; SHOW DATABASES;…
What is Managed and External table in Spark
In Apache Spark, both Managed and External tables are used to store the data. However, there are significant differences in how Spark manages the data for …
Schema and Handling Corrupt data in PySpark
A schema in PySpark (and generally in data processing) defines the structure of a DataFrame, including the names and data types of each column. It serves a…
What is cluster in Spark
what is cluster : In computing, a cluster refers to a collection of interconnected computers that work together as a single system . These computers, often…
What is Big Data
Big Data refers to extremely large datasets that are too complex and voluminous to be processed and analyzed using traditional data processing tools and te…
Spark Transformations, Actions and Lazy Evaluation and DAG.
Apache Spark RDD supports two types of Operations: Transformations Actions A Transformation is a function that produces new RDD from the existing RDDs but …
What is catalog in Spark
In Apache Spark, the catalog refers to the internal management system that keeps track of all the metadata related to tables, databases, functions, and oth…
What is Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs) are a data structure of Apache Spark. They represent an immutable, distributed collection of objects that can be proc…
Spark session vs Spark context
In Apache Spark, SparkSession and SparkContext are both essential components, but they serve different purposes and have different scopes. Here's a detaile…
Application,Job,Stage,Task in Spark
In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these …
PartitionBy() in PySpark
partitionBy() एक function है जो DataFrame को disk par likhne (write) के time par use hota hai. ये function pyspark.sql.DataFrameWriter class ka part hai. �…
Understanding DataFrames in PySpark
DataFrames are an important data structure in PySpark. They help in handling structured and semi-structured data efficiently. DataFrames are like tables in…
Understanding show() in PySpark
In PySpark, the .show() function is used to display DataFrame content in a tabular format. Syntax of show() DataFrame.show(n=20, truncate=True, vertical=Fa…
Complex Data(StructType, ArrayType, and MapType) Types in PySpark
Great! Let’s break down PySpark's complex data types— StructType , ArrayType , and MapType —in a simple and clear way. We'll go over: What they are When to…
select() Function in PySpark
In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpa…
Collect() in PySpark
PySpark collect() Function – The collect() function in PySpark is used to retrieve all the rows of a DataFrame (or RDD) from the distributed cluster back t…
withColumn() in Pyspark
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new …
where() & filter() in PySpark
The filter() function in PySpark is used to create a new DataFrame by selecting rows that meet a specified condition or SQL expression. Alternatively, the …
Groupby in Pyspark
Function Description Example count() Counts the number of rows per group df.groupBy("col").count() mean() Returns the average value per group df.groupBy("c…
drop(), dropDuplicates(), and distinct() in PySpark
🔹 1. drop() – Removing Columns The drop() function is used to remove one or more columns from a DataFrame. 👉 Example: Removing a Single Column from pyspa…
Applying Functions in PySpark
PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analys…
Joins in PySpark
They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis. 1. PySpark Join Types Below are …
orderBy() and sort() in PySpark
PySpark provides two functions, sort() and orderBy() , to arrange data in a structured manner. 1. Understanding sort() in PySpark from pyspark.sql.function…
union(), unionAll(), and unionByName() in PySpark
Here's the corrected explanation of union() , unionAll() , and unionByName() in PySpark along with appropriate examples. 1. union() The union() method is u…
PySpark Built-in Functions
These functions are commonly used with groupBy() , agg() , or select() to compute things like sum, average, max, min, count, etc. PySpark functions come fr…
PySpark SQL Date and Timestamp Functions
🔧 Setup First (Optional for Reference) from pyspark.sql import functions as F from pyspark.sql import types as T data = df = spark.createDataFrame(data, )…
PySpark Pivot and Unpivot DataFrame
✅ What is Pivot and Unpivot? Pivot = Convert rows into columns Unpivot = Convert columns into rows 🌀 Sample DataFrame Let’s start with a small DataFrame t…
Working with NULL/None Values in PySpark
🔍 What's fillna() or fill() in PySpark? In PySpark, both fillna() and fill() are used to replace null or missing values in a DataFrame. Both fillna() and …
PySpark Convert String to Array Column
To convert a string column (StringType) to an array column (ArrayType) in PySpark, you can use the split() function from the pyspark.sql.…
concat() and concat_ws() in PySpark
In PySpark, both concat() and concat_ws() are used to combine multiple columns into a single string column. ✅ concat() – Combines columns without any delim…
substring() in PySpark
📌 What is substring() ? The substring() function in PySpark is used to extract a portion of a string from a column in a DataFrame. It is part of the PySpa…