Results for "spark"

43 / 184 posts

Window Functions in PySpark

Window functions in PySpark allow you to perform operations across a set of rows that are somehow related to the current row. They are useful for tasks lik…

Mar 19, 2026 2 min read

Databricks, Apache Spark, Data Engineering and Science etc.

Azure Databricks is a platform on Microsoft Azure that helps with big data analysis and machine learning. It lets you work with large datasets easily and c…

Analytics Apache Spark azure
Mar 19, 2026 3 min read

Markdown Cheat Sheet

Basic Syntax These are the elements outlined in John Gruber’s original design document. All Markdown applications support these elements. Element Markdown …

blogging markdown notes
Mar 19, 2026 1 min read

Azure Databricks command

Here are some common commands used in Databricks: %fs : Allows you to interact with the filesystem. For example, %fs ls lists the files in the current dire…

PySpark
Mar 19, 2026 1 min read

What is Data Ingestion and DataFrame API

Data ingestion : Data ingestion refers to the process of collecting, importing, and importing data from various sources into a system or storage environmen…

DAta ingestion Data Load etl
Mar 19, 2026 4 min read

How to Read and Write file into DataFrame by using Pyspark

# dataframe reader API.... spark.read.format("") \ .option("key":"value") \ .schema(schemavariable) \ .load() # dataframe write API...... spark.write.mode(…

PySpark
Mar 19, 2026 3 min read

Databricks widgets

Input widgets allow you to add parameters to your notebooks and dashboards. You can add a widget from the Databricks UI or using the widget API.  If y…

comma saparate database Dropdown
Mar 19, 2026 1 min read

PaySpark Data manipulation

Select Table into dataframe: df = spark.read.table(tableName="samples.tpch.customer").limit(5) df = spark.table(tableName="samples.tpch.customer").limit(5)…

PySpark
Mar 19, 2026 1 min read

How to Read and Write CSV file into DataFrame by using Pyspark

PySpark Read CSV File into DataFrame: reading CSV files from disk using PySpark offers a versatile and efficient approach to data ingestion and processing.…

csv data-science Pandas
Mar 19, 2026 2 min read

Join in PySpark

PySpark Join  is used to combine two DataFrames and by chaining these you can join multiple DataFrames. # Syntax join(self, other, on=None, how=None) …

data-analysis data-science machine-learning
Mar 19, 2026 1 min read

How to use Window Functions in PySpark

Absolutely! Let’s break it down and explain each PySpark window function with examples using your code and dataset. I’ll categorize the functions into thre…

data-science finance machine-learning
Mar 19, 2026 3 min read

What is Data tables(Fact Tables) vs Lookup tables(Dimension Tables)

Data Tables(Fact Tables) Purpose : Store detailed, raw data. Structure : Multiple columns (attributes) and rows (records). Example : A table with order det…

DAX Power BI PySpark
Mar 19, 2026 1 min read

Spark SQL useful command

Spark SQL provides a variety of commands for managing databases, tables, and performing SQL operations. CREATE DATABASE IF NOT EXISTS demo; SHOW DATABASES;…

create table database spark sql
Mar 19, 2026 3 min read

What is Managed and External table in Spark

In Apache Spark, both Managed and External tables are used to store the data. However, there are significant differences in how Spark manages the data for …

azure data-engineering data-science
Mar 19, 2026 3 min read

Schema and Handling Corrupt data in PySpark

A schema in PySpark (and generally in data processing) defines the structure of a DataFrame, including the names and data types of each column. It serves a…

comma saparate data-engineering database
Mar 19, 2026 4 min read

What is cluster in Spark

what is cluster : In computing, a cluster refers to a collection of interconnected computers that work together as a single system . These computers, often…

Cluster SQL sql-server
Mar 19, 2026 3 min read

What is Big Data

Big Data refers to extremely large datasets that are too complex and voluminous to be processed and analyzed using traditional data processing tools and te…

Big Data PySpark Data World
Mar 19, 2026 2 min read

Spark Transformations, Actions and Lazy Evaluation and DAG.

Apache Spark RDD supports two types of Operations: Transformations Actions A Transformation is a function that produces new RDD from the existing RDDs but …

Apache Spark azure cloud
Mar 19, 2026 4 min read

What is catalog in Spark

In Apache Spark, the catalog refers to the internal management system that keeps track of all the metadata related to tables, databases, functions, and oth…

azure database databricks
Mar 19, 2026 2 min read

What is Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs) are a data structure of Apache Spark. They represent an immutable, distributed collection of objects that can be proc…

ai artificial-intelligence data-engineering
Mar 19, 2026 3 min read

Spark session vs Spark context

In Apache Spark, SparkSession and SparkContext are both essential components, but they serve different purposes and have different scopes. Here's a detaile…

data-science Pandas Python
Mar 19, 2026 3 min read

Application,Job,Stage,Task in Spark

In Apache Spark, the execution of a program is broken down into multiple levels of granularity: applications, jobs, stages, and tasks. Understanding these …

PySpark
Mar 19, 2026 3 min read

PartitionBy() in PySpark

partitionBy() एक function है जो DataFrame को disk par likhne (write) के time par use hota hai. ये function pyspark.sql.DataFrameWriter class ka part hai. �…

create table Data Skew partitionBy
Mar 19, 2026 3 min read

Understanding DataFrames in PySpark

DataFrames are an important data structure in PySpark. They help in handling structured and semi-structured data efficiently. DataFrames are like tables in…

Create Dataframe PySpark
Mar 19, 2026 2 min read

Understanding show() in PySpark

In PySpark, the .show() function is used to display DataFrame content in a tabular format. Syntax of show() DataFrame.show(n=20, truncate=True, vertical=Fa…

show PySpark
Mar 19, 2026 2 min read

Complex Data(StructType, ArrayType, and MapType) Types in PySpark

Great! Let’s break down PySpark's complex data types— StructType , ArrayType , and MapType —in a simple and clear way. We'll go over: What they are When to…

Dataframe StructField StructType
Mar 19, 2026 4 min read

select() Function in PySpark

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpa…

column select Select select function in pyspark
Mar 19, 2026 1 min read

Collect() in PySpark

PySpark collect() Function – The collect() function in PySpark is used to retrieve all the rows of a DataFrame (or RDD) from the distributed cluster back t…

collect PySpark
Mar 19, 2026 2 min read

withColumn() in Pyspark

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new …

withColumn PySpark
Mar 19, 2026 1 min read

where() & filter() in PySpark

The filter() function in PySpark is used to create a new DataFrame by selecting rows that meet a specified condition or SQL expression. Alternatively, the …

arra_contain endwith Filter
Mar 19, 2026 1 min read

Groupby in Pyspark

Function Description Example count() Counts the number of rows per group df.groupBy("col").count() mean() Returns the average value per group df.groupBy("c…

Agg Groupby PySpark
Mar 19, 2026 1 min read

drop(), dropDuplicates(), and distinct() in PySpark

🔹 1. drop() – Removing Columns The drop() function is used to remove one or more columns from a DataFrame. 👉 Example: Removing a Single Column from pyspa…

distinct drop dropDuplicates
Mar 19, 2026 2 min read

Applying Functions in PySpark

PySpark, the Python API for Apache Spark, provides multiple ways to apply functions to DataFrame columns. This flexibility allows data engineers and analys…

apply Apply function lower
Mar 19, 2026 2 min read

Joins in PySpark

They allow us to combine two or more DataFrames based on a common column, enabling efficient data processing and analysis. 1. PySpark Join Types Below are …

cross join inner join Joins
Mar 19, 2026 3 min read

orderBy() and sort() in PySpark

PySpark provides two functions, sort() and orderBy() , to arrange data in a structured manner. 1. Understanding sort() in PySpark from pyspark.sql.function…

OrderBy Sort PySpark
Mar 19, 2026 1 min read

union(), unionAll(), and unionByName() in PySpark

Here's the corrected explanation of union() , unionAll() , and unionByName() in PySpark along with appropriate examples. 1. union() The union() method is u…

UNION unionAll unionByName
Mar 19, 2026 2 min read

PySpark Built-in Functions

These functions are commonly used with groupBy() , agg() , or select() to compute things like sum, average, max, min, count, etc. PySpark functions come fr…

Aggregate apache spark for beginners big data tutorial
Mar 19, 2026 2 min read

PySpark SQL Date and Timestamp Functions

🔧 Setup First (Optional for Reference) from pyspark.sql import functions as F from pyspark.sql import types as T data = df = spark.createDataFrame(data, )…

Date Datetime PySpark SQL Date and Timestamp Functions
Mar 19, 2026 2 min read

PySpark Pivot and Unpivot DataFrame

✅ What is Pivot and Unpivot? Pivot = Convert rows into columns Unpivot = Convert columns into rows 🌀 Sample DataFrame Let’s start with a small DataFrame t…

pivot Unpivot PySpark
Mar 19, 2026 2 min read

Working with NULL/None Values in PySpark

🔍 What's fillna() or fill() in PySpark? In PySpark, both fillna() and fill() are used to replace null or missing values in a DataFrame. Both fillna() and …

dropna dropna() fill
Mar 19, 2026 1 min read

PySpark Convert String to Array Column

To convert a string column (StringType) to an array column (ArrayType) in PySpark, you can use the  split()  function from the  pyspark.sql.…

PySpark Convert String to Array Column SPLIT PySpark
Mar 19, 2026 1 min read

concat() and concat_ws() in PySpark

In PySpark, both concat() and concat_ws() are used to combine multiple columns into a single string column. ✅ concat() – Combines columns without any delim…

Combines columns concat concat_ws
Mar 19, 2026 2 min read

substring() in PySpark

📌 What is substring() ? The substring() function in PySpark is used to extract a portion of a string from a column in a DataFrame. It is part of the PySpa…

substr substring substring() vs substr()
Mar 19, 2026 2 min read