Back to all posts

PaySpark Data manipulation

Select Table into dataframe: df = spark.read.table(tableName="samples.tpch.customer").limit(5) df = spark.table(tableName="samples.tpch.customer").limit(5)…

Select Table into dataframe:

Bash
df = spark.read.table(tableName="samples.tpch.customer").limit(5)

df = spark.table(tableName="samples.tpch.customer").limit(5)

df = spark.sql('''Select * FROM samples.tpch.customer''').limit(5)

%sql
SELECT * FROM samples.tpch.customer limit 5

How to select Columns

Bash
df = df.selectExpr("*")
df = df.selectExpr("ColName1","ColName2")
df = df.select("ColName1","ColName2")
df.select(df['patientid'], df['2018_hospitalid']).show(1)
df.select(F.col("patientid"),F.col("2018_hospitalid")).show(1)

How to filter data: used where or filter, both are same

Python
df = df.filter((df.c_custkey == 412446) & (df.c_nationkey == 20))

df = df.filter((df["speciesname"] == "Dog") & (df["hospitalid"] == 153))

df = df.filter((F.col("speciesname") == "Dog") & (F.col("hospitalid") == 153))

table2019 = df.where('''speciesname="Dog" and hospitalid=153''')

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.