Back to all posts

select() Function in PySpark

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpa…

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.

SQL
# Create DataFrame with nested columns
data = 

from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType()),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show(truncate=False) # shows all columns
SQL
# Select columns by different ways
df.select("name.firstname","name.lastname").show()
df.select(df.name.firstname,df.name.lastname).show()
df.select(df,df).show()

# By using col() function
from pyspark.sql.functions import col
df.select(col("name.firstname"),col("name").lastname).show()
SQL
# Select columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()
SQL
df.select('*').show()
SQL
df.select().show()

0 likes

Rate this post

No rating

Tap a star to rate

0 comments

Latest comments

0 comments

No comments yet.

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.