Back to all posts

select() Function in PySpark

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpa…

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.

SQL
# Create DataFrame with nested columns
data = 

from pyspark.sql.types import StructType,StructField, StringType        
schema = StructType()),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])
df = spark.createDataFrame(data = data, schema = schema)
df.printSchema()
df.show(truncate=False) # shows all columns
SQL
# Select columns by different ways
df.select("name.firstname","name.lastname").show()
df.select(df.name.firstname,df.name.lastname).show()
df.select(df,df).show()

# By using col() function
from pyspark.sql.functions import col
df.select(col("name.firstname"),col("name").lastname).show()
SQL
# Select columns by regular expression
df.select(df.colRegex("`^.*name*`")).show()
SQL
df.select('*').show()
SQL
df.select().show()

Keep building your data skillset

Explore more SQL, Python, analytics, and engineering tutorials.