Drop rows in PySpark DataFrame with condition

Drop rows in PySpark DataFrame with condition

If you're working with a PySpark DataFrame and want to drop rows based on a specific condition, you can use the filter or where methods. Both methods work similarly, so you can choose whichever you find more intuitive.

Here's how to drop rows based on a condition in a PySpark DataFrame:

  • First, ensure you've set up PySpark:
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("dropRowsExample").getOrCreate() 
  • Let's create a sample DataFrame:
from pyspark.sql import Row data = [ Row(name="Alice", age=25), Row(name="Bob", age=30), Row(name="Charlie", age=35), Row(name="David", age=40) ] df = spark.createDataFrame(data) df.show() 

This should display:

+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 35| | David| 40| +-------+---+ 
  • Now, let's drop the rows where age is less than 30:

Using the filter method:

filtered_df = df.filter(df.age >= 30) filtered_df.show() 

Using the where method:

filtered_df = df.where(df.age >= 30) filtered_df.show() 

Both methods will produce:

+-------+---+ | name|age| +-------+---+ | Bob| 30| |Charlie| 35| | David| 40| +-------+---+ 

Note that the row with Alice, whose age is 25, has been dropped based on the condition.


More Tags

gradle pyarrow pipeline jsonparser jquery-mobile general-network-error containers unix monitor low-level

More Programming Guides

Other Guides

More Programming Examples