Drop rows in PySpark DataFrame with condition

If you're working with a PySpark DataFrame and want to drop rows based on a specific condition, you can use the filter or where methods. Both methods work similarly, so you can choose whichever you find more intuitive.

Here's how to drop rows based on a condition in a PySpark DataFrame:

First, ensure you've set up PySpark:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("dropRowsExample").getOrCreate()

Let's create a sample DataFrame:

from pyspark.sql import Row data = [ Row(name="Alice", age=25), Row(name="Bob", age=30), Row(name="Charlie", age=35), Row(name="David", age=40) ] df = spark.createDataFrame(data) df.show()

This should display:

+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 35| | David| 40| +-------+---+

Now, let's drop the rows where age is less than 30:

Using the filter method:

filtered_df = df.filter(df.age >= 30) filtered_df.show()

Using the where method:

filtered_df = df.where(df.age >= 30) filtered_df.show()

Both methods will produce:

+-------+---+ | name|age| +-------+---+ | Bob| 30| |Charlie| 35| | David| 40| +-------+---+

Note that the row with Alice, whose age is 25, has been dropped based on the condition.

More Tags

gradle pyarrow pipeline jsonparser jquery-mobile general-network-error containers unix monitor low-level

Drop rows in PySpark DataFrame with condition

More Tags

More Programming Guides

Other Guides

More Programming Examples

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators