How to check for a substring in a PySpark dataframe?

How to check for a substring in a PySpark dataframe?

If you want to check for the presence of a substring within a column of a PySpark DataFrame, you can use the contains method provided by the pyspark.sql.functions module.

Here's a step-by-step guide on how to do this:

  • First, ensure you've set up PySpark:
pip install pyspark 
  • Create a sample PySpark DataFrame and check for the presence of a substring:
from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize Spark session spark = SparkSession.builder.appName("substringCheck").getOrCreate() # Sample dataframe data = [("John Doe",), ("Jane Smith",), ("Sam Brown",)] df = spark.createDataFrame(data, ["name"]) # Check for the substring 'Smith' in the 'name' column df_with_substring_check = df.withColumn("has_substring", col("name").contains("Smith")) df_with_substring_check.show() 

The resulting DataFrame will have a new column "has_substring" which will be True if the name contains "Smith" and False otherwise.

Output:

+----------+-------------+ | name|has_substring| +----------+-------------+ | John Doe| false| |Jane Smith| true| | Sam Brown| false| +----------+-------------+ 

You can replace "Smith" with any substring you want to check for, and "name" with the name of the column you're interested in.


More Tags

has-many spring-transactions sqldatasource refresh scrapy maya yuv thumbnails hadoop2 conditional-operator

More Programming Guides

Other Guides

More Programming Examples