Filtering a pyspark dataframe using isin by exclusion

Filtering a pyspark dataframe using isin by exclusion

In PySpark, you can use the ~ operator to exclude values when using the isin() function to filter a DataFrame. The ~ operator is used to negate a condition. Here's how you can filter a PySpark DataFrame using isin() by exclusion:

from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize Spark session spark = SparkSession.builder.appName("Example").getOrCreate() # Create a sample DataFrame data = [("Alice", 25), ("Bob", 30), ("Charlie", 22), ("David", 28)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) # Values to exclude excluded_names = ["Alice", "Charlie"] # Filter DataFrame using isin with exclusion filtered_df = df.filter(~col("Name").isin(excluded_names)) # Show the filtered DataFrame filtered_df.show() 

Output:

+-----+---+ | Name|Age| +-----+---+ | Bob| 30| |David| 28| +-----+---+ 

In this example, the ~ operator is used to negate the condition created by col("Name").isin(excluded_names), effectively excluding the names present in the excluded_names list.

Keep in mind that the isin() function in PySpark is a column-level operation, and using the ~ operator negates the condition on the specified column.

Examples

  1. "Exclude specific values in PySpark DataFrame using isin"

    Description: This query indicates a search for excluding specific values from a PySpark DataFrame using the isin function.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Exclude Values") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values to exclude exclude_values = ["B", "D"] # Filtering DataFrame to exclude values filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 
  2. "Exclude rows based on condition PySpark"

    Description: This query suggests looking for ways to exclude rows from a PySpark DataFrame based on specific conditions.

    from pyspark.sql import SparkSession # Initialize SparkSession spark = SparkSession.builder \ .appName("Exclude Rows") \ .getOrCreate() # Sample DataFrame data = [("A", 10), ("B", 20), ("C", 30), ("D", 40), ("E", 50)] df = spark.createDataFrame(data, ["col1", "col2"]) # Condition to exclude rows condition = df.col2 < 30 # Filtering DataFrame based on condition filtered_df = df.filter(~condition) filtered_df.show() 
  3. "PySpark DataFrame filter not in list"

    Description: This query suggests looking for methods to filter a PySpark DataFrame where values are not present in a given list.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Filter Not In List") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values not to include exclude_values = ["B", "D"] # Filtering DataFrame where values are not in the list filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 
  4. "PySpark DataFrame exclude values using isin function"

    Description: This query indicates a search for excluding specific values from a PySpark DataFrame utilizing the isin function.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Exclude Values Using isin") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values to exclude exclude_values = ["B", "D"] # Filtering DataFrame to exclude values using isin function filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 
  5. "Filter PySpark DataFrame excluding specific values"

    Description: This query suggests searching for methods to filter a PySpark DataFrame while excluding specific values.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Filter Excluding Values") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values to exclude exclude_values = ["B", "D"] # Filtering DataFrame to exclude values filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 
  6. "Exclude rows from PySpark DataFrame based on a list"

    Description: This query indicates a search for methods to exclude specific rows from a PySpark DataFrame based on a list of values.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Exclude Rows Based on List") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # List of values to exclude exclude_list = ["B", "D"] # Filtering DataFrame to exclude rows based on list filtered_df = df.filter(~col("col").isin(exclude_list)) filtered_df.show() 
  7. "PySpark filter DataFrame exclude multiple values"

    Description: This query suggests looking for ways to filter a PySpark DataFrame while excluding multiple values from consideration.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Filter Exclude Multiple Values") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values to exclude exclude_values = ["B", "D"] # Filtering DataFrame to exclude multiple values filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 
  8. "PySpark DataFrame exclude rows based on values"

    Description: This query indicates a search for methods to exclude specific rows from a PySpark DataFrame based on certain values.

    from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Exclude Rows Based on Values") \ .getOrCreate() # Sample DataFrame data = [("A",), ("B",), ("C",), ("D",), ("E",)] df = spark.createDataFrame(data, ["col"]) # Values to exclude exclude_values = ["B", "D"] # Filtering DataFrame to exclude rows based on values filtered_df = df.filter(~col("col").isin(exclude_values)) filtered_df.show() 

More Tags

insert-update cpu-usage subset hadoop-yarn documentfile abi angularjs http-status-code-301 tui jpanel

More Python Questions

More Weather Calculators

More Fitness-Health Calculators

More Internet Calculators

More Chemical thermodynamics Calculators