python - spliting a row to multiple row pyspark

Python - spliting a row to multiple row pyspark

In PySpark, you can split a row into multiple rows using the explode function along with split or other relevant transformations. Here's a basic example:

Let's assume you have a DataFrame with a column containing comma-separated values, and you want to split each row based on the commas:

from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode # Create a Spark session spark = SparkSession.builder.appName("example").getOrCreate() # Sample DataFrame data = [("John", "apple,orange,banana"), ("Alice", "grape,pear"), ("Bob", "watermelon")] columns = ["Name", "Fruits"] df = spark.createDataFrame(data, columns) # Split the "Fruits" column based on commas df_split = df.withColumn("Fruit", split(df["Fruits"], ",")) # Explode the array to multiple rows df_exploded = df_split.select("Name", explode("Fruit").alias("SingleFruit")) # Show the result df_exploded.show() 

In this example:

  1. We create a sample DataFrame (df) with columns "Name" and "Fruits."
  2. We use the split function to split the "Fruits" column based on commas, resulting in an array column named "Fruit."
  3. We use the explode function to explode the array into multiple rows, creating a new DataFrame (df_exploded).

The output will be:

+-----+----------+ | Name|SingleFruit| +-----+----------+ | John| apple| | John| orange| | John| banana| |Alice| grape| |Alice| pear| | Bob| watermelon| +-----+----------+ 

Adjust the code based on your specific DataFrame structure and the delimiter you need to split the rows.

Examples

  1. PySpark Split Row by Delimiter

    from pyspark.sql import SparkSession from pyspark.sql.functions import split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")) 

    Description: Splits a row by a comma delimiter using the split function in PySpark.

  2. PySpark Explode Split Rows

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_parts")) 

    Description: Splits a row by a comma delimiter and explodes the resulting array into multiple rows using the explode function in PySpark.

  3. PySpark Split Row to Multiple Rows Using UDF

    from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StringType spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) split_udf = udf(lambda x: x.split(","), ArrayType(StringType())) df_split = df.withColumn("name_split", split_udf(df["name"])).select("name", explode("name_split").alias("name_parts")) 

    Description: Uses a User-Defined Function (UDF) to split a row by a comma delimiter and explode the resulting array into multiple rows in PySpark.

  4. PySpark Split Row to Rows with Index

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, posexplode spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", posexplode("name_split").alias("index", "name_part")) 

    Description: Splits a row by a comma delimiter and includes the index of each element using the posexplode function in PySpark.

  5. PySpark Split Row with Fixed Number of Columns

    from pyspark.sql import SparkSession from pyspark.sql.functions import split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe,30",)] df = spark.createDataFrame(data, ["info"]) df_split = df.withColumn("info_split", split(df["info"], ",")).selectExpr("info_split[0] as name", "info_split[1] as surname", "info_split[2] as age") 

    Description: Splits a row with a fixed number of columns by a comma delimiter using the split function in PySpark.

  6. PySpark Split Row to Rows with Null Handling

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode, when spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",), (None,)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_part")) df_split = df_split.withColumn("name_part", when(df_split["name_part"].isNotNull(), df_split["name_part"])) 

    Description: Handles null values by replacing them with null when splitting a row into multiple rows in PySpark.

  7. PySpark Split Row with Regex Pattern

    from pyspark.sql import SparkSession from pyspark.sql.functions import regexp_split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John-Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", regexp_split(df["name"], "-", 2)).select("name", "name_split[0]", "name_split[1]") 

    Description: Splits a row using a regex pattern (hyphen in this case) and extracts the first two parts using the regexp_split function in PySpark.

  8. PySpark Split Row to Rows with Replication

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode, lit spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_part")) df_split = df_split.withColumn("replicated", lit(1)) 

    Description: Replicates rows after splitting to create multiple rows for each element in the split array in PySpark.

  9. PySpark Split Row with Trim

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, trim, col spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [(" John , Doe ",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", trim(col("name_split")[0]).alias("first_name"), trim(col("name_split")[1]).alias("last_name")) 

    Description: Splits a row and trims leading and trailing whitespaces using the trim function in PySpark.

  10. PySpark Split Row with Condition

    from pyspark.sql import SparkSession from pyspark.sql.functions import split, when spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",), ("Alice",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")) df_split = df_split.withColumn("first_name", when(df_split["name_split"].size() > 0, df_split["name_split"][0])) df_split = df_split.withColumn("last_name", when(df_split["name_split"].size() > 1, df_split["name_split"][1])) 

    Description: Uses a conditional statement to handle cases where the split array may not have the expected number of elements in PySpark.


More Tags

shortest-path snackbar kendo-datasource submenu ibaction scrolltop lldb enoent text-cursor android-navigationview

More Programming Questions

More Gardening and crops Calculators

More Financial Calculators

More Investment Calculators

More Transportation Calculators