Python - spliting a row to multiple row pyspark

In PySpark, you can split a row into multiple rows using the explode function along with split or other relevant transformations. Here's a basic example:

Let's assume you have a DataFrame with a column containing comma-separated values, and you want to split each row based on the commas:

from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode # Create a Spark session spark = SparkSession.builder.appName("example").getOrCreate() # Sample DataFrame data = [("John", "apple,orange,banana"), ("Alice", "grape,pear"), ("Bob", "watermelon")] columns = ["Name", "Fruits"] df = spark.createDataFrame(data, columns) # Split the "Fruits" column based on commas df_split = df.withColumn("Fruit", split(df["Fruits"], ",")) # Explode the array to multiple rows df_exploded = df_split.select("Name", explode("Fruit").alias("SingleFruit")) # Show the result df_exploded.show()

In this example:

We create a sample DataFrame (df) with columns "Name" and "Fruits."
We use the split function to split the "Fruits" column based on commas, resulting in an array column named "Fruit."
We use the explode function to explode the array into multiple rows, creating a new DataFrame (df_exploded).

The output will be:

+-----+----------+ | Name|SingleFruit| +-----+----------+ | John| apple| | John| orange| | John| banana| |Alice| grape| |Alice| pear| | Bob| watermelon| +-----+----------+

Adjust the code based on your specific DataFrame structure and the delimiter you need to split the rows.

Examples

PySpark Split Row by Delimiter

from pyspark.sql import SparkSession from pyspark.sql.functions import split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ","))

Description: Splits a row by a comma delimiter using the split function in PySpark.

PySpark Explode Split Rows

from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_parts"))

Description: Splits a row by a comma delimiter and explodes the resulting array into multiple rows using the explode function in PySpark.

PySpark Split Row to Multiple Rows Using UDF

from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StringType spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) split_udf = udf(lambda x: x.split(","), ArrayType(StringType())) df_split = df.withColumn("name_split", split_udf(df["name"])).select("name", explode("name_split").alias("name_parts"))

Description: Uses a User-Defined Function (UDF) to split a row by a comma delimiter and explode the resulting array into multiple rows in PySpark.

PySpark Split Row to Rows with Index

from pyspark.sql import SparkSession from pyspark.sql.functions import split, posexplode spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", posexplode("name_split").alias("index", "name_part"))

Description: Splits a row by a comma delimiter and includes the index of each element using the posexplode function in PySpark.

PySpark Split Row with Fixed Number of Columns

from pyspark.sql import SparkSession from pyspark.sql.functions import split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe,30",)] df = spark.createDataFrame(data, ["info"]) df_split = df.withColumn("info_split", split(df["info"], ",")).selectExpr("info_split[0] as name", "info_split[1] as surname", "info_split[2] as age")

Description: Splits a row with a fixed number of columns by a comma delimiter using the split function in PySpark.

PySpark Split Row to Rows with Null Handling

from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode, when spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",), (None,)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_part")) df_split = df_split.withColumn("name_part", when(df_split["name_part"].isNotNull(), df_split["name_part"]))

Description: Handles null values by replacing them with null when splitting a row into multiple rows in PySpark.

PySpark Split Row with Regex Pattern

from pyspark.sql import SparkSession from pyspark.sql.functions import regexp_split spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John-Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", regexp_split(df["name"], "-", 2)).select("name", "name_split[0]", "name_split[1]")

Description: Splits a row using a regex pattern (hyphen in this case) and extracts the first two parts using the regexp_split function in PySpark.

PySpark Split Row to Rows with Replication

from pyspark.sql import SparkSession from pyspark.sql.functions import split, explode, lit spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", explode("name_split").alias("name_part")) df_split = df_split.withColumn("replicated", lit(1))

Description: Replicates rows after splitting to create multiple rows for each element in the split array in PySpark.

PySpark Split Row with Trim

from pyspark.sql import SparkSession from pyspark.sql.functions import split, trim, col spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [(" John , Doe ",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")).select("name", trim(col("name_split")[0]).alias("first_name"), trim(col("name_split")[1]).alias("last_name"))

Description: Splits a row and trims leading and trailing whitespaces using the trim function in PySpark.

PySpark Split Row with Condition

from pyspark.sql import SparkSession from pyspark.sql.functions import split, when spark = SparkSession.builder.appName("SplitRow").getOrCreate() data = [("John,Doe",), ("Alice",)] df = spark.createDataFrame(data, ["name"]) df_split = df.withColumn("name_split", split(df["name"], ",")) df_split = df_split.withColumn("first_name", when(df_split["name_split"].size() > 0, df_split["name_split"][0])) df_split = df_split.withColumn("last_name", when(df_split["name_split"].size() > 1, df_split["name_split"][1]))

Description: Uses a conditional statement to handle cases where the split array may not have the expected number of elements in PySpark.

More Tags

shortest-path snackbar kendo-datasource submenu ibaction scrolltop lldb enoent text-cursor android-navigationview

Python - spliting a row to multiple row pyspark

Examples

More Tags

More Programming Questions

More Gardening and crops Calculators

More Financial Calculators

More Investment Calculators

More Transportation Calculators

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators