Python - Pyspark: how to duplicate a row n time in dataframe?

In PySpark, you can duplicate a row n times in a DataFrame using the union method in combination with the range function. Here's an example:

from pyspark.sql import SparkSession from pyspark.sql.functions import expr # Create a Spark session spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() # Sample DataFrame data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) # Number of times to duplicate each row n = 3 # Duplicate rows n times using union and range duplicated_df = df.withColumn("duplicate", expr(f"explode(array_repeat(1, {n}))")).drop("duplicate") # Display the original and duplicated DataFrames print("Original DataFrame:") df.show() print(f"\nDataFrame after duplicating each row {n} times:") duplicated_df.show() # Stop the Spark session spark.stop()

In this example:

We use the array_repeat function from the pyspark.sql.functions module to create an array of size n filled with 1s.
We use the explode function to create n copies of each row by exploding the array.
The resulting DataFrame duplicated_df contains each original row duplicated n times.

Adjust the n variable according to the number of times you want to duplicate each row. The array_repeat function and explode function are key to achieving this duplication.

Keep in mind that duplicating rows in this way can lead to a substantial increase in the size of your DataFrame, so use it judiciously based on your specific requirements.

Examples

Pyspark DataFrame duplicate row n times:

from pyspark.sql import SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.withColumn("dummy", explode(array([lit(1)] * n))).drop("dummy")

Description: Uses the explode function to duplicate a row n times by creating an array of n ones and then exploding it.

Pyspark DataFrame replicate row multiple times:

from pyspark.sql import SparkSession from pyspark.sql.functions import monotonically_increasing_id spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.crossJoin(range(n)).withColumnRenamed("col1", "new_col1").drop("id")

Description: Uses crossJoin to replicate a row multiple times by creating a DataFrame with n duplicates.

Pyspark duplicate rows in DataFrame:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.union(*([df] * (n - 1)))

Description: Uses the union operation to duplicate rows by concatenating the DataFrame with itself n-1 times.

Pyspark replicate row n times with RDD:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 rdd = df.rdd.flatMap(lambda x: [x] * n) duplicated_df = spark.createDataFrame(rdd, df.schema)

Description: Uses RDD to replicate a row n times by flatMapping each row to a list of n identical rows.

Pyspark DataFrame duplicate rows using crossJoin:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.crossJoin(range(n)).drop("id")

Description: Utilizes crossJoin with a range to replicate rows n times.

Pyspark DataFrame duplicate row with explode:

from pyspark.sql import SparkSession from pyspark.sql.functions import explode spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.withColumn("dummy", explode(array([lit(1)] * n))).drop("dummy")

Description: Similar to query 1, uses explode to duplicate a row n times.

Pyspark DataFrame repeat row n times:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.crossJoin(range(n))

Description: Uses crossJoin with a range to replicate rows n times.

Pyspark DataFrame duplicate row with array_repeat:

from pyspark.sql import SparkSession from pyspark.sql.functions import array_repeat spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.withColumn("dummy", array_repeat(df["col1"], n)).drop("dummy")

Description: Utilizes array_repeat to duplicate a column n times and create a new DataFrame.

Pyspark replicate row with broadcast join:

from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.join(broadcast(spark.range(n)), how="cross").drop("id")

Description: Uses a broadcast join to replicate rows n times.

Pyspark DataFrame duplicate rows with selectExpr:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateRows").getOrCreate() df = spark.createDataFrame([(1, "data")], ["col1", "col2"]) n = 3 duplicated_df = df.selectExpr(f"stack({n}, {', '.join([f'col1 as new_col1, col2 as new_col2']*n)})")

Description: Uses selectExpr with the stack function to duplicate rows n times.

More Tags

dot-matrix recyclerview-layout nextion percona semaphore null-pointer e-commerce spline portrait flash-message

Python - Pyspark: how to duplicate a row n time in dataframe?

Examples

More Tags

More Programming Questions

More Fitness Calculators

More Mortgage and Real Estate Calculators

More Animal pregnancy Calculators

More Entertainment Anecdotes Calculators

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators