PySpark create new column with mapping from a dict

In PySpark, you can create a new column in a DataFrame by mapping values from an existing column to a dictionary using the withColumn function along with F.create_map and F.col from the pyspark.sql.functions module. However, a more common and efficient way is to use the F.create_map function for a literal dictionary lookup within a F.when or F.expr clause.

Here's a step-by-step example to show you how to do this:

from pyspark.sql import SparkSession from pyspark.sql.functions import col, create_map, lit from itertools import chain # Initialize a SparkSession spark = SparkSession.builder.appName("example").getOrCreate() # Suppose we have the following DataFrame data = [("A",), ("B",), ("C",), ("D",)] df = spark.createDataFrame(data, ["letter"]) # And a dictionary for mapping mapping_dict = {"A": "Apple", "B": "Banana", "C": "Cherry"} # Convert dictionary to a column of key-value pairs mapping_expr = create_map([lit(x) for x in chain(*mapping_dict.items())]) # Add a new column with the mapped values df = df.withColumn("fruit", mapping_expr.getItem(col("letter"))) # Show the resulting DataFrame df.show() # Stop the SparkSession spark.stop()

In this code snippet:

chain(*mapping_dict.items()) flattens the dictionary into a list of alternating keys and values.
create_map creates a mapping expression from the flattened list.
getItem(col("letter")) retrieves the value from the mapping for each entry in the 'letter' column.
withColumn creates a new column in the DataFrame with the corresponding mapped values.
The resulting DataFrame should have two columns: 'letter' and 'fruit', where 'fruit' contains the values from the mapping based on the 'letter' keys.

Please note that if a key from the 'letter' column doesn't exist in the mapping_dict, the corresponding entry in the 'fruit' column will be null. If you have a default value for unmapped keys, you could use F.coalesce to substitute a fallback value for null results.

More Tags

code-documentation pcf live-templates geopandas embedded-tomcat-8 ftp getter-setter text-to-speech chromium redirectstandardoutput

PySpark create new column with mapping from a dict

More Tags

More Programming Guides

Other Guides

More Programming Examples

Fitness Calculators

Auto Calculators

Financial Calculators

Date and Time Calculators

Internet Calculators

Pregnancy Calculators

Investment Calculators

Math Calculators

Housing/Building Calculators

Health Calculators

Retirement Calculators

Statistics Calculators

Various Measurements/Units Calculators

Everyday Utility Calculators

Weather Calculators

Real Estate Calculators

Tax and Salary Calculators

Geometry Calculators

Electronics/Circuits Calculators

Transportation Calculators

Entertainment/Anecdotes Calculators