Posted on Apr 18, 2024

PySpark： missing value

#pyspark #python #dataengineering #bigdata

Drop

df.na.drop() vs. df.dropna()

DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. So theoretically their efficiency should be equivalent.

In addition, df.na.drop() can also specify a subset.

examples

# Code to drop any row that contains missing data df.na.drop().show()

# Only drop if row has at least 2 NON-null values df.na.drop(thresh=2).show()

# Only drop the rows with null in Sales col df.dropna(how='any',subset='Sales').show()

df.na.drop(how='any').show() df.na.drop(how='all').show()

fill

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark smart enough to match up the data types. For example:

df.na.fill('NEW VALUE').show()

if you have multiple columns to fill, you could use a dictionary.

DEV Community

PySpark： missing value

Drop

examples

fill

Top comments (0)