How to delete columns in PySpark dataframe?

How to delete columns in PySpark dataframe?

In PySpark, you can delete or drop columns from a DataFrame using the drop() method. This method is quite straightforward and allows you to remove one or more columns from your DataFrame.

Here's a basic example to demonstrate how to delete columns in a PySpark DataFrame:

Example: Dropping Columns from a PySpark DataFrame

First, make sure you have PySpark installed. If not, you can install it using pip:

pip install pyspark 

Then, you can create a Spark session and use the drop() method as follows:

from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder \ .appName("Example") \ .getOrCreate() # Sample data data = [("John", 28, "New York"), ("Smith", 33, "Las Vegas"), ("Adam", 23, "San Francisco")] # Create DataFrame columns = ["Name", "Age", "City"] df = spark.createDataFrame(data, columns) print("Original DataFrame:") df.show() # Drop a single column df = df.drop("Age") print("DataFrame after dropping 'Age' column:") df.show() # Drop multiple columns df = df.drop("Name", "City") print("DataFrame after dropping 'Name' and 'City' columns:") df.show() 

In this example:

  • A Spark session is created.
  • A DataFrame df is created with sample data.
  • The drop() method is used to remove columns. You can pass the column names as individual arguments to drop multiple columns.

Notes

  • Ensure that the column names passed to drop() are spelled correctly and exist in the DataFrame.
  • The drop() operation does not modify the original DataFrame; it returns a new DataFrame. Hence, you need to assign the result to a variable (can be the same DataFrame variable or a new one).
  • If you are using a version of PySpark earlier than 2.0, the approach to dropping columns might be slightly different. The above method is consistent with PySpark 2.0 and later.

Conclusion

Deleting columns in a PySpark DataFrame is a simple and common operation, important for data preprocessing and manipulation tasks in big data processing. The drop() method provides a flexible way to remove one or more columns from a DataFrame.


More Tags

linear-regression python google-search-api nav rtmp avkit razor-components batch-processing confirm zpl-ii

More Programming Guides

Other Guides

More Programming Examples