DEV Community

Cover image for Data Cleaning & Preparation: The Ultimate Guide for Any Dataset
Nivesh Bansal
Nivesh Bansal

Posted on

Data Cleaning & Preparation: The Ultimate Guide for Any Dataset

When we start working as a Data Analyst, the first step is not building dashboards or making charts.
The first step is always cleaning and preparing the dataset.

πŸ‘‰ If the dataset is not clean, the final analysis will be wrong or misleading.

In this post, we will go through every important task you should do with a fresh dataset (no matter if it is sales data, HR data, COVID data, banking data, or any other).

We will use Python (pandas) for examples, but the steps apply to any dataset.


Step 1: Import Libraries and Load Dataset

import pandas as pd ## Example for CSV file df = pd.read_csv("data.csv") ## Example for Excel file ## df = pd.read_excel("data.xlsx")  ## Example for SQL database ## import sqlite3 ## conn = sqlite3.connect("data.db") ## df = pd.read_sql("SELECT * FROM table_name", conn)  print(df.head()) 
Enter fullscreen mode Exit fullscreen mode

Always start by loading the dataset and looking at the first few rows.


Step 2: Check Dataset Shape

print(df.shape) 
Enter fullscreen mode Exit fullscreen mode

Example Output:

(10000, 15) 
Enter fullscreen mode Exit fullscreen mode
  • 10,000 rows (data entries)
  • 15 columns (variables)

Helps you know how big the dataset is.


Step 3: Get Dataset Info

print(df.info()) 
Enter fullscreen mode Exit fullscreen mode

This shows:

  • Column names
  • Data types (int, float, object, datetime)
  • Non-null counts

Example:

order_id 10000 non-null int64 order_date 9990 non-null object sales_amount 9800 non-null float64 customer_name 10000 non-null object 
Enter fullscreen mode Exit fullscreen mode

Step 4: Check First and Last Few Rows

print(df.head(10)) print(df.tail(10)) 
Enter fullscreen mode Exit fullscreen mode

Helps you:

  • Confirm column order
  • Check sorting (time, id, etc.)
  • Spot strange values

Step 5: Check Column Names

print(df.columns) 
Enter fullscreen mode Exit fullscreen mode

Example:

['Order ID', 'Order Date', 'Sales Amount', 'Customer Name'] 
Enter fullscreen mode Exit fullscreen mode

Sometimes names have spaces, symbols, or capital letters. Clean them:

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_") 
Enter fullscreen mode Exit fullscreen mode

Now column names are:

['order_id', 'order_date', 'sales_amount', 'customer_name'] 
Enter fullscreen mode Exit fullscreen mode

Step 6: Handle Data Types

Always check if each column has the correct type:

print(df.dtypes) 
Enter fullscreen mode Exit fullscreen mode

Example fixes:

## Convert date column df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce') ## Convert numeric columns df['sales_amount'] = pd.to_numeric(df['sales_amount'], errors='coerce') ## Convert categorical columns df['customer_name'] = df['customer_name'].astype('category') 
Enter fullscreen mode Exit fullscreen mode

Step 7: Check Missing Values

print(df.isnull().sum()) 
Enter fullscreen mode Exit fullscreen mode

Example Output:

order_date 10 sales_amount 200 customer_name 0 
Enter fullscreen mode Exit fullscreen mode

Ways to handle missing values:

  1. Drop rows/columns
df.dropna(subset=['order_date'], inplace=True) 
Enter fullscreen mode Exit fullscreen mode
  1. Fill with mean/median/mode
df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True) 
Enter fullscreen mode Exit fullscreen mode
  1. Fill forward/backward (time series)
df['sales_amount'].fillna(method='ffill', inplace=True) 
Enter fullscreen mode Exit fullscreen mode

Step 8: Check for Duplicates

print(df.duplicated().sum()) df = df.drop_duplicates() 
Enter fullscreen mode Exit fullscreen mode

Avoid double counting.


Step 9: Explore Unique Values

print(df['customer_name'].unique()) print(df['customer_name'].nunique()) 
Enter fullscreen mode Exit fullscreen mode

Helps find spelling mistakes like "John Smith" vs "Jon Smith".


Step 10: Descriptive Statistics

print(df.describe(include='all')) 
Enter fullscreen mode Exit fullscreen mode

Example:

sales_amount min: -50 max: 100000 mean: 2500 
Enter fullscreen mode Exit fullscreen mode

Helps detect:

  • Outliers
  • Negative values (not possible for sales)
  • Very large values

Step 11: Handle Outliers

Example: Negative sales amount should not exist.

df = df[df['sales_amount'] >= 0] 
Enter fullscreen mode Exit fullscreen mode

Or cap extreme values:

q1 = df['sales_amount'].quantile(0.25) q3 = df['sales_amount'].quantile(0.75) iqr = q3 - q1 lower = q1 - 1.5 * iqr upper = q3 + 1.5 * iqr df = df[(df['sales_amount'] >= lower) & (df['sales_amount'] <= upper)] 
Enter fullscreen mode Exit fullscreen mode

Step 12: Standardize Text Data

Example: customer names, product names, cities

df['customer_name'] = df['customer_name'].str.strip().str.title() 
Enter fullscreen mode Exit fullscreen mode

Removes extra spaces and standardizes case.


Step 13: Rename or Drop Columns

df.rename(columns={'sales_amount': 'revenue'}, inplace=True) df.drop(columns=['unnecessary_column'], inplace=True) 
Enter fullscreen mode Exit fullscreen mode

Step 14: Create New Features

Examples:

  • Profit Margin
  • Year, Month from date
  • Age group
df['year'] = df['order_date'].dt.year df['month'] = df['order_date'].dt.month 
Enter fullscreen mode Exit fullscreen mode

Step 15: Check Data Consistency

  • Dates should be in correct order
  • IDs should be unique
  • Categories should make sense
print(df['year'].unique()) print(df['order_id'].nunique(), df.shape[0]) 
Enter fullscreen mode Exit fullscreen mode

Step 16: Sort and Reset Index

df = df.sort_values(by=['order_date']) df = df.reset_index(drop=True) 
Enter fullscreen mode Exit fullscreen mode

Step 17: Final Save

df.to_csv("clean_data.csv", index=False) 
Enter fullscreen mode Exit fullscreen mode

Now dataset is ready for analysis or dashboard (Power BI, Tableau, etc.).


Final Checklist for Any Dataset

βœ… Import dataset
βœ… Check shape, info, head/tail
βœ… Clean column names
βœ… Fix data types
βœ… Handle missing values
βœ… Remove duplicates
βœ… Check unique values
βœ… Descriptive statistics
βœ… Handle outliers
βœ… Standardize text
βœ… Drop/rename columns
βœ… Create new features
βœ… Check consistency
βœ… Sort and reset index
βœ… Save clean dataset


Conclusion

No matter what dataset you are working on (Sales, HR, Finance, COVID, Marketing, Banking, Ecommerce),
these steps will always help you make the dataset clean and ready for analysis.

πŸ‘‰ Clean data = Better insights + Correct dashboards + Happy clients.

By Nivesh Bansal

Top comments (0)