Collection,
Cleaning and
Transformatio
n
INTRODUCTION TO ESSENTIAL DATA
SCIENCE STEPS
Agenda
- Data Collection
- Data Cleaning
- Data Transformation
Importance of Data
Collection
•Why data collection is crucial?
Data collection is crucial because it forms the foundation for informed
decision-making in any field. By gathering accurate and relevant data,
organizations can identify trends, measure performance, and gain insights into
customer behavior, market dynamics, and operational efficiency
•Impact of good data collection on analysis and results
Good data collection enhances the accuracy and reliability of analysis, leading
to more precise and actionable results. It ensures that insights are based on
solid evidence, reducing the risk of errors and improving decision-making
outcomes.
Types of Data
- Structured Data
- Unstructured Data
Common Data Sources
- Surveys and Questionnaires
- Databases and Data Warehouses
- Web Scraping
- APIs and Public Data Sets
Data Collection Methods
Manual Data Collection
◦ Pros – ◦ Cons –
◦ Flexibility and Customization ◦ Time Consuming
◦ Human Insight ◦ Prone to Human Error
◦ Cost-Effective for Small-Scale Projects ◦ Scalability Issues
Automated Data Collection
◦ Pros – ◦ Cons –
◦ Speed and Efficiency ◦ High Initial Costs
◦ Accuracy and Consistency ◦ Lack of Flexibility
◦ Scalability ◦ Technical Issues
Introduction to Data
Cleaning
The necessity of cleaning data before analysis
◦ Data cleaning is essential to remove inaccuracies, inconsistencies, and errors
from datasets, ensuring the reliability of analysis. Clean data leads to more
accurate insights and better decision-making, preventing misleading
conclusions.
Brief overview of common issues in raw data
◦ Missing Data
◦ Duplicate Entries
◦ Inconsistent Formats
◦ Outliers
Handling Missing Values
Types of missing data
◦ Missing Completely at Random (MCAR)
◦ Missing at Random (MAR)
◦ Missing Not at Random (MNAR)
Techniques for handling missing values (e.g., removal, imputation)
◦ Deletion Methods
◦ Listwise Deletion
◦ Pairwise Deletion
◦ Imputation Methods
◦ Mean/Median/Mode Imputation
◦ Predictive Imputation
◦ Multiple Imputation
◦ Time Series Imputation
Dealing with Outliers
Definition of outliers
◦ Outliers are data points that significantly deviate from the rest of the
dataset. They can be much higher or lower than the other values and can
skew or mislead statistical analyses.
Handling Outliers
◦ Identification
◦ Transformation
◦ Removal
◦ Imputation
◦ Segmentation
◦ Modeling
Data Transformation
Aspect Normalization Standardization
Rescales data to a fixed range, usually [0, 1] or [-1, Transforms data to have a mean of 0 and a standard
Definition
1]. deviation of 1.
Does not alter the shape of the distribution; only Alters the distribution by centering it around 0 and
Effect on Distribution
scales it. scaling by standard deviation.
Sensitive to Outliers More sensitive to outliers as they can skew the Less sensitive; outliers may still be present but are
range. scaled differently.
Commonly used in scenarios where data needs to fit Preferred
in statistical analyses and machine
Use Case learning algorithms that assume normally
within a bounded range, e.g., image processing.
distributed data, e.g., linear regression.
Assumes data is within a known range and is Assumes data is normally distributed and is
Assumption
bounded. unbounded.
Example Workflow
Tools for Data Cleaning
and Preprocessing
Python Libraries:
• Pandas
• NumPy
• SciPy
• Scikit-learn
•SQL-Based Tools:
• SQL
• Apache Hive
•Data Visualization Tools:
• Tableau Prep
• Power BI
Q&A
Questions?