Unit-1 AI ETC MS
Unit-1 AI ETC MS
Artificial Intelligence
Introduction to
Artificial Intelligence
mangal.singh@sitpune.edu.in
https://www.linkedin.com/in/singhmangal
Outline
Introduction to AI:
❖ Strong AI, Weak AI, Applications
❖ Machine Learning
❖ Supervised Learning, Unsupervised Learning, Applications
❖ Training, Testing and Validation of data
❖ Data Wrangling, Data Visualization
❖ Exploratory Data Analysis
❖ Univariate, Bivariate and Multivariate Data Analysis.
What is Artificial Intelligence (AI)?
John McCarthy
(the father of Artificial Intelligence)
AI is a branch of computer science dealing with the simulation of intelligent behavior in computers.
AI is the study of how to make computers do things which, at the moment, people do better.
AI is, the study and design of intelligent agents where an intelligent agent is a system that perceives its environment and takes
actions.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 3
What is Weak AI?
Weak AI, also known as Narrow AI, refers to AI systems that are designed and trained for a specific task or a narrow range of
tasks. These systems do not possess general intelligence or consciousness. Instead, they excel at performing particular
functions within predefined parameters.
Characteristics of Weak AI
▪Task-Specific: Weak AI systems are developed to handle specific tasks, such as language translation, facial recognition, or
playing chess.
▪Lack of Generalization: These AI systems cannot generalize their knowledge or skills to perform tasks outside their
designated domain.
▪No Consciousness: Weak AI lacks self-awareness, consciousness, or understanding. It operates based on programmed
algorithms and learned patterns.
▪Human Assistance: Often requires human intervention for maintenance, updates, and handling unexpected situations.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 4
Examples of Weak AI
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 5
What is Strong AI?
Strong AI, also known as Artificial General Intelligence (AGI), refers to AI systems that possess general cognitive abilities.
These systems are capable of understanding, learning, and applying knowledge across a wide range of tasks, much like a
human being. Strong AI remains a theoretical concept and has not yet been achieved.
Characteristics of Strong AI
▪General Intelligence: Strong AI can understand, learn, and apply knowledge in different contexts, much like human
intelligence.
▪Consciousness and Self-Awareness: It possesses self-awareness and consciousness, allowing it to understand and reflect
on its existence.
▪Autonomy: Strong AI can operate independently, make decisions, and solve problems without human intervention.
▪Adaptability: Capable of adapting to new situations and learning from experiences in a way that mimics human cognitive
processes.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 6
Theoretical Implications of Strong AI
Human-Like Understanding: Strong AI would understand natural language, emotions, and complex concepts at a
level comparable to humans.
Versatility: It could perform a wide range of tasks across different domains without being limited to specific
functions.
Ethical and Moral Reasoning: Possess the ability to make ethical and moral decisions, taking into account the
implications of its actions.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 7
Key Differences between Strong AI and Weak AI
Scope and Functionality Task-specific, narrow focus General intelligence, wide range of tasks
Adaptability Limited to specific functions, not easily adaptable Highly adaptable, learns from experiences
Current Examples Siri, Alexa, Chatbots, Recommendation Systems Currently theoretical, not yet achieved
Development Status Widely used in various applications Subject of ongoing research and development
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 8
AI Techniques
1. Search –
It also provides a framework into which any direct techniques that are available can be embedded.
2. Use of knowledge –
Provides a way of solving complex problems by exploiting the structure of the objects that are involved.
3. Abstraction –
Provides a way of separating important features and variations from many unimportant ones that would
otherwise overwhelm any process.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 9
Task Domains of AI
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 10
History of AI
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 11
Application Domains of AI
Natural Language Processing
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 13
Image Processing
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 14
Speech Recognition
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 15
Data Mining
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 16
Expert System
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 17
Robotics
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 18
Scheduling
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 19
Optimization
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 20
Game Playing
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 21
Virtual Agents
Chatbots Conversational AI
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 22
Personalized Recommender Systems
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 23
Automated Control Systems
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 24
Security
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 25
AI – ML – DL and Data Science
Technique that enables
machines to mimic human
behavior
AI
Subset of AI which uses
Machine statistical methods to enable
Learning machine to learn and improve
Data with time
Science
Deep Subset of ML that includes
Learning algorithms and enables
system to train itself
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 26
What is Machine Learning?
Human can learn from past experience and make decision of its own.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 27
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 28
What is Machine Learning?
CAR
CAR
BIKE
It is a CAR
BIKE
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 29
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 30
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 31
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 32
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 33
What is Machine Learning?
show him
CAR
BIKE
BIKE
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 34
What is Machine Learning?
CAR
CAR
BIKE
BIKE
Past experience
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 35
What is Machine Learning?
CAR
CAR
BIKE
BIKE
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 36
What about a Machine ?
• Addition
• Multiplication
• Division
Machines follow instructions
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 38
What about a Machine ?
• Comparison
• Plotting a chart
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 39
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 40
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 41
What is Machine Learning?
Price in 2025?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 42
What is Machine Learning?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 43
What is Machine Learning?
recognize face
[ Recognize Faces ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 44
What is Machine Learning?
[ What do we do?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 45
What is Machine Learning?
[
This what we called as Data
or Training dataset
Dataset
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 46
What is Machine Learning?
+ +
[ Then, devise algorithms and execute programs on the
data
Dataset
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 47
What is Machine Learning?
+ + +
+ + +
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 49
What is Machine Learning?
+ + +
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 50
What is Machine Learning?
+ + + =
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 52
Machine Learning Paradigms
• Supervised
• Unsupervised Learning
• Reinforcement learning
[ We as human being solve various types of problem in our day-to-day life, <pause> Various decisions
need to be taken.
Depending on the nature of the problem, machine learning tasks can be broadly divided in ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 53
What is Supervised Learning?
CAR
CAR
+ BIKE
= Training Dataset
BIKE
Samples Labels
[In supervised learning, we need some thing called a Labelled Training Dataset ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 54
What is Supervised Learning?
CAR
CAR
+ BIKE
= Training Dataset 𝑓( , )=
BIKE
Samples Labels
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 55
What is Supervised Learning?
CAR
CAR
+ BIKE
= Training Dataset 𝑓( , )=
BIKE
Samples Labels
[ Given a labelled dataset, the task is to devise a function which takes the dataset, and a new sample, and
produces an output value.]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 56
What is Supervised Learning?
CAR
CAR
+ BIKE
= Training Dataset 𝑓( , )= CAR
BIKE
Samples Labels
[ Given a labelled dataset, the task is to devise a function which takes the dataset, and a new sample, and
produces an output value.]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 57
What is Supervised Learning?
CAR
Classification
CAR
+ BIKE
= Training Dataset 𝑓( , )= CAR
BIKE
Samples Labels
[ If the possible output values of the function are predefined and discrete/categorical, it is called
Classification
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 58
What is Supervised Learning?
CAR
Classification
CAR
+ BIKE
= Training Dataset 𝑓( , )= CAR
BIKE
Samples Labels
[ Predefined classes means, it will produce output only from the labels defined in the dataset. For example,
even if we input a bus, it will produce either CAR or BIKE ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 59
Classifier
Elephant
Elephant
Classifier
Dataset
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 60
Regression
𝑓( , )= 20500.50
Dataset
[ If the possible output values of the function are continuous real values, then it is called Regression
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 61
Classification and Regression problems
[
The classification and Regression problems are supervised, because the decision depends on
the characteristics of the ground truth labels or values present in the dataset, which we
define as experience
]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 62
What is Unsupervised Learning?
CAR
CAR
BIKE
BIKE
Dataset
[ In the unsupervised learning, we do not need to know the labels or Ground truth values ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 63
What is Unsupervised Learning?
Clustering
Dataset
[ The task is to identify the patterns like group the similar objects together ]
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 64
What is Unsupervised Learning?
Dataset
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 66
More Examples: Unsupervised Learning
Dataset
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 67
More Examples: Unsupervised Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 68
What is Reinforcement Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 69
What is Reinforcement Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 70
What is Reinforcement Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 71
What is Reinforcement Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 72
Another Example
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 73
Reinforcement Learning
Punishment
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 74
Reinforcement Learning
Reward
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 75
Reinforcement Learning
Reward
Baby Learn from the Trials and Errors
Reinforcement Learning
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 76
Train vs. Validation vs. Test set
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 77
The Training Set
It is the set of data that is used to train and make the model learn the hidden features/patterns in the data.
In each epoch, the same training data is fed to the neural network architecture repeatedly, and the model continues
to learn the features of the data.
The training set should have a diversified set of inputs so that the model is trained in all scenarios and can predict
any unseen data sample that may appear in the future.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 78
The Test Set
The test set is a separate set of data used to test the model after completing the training.
It provides an unbiased final model performance metric in terms of accuracy, precision, etc. To put it simply, it
answers the question of "How well does the model perform?"
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 79
The Validation Set
The validation set is a set of data, separate from the training set, that is used to validate our model performance
during training.
This validation process gives information that helps us tune the model’s hyperparameters and configurations
accordingly. It is like a critic telling us whether the training is moving in the right direction or not.
The model is trained on the training set, and, simultaneously, the model evaluation is performed on the validation
set after every epoch.
The main idea of splitting the dataset into a validation set is to prevent our model from overfitting i.e., the model
becomes really good at classifying the samples in the training set but cannot generalize and make accurate
classifications on the data it has not seen before.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 80
How to split your Machine Learning data?
If there are several hyperparameters to tune, the machine learning model requires a larger validation set to
optimize the model performance. Similarly, if the model has fewer or no hyperparameters, it would be easy to
validate the model using a small set of data.
If a model use case is such that a false prediction can drastically hamper the model performance—like falsely
predicting cancer—it’s better to validate the model after each epoch to make the model learn varied scenarios.
With the increase in the dimension/features of the data, the hyperparameters of the neural network functions also
increase making the model more complex. In these scenarios, a large split of data should be kept in training set
with a validation set.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 81
How to split your Machine Learning data?
One has to come to a split percentage that suits the requirements and meets the model’s needs.
However, there are two major concerns while deciding on the optimum split:
▪If there is less training data, the machine learning model will show high variance in training.
▪With less testing data/validation data, your model evaluation/model performance statistic will have greater
variance.
Essentially, you need to come up with an optimum split that suits the need of the dataset/model.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 82
How to split your Machine Learning data?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 83
What is Data Wrangling?
It is the process of transforming and mapping data from one "raw" data form into another format to make it more
Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 84
What is Data Wrangling?
The process of data wrangling may include further munging, data visualization, data aggregation, training a
statistical model, and many other potential uses.
Data wrangling typically follows a set of general steps, which begin with extracting the raw data from the data
source, "munging" the raw data (e.g., sorting) or parsing the data into predefined data structures, and finally
depositing the resulting content into a data sink for storage and future use.
Wrangling the data is usually accompanied by Mapping. The term "Data Mapping" refers to the element of the
wrangling process that involves identifying source data fields to their respective target data fields.
While Wrangling is dedicated to transforming data, Mapping is about connecting the dots between different
elements.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 85
What is Data Wrangling?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 86
Importance of Data Wrangling
Making raw data usable. Accurately wrangled data guarantees that quality data is entered into the downstream analysis.
Getting all data from various sources into a centralized location so it can be used.
Piecing together raw data according to the required format and understanding the business context of data.
Automated data integration tools are used as data wrangling techniques that clean and convert source data into a standard
format that can be used repeatedly according to end requirements. Businesses use this standardized data to perform crucial,
cross-data set analytics.
Data wrangling acts as a preparation stage for the data mining process, which involves gathering data and making sense of it.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 87
Data Wrangling Process
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 88
Data Wrangling Process-I
Discovery: Before starting the wrangling process, it is critical to think about what may lie beneath your data. It is crucial to think
critically about what results from you anticipate from your data and what you will use it for once the wrangling process is
complete. Once you've determined your objectives, you can gather your data.
Organization: After you've gathered your raw data within a particular dataset, you must structure your data. Due to the variety
and complexity of data types and sources, raw data is often overwhelming at first glance.
Cleaning: When your data is organized, you can begin cleaning your data. Data cleaning involves removing outliers, formatting
nulls, and eliminating duplicate data. It is important to note that cleaning data collected from web scraping methods might be
more tedious than cleaning data collected from a database. Essentially, web data can be highly unstructured and require more
time than structured datafrom a database.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 89
Data Wrangling Process-II
Data enrichment: This step requires that you take a step back from your data to determine if you have enough data to proceed.
Finishing the wrangling process without enough data may compromise insights gathered from further analysis. For example,
investors looking to analyze product review data will want a significant amount of data to portray the market and increase
investment intelligence
Validation: After determining you gathered enough data, you will need to apply validation rules to your data. Validation rules,
performed in repetitive sequences, confirm that your data is consistent throughout your dataset. Validation rules will also
ensure quality as well as security. This step follows similar logic utilized in data normalization, a data standardization process
involving validation rules.
Publishing: The final step of the data munging process is data publishing. Data publishing involves preparing the data for future
use. This may include providing notes and documentation of your wrangling process and creating access for other users and
applications.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 90
Benefits of Data Wrangling
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 91
Data Wrangling Formats
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 92
Data Wrangling Formats
Transactional data: Transactional data refers to business operation transactions. This data type involves detailed subjective
information about particular transactions, including client documentation, client interactions, receipts, and notes regarding any
external transactions.
Analytical Base Table (ABT): Analytical Base Table data involves data within a table with unique entries for each attribute
column. ABT data is the most common business data type as it involves various data types that contribute to the most common
data sources. Even more notable is that ABT data is primarily used for AI and ML, which we will examine later.
Time-series: Time series data involves data that has been divided by a particular amount of time or data that has a relation with
time, particularly sequential time. For example, tracking data regarding an application's downloads over a year or tracking traffic
data over a month would be considered time series data.
Document library: Lastly, document library data is information that involves a large amount of textual data, particularly text
within a document. While document libraries contain rather massive amounts of data, automated data mining tools specifically
designed for text mining can help extract entire texts from documents for further analysis.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 93
Data Wrangling Examples
Merging several data sources into one data set for analysis
Identifying gaps or empty cells in data and either filling or removing them
Identifying severe outliers in data and either explaining the inconsistencies or deleting them to facilitate analysis.
Businesses also use data wrangling tools to:
• Detect corporate fraud
• Support data security
• Ensure accurate and recurring data modeling results
• Ensure business compliance with industry standards
• Perform Customer Behavior Analysis
• Reduce time spent on preparing data for analysis
• Promptly recognize the business value of your data
• Find out data trends
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 94
What is Data Visualization?
Data visualization is a graphical representation of quantitative information and data by using visual elements like
graphs, charts, and maps.
Data visualization convert large and small data sets into visuals, which is easy to understand and process for
humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.
In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts of
information.
Data visualizations are common in your everyday life, but they always appear in the form of graphs and charts. The
combination of multiple visualizations and bits of information are still referred to as Infographics.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 95
What makes Data Visualization Effective?
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 96
What makes Data Visualization Effective?
Using graphs and charts to visualize a large amount of the complex data sets is more comfortable in comparison
to studying the spreadsheet and reports.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 97
Why Use Data Visualization?
To competitive analyze.
To improve insights.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 98
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.
It involves analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify
relationships between variables refers to the method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or modeling.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 99
Exploratory Data Analysis: Key Aspects
Distribution of Data: Examining the distribution of data points to understand their range, central tendencies (mean, median), and
dispersion (variance, standard deviation).
Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize relationships within
the data and distributions of variables.
Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence statistical analyses and might
indicate data entry errors or unique cases.
Correlation Analysis: Checking the relationships between variables to understand how they might affect each other. This includes
computing correlation coefficients and creating correlation matrices.
Handling Missing Values: Detecting and deciding how to address missing data points, whether by imputation or removal, depending on
their impact and the amount of missing data.
Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
Testing Assumptions: Many statistical tests and models assume the data meet certain conditions (like normality or homoscedasticity).
EDA helps verify these assumptions.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 100
Why Exploratory Data Analysis is Important?
Understanding Data Structures: EDA helps in getting familiar with the dataset, understanding the number of features, the type of
data in each feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or
prediction techniques.
Identifying Patterns and Relationships: Through visualizations and statistical summaries, EDA can reveal hidden patterns and
intrinsic relationships between variables. These insights can guide further analysis and enable more effective feature
engineering and model building.
Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual data points that may adversely affect the
results of your analysis. Detecting these early can prevent costly mistakes in predictive modeling and analysis.
Testing Assumptions: Many statistical models assume that data follow a certain distribution or that variables are independent.
EDA involves checking these assumptions. If the assumptions do not hold, the conclusions drawn from the model could be
invalid.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 101
Why Exploratory Data Analysis is Important?
Informing Feature Selection and Engineering: Insights gained from EDA can inform which features are most relevant to include
in a model and how to transform them (scaling, encoding) to improve model performance.
Optimizing Model Design: By understanding the data’s characteristics, analysts can choose appropriate modeling techniques,
decide on the complexity of the model, and better tune model parameters.
Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the data, which are critical to address before
further analysis to improve data quality and integrity.
Enhancing Communication: Visual and statistical summaries from EDA can make it easier to communicate findings and
convince others of the validity of your conclusions, particularly when explaining data-driven insights to stakeholders without
technical backgrounds.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 102
Types of Exploratory Data Analysis
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information units to uncover styles, pick
out relationships, and gain insights.
There are various sorts of EDA strategies that can be hired relying on the nature of the records and the desires of the evaluation.
Depending on the number of columns we are analyzing. We can divide EDA into three types:
• Univariate
• Bivariate
• Multivariate.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 103
Univariate Data Analysis
It is primarily concerned with describing the data and finding patterns existing in a single feature.
This sort of evaluation makes a specialty of analyzing character variables inside the records set.
It involves summarizing and visualizing a unmarried variable at a time to understand its distribution, relevant tendency, unfold,
and different applicable records.
▪ Box plots: Useful for detecting outliers and understanding the spread and skewness of the data.
▪ Bar charts: Employed for categorical data to show the frequency of each category.
▪ Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that describe the central
tendency and dispersion of the data.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 104
Bivariate Data Analysis
Bivariate analysis is a crucial form of exploratory data analysis that examines the relationship between two variables.
▪ Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps visualize the relationship
between two continuous variables.
▪ Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear relationships) quantifies the
degree to which two variables are related.
▪ Line Graphs: In the context of time series data, line graphs can be used to compare two variables over time. This helps in
identifying trends, cycles, or patterns that emerge in the interaction of the variables over the specified period.
▪ Covariance: Covariance is a measure used to determine how much two random variables change together. However, it is sensitive
to the scale of the variables, so it’s often supplemented by the correlation coefficient for a more standardized assessment of the
relationship.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 105
Multivariate Data Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset.
It aims to understand how variables interact with one another, which is crucial for most statistical modeling techniques.
▪ Pair plots: Visualize relationships across several variables simultaneously to capture a comprehensive view of potential
interactions.
▪ Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the dimensionality of large
datasets, while preserving as much variance as possible.
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 106
Steps for Performing Exploratory Data Analysis
Dr. Mangal Singh #T7473 Unit 1 – Introduction to Artificial Intelligence Symbiosis Institute of Technology, Pune 107