Data Science
Unit - I
Reference Book
• Data Science from Scratch First Principles with
Python- Joel Grus O'reilly, 2nd Edition
• Advancing into Analytics From Excel to Python
and R, George Mount, Oreilly, First Edition
• Introduction to Machine Learning with
Python, Andreas C. Muller, Sarah Guido,
Oreilly, First Edition
DS Unit I
Introduction to Data Science and Data Preprocessing
• What is Data Science?
• Data Types and Sources
• Data Preprocessing
• Data Wrangling and Feature Engineering
• Tools and Libraries
What is Data Science?
• Data science is the domain of study that deals with vast
volumes of data using modern tools and techniques to find
unseen patterns, derive meaningful information, and make
business decisions.
• Data science uses complex machine learning algorithms to build
predictive models.
• The data used for analysis can come from many different
sources and presented in various formats.
• Data science is the study of data to extract meaningful insights
for business.
• It is a multidisciplinary approach that combines principles and
practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large
amounts of data.
• This analysis helps data scientists to ask and answer questions
like what happened, why it happened, what will happen, and
Data Science Definition
• Data Science may be defined as a
multidisciplinary blend of data inference,
algorithm development, and technology to
solve complex data analysis issues.
• DS deals with identification, representation,
and extraction of meaningful information from
data sources to be used for business purposes.
• Data engineers are responsible for setting up
the database and storage to facilitate the
process of data mining, data munging and
other processes.
Scope of Data Science
• Machine Learning: Machine Learning has
become one of the most critical trends in data
science. It involves using algorithms to learn
from data and make predictions.
• Big Data: With the rise of digital data, Big Data
is becoming a significant trend in data science.
Companies are using Big Data to gain insights
from large sets of data.
• IoT and Edge Computing: IoT and Edge
Computing are becoming increasingly popular
as companies seek ways to make sense of data
from connected devices.
• Natural Language Processing: Natural Language
Processing is becoming a popular trend in data
science. It involves using algorithms to understand
human language and make predictions.
• Artificial Intelligence: Artificial Intelligence is
becoming more advanced and is being used to
automate many processes in data science.
• Cloud Computing: Cloud Computing is becoming
increasingly popular as companies look for ways to
store and process large amounts of data in the cloud.
• Data Visualisation: Data Visualisation is an essential
trend in data science. It involves creating visual
representations of data to make it easier to
understand.
Applications of Data Science
In Search Engines
• The most useful application of Data Science is
Search Engines.
• As we know when we want to search for
something on the internet, we mostly use Search
engines like Google, Yahoo, Safari, Firefox, etc.
• So Data Science is used to get Searches faster.
• For Example, When we search for something
suppose “Data Structure and algorithm courses ”
then at that time on web browser we get the first
link of particular websites.
• This happens because these particular websites
are visited most in order to get information.
• So this analysis is done using Data Science, and
we get the Topmost visited Web Links.
In Transport
• Data Science is also entered in real-time
such as the Transport field like Driverless
Cars.
• With the help of Driverless Cars, it is easy
to reduce the number of Accidents.
• For Example, In Driverless Cars the
training data is fed into the algorithm and
with the help of Data Science techniques,
the Data is analyzed like what as the
speed limit in highways, Busy Streets,
Narrow Roads, etc. And how to handle
different situations while driving etc.
In Finance
• Data Science plays a key role in Financial Industries.
• Financial Industries always have an issue of fraud and
risk of losses.
• Thus, Financial Industries needs to automate risk of
loss analysis in order to carry out strategic decisions
for the company.
• Also, Financial Industries uses Data Science Analytics
tools in order to predict the future.
• It allows the companies to predict customer lifetime
value and their stock market moves.
• For Example, In Stock Market, Data Science is the
main part.
• In the Stock Market, Data Science is used to examine
past behavior with past data and their goal is to
examine the future outcome.
• Data is analyzed in such a way that it makes it possible
to predict future stock prices over a set timetable.
In E-Commerce
• E-Commerce Websites like Amazon,
Flipkart, etc. uses data Science to
make a better user experience with
personalized recommendations.
• For Example, When we search for
something on the E-commerce
websites we get suggestions similar to
choices according to our past data and
also we get recommendations
according to most buy the product,
most rated, most searched, etc. This is
all done with the help of Data Science.
In Health Care
• In the Healthcare Industry data
science act as a boon.
• Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
Image Recognition
• Currently, Data Science is also used in
Image Recognition.
• For Example, When we upload our image
with our friend on Facebook, Facebook gives
suggestions Tagging who is in the picture.
• This is done with the help of machine
learning and Data Science.
• When an Image is Recognized, the data
analysis is done on one’s Facebook friends
and after analysis, if the faces which are
present in the picture matched with
someone else profile then Facebook
suggests us auto-tagging.
Targeting Recommendation
• Targeting Recommendation is the most important
application of Data Science.
• Whatever the user searches on the Internet, he/she
will see numerous posts everywhere.
• For example, Suppose I want a mobile phone, so I
just Google search it and after that, I changed my
mind to buy offline.
• In Real -World Data Science helps those companies
who are paying for Advertisements for their mobile.
• So everywhere on the internet in the social media,
in the websites, in the apps everywhere I will see
the recommendation of that mobile phone which I
searched for.
• So this will force me to buy online.
Airline Routing Planning
• With the help of Data Science, Airline
Sector is also growing like with the
help of it, it becomes easy to predict
flight delays.
• It also helps to decide whether to
directly land into the destination or
take a halt in between like a flight
can have a direct route from Delhi to
the U.S.A or it can halt in between
after that reach at the destination.
Gaming
• In most of the games where a user
will play with an opponent i.e. a
Computer Opponent, data science
concepts are used with machine
learning where with the help of past
data the Computer will improve its
performance.
• There are many games like Chess, EA
Sports, etc. will use Data Science
concepts.
Medicine and Drug Development
• The process of creating medicine is very
difficult and time-consuming and has to be
done with full disciplined because it is a
matter of Someone’s life.
• Without Data Science, it takes lots of time,
resources, and finance or developing new
Medicine or drug but with the help of Data
Science, it becomes easy because the
prediction of success rate can be easily
determined based on biological data or
factors.
• The algorithms based on data science will
forecast how this will react to the human body
without lab experiments.
In Delivery Logistics
• Various Logistics companies like DHL,
FedEx, etc. make use of Data
Science.
• Data Science helps these companies
to find the best route for the
Shipment of their Products, the best
time suited for delivery, the best
mode of transport to reach the
destination, etc.
Autocomplete
• AutoComplete feature is an important part of
Data Science where the user will get the facility
to just type a few letters or words, and he will get
the feature of auto-completing the line.
• In Google Mail, when we are writing formal mail to
someone so at that time data science concept of
Autocomplete feature is used where he/she is an
efficient choice to auto-complete the whole line.
• Also in Search Engines in social media, in various
apps, AutoComplete feature is widely used.
Fraud and risk detection
• Banking and Financial services industry
has a separate segment for data analysis.
• Data Science was brought in order to
rescue the organisations out of losses.
• It helped them to segment the customers
on the basis of past expenditure, current
credits and other essential variables to
analyse the probability of risk and default.
• It also helped them to push their financial
products based on customer’s financials.
Government
• Government is maintaining the records
of the citizens in their database
including the photographs, fingerprints,
addresses, phone numbers etc in order
to maintain law and order in the country.
• This data helps the government in
taxation, passing on financial benefits to
the needy, and even tracking down the
lost people.
Data Science Machine Learning
Data Science helps with creating Machine Learning helps in
insights from data that deals with accurately predicting or classifying
real-world complexities outcomes for new data points by
learning patterns from historical
data
Preferred skillset: Preferred skillset:
– domain expertise – Python/ R Programming
– strong SQL – Strong Mathematics Knowledge
– ETL and data profiling – Data Wrangling
– NoSQL systems, Standard – SQL Model-specific Visualization
reporting, Visualization
Horizontally scalable systems GPUs are preferred for intensive
preferred to handle massive data vector operations
Components for handling Significant complexity is with the
unstructured raw data algorithms and mathematical
concepts behind them.
Most of the input data is in a Input data is transformed specifically
human-consumable form for the type of algorithms used
Data Science Data Mining
Data Science is an area. Data Mining is a technique.
It is about collection, processing, analyzing It is about extracting the vital and valuable
and utilizing of data into various operations. It information from the data.
is more conceptual.
It is a field of study just like the Computer It is a technique which is a part of the
Science, Applied Statistics or Applied Knowledge Discovery in Data Base processes
Mathematics. (KDD).
The goal is to build data-dominant products The goal is to make data more vital and usable
for a venture. i.e. by extracting only important information.
It deals with the all types of data i.e. It mainly deals with the structured forms of
structured, unstructured or semi-structured. the data.
It is a super set of Data Mining as data science It is a sub set of Data Science as mining
consists of Data scrapping, cleaning, activities which is in a pipeline of the Data
visualization, statistics and many more science.
techniques.
It is mainly used for scientific purposes. It is mainly used for business purposes.
It broadly focuses on the science of the data. It is more involved with the processes.
Different types of data
• Structured
• Unstructured
• Semi-structured
Structured Data
• Structured data is generally stored in tables in the form of rows and
columns.
• Structured data in these tables can form relations with another tables.
• Humans and machines can easily retrieve information from structured
data.
• This data is meaningful and is used to develop data models.
• Structured data is used by many business organizations.
• Companies apply data visualization techniques on the structured data to
extract some meaningful insights from that data and develop data
models.
• Machine learning algorithms are applied on this data so that they can
predict the future outcomes based on this.
• Data present in a Relational Database is the best example for structured
data and this data can be accessed using a structured query language
(SQL).
• Structured data is highly secured and requires low storage space. About
20% of the data is structured.
Structured data advantages
• It is easy to search for data
• Less storage space is required
• More data analytics tools can be
used
• Data is highly secured
Structured data disadvantages
• Data is not flexible
• Its storage options are limited
Unstructured Data
• Unprocessed and unorganized data is known as unstructured data.
• This type of data has no meaning and is not used to develop data models.
• Unstructured data may be text, images, audio, videos, reviews, satellite
images, etc.
• Almost 80% of the data in this world is in the form of unstructured data.
• Unstructured data needs a lots of storage space.
• Here, data is not secured. It is difficult to search this data as it is not
organized properly.
• This data is stored in NoSQL databases as they can’t be managed using
relational databases.
• It is very difficult to get insights from this data.
• Text files, Emails, data from social media applications, IoT, media etc., are
examples of human generated unstructured data. Satellite images,
scientific data etc., are examples of machine generated unstructured data.
• Tools used on unstructured data are MongoDB, Hadoop, DynamoDB,
Azure, etc. Data visualization is best for analyzing unstructured data as they
show hidden meaning of that data.
Unstructured data advantages
• Data is flexible.
• This data can be used for a wide
range of purposes as it is in its
original form.
Unstructured data disadvantages
• It requires more storage space.
• There is no security for data.
• Searching for data is a difficult
process.
• There are limited tools available to
analyze this data.
Semi-Structured Data
• Semi structured data is organized up to some extent only and
the rest is unstructured.
• Hence, the level of organizing is less than that of Structured
Data and higher than that of Unstructured Data.
• Semi-structured data is partially organized by means of XML.
• In semi-structured data, transaction management is not by
default but is get adapted from DBMS, however there is no
data concurrency.
• Data versioning is done only where tuples or graph is possible
because semi structured data supports partial database.
• Semi-structured data is more flexible than structured data
but less flexible and scalable as compared to unstructured
data.
• If there is semi-structured data, then we can query only
anonymous nodes, so its performance is lower than
Properties Structured data Semi-structured data Unstructured data
Technology It is based on It is based on It is based on character
Relational database XML/RDF(Resource and binary data
table Description
Framework).
Transaction Matured transaction Transaction is adapted No transaction
management and various from DBMS not management and no
concurrency matured concurrency
techniques
Version management Versioning over Versioning over tuples Versioned as a whole
tuples,row,tables or graph is possible
Flexibility It is schema dependent It is more flexible than It is more flexible and
and less flexible structured data but there is absence of
less flexible than schema
unstructured data
Scalability It is very difficult to It’s scaling is simpler It is more scalable.
scale DB schema than structured data
Robustness Very robust New technology, not —
very spread
Query performance Structured query allow Queries over Only textual queries
complex joining anonymous nodes are are possible
Data sources
• A data source may be the initial location where data is
born or where physical information is first digitized,
however even the most refined data may serve as a
source, as long as another process accesses and utilizes it.
• Concretely, a data source may be a database, a flat file,
live measurements from physical devices, scraped web
data, or any of the myriad static and streaming data
services which abound across the internet.
• Example of a data source
• Imagine a fashion brand selling products online. To display
whether an item is out of stock, the website gets
information from an inventory database. In this case, the
inventory tables are a data source, accessed by the web
application which serves the website to customers.
Data sources - Databases
• Data science involves extracting value and insights from large volumes of data to
drive business decisions.
• It also involves building predictive models using historical data.
• Databases facilitate effective storage, management, retrieval, and analysis of
such large volumes of data.
• So, as a data scientist, you should understand the fundamentals of databases.
Because they enable the storage and management of large and complex
datasets, allowing for efficient data exploration, modeling, and deriving insights.
Data sources - Files
Data sources - APIs
• An Application Programming Interface (API) allows
pieces of code to interact with one another.
• Developers use APIs to build their websites with
specific features, like a Google Maps interface,
instead of having to write code from scratch.
• Some may be open-source, while others charge a
fee for implementation.
• You typically need to register a developer account
or have some other means of authentication for
APIs.
• APIs are the essential building blocks for data
science.
• They provide key data sources and enable data
integration and visualization.
Data Source - web scraping
• Web scraping is one of the most powerful
tools that data scientists can use to extract
data from the web.
• Web scraping is used to extract data from
web pages automatically.
• Web scraping can be used to extract data
from almost any type of website, including
blogs, news sites, forums, and social media
sites.
• Web scraping is used to extract data for
many different purposes, including data
mining, data aggregation, web indexing,
and information filtering.
Data sources - sensors
• Sensor data analytics is needed to gather and analyze
the data from sensor-equipped devices used in various
fields: manufacturing, healthcare, retail, BFSI, oil and
gas, automotive, energy, transportation, logistics,
agriculture, smart cities, and more.
• Sensor data is the output of a device that detects and
responds to some type of input from the physical
environment.
• The output may be used to provide information to an
end user or as input to another system or to guide a
process.
• Depending on your application, you might need to use
multiple types of sensors, or combine sensor data with
other sources of data, such as GPS, images, or text.
Data sources - social media
• Social media data is any type of data that can
be gathered through social media.
• Social media metrics - Engagement: Clicks,
comments, shares, etc., Reach, Impressions
and video views, Follower count and growth
over time, Profile visits, Brand sentiment,
Social share of voice
• Demographics - age, gender, location,
language, behaviors, etc.
Data Preprocessing
• Data preprocessing is an important step in the data
mining process.
• It refers to the cleaning, transforming, and integrating
of data in order to make it ready for analysis.
• The goal of data preprocessing is to improve the quality
of the data and to make it more suitable for the specific
data mining task.
• Real-world datasets are generally messy, raw,
incomplete, inconsistent, and unusable.
• It can contain manual entry errors, missing values,
inconsistent schema, etc.
• Data Preprocessing is the process of converting raw
data into a format that is understandable and usable.
• It is a crucial step in any Data Science project to carry
out an efficient and accurate analysis. It ensures that
data quality is consistent before applying any Machine
Learning or Data Mining techniques.
Data Preprocessing
– Data cleaning
• Handling missing values
• Outliers
• Duplicates
– Data transformation
• Scaling
• Normalization
• Encoding categorical variables
– Feature selection
• Selecting relevant features/columns
– Data merging
• Combining multiple datasets
Data cleaning
• This involves identifying and correcting errors
or inconsistencies in the data, such as missing
values, outliers, and duplicates.
• Various techniques can be used for data
cleaning, such as imputation, removal, and
transformation.
• Data Cleaning uses methods to handle
incorrect, incomplete, inconsistent, or missing
values.
Handling missing values
• Input data can contain missing
or NULL values, which must be
handled before applying any Machine
Learning or Data Mining techniques.
• Missing values can be handled by
many techniques, such as removing
rows/columns containing NULL values
and imputing NULL values using
mean, mode, regression, etc.
Handling outliers
• Outliers are data points that stand out from the rest.
• They’re unusual values that don’t follow the overall
pattern of your data.
• Identifying outliers in Data science is important because
they can skew results and mislead analyses.
• Once found, you have a few options for handling outliers:
1.Transform the data - Apply log, square root, or other
transformations to compress the range of values and
reduce outlier impact.
2.Use robust statistics - Choose statistical methods less
influenced by outliers like median, Mode, and interquartile
range instead of mean and standard deviation.
3.Impute missing values - For outliers caused by missing
or erroneous values, you can estimate replacements using
the mean, median, or most frequent values.
Handling duplicates
• When you are working with large datasets,
working across multiple data sources, or have
not implemented any quality checks before
adding an entry, your data will likely show
duplicated values.
• These duplicated values add redundancy to
your data and can make your calculations go
wrong. Duplicate serial numbers of products in
a dataset will give you a higher count of
products than the actual numbers.
• Duplicate email IDs or mobile numbers might
cause your communication to look more like
spam. We take care of these duplicate records
by keeping just one occurrence of any unique
observation in our data.
Data transformation
• This involves converting the data into a
suitable format for analysis.
• Common techniques used in data
transformation include normalization,
standardization, and discretization.
• Normalization is used to scale the data
to a common range, while
standardization is used to transform the
data to have zero mean and unit
variance.
• Discretization is used to convert
continuous data into discrete categories.
Data transformation - scaling
• Scaling is useful when you want to
compare two different variables on
equal grounds.
• This is especially useful with variables
which use distance measures.
• For example, models that use Euclidean
Distance are sensitive to the magnitude
of distance, so scaling helps even the
weight of all the features.
• This is important because if one variable
is more heavily weighted than the other,
it introduces bias into our analysis.
Data transformation - normalization
• Normalization is used to scale the data to
a common range, while standardization
is used to transform the data to have
zero mean and unit variance.
• This involves scaling the data to a
common range, such as between 0 and 1
or -1 and 1.
• Normalization is often used to handle
data with different units and scales.
• Common normalization techniques
include min-max normalization, z-score
normalization, and decimal scaling.
Data transformation - encoding categorical variables
• The process of encoding categorical
data into numerical data is called
“categorical encoding.”
• It involves transforming categorical
variables into a numerical format
suitable for machine learning models.
• Encoding categorical data is a
process of converting categorical
data into integer format so that the
data with converted categorical
values can be provided to the
different models.
Feature selection
• This involves selecting a subset of
relevant features from the dataset.
• Feature selection is often performed
to remove irrelevant or redundant
features from the dataset.
• It can be done using various
techniques such as correlation
analysis, mutual information, and
principal component analysis (PCA).
Feature selection - selecting relevant features/columns
• There are mainly two types of Feature
Selection techniques, which are:
• Supervised Feature Selection
technique
Supervised Feature selection techniques
consider the target variable and can be
used for the labelled dataset.
• Unsupervised Feature Selection
technique
Unsupervised Feature selection
techniques ignore the target variable and
can be used for the unlabelled dataset.
• Data merging is the process of
combining two or more datasets into
a single dataset.
• It is a critical step in modern data
pipelines when working with data
from multiple sources or with
different formats that need to be
merged for analysis.
Data merging - combining multiple datasets
• Multiple datasets can be combined using
join(), concat() and merge() functions.
Data wrangling
• Data Wrangling is referred to as data munging.
• It is the process of transforming and mapping data
from one "raw" data form into another format to
make it more appropriate and valuable for various
downstream purposes such as analytics.
• The goal of data wrangling is to assure quality and
useful data.
• The process of data wrangling may include further
munging, data visualization, data aggregation,
training a statistical model, and many other potential
uses.
• Data wrangling typically follows a set of general
steps, which begin with extracting the raw data from
the data source, "munging" the raw data (e.g.,
sorting) or parsing the data into predefined data
structures, and finally depositing the resulting
content into a data sink for storage and future use.
Data wrangling techniques - Reshaping
• Data Reshaping is about changing the
way data is organized into rows and
columns.
• It is easy to extract data from the rows
and columns of a data frame but there
are situations when we need the data
frame in a format that is different from
format in which we received it.
• There are many functions to split,
merge and change the rows to columns
and vice-versa in a data frame.
Data wrangling techniques - Pivoting
• Pivoting can be used to restructure a
DataFrame, such that the rows can be
converted into additional column headings
where a chosen column is displayed in these
new column headings.
• Pivoting aids data understanding and
presentation.
• Pivoting is where you take a long data file
(lots of rows, few columns) and make it wider.
Or where you take a wide data file (lots of
columns, few rows) and make it longer.
Data wrangling techniques - aggregating
• Data aggregation is the process of collecting data to
present it in summary form.
• This information is then used to conduct statistical
analysis and can also help company executives make
more informed decisions about marketing strategies,
price settings, and structuring operations, among other
things.
• Aggregating data is a useful tool for data exploration.
• Aggregation is sometimes done to allow for analysis
to be completed at a higher level of the data. For
example, if an analysis of the size of school districts in
a region is to be done, the number of students from the
schools with in the district are summed (aggregated.)
Feature engineering
• Feature Engineering is the process of
creating new features or transforming
existing features to improve the
performance of a machine-learning model.
• It involves selecting relevant information
from raw data and transforming it into a
format that can be easily understood by a
model.
• The goal is to improve model accuracy by
providing more meaningful and relevant
information.
Feature engineering - Creating new features
• Feature Creation is the process of generating new
features based on domain knowledge or by
observing patterns in the data.
• It is a form of feature engineering that can
significantly improve the performance of a
machine-learning model.
• Types of Feature Creation:
1.Domain-Specific: Creating new features based
on domain knowledge, such as creating features
based on business rules or industry standards.
2.Data-Driven: Creating new features by observing
patterns in the data, such as calculating
aggregations or creating interaction features.
3.Synthetic: Generating new features by combining
existing features or synthesizing new data points.
Dummification
• Most algorithms can’t deal with
categorical variables directly.
• So, a process
called dummification is used to turn
categorical variables into numerical
ones.
• This process is used to convert each
category into a binary numerical
variable.
Converting categorical variables into binary
indicators
• The pd. get_dummies() function is
called on the DataFrame to convert
the categorical variables into binary
values.
Feature scaling
• Feature Scaling is a technique to
standardize the independent features
present in the data in a fixed range.
• It is performed during the data pre-
processing to handle highly varying
magnitudes or values or units.
• If feature scaling is not done, then
a machine learning algorithm tends to
weigh greater values, higher and
consider smaller values as the lower
values, regardless of the unit of the
values.
Normalization
• This method is more or less the same
as the previous method but here
instead of the minimum value, we
subtract each entry by the mean
value of the whole data and then
divide the results by the difference
between the minimum and the
maximum value.
Standardization
• This method of scaling is basically based on the
central tendencies and variance of the data.
1. First, we should calculate the mean and standard
deviation of the data we would like to normalize.
2. Then we are supposed to subtract the mean
value from each entry and then divide the result
by the standard deviation.
• This helps us achieve a normal distribution(if it is
already normal but skewed) of the data with a
mean equal to zero and a standard deviation
equal to 1.
Libraries used for DS
• Numpy and Scipy
• Pandas
• Matplotlib
• Scikit-learn
• StatsModels
• Seaborn
Numpy and Scipy -
Fundamental Scientific
Computing
• NumPy stands for Numerical Python.
• The most powerful feature of NumPy is n-
dimensional array.
• This library also contains basic linear algebra
functions, Fourier transforms, advanced random
number capabilities and tools for integration with
other low level languages like Fortran, C and C+
+.
• SciPy stands for Scientific Python.
• It is built on NumPy.
• Scipy is one of the most useful library for variety
of high level science and engineering modules
like discrete Fourier transform, Linear Algebra,
Optimization and Sparse matrices.
Pandas - Data Manipulation and Analysis
• Pandas for structured data
operations and manipulations.
• It is extensively used for data
munging and preparation.
• Pandas were added relatively
recently to Python and have been
instrumental in boosting Python’s
usage in data scientist community.
Scikit-learn - Machine Learning
and Data Mining
• Scikit Learn for machine learning.
• Built on NumPy, SciPy and matplotlib,
this library contains a lot of efficient
tools for machine learning and
statistical modeling including
classification, regression, clustering
and dimensional reduction.