CHAPTER TWO
Data Science
1
An Overview of Data Science
Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.
It is a systematic study of raw data and making insightful observations.
Data acquisition, data cleaning, feature engineering, modelling and
visualization are some major parts of this universe.
Data science is much more than simply analyzing data. It offers a range of
roles and requires a range of skills. 2
Overview of Data Science …
• Example:
• Consider data involved in buying a box of KitKat from the store or supermarket:
• Your data here is the planned purchase written somewhere
• When you get to the store, you use that piece of data to remind yourself about
what you need to buy and pick it up and put it in your cart.
• At checkout, the cashier scans the barcode on your box and the cash register
logs the price.
• Back in the warehouse, a computer informs the stock manager that it is time to
order this item from distributor because your purchase takes the last box in the
store.
• You may have a coupon for your purchase and the cashier scans that too, giving
you a predetermined discount.
Overview of Data Science …
• Example:
• At the end of the week, a report of all the scanned
manufacturer coupons gets uploaded to the KitKat company
so they can issue a reimbursement to the grocery store for all
of the coupon discounts they have handed out to customers.
• Finally, at the end of the month, a store manager looks at a
colorful collection of pie charts showing all the different kinds
of KitKat that were sold and, on the basis of strong sales of
KitKat, decides to offer more varieties of these on the store’s
limited shelf space next month.
• So, the small piece of information on your notebook ended up in many different places
• Notably on the desk of a manager as an aid to decision making.
• The data went through many transformations.
Overview of Data Science …
• In addition to the computers where the data might have stopped by or stayed on
for the long term, lots of other pieces of hardware—such as the QR code scanner—
were involved in collecting, manipulating, transmitting, and storing the data.
• In addition, many different pieces of software were used to organize,
aggregate, visualize, and present the data.
• Finally, many different human systems were involved in working with the data.
• People decided which systems to buy and install, who should get access to
what kinds of data, and what would happen to the data after its immediate
purpose was fulfilled.
• Data science evolves as one of the most promising and in-demand career
paths.
• Professionals use advanced techniques for analyzing large volumes of
data.
Overview of Data Science …
• Skills important for data science:
• Statistics
• Linear algebra
• Programming knowledge with focus on data warehousing,
data mining, and data modeling
What are data and information?
Data is the representation of facts, concepts, or instructions in a
formalized manner
It is unprocessed facts and figures.
It has no meaning since it has multiple meaning
What does ‘CHMS’ mean?
What does ‘1992’ mean?
It is the level of conceptualization
8
What are data and information?
Information is the processed data on which decisions and actions are
based.
Data is processed to form information.
Information is the level of contextualization
Can answer WH questions except ‘why’
Information is interpreted data; created from organized, structured,
and processed data in a particular context.
Still information is not enough for decision making … thus go for
9
knowledge
What are data and information?
Knowledge: An appropriate collection of information.
Is the level of patronization (creating r/ship among concept)
Used to answer ‘how’ question
Found through many experience and much information.
Come through understanding patterns.
Wisdom: Collection of very deep knowledge.
Come through understanding principles. Hierarchical Model10of
human competency
Data VS Information…
Data vs. Information Examples Chart
• Seeing examples of data and information side-by-side in a chart can help you
better understand the differences between the two terms.
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness and add values for
a particular purpose.
It is the activity of converting raw facts [data] into information.
Information is data that have been processed using the data
processing functions.
13
Data Processing Cycle
What is the ultimate purpose of storing and then analyzing/
processing data?
Data Information Knowledge Action
Is to transform
14
Data Processing Cycle
Data processing consists of the following basic steps - input, processing,
and output.
Input − in this step, the input data is prepared in some convenient form for
processing.
The form will depend on the processing machine.
Processing − in this step, the input data is changed to produce data in a
more useful form.
Output − at this stage, the result of the proceeding processing step is
collected.
15
Input Processing Output
Data types and their representation
Data types can be described from diverse perspectives.
In computer science and computer programming, for instance,
A data type is an attribute of data that tells the compiler or
interpreter how the programmer intends to use the data.
• Almost all programming languages explicitly include the notion
of data type, though different languages may use different
terminology.
Common data types include:
• Integers: store integers.
• Booleans: store one of the two values: true or false
• Characters: store a single character (numeric, alphabetic,
symbol, …) 16
• Floating-point numbers: stores real numbers
• Alphanumeric strings: stores a combination of characters
Data types from Data Analytics perspective
From a data analytics point of view, it is important to understand that
there are three common types of data types or structures:
A. Structured
Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
Structured data conforms to a tabular format with a relationship
between the different rows and columns.
Common examples of structured data are Excel files or SQL databases.
17
Data types from Data Analytics perspective
B. Semi-structured
It is a form of structured data that does not conform with the formal structure of data
models associated with relational databases or other forms of data tables.
Examples of semi-structured data include JSON and XML are forms of semi-structured
data.
C. Unstructured
Unstructured data is information that either does not have a predefined data model or is
not organized in a pre-defined manner.
Unstructured information is typically text-heavy but may contain data such as dates,
numbers, and facts as well. 18
Data types from Data Analytics perspective
Metadata : Metadata is data about data
The last category of data type is metadata.
It provides additional information about a specific set of data.
• Example: In a set of photographs, metadata could describe
when and where the photos were taken.
• The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data.
• Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
20
Data value Chain
The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful insights from
data.
The Big Data Value Chain identifies the following key high-level activities:
21
Data value Chain
A. Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is put in
a data warehouse or any other storage solution on which data analysis can
be carried out.
B. Data Analysis
Data analysis involves exploring, transforming, and modeling data with the
goal of highlighting relevant data, synthesizing and extracting useful
hidden information with high potential from a business point of view.
22
Data value Chain
C. Data Curation
It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
D. Data Storage
It is the persistence and management of data in a scalable way that satisfies
the needs of applications that require fast access to the data.
E. Data Usage
Data usage in business decision making can enhance competitiveness through
the reduction of costs, increased added value, or any other parameter that can
be measured against existing performance criteria. 23
Basic concepts of big data
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or
traditional data processing applications.
The challenges include capture, storage, search, sharing, analysis, and
visualization.
“Large dataset” means a dataset too large to reasonably process or store
with traditional tooling or on a single computer.
Scale of big datasets is constantly shifting and may vary significantly
from organization to organization.
24
Characteristics of big data
Big data is a term that describes large, hard-to-manage
volumes of data – both structured and unstructured
It is has 4 Vs characters:
1. Volume:-Large amount of data (in zeta bytes)
2. Velocity-Data is live streaming or in motion
3. Variety-Data comes in d/t forms from d/t sources
4. Veracity–Can we trust the data? How it is accurate?
25
• Let’s look our smart phones, now a day smart phones
generates a lot of data in the form of text, phone calls,
emails, photos, videos, searches and music.
• Approximately 40 Exabytes (10^18) of data get generated
every month by a single smart phone user, now consider
how much data will generate from 5 billon smart phone.
• That is mind blowing in fact, this amount of data quit a lot
for traditional computing systems to handle. This massive
amount of data is called big data.
• Now let’s have a look at the data generated per
minute on internet.
• 2.1M snaps are shard in Snap chat,
• 3.8M search queries are mead in Google,
• 1M people are log in Facebook,
• 4.5M videos are watched in YouTube and
• 188M emails are send.
Big Data Solutions: Clustered Computing
• Individual computers are often inadequate for handling big data a
most stages.
• Clustered computing is used to better address the high storage
and computational needs of big data.
• Clustered computing is a form of computing in which a group of
computers (often called nodes) that are connected through a LAN
(local area network) so that, they behave like a single machine.
• The set of computers is called a cluster.
• The resources from these computers are pooled to appear as one
more powerful computer than the individual computers.
Clustered Computing
Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
I. Resource Pooling
Combining the available storage space to hold data is a clear benefit, but CPU
and memory pooling are also extremely important.
II. High Availability
Clusters can provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting access to
data and processing. 29
Clustered Computing
III. Easy Scalability:
Clusters make it easy to scale horizontally by adding additional
machines to the group.
Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
The machines involved in the computing cluster are also typically
involved with the management of a distributed storage system 30
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction
with big data easier.
Hadoop is a database framework, which allows users to save,
process Big Data in a fault-tolerant, low latency ecosystem using
programming models.
It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming
31
models.
Characteristics of Hadoop
Economical: Its systems are highly economical as ordinary computers can
be used for data processing.
Reliable: It is reliable as it stores copies of the data on different machines
and is resistant to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A few extra
nodes help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
32
Hadoop and its Ecosystem
It comprises the following components and many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
33
Zookeeper: Managing cluster and Oozie: Job Scheduling
Hadoop and its Ecosystem
Hadoop has an ecosystem that has evolved from its four core components:
data management, access, processing, and storage.
34
Big data life cycle with hadoop
1. Ingesting data into the system
• The first stage of Big Data processing is to Ingest data into the
system.
• The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage.
• The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
• Spark and MapReduce perform data processing.
Big data life cycle with hadoop
3. Computing and analyzing data
• The third stage is to Analyze Data
• Here, the data is analyzed by processing frameworks such
as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then
analyzes it.
• Hive is also based on the map and reduce programming
and is most suitable for structured data.
4. Visualizing the results
• The fourth stage is access, which is performed by tools
such as Sqoop, Hive, Hue and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
37