Course Name: ETL Tools
Module 1: Foundations of Data
Integration
Topic: Sources of data
Types of Digital Data
Structured
Sources of structured data
Ease with structured data
Semi-Structured
Sources of semi-structured data
Characteristics of semi structured data
Unstructured
Sources of unstructured data
Issues with terminology
Dealing with unstructured data
About data
• Data source – Internal and External to the enterprise
• Data may come from homogeneous and heterogeneous sources
• Data processing requirement:
Data Information
Information Insight (Knowledge)
About data
Classification of Digital Data
Digital data is classified into the following
categories:
Structured data
Semi-structured data
Unstructured data
Approximate Percentage Distribution of
Digital Data
Approximate percentage distribution of digital
data
Source : https://www.researchgate.net/figure/Pie-of-big-data-
percentages_fig4_336678115
Structured
Data
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Structured Data
• This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program - Relational data model
• Cardinality of Relation
• Degree of Relation
• Data type ,Constraints(Unique, Not Null)
• Relationships exist between entities of data, such as classes and their
objects.
• Data stored in databases is an example of structured data.
• Example : Employee Data base
Structured Data
Structured Data
Sources of Structured Data
Databases
such as
Oracle, DB2,
Teradata,
MySql,
PostgreSQL,
etc
Structured Spreadshe
data ets
OLTP
Systems
Structured Data
Ease with Structured Data
Input / Update / DML operations
Delete
Security Access control ( Tokens), Encryption
Speedup select operations with
Ease with Structured data Indexing /
additional write and storage
Searching space
Scale up by increasing the
horse power ( Additional
Scalability
memory and processing
capacity)
Transaction
ACID properties
Processing
Semi-structured
Data
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Semi-structured Data
• This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a
computer program.
• It uses tags to separate semantic elements and to enforce hierarchies
of records and fields within data.
• No separation between schema and data.
• Metadata for this data is available but is not sufficient.
• Example: emails, XML, markup languages like HTML, etc.
Sources of Semi-structured Data
<student>
XML (eXtensible <name> xyz </name>
<rollno> 125</rollno>
Markup Language)
</student>
Semi- Other Markup
Structured Languages (HTML)
Data
{
JSON (Java Script _id:1,
Object Notation) StudentName: “XYZ”,
RollNo: 125
}
Characteristics of Semi-structured Data
Inconsistent Structure
Self-describing
(lable/value
Semi-structured pairs)
data
Often Schema
information is blended
with data values
Data objects may have
different attributes not
known beforehand
Unstructured
Data
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Unstructured Data
• This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.
• About 80–90% data of an organization is in this format.
• Example: memos, chat rooms, PowerPointpresentations, images,
videos, letters, researches, white papers, body of an email, etc.
Unstructured Data
Unstructured Data
Unstructured Data
Unstructured Data
Unstructured Data
Sources of Unstructured Data
Web Pages
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media
data
Document
Word
Issues with terminology – Unstructured Data
Structure can be implied despite not being
formerly defined.
Data with some structure may still be labeled
Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand
Data may have some structure or may even be
highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data
Data Mining
Natural Language
Processing (NLP)
Dealing with Text Analytics
Unstructured Data
Noisy Text Analytics
Questions
Which category (structured, semi-structured, or unstructured) will
you place a Web Page in?
Which category (structured, semi-structured, or unstructured) will
you place
Word Document in?
State a few examples of human generated and machine-generated
data.
Thank
you