0% found this document useful (0 votes)
22 views34 pages

Sources of Digital Data

The document provides an overview of different types of digital data, categorizing them into structured, semi-structured, and unstructured data. It discusses the characteristics, sources, and processing requirements for each type, highlighting examples such as databases for structured data and emails for semi-structured data. Additionally, it addresses the challenges of dealing with unstructured data, which constitutes a significant portion of organizational data.

Uploaded by

Ramya Murugesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views34 pages

Sources of Digital Data

The document provides an overview of different types of digital data, categorizing them into structured, semi-structured, and unstructured data. It discusses the characteristics, sources, and processing requirements for each type, highlighting examples such as databases for structured data and emails for semi-structured data. Additionally, it addresses the challenges of dealing with unstructured data, which constitutes a significant portion of organizational data.

Uploaded by

Ramya Murugesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Course Name: ETL Tools

Module 1: Foundations of Data


Integration
Topic: Sources of data
Types of Digital Data
 Structured
 Sources of structured data
 Ease with structured data

 Semi-Structured
 Sources of semi-structured data
 Characteristics of semi structured data

 Unstructured
 Sources of unstructured data
 Issues with terminology
 Dealing with unstructured data
About data

• Data source – Internal and External to the enterprise

• Data may come from homogeneous and heterogeneous sources

• Data processing requirement:

Data  Information
Information  Insight (Knowledge)
About data
Classification of Digital Data

Digital data is classified into the following


categories:

 Structured data

 Semi-structured data

 Unstructured data
Approximate Percentage Distribution of
Digital Data

Approximate percentage distribution of digital


data

Source : https://www.researchgate.net/figure/Pie-of-big-data-
percentages_fig4_336678115
Structured
Data

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Structured Data

• This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program - Relational data model

• Cardinality of Relation

• Degree of Relation

• Data type ,Constraints(Unique, Not Null)



• Relationships exist between entities of data, such as classes and their
objects.

• Data stored in databases is an example of structured data.

• Example : Employee Data base




Structured Data


Structured Data


Sources of Structured Data

Databases
such as
Oracle, DB2,
Teradata,
MySql,
PostgreSQL,
etc

Structured Spreadshe
data ets

OLTP
Systems
Structured Data


Ease with Structured Data

Input / Update / DML operations


Delete

Security Access control ( Tokens), Encryption

Speedup select operations with


Ease with Structured data Indexing /
additional write and storage
Searching space
Scale up by increasing the
horse power ( Additional
Scalability
memory and processing
capacity)
Transaction
ACID properties
Processing
Semi-structured
Data

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Semi-structured Data

• This is the data which does not conform to a data model but has some
structure. However, it is not in a form which can be used easily by a
computer program.

• It uses tags to separate semantic elements and to enforce hierarchies


of records and fields within data.

• No separation between schema and data.

• Metadata for this data is available but is not sufficient.

• Example: emails, XML, markup languages like HTML, etc.


Sources of Semi-structured Data

<student>
XML (eXtensible <name> xyz </name>
<rollno> 125</rollno>
Markup Language)
</student>

Semi- Other Markup


Structured Languages (HTML)
Data
{
JSON (Java Script _id:1,
Object Notation) StudentName: “XYZ”,
RollNo: 125
}
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value
Semi-structured pairs)
data
Often Schema
information is blended
with data values

Data objects may have


different attributes not
known beforehand
Unstructured
Data

Big Data and Analytics by Seema Acharya and Subhashini Chellappan


Copyright 2015, WILEY INDIA PVT. LTD.
Unstructured Data

• This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.

• About 80–90% data of an organization is in this format.

• Example: memos, chat rooms, PowerPointpresentations, images,


videos, letters, researches, white papers, body of an email, etc.
Unstructured Data
Unstructured Data
Unstructured Data
Unstructured Data
Unstructured Data
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media
data

Document
Word
Issues with terminology – Unstructured Data

Structure can be implied despite not being


formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language
Processing (NLP)
Dealing with Text Analytics
Unstructured Data
Noisy Text Analytics
Questions

 Which category (structured, semi-structured, or unstructured) will


you place a Web Page in?

 Which category (structured, semi-structured, or unstructured) will


you place
Word Document in?

 State a few examples of human generated and machine-generated


data.
Thank
you

You might also like