0% found this document useful (0 votes)

30 views82 pages

1 UNIT-1 Introduction To Types of Digital Data

The document categorizes digital data into three types: unstructured, semi-structured, and structured data, detailing their characteristics and examples. It discusses the ease of working with structured data, including operations like insert/update/delete, security, indexing, scalability, and transaction processing. Additionally, it covers the evolution and definition of big data, emphasizing its high-volume, high-velocity, and high-variety nature, and the innovative processing techniques required for effective analysis and decision-making.

Uploaded by

udayantathe1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views82 pages

1 UNIT-1 Introduction To Types of Digital Data

Uploaded by

udayantathe1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

UNIT-I

Classification of Digital Data

1. Unstructured data:
• This is the data which does not conform to a
data model or is not in a form which can be
used easily by a computer program.
• About 80–90% data of an organization is in
this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos,
letters, researches, white papers, body of an
email, etc.
2. Semi-structured data:
• This is the data which does not conform to a
data model but has some structure. However,
it is not in a form which can be used easily by
a computer program;
• for example, emails, XML, markup languages
like HTML, etc. Metadata for this data is
available but is not sufficient.
3. Structured data:
• This is the data which is in an organized form
(e.g., in rows and columns) and can be easily
used by a computer program. Relationships
exist between entities of data, such as classes
and their objects. Data stored in databases is
an example of structured data.
• We have grown comfortable working with
RDBMS – the storage, retrieval, and
management of data has been immensely
simplified. The data held in RDBMS is typically
structured data.
Sources of Structured Data
• If your data is highly structured, one can look at
leveraging any of the available RDBMS [Oracle Corp. –
Oracle, IBM – DB2, Microsoft – Microsoft SQL Server,
EMC – Greenplum, Teradata – Teradata, MySQL (open
source), PostgreSQL (advanced open source), etc.] to
house it.
• These databases are typically used to hold
transaction/operational data generated and collected
by day-to-day business activities. In other words, the
data of the On-Line Transaction Processing (OLTP)
systems are generally quite structured.
Ease of Working with Structured Data
Ease of Working with Structured Data
• Structured data provides the ease of working
with it..The ease is with respect to the
following:
• 1. Insert/update/delete:
The Data Manipulation Language (DML)
operations provide the required ease with
data input, storage, access, process, analysis,
etc.
2. Security:
• How does one ensure the security of
information?
• There are available staunch encryption and
tokenization solutions to warrant the security
of information throughout its lifecycle.
• Organizations are able to retain control and
maintain compliance adherence by ensuring
that only authorized individuals are able to
decrypt and view sensitive information.
3. Indexing:
• An index is a data structure that speeds up the
data retrieval operations (primarily the
SELECT DML statement) at the cost of
additional writes and storage space, but the
benefits that ensue in search operation are
worth the additional writes and storage space.
4. Scalability:
• The storage and processing capabilities of the
traditional RDBMS can be easily scaled up by
increasing the horsepower of the database
server (increasing the primary and secondary
or peripheral storage capacity, processing
capacity of the processor, etc.).
5. Transaction processing:
• RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID)
properties of transaction. Given next is a quick
explanation of the ACID properties:
Atomicity:
A transaction is atomic, means that either it
happens in its entirety or none of it at all.
Consistency:
• The database moves from one consistent state
to another consistent state. In other words, if
the same piece of information is stored at two
or more places, they are in complete
agreement.
• Isolation:
The resource allocation to the transaction
happens such that the transaction gets the
impression that it is the only transaction
happening in isolation.
Durability:
• All changes made to the database during a
transaction are permanent and that accounts
for the durability of the transaction.
1.1.2 Semi-Structured Data
• Semi-structured data is also referred to as
self-describing structure.
• It has the following features:
• 1. It does not conform to the data models that
one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.

3. Tags are also used to enforce hierarchies of

records and fields within data.
• 4. There is no separation between the data
and the schema. The amount of structure
used is dictated by the purpose at hand
5. In semi-structured data, entities belonging to
the same class and also grouped together
need not necessarily have the same set of
attributes.
Sources of Semi-Structured Data
Sources of Semi-Structured Data
• . 1. XML: eXtensible Markup Language (XML)
is hugely popularized by web services
developed utilizing the Simple Object Access
Protocol (SOAP) principles.
• 2. JSON: Java Script Object Notation (JSON) is
used to transmit data between a server and a
web application.
• JSON is popularized by web services
developed utilizing the Representational State
Transfer (REST) – an architecture style for
creating scalable web services.
• MongoDB (open-source, distributed, NoSQL,
documented-oriented database) and
Couchbase (originally known as Membase,
open-source, distributed, NoSQL,
document-oriented database) store data
natively in JSON format.
Characteristics of Semi-structured data
Unstructured Data

Unstructured data does not conform to any

pre-defined data model.
the structure is quite unpredictable.
(Example of Unstructured Data)
Issues with “Unstructured” Data
Sources of Unstructured Data
Dealing with Unstructured Data
Techniques are used to find patterns in
or interpret unstructured data:
• 1. Data mining
• 2. Text analytics or text mining
• 3. Natural language processing (NLP)
• 4. Noisy text analytics
• 5. Manual tagging with metadata
• 6. Part-of-speech tagging
• 7. Unstructured Information Management
Architecture (UIMA)
1. Data mining

• It is the analysis step of the “knowledge

discovery in databases” process.
• Few popular data mining algorithms are as
follows:
1. Association rule mining
2. Regression analysis
3. Collaborative filtering
a.Association rule mining
• It is also called “market basket analysis” or
“affinity analysis”.
• It is used to determine “What goes with
what?”
• It is about when you buy a product, what is
the other product that you are likely to
purchase with it.
Example
• If you pick up bread from the grocery, are you
likely to pick eggs or cheese to go with it.
b.Regression analysis
• It helps to predict the relationship between
two variables.
• The variable whose value needs to be
predicted is called the dependent variable and
the variables which are used to predict the
value are referred to as the independent
variables.
c. Collaborative filtering
• It is about predicting a user’s preference or
preferences based on the preferences of a
group of users.
• For example, take a look at Table 1.5. We are
looking at predicting whether User 4 will
prefer to learn using videos or is a textual
learner depending on one or a couple of his
or her known preferences.
• We analyze the preferences of similar user
profiles and on the basis of it, predict that
User 4 will also like to learn using videos and is
not a textual learner.
2. Text analytics or text mining:

• Text mining is the process of gleaning high

quality and meaningful information (through
devising of patterns and trends by means of
statistical pattern learning) from text.
• It includes tasks such as text categorization,
text clustering, sentiment analysis,
concept/entity extraction, etc.
3. Natural language processing (NLP)
• It is related to the area of human computer
interaction.

• It is about enabling computers to understand

human or natural language input.
4. Noisy text analytics:

• It is the process of extracting structured or

semi-structured information from noisy
unstructured data such as chats, blogs, wikis,
emails, message-boards, text messages, etc.
The noisy unstructured data usually comprises
one or more of the following:
• Spelling mistakes,
• abbreviations, acronyms,
• non-standard words, missing punctuation,
• missing letter case,
• filler words such as “uh”, “um”, etc.
5. Manual tagging with metadata

• This is about tagging manually with adequate

metadata to provide the requisite semantics
to understand unstructured data.
6. Part-of-speech tagging

• It is also called POS or POST or grammatical

tagging.
• It is the process of reading text and tagging
each word in the sentence as belonging to a
particular part of speech such as “noun”,
“verb”, “adjective”, etc.
7. Unstructured Information
Management Architecture (UIMA)
• It is an open source platform from IBM.
• It is used for real-time content analytics.
• It is about processing text and other
unstructured data to find latent meaning and
relevant relationship buried therein.
Read up more on UIMA at the link:
http://www.ibm.com/developerworks/data/d
ownloads/uima/
• 1. Why is an email placed in the “unstructured
category”?
Answer: Let us take a look at what we
can place in the body of the email.

We can have any or more of the following:

• Hyperlink
• PDFs/DOCs/XLS/etc. attachments
• Emoticons
• Images
• Audio/video attachments
• Free flowing text, etc.
The above are reasons behind placing the email in
the “unstructured category”.
• 2. What category will you place a CCTV
footage into?
• Answer: Unstructured
• 3. You have just got a book issued from the
library.
• What are the details about the book that can
be placed in an RDBMS table?
Answer:

• Title of the book

• Author of the book .
• Publisher of the book
• Year of Publication
• No. of pages in the book
• Type of book such as whether hardbound or
paperback
• Price of the book
• ISBN No. of the book
• Attachments such as With CD or Without CD, etc.
• 4. Which category would you place the
consumer complaints and feedback?
• Answer: Unstructured data
CHARACTERISTICS OF DATA
• Data has three key characteristics:
• 1. Composition
• 2. Condition
• 3. Context
1. Composition
• The composition of data deals with the
structure of data, that is, the sources of data,
the granularity, the types, and the nature of
data as to whether it is static or real-time
streaming.
2. Condition:
• The condition of data deals with the state of
data, that is, “Can one use this data as is for
analysis?” or “Does it require cleansing for
further enhancement and enrichment?”
3. Context
• The context of data deals with “Where has this
data been generated?” “Why was this data
generated?” “How sensitive is this data?”
“What are the events associated with this
data?” and so on.
2.2 EVOLUTION OF BIG DATA

• 1970s and before was the era of mainframes.

The data was essentially primitive and
structured.
• Relational databases evolved in 1980s and
1990s. The era was of data intensive
applications.
• The World Wide Web (WWW) and the
Internet of Things (IoT) have led to an
onslaught of structured, unstructured, and
multimedia data.
2.3 DEFINITION OF BIG DATA

Well, we will give you a few responses that we

have heard over time:
• 1. Anything beyond the human and technical
infrastructure needed to support storage,
processing, and analysis.
• 2. Today’s BIG may be tomorrow’s NORMAL.
• 3. Terabytes or petabytes or zettabytes of
data.
• 4. I think it is about 3 Vs.
Big data definition
• Big data is high-volume, high-velocity, and
high-variety information assets that demand
cost effective, innovative forms of information
processing for enhanced insight and decision
making.
Part I of the definition
• “big data is high-volume, high-velocity, and
high-variety information assets” talks about
voluminous data (humongous data) that may
have great variety (a good mix of structured,
semi-structured, and unstructured data)
and will require a good speed/pace for storage,
preparation, processing, and analysis.
Part II of the definition
• “cost effective, innovative forms of
information processing” talks about
embracing new techniques and technologies
to capture (ingest), store, process, persist,
integrate, and visualize the high-volume,
high-velocity, and high-variety data.
Part III of the definition
• “enhanced insight and decision making” talks
about deriving deeper, richer, and meaningful
insights and then using these insights to make
faster and better decisions to gain business
value and thus a competitive edge.
• Data → Information → Actionable
intelligence → Better decisions → Enhanced
business value

Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
Module 1
No ratings yet
Module 1
40 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
Bigdata Notes-1 To 3
No ratings yet
Bigdata Notes-1 To 3
32 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
BDT Unit I
No ratings yet
BDT Unit I
42 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
No ratings yet
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
11 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
10 pages
BAD601 Module 1 PDF
No ratings yet
BAD601 Module 1 PDF
64 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
No ratings yet
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
68 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Introduction and Scope of DBMS Subject
No ratings yet
Introduction and Scope of DBMS Subject
47 pages
Unit 3
No ratings yet
Unit 3
12 pages
DA Unit 1
No ratings yet
DA Unit 1
44 pages
Unit 1
No ratings yet
Unit 1
62 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
Understanding Digital Data Types
No ratings yet
Understanding Digital Data Types
79 pages
Digital Data
No ratings yet
Digital Data
32 pages
DBMS Unit-1,2
No ratings yet
DBMS Unit-1,2
77 pages
Data Science: Structured vs. Unstructured Data
No ratings yet
Data Science: Structured vs. Unstructured Data
56 pages
Big Data
No ratings yet
Big Data
18 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
BigData 1
No ratings yet
BigData 1
14 pages
Big Data Analytics Basics
No ratings yet
Big Data Analytics Basics
44 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
1 - Data and Organizations
No ratings yet
1 - Data and Organizations
5 pages
AI Primer
No ratings yet
AI Primer
24 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
DBMS PPT 1 Eng
No ratings yet
DBMS PPT 1 Eng
74 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
NoSQL Databases A Survey On Schema Less Databases
No ratings yet
NoSQL Databases A Survey On Schema Less Databases
3 pages
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
No ratings yet
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
3 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
SQL For Beginners
No ratings yet
SQL For Beginners
15 pages
DBMS Unit 6 Macro
No ratings yet
DBMS Unit 6 Macro
3 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
Bda Unit 1
No ratings yet
Bda Unit 1
25 pages
SQL Notes
No ratings yet
SQL Notes
45 pages
Database Unit 1
No ratings yet
Database Unit 1
31 pages
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Big Data: Structured, Semi-Structured, Unstructured
No ratings yet
Big Data: Structured, Semi-Structured, Unstructured
36 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Bda Unit-1
No ratings yet
Bda Unit-1
17 pages
Structured, Semi-Structured and Unstructured Data
No ratings yet
Structured, Semi-Structured and Unstructured Data
2 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
DBMS PPT 1
No ratings yet
DBMS PPT 1
27 pages
Price List Haiwell 2023 Ver 5.05.02
No ratings yet
Price List Haiwell 2023 Ver 5.05.02
51 pages
Web Aam 101 Presentation 12 11 2023
No ratings yet
Web Aam 101 Presentation 12 11 2023
15 pages
Fire Alarm (Gent)
No ratings yet
Fire Alarm (Gent)
7 pages
Electrical Motor Connection - Electrical Notes & Articles
No ratings yet
Electrical Motor Connection - Electrical Notes & Articles
4 pages
Me 2 D PDF
No ratings yet
Me 2 D PDF
26 pages
CMX 150 A 1
No ratings yet
CMX 150 A 1
2 pages
David Jonassen: Mindtools Pioneer
No ratings yet
David Jonassen: Mindtools Pioneer
6 pages
SOFTWARE PROJECT MANAGEMENT Jntu
No ratings yet
SOFTWARE PROJECT MANAGEMENT Jntu
4 pages
Just Basic (Computer)
0% (1)
Just Basic (Computer)
9 pages
Aerox Seat Opening and Closing
No ratings yet
Aerox Seat Opening and Closing
1 page
Meeting Script
No ratings yet
Meeting Script
3 pages
615 Series Modbus Communication Protocol Manual - M
No ratings yet
615 Series Modbus Communication Protocol Manual - M
84 pages
GuardPLC Digital Input Output Module PDF
No ratings yet
GuardPLC Digital Input Output Module PDF
20 pages
Best Practice Guide Sound Mixing For BBC Programmes: 1. Volume Surfing
No ratings yet
Best Practice Guide Sound Mixing For BBC Programmes: 1. Volume Surfing
6 pages
Test Yourself C - BT MLH 12 - Key
No ratings yet
Test Yourself C - BT MLH 12 - Key
4 pages
Pricing Strategies of Mercedez-Benz
No ratings yet
Pricing Strategies of Mercedez-Benz
9 pages
Fundamentals of Digital Systems
No ratings yet
Fundamentals of Digital Systems
3 pages
Azure Interview Questions by Deepak Goyal
No ratings yet
Azure Interview Questions by Deepak Goyal
40 pages
Mammoet 2007 PDF
100% (1)
Mammoet 2007 PDF
14 pages
MICROSOFT AZURE IoT Platform-Manual PDF
100% (1)
MICROSOFT AZURE IoT Platform-Manual PDF
1,100 pages
Top Engineering Schools in PH
No ratings yet
Top Engineering Schools in PH
6 pages
WhatsApp Bulk Marketing Guide
No ratings yet
WhatsApp Bulk Marketing Guide
5 pages
Black Box Testing Techniques Guide
No ratings yet
Black Box Testing Techniques Guide
14 pages
Case Week 2
No ratings yet
Case Week 2
2 pages
Micro-Measurements: Features
No ratings yet
Micro-Measurements: Features
4 pages
Fortitoken One-Time Password Hardware Token: Ftk-200, Ftk-200Cd and Ftk-220
No ratings yet
Fortitoken One-Time Password Hardware Token: Ftk-200, Ftk-200Cd and Ftk-220
3 pages
Arduino Hexapod Avoider Robot
No ratings yet
Arduino Hexapod Avoider Robot
16 pages
MEstudentHandbook 200710
No ratings yet
MEstudentHandbook 200710
40 pages
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing Unit Number and Title Submission Date
No ratings yet
Assignment 1 Front Sheet: Qualification BTEC Level 5 HND Diploma in Computing Unit Number and Title Submission Date
35 pages
Building A Classroom Noise Monitor
No ratings yet
Building A Classroom Noise Monitor
11 pages

1 UNIT-1 Introduction To Types of Digital Data

Uploaded by

1 UNIT-1 Introduction To Types of Digital Data

Uploaded by

UNIT-I

Classification of Digital Data

3. Tags are also used to enforce hierarchies of

Unstructured data does not conform to any

• It is the analysis step of the “knowledge

• Text mining is the process of gleaning high

• It is about enabling computers to understand

• It is the process of extracting structured or

• This is about tagging manually with adequate

• It is also called POS or POST or grammatical

We can have any or more of the following:

• Title of the book

• 1970s and before was the era of mainframes.

Well, we will give you a few responses that we

You might also like