UNIT-I
Classification of Digital Data
1. Unstructured data:
• This is the data which does not conform to a
data model or is not in a form which can be
used easily by a computer program.
• About 80–90% data of an organization is in
this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos,
letters, researches, white papers, body of an
email, etc.
2. Semi-structured data:
• This is the data which does not conform to a
data model but has some structure. However,
it is not in a form which can be used easily by
a computer program;
• for example, emails, XML, markup languages
like HTML, etc. Metadata for this data is
available but is not sufficient.
3. Structured data:
• This is the data which is in an organized form
(e.g., in rows and columns) and can be easily
used by a computer program. Relationships
exist between entities of data, such as classes
and their objects. Data stored in databases is
an example of structured data.
• We have grown comfortable working with
RDBMS – the storage, retrieval, and
management of data has been immensely
simplified. The data held in RDBMS is typically
structured data.
Sources of Structured Data
• If your data is highly structured, one can look at
leveraging any of the available RDBMS [Oracle Corp. –
Oracle, IBM – DB2, Microsoft – Microsoft SQL Server,
EMC – Greenplum, Teradata – Teradata, MySQL (open
source), PostgreSQL (advanced open source), etc.] to
house it.
• These databases are typically used to hold
transaction/operational data generated and collected
by day-to-day business activities. In other words, the
data of the On-Line Transaction Processing (OLTP)
systems are generally quite structured.
Ease of Working with Structured Data
Ease of Working with Structured Data
• Structured data provides the ease of working
with it..The ease is with respect to the
following:
• 1. Insert/update/delete:
The Data Manipulation Language (DML)
operations provide the required ease with
data input, storage, access, process, analysis,
etc.
2. Security:
• How does one ensure the security of
information?
• There are available staunch encryption and
tokenization solutions to warrant the security
of information throughout its lifecycle.
• Organizations are able to retain control and
maintain compliance adherence by ensuring
that only authorized individuals are able to
decrypt and view sensitive information.
3. Indexing:
• An index is a data structure that speeds up the
data retrieval operations (primarily the
SELECT DML statement) at the cost of
additional writes and storage space, but the
benefits that ensue in search operation are
worth the additional writes and storage space.
4. Scalability:
• The storage and processing capabilities of the
traditional RDBMS can be easily scaled up by
increasing the horsepower of the database
server (increasing the primary and secondary
or peripheral storage capacity, processing
capacity of the processor, etc.).
5. Transaction processing:
• RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID)
properties of transaction. Given next is a quick
explanation of the ACID properties:
Atomicity:
A transaction is atomic, means that either it
happens in its entirety or none of it at all.
Consistency:
• The database moves from one consistent state
to another consistent state. In other words, if
the same piece of information is stored at two
or more places, they are in complete
agreement.
• Isolation:
The resource allocation to the transaction
happens such that the transaction gets the
impression that it is the only transaction
happening in isolation.
Durability:
• All changes made to the database during a
transaction are permanent and that accounts
for the durability of the transaction.
1.1.2 Semi-Structured Data
• Semi-structured data is also referred to as
self-describing structure.
• It has the following features:
• 1. It does not conform to the data models that
one typically associates with relational
databases or any other form of data tables.
2. It uses tags to segregate semantic elements.
3. Tags are also used to enforce hierarchies of
records and fields within data.
• 4. There is no separation between the data
and the schema. The amount of structure
used is dictated by the purpose at hand
5. In semi-structured data, entities belonging to
the same class and also grouped together
need not necessarily have the same set of
attributes.
Sources of Semi-Structured Data
Sources of Semi-Structured Data
• . 1. XML: eXtensible Markup Language (XML)
is hugely popularized by web services
developed utilizing the Simple Object Access
Protocol (SOAP) principles.
• 2. JSON: Java Script Object Notation (JSON) is
used to transmit data between a server and a
web application.
• JSON is popularized by web services
developed utilizing the Representational State
Transfer (REST) – an architecture style for
creating scalable web services.
• MongoDB (open-source, distributed, NoSQL,
documented-oriented database) and
Couchbase (originally known as Membase,
open-source, distributed, NoSQL,
document-oriented database) store data
natively in JSON format.
Characteristics of Semi-structured data
Unstructured Data
Unstructured data does not conform to any
pre-defined data model.
the structure is quite unpredictable.
(Example of Unstructured Data)
Issues with “Unstructured” Data
Sources of Unstructured Data
Dealing with Unstructured Data
Techniques are used to find patterns in
or interpret unstructured data:
• 1. Data mining
• 2. Text analytics or text mining
• 3. Natural language processing (NLP)
• 4. Noisy text analytics
• 5. Manual tagging with metadata
• 6. Part-of-speech tagging
• 7. Unstructured Information Management
Architecture (UIMA)
1. Data mining
• It is the analysis step of the “knowledge
discovery in databases” process.
• Few popular data mining algorithms are as
follows:
1. Association rule mining
2. Regression analysis
3. Collaborative filtering
a.Association rule mining
• It is also called “market basket analysis” or
“affinity analysis”.
• It is used to determine “What goes with
what?”
• It is about when you buy a product, what is
the other product that you are likely to
purchase with it.
Example
• If you pick up bread from the grocery, are you
likely to pick eggs or cheese to go with it.
b.Regression analysis
• It helps to predict the relationship between
two variables.
• The variable whose value needs to be
predicted is called the dependent variable and
the variables which are used to predict the
value are referred to as the independent
variables.
c. Collaborative filtering
• It is about predicting a user’s preference or
preferences based on the preferences of a
group of users.
• For example, take a look at Table 1.5. We are
looking at predicting whether User 4 will
prefer to learn using videos or is a textual
learner depending on one or a couple of his
or her known preferences.
• We analyze the preferences of similar user
profiles and on the basis of it, predict that
User 4 will also like to learn using videos and is
not a textual learner.
2. Text analytics or text mining:
• Text mining is the process of gleaning high
quality and meaningful information (through
devising of patterns and trends by means of
statistical pattern learning) from text.
• It includes tasks such as text categorization,
text clustering, sentiment analysis,
concept/entity extraction, etc.
3. Natural language processing (NLP)
• It is related to the area of human computer
interaction.
• It is about enabling computers to understand
human or natural language input.
4. Noisy text analytics:
• It is the process of extracting structured or
semi-structured information from noisy
unstructured data such as chats, blogs, wikis,
emails, message-boards, text messages, etc.
The noisy unstructured data usually comprises
one or more of the following:
• Spelling mistakes,
• abbreviations, acronyms,
• non-standard words, missing punctuation,
• missing letter case,
• filler words such as “uh”, “um”, etc.
5. Manual tagging with metadata
• This is about tagging manually with adequate
metadata to provide the requisite semantics
to understand unstructured data.
6. Part-of-speech tagging
• It is also called POS or POST or grammatical
tagging.
• It is the process of reading text and tagging
each word in the sentence as belonging to a
particular part of speech such as “noun”,
“verb”, “adjective”, etc.
7. Unstructured Information
Management Architecture (UIMA)
• It is an open source platform from IBM.
• It is used for real-time content analytics.
• It is about processing text and other
unstructured data to find latent meaning and
relevant relationship buried therein.
Read up more on UIMA at the link:
http://www.ibm.com/developerworks/data/d
ownloads/uima/
• 1. Why is an email placed in the “unstructured
category”?
Answer: Let us take a look at what we
can place in the body of the email.
We can have any or more of the following:
• Hyperlink
• PDFs/DOCs/XLS/etc. attachments
• Emoticons
• Images
• Audio/video attachments
• Free flowing text, etc.
The above are reasons behind placing the email in
the “unstructured category”.
• 2. What category will you place a CCTV
footage into?
• Answer: Unstructured
• 3. You have just got a book issued from the
library.
• What are the details about the book that can
be placed in an RDBMS table?
Answer:
• Title of the book
• Author of the book .
• Publisher of the book
• Year of Publication
• No. of pages in the book
• Type of book such as whether hardbound or
paperback
• Price of the book
• ISBN No. of the book
• Attachments such as With CD or Without CD, etc.
• 4. Which category would you place the
consumer complaints and feedback?
• Answer: Unstructured data
CHARACTERISTICS OF DATA
• Data has three key characteristics:
• 1. Composition
• 2. Condition
• 3. Context
1. Composition
• The composition of data deals with the
structure of data, that is, the sources of data,
the granularity, the types, and the nature of
data as to whether it is static or real-time
streaming.
2. Condition:
• The condition of data deals with the state of
data, that is, “Can one use this data as is for
analysis?” or “Does it require cleansing for
further enhancement and enrichment?”
3. Context
• The context of data deals with “Where has this
data been generated?” “Why was this data
generated?” “How sensitive is this data?”
“What are the events associated with this
data?” and so on.
2.2 EVOLUTION OF BIG DATA
• 1970s and before was the era of mainframes.
The data was essentially primitive and
structured.
• Relational databases evolved in 1980s and
1990s. The era was of data intensive
applications.
• The World Wide Web (WWW) and the
Internet of Things (IoT) have led to an
onslaught of structured, unstructured, and
multimedia data.
2.3 DEFINITION OF BIG DATA
Well, we will give you a few responses that we
have heard over time:
• 1. Anything beyond the human and technical
infrastructure needed to support storage,
processing, and analysis.
• 2. Today’s BIG may be tomorrow’s NORMAL.
• 3. Terabytes or petabytes or zettabytes of
data.
• 4. I think it is about 3 Vs.
Big data definition
• Big data is high-volume, high-velocity, and
high-variety information assets that demand
cost effective, innovative forms of information
processing for enhanced insight and decision
making.
Part I of the definition
• “big data is high-volume, high-velocity, and
high-variety information assets” talks about
voluminous data (humongous data) that may
have great variety (a good mix of structured,
semi-structured, and unstructured data)
and will require a good speed/pace for storage,
preparation, processing, and analysis.
Part II of the definition
• “cost effective, innovative forms of
information processing” talks about
embracing new techniques and technologies
to capture (ingest), store, process, persist,
integrate, and visualize the high-volume,
high-velocity, and high-variety data.
Part III of the definition
• “enhanced insight and decision making” talks
about deriving deeper, richer, and meaningful
insights and then using these insights to make
faster and better decisions to gain business
value and thus a competitive edge.
• Data → Information → Actionable
intelligence → Better decisions → Enhanced
business value