106 CHAPTER 4 Data and Knowledge Management
4.1 Managing Data
As we have seen throughout this textbook, IT applications require data. Data should be of
high quality, meaning that they should be accurate, complete, timely, consistent, accessible,
relevant, and concise. Unfortunately, however, the process of acquiring, keeping, and man-
aging data is becoming increasingly difcult.
The Difculties of Managing Data
Because data are processed in several stages and often in several places, they may be subject
to problems and difculties. Managing data in organizations is difcult for many reasons.
The amount of data increases exponentially with time. Much historical data must be kept for
a long time, and new data are added rapidly. For example, to support some 40 million people
who play fantasy football, Web sites such as ESPN.com, NFL.com, and CBSSportsLine.com
have to manage terabytes of sports data.
Data are scattered throughout organizations and are collected by many individuals using
various methods and devices. Data are frequently stored in numerous servers and loca-
tions and in different computing systems, databases, formats, and human and computer
languages.
Data come from internal sources (e.g., corporate databases), personal sources (e.g., per-
sonal thoughts, opinions, and experiences), and external sources (e.g., commercial data-
bases, government reports, and corporate Web sites). Data also come from the Web, in the
form of clickstream data. Clickstream data are those data that visitors and customers pro-
duce when they visit a Web site and click on hyperlinks (described in Chapter 5). Click-
stream data provide a trail of the users activities in the Web site, including user behavior
and browsing patterns.
New sources of data, such as blogs, podcasts, videocasts, and RFID tags and other wire-
less sensors are constantly being developed. Much of these new data are unstructured, mean-
ing that their content cannot be truly represented in a computer record. Examples of
unstructured data are digital images, digital video, voice packets, and musical notes in an
MP3 le.
Data decays over time. For example, customers move to new addresses or change their
names, companies go out of business or are bought, new products are developed, employees
are hired or red, companies expand into new countries, and so on.
Data security, quality, and integrity are critical, yet they are easily jeopardized. In addi-
tion, legal requirements relating to data differ among both countries and industries, and
they change frequently.
Because of these problems, data are difcult to manage. As a result, organizations are using
databases and data warehouses to manage their data more efciently and effectively. We dis-
cuss the data life cycle in the next section, which shows you how organizations process and
manage data to make decisions, generate knowledge, and utilize this knowledge in a variety
of applications.
The Data Life Cycle
Businesses run on data that have been processed into information and knowledge. Managers
then apply this knowledge to business problems and opportunities. Businesses transform data
into knowledge and solutions in several ways; the general process is illustrated in Figure 4.1. It
starts with the collection of data from various sources and the storage of data in a database(s).
Selected data from the organizations databases are then processed to t the format of a data
warehouse or data mart. Next, users access the data in the warehouse or data mart for analysis.
SECTION 4.2 The Database Approach 107
The analysis is done with data analysis tools, which look for patterns, and with intelligent sys-
tems, which support data interpretation. We discuss each of these concepts in this chapter.
These activities ultimately generate knowledge that can be used to support decision mak-
ing. Both the data (at various times during the process) and the knowledge (derived at the
end of the process) must be presented to users. This presentation can be accomplished by
using different visualization tools. The knowledge created can also be stored in an organiza-
tional knowledge base and then used, together with decision support tools, to provide solu-
tions to organizational problems. The remaining sections of this chapter will examine the
elements and the process shown in Figure 4.1 in greater detail.
1. What are some of the difculties involved in managing data?
2. Describe the data life cycle.
3. What are the various sources for data?
Before you go on . . .
Supply
Chain Management
Internal
Data
Data
Warehouse
Metadata
Data
Sources
External
Data
Personal
Data
Data
Visualization
Online Analytical Processing,
Queries,
Executive Dashboard,
Decision Support System
Decisions
Knowledge
Management
Data
Marts
Electronic Commerce
Strategy
Document Management
Others
Customer Relationship
Management
Data
Marts
Data Analysis
Result
Solutions
Data Mining
FIGURE 4.1 Data life cycle.
4.2 The Database Approach
Using databases eliminates many problems that arose from previous methods of storing and
accessing data. Databases are arranged so that one set of software programsthe database
management systemprovides all users with access to all the data. This system minimizes the
following problems:
Data redundancy: The same data are stored in many places.
Data isolation: Applications cannot access data associated with other applications.
Data inconsistency: Various copies of the data do not agree.
In addition, database systems maximize the following issues:
Data security.
Data integrity: Data meet certain constraints, such as no alphabetic characters in a Social
Security number eld.
Data independence: Applications and data are independent of one another (i.e., applica-
tions and data are not linked to each other, meaning that all applications are able to access
the same data).
Figure 4.2 illustrates a university database. Note that university applications from the
Registrars ofce, the Accounting department, and the Athletics department access data
through the database management system.
In the next section, we discuss the data hierarchy, after which we turn our attention to
how databases are designed.
The Data Hierarchy
Data are organized in a hierarchy that begins with bits and proceeds all the way to databases
(see Figure 4.3). A bit (binary digit) represents the smallest unit of data a computer can
process. The term binary means that a bit can consist only of a 0 or a 1. A group of eight bits,
called a byte, represents a single character. A byte can be a letter, a number, or a symbol. A log-
ical grouping of characters into a word, a small group of words, or an identication number is
called a eld. For example, a students name in a universitys computer les would appear in
the name eld, and her or his Social Security number would appear in the Social Security
number eld. Fields can also contain data other than text and numbers. A eld can contain
an image, or any other type of multimedia. For example, a motor vehicle departments licens-
ing database could contain a persons photograph. A logical grouping of related elds, such as
the students name, the courses taken, the date, and the grade, comprise a record. A logical
grouping of related records is called a le or table. For example, the records from a particular
course, consisting of course number, professor, and students grades, would constitute a data
le for that course. A logical grouping of related les would constitute a database. Using the
same example, the student course le could be grouped with les on students personal histo-
ries and nancial backgrounds to create a student database.
The next section discusses designing the database in todays organizations. We focus on
entity-relationship modeling and normalization procedures.
Designing the Database
Data must be organized so that users can retrieve, analyze, and understand them. A key to
effectively designing a database is the data model. A data model is a diagram that represents
entities in the database and their relationships. An entity is a person, place, thing, or eventsuch
as a customer, an employee, or a productabout which information is maintained. Entities can
typically be identied in the users work environment. A record generally describes an entity.
Each characteristic or quality of a particular entity is called an attribute. Using the above exam-
ples, we would consider customer name, employee number, and product color attributes.
108 CHAPTER 4 Data and Knowledge Management
Academic info
Team data
Employee data
Tuition data
Financial data
Student data
Course data
Registration data
Registrar's office Class programs
Accounting dept.
Accounts
programs
Athletics dept. Sports programs
Database
management
system
FIGURE 4.2 A database management system (DBMS) provide access to all data in the
database.
SECTION 4.2 The Database Approach 109
Every record in a le must contain at least one eld that uniquely identies that record so
that it can be retrieved, updated, and sorted. This identier eld is called the primary key.
For example, a student record in a U.S. college would probably use the Social Security num-
ber as its primary key. In some cases, locating a particular record requires the use of secondary
keys. Secondary keys are other elds that have some identifying information but typically do
not identify the le with complete accuracy. For example, the students major might be a sec-
ondary key if a user wanted to nd all students in a particular major eld of study. It should
not be the primary key, however, because many students can have the same major.
Entity-Relationship Modeling. Database designers plan the database design in a
process called entity-relationship (ER) modeling, using an entity-relationship diagram. ER
diagrams consist of entities, attributes, and relationships. Entities are pictured in boxes, and re-
lationships are shown in diamonds. The attributes for each entity are listed next to the entity,
and the primary key is underlined. Figures 4.4a and 4.4b show an entity-relationship diagram.
As dened earlier, an entity can be identied in the users work environment. For exam-
ple, consider student registration at a university. Students register for courses and register
their cars for parking permits. In this example, STUDENT, PARKING PERMIT, CLASS,
and PROFESSOR are entities, as shown in Figure 4.4.
Entities of a given type are grouped in entity classes. In our example, STUDENT, PARK-
ING PERMIT, CLASS, and PROFESSOR are entity classes. An instance of an entity class is
the representation of a particular entity. Therefore, a particular STUDENT (James Smythe,
145-89-7123) is an instance of the STUDENT entity class; a particular parking permit
(91778) is an instance of the PARKING PERMIT entity class; a particular class (76890) is
an instance of the CLASS entity class; and a particular professor (Margaret Wilson, 115-65-
7632) is an instance of the PROFESSOR entity class.
Entity instances have identiers, which are attributes that are unique to that entity instance.
For example, STUDENT instances can be identied with StudentIdenticationNumber;
PARKING PERMIT instances can be identied with PermitNumber; CLASS instances can be
identied with ClassNumber; and PROFESSOR instances can be identied with Professor
IdenticationNumber. These identiers (or primary keys) are underlined on ER diagrams, as in
part (b) of Figure 4.4.
Entities have attributes, or properties, that describe the entitys characteristics. In our
example, examples of attributes for STUDENT would be StudentIdenticationNumber,
StudentName, and StudentAddress. Examples of attributes for PARKING PERMIT would be
PermitNumber, StudentIdenticationNumber, and CarType. Examples of attributes for CLASS
would be ClassNumber, ClassName, ClassTime, and ClassPlace. Examples of attributes for
PROFESSOR would be ProfessorIdenticationNumber, ProfessorName, and ProfessorDepart-
ment. (Note that each course at this university has one professorno team teaching.)
Why is StudentIdenticationNumber an attribute of both the STUDENT and PARK-
ING PERMIT entity classes? That is, why do we need the PARKING PERMIT entity
FIGURE 4.3
Hierarchy of data for a
computer-based le.
File
Record
Field
Byte Byte
Bit Bit
Record
Field
Byte Byte
Bit Bit
Field Field
class? If you consider all interlinked university systems, the PARKING PERMIT entity class
is needed for other applications, such as fee payments, parking tickets, and external links to
the state Department of Motor Vehicles.
Entities are associated with one another in relationships, which can include many enti-
ties. (Remember that relationships are noted by diamonds on ER diagrams.) The number of
entities in a relationship is the degree of the relationship. Relationships between two items
110 CHAPTER 4 Data and Knowledge Management
Can
have
Relationships
Keyfield
Professor
Entities
Key
A student can have
many classes.
(a) ER diagram
(b) Entities, Attributes, and Identifiers
A class can have
many students.
A professor can have
many classes.
A class can have
only 1 professor.
A student can have
only 1 parking permit.
A parking permit can
have only 1 student.
1
M
1
1:1
M:M
1:M
Can
have
Can
have
M
M
1
Class
Student Parking Permit
Student Identification Number
STUDENT
Student Name
Student Address
Permit Number
PARKING PERMIT
Student Identification Number
Car Type
Class Number
CLASS
Class Name
Class Time
Class Place
Professor Identification Number
PROFESSOR
Professor Name
Professor Department
FIGURE 4.4 Entity-relationship diagram model.
SECTION 4.3 Database Management Systems 111
are called binary relationships. The three types of binary relationships are one-to-one, one-to-
many, and many-to-many.
In a one-to-one (1:1) relationship, a single-entity instance of one type is related to a single-
entity instance of another type. Figure 4.4 shows STUDENT-PARKING PERMIT as a 1:1 rela-
tionship that relates a single STUDENT with a single PARKING PERMIT. That is, no student
has more than one parking permit, and no parking permit is issued for more than one student.
The second type of relationship, one-to-many (1:M), is represented by the CLASS-
PROFESSOR relationship in Figure 4.4. This relationship means that a professor can have
many courses, but each course can have only one professor.
The third type of relationship, many-to-many (M:M), is represented by the STUDENT-
CLASS relationship. This M:M relationship means that a student can have many courses,
and a course can have many students.
Entity-relationship modeling is valuable because it allows database designers to talk with
users throughout the organization to ensure that all entities and the relationships among
them are represented. This process underscores the importance of taking all users into ac-
count in designing organizational databases. Notice that all entities and relationships in our
example are labeled in terms that users can understand. Now that we understand how a
database is designed, we turn our attention to database management systems.
1. What is a data model?
2. What is a primary key? a secondary key?
3. What is an entity? a relationship?
4.3 Database Management Systems
A database management system (DBMS) is a set of programs that provide users with tools
to add, delete, access, and analyze data stored in one location. An organization can access
the data by using query and reporting tools that are part of the DBMS or by using applica-
tion programs specically written to access the data. DBMSs also provide the mechanisms
for maintaining the integrity of stored data, managing security and user access, and recover-
ing information if the system fails. Because databases and DBMSs are essential to all areas of
business, they must be carefully managed.
There are a number of different database architectures, but we focus on the relational data-
base model because it is popular and easy to use. Other database models (e.g., the hierarchical
and network models) are the responsibility of the MIS function and are not used by organiza-
tional employees. Popular examples of relational databases are Microsoft Access and Oracle.
The Relational Database Model
Most business dataespecially accounting and nancial datatraditionally were organized
into simple tables consisting of columns and rows. Tables allow people to compare informa-
tion quickly by row or column. In addition, items are easy to retrieve by nding the point of
intersection of a particular row and column.
The relational database model is based on the concept of two-dimensional tables. A rela-
tional database is not always one big tableusually called a at lethat contains all of the
records and attributes. Such a design would entail far too much data redundancy. Instead,
a relational database is usually designed with a number of related tables. Each of these tables
contains records (listed in rows) and attributes (listed in columns).
These related tables can be joined when they contain common columns. The uniqueness
of the primary key tells the DBMS which records are joined with others in related tables.
Before you go on . . .
This feature allows users great exibility in the variety of queries they can make. Despite
these features, this model has some disadvantages. Because large-scale databases can be com-
posed of many interrelated tables, the overall design can be complex and therefore have slow
search and access times.
Consider the relational database example about students shown in Figure 4.5. The table
contains data about the entity called students. Attributes of the entity are name, undergrad-
uate major, and grade point average. The rows are the records on Sally Adams, John Jones,
Jane Lee, Kevin Durham, Juan Rodriguez, Stella Zubnicki, and Ben Jones. Of course, your
university keeps much more data on you than our example shows. In fact, your universitys
student database probably keeps hundreds of attributes on each student.
Query Languages. Requesting information from a database is the most commonly
performed operation. Structured query language (SQL) is the most popular query lan-
guage used to request information. It allows people to perform complicated searches by
using relatively simple statements or keywords. Typical keywords are SELECT (to specify a
desired attribute), FROM (to specify the table to be used), and WHERE (to specify condi-
tions to apply in the query).
To understand how SQL works, imagine that a university wants to know the names of stu-
dents who will graduate with honors in May 2009. The university IS staff would query the
student relational database with an SQL statement such as SELECT Student Name, FROM
Student Database, WHERE Grade Point Average 3.40 and Grade Point Average 3.59.
The SQL query would return John Jones and Juan Rodriguez.
Another way to nd information in a database is to use query by example (QBE). In
QBE, the user lls out a grid or template (also known as a form) to construct a sample or
description of the data he or she wants. Users can construct a query quickly and easily by
using drag-and-drop features in a DBMS such as Microsoft Access. Conducting queries in
this manner is simpler than keying in SQL commands.
112 CHAPTER 4 Data and Knowledge Management
FIGURE 4.5 Student
database example.
SECTION 4.3 Database Management Systems 113
Data Dictionary. When a relational model is created, the data dictionary denes the
format necessary to enter the data into the database. The data dictionary provides informa-
tion on each attribute, such as its name, whether it is a key or part of a key, the type of data
expected (alphanumeric, numeric, dates, and so on), and valid values. Data dictionaries can
also provide information on how often the attribute should be updated, why it is needed in
the database, and which business functions, applications, forms, and reports use the attribute.
Data dictionaries provide many advantages to the organization. Because they provide
names and standard denitions for all attributes, they reduce the chances that the same at-
tribute will be used in different applications but with a different name. In addition, data
dictionaries enable programmers to develop programs more quickly because they dont have
to create new data names.
Normalization. In order to use a relational database management system effectively, the
data must be analyzed to eliminate redundant data elements. Normalization is a method
for analyzing and reducing a relational database to its most streamlined form for minimum
redundancy, maximum data integrity, and best processing performance. When data are nor-
malized, attributes in the table depend only on the primary key.
As an example of normalization, consider an automotive repair garage. This business takes
orders from customers who want to have their cars repaired. In this example, ORDER, PART,
SUPPLIER, and CUSTOMER are entities. There can be many PARTS in an ORDER, but
each PART can come from only one SUPPLIER. In a nonnormalized relation called ORDER
(see Figure 4.6), each ORDER would have to repeat the name, description, and price of each
PART needed to complete the ORDER, as well as the name and address of each SUPPLIER.
This relation contains repeating groups and describes multiple entities.
The normalization process, illustrated in Figure 4.7, breaks down the relation, ORDER,
into smaller relations: ORDER, SUPPLIER, and CUSTOMER (Figure 4.7a) and
ORDERED-PARTS and PART (Figure 4.7b). Each of these relations describes a single en-
tity. This process is conceptually simpler, and it eliminates repeating groups. For example,
FIGURE 4.6
Nonnormalized
relation.
FIGURE 4.7 Smaller
relationships broken
down from the
nonnormal relations.
(a) Order, Supplier,
Customer. (b) Ordered
Parts, Part.
114 CHAPTER 4 Data and Knowledge Management
(a)
(b)
SECTION 4.3 Database Management Systems 115
consider an order at the automobile repair shop. The normalized relations can produce the
order in the following manner (see Figure 4.8).
1. The ORDER relation provides the Order Number (the primary key), Order Date, De-
livery Date, Order Total, and Customer Number.
2. The primary key of the ORDER relation (Order Number) provides a link to the OR-
DERED PARTS relation (the link numbered 1 in Figure 4.8).
3. The ORDERED PARTS relation supplies the Number of Parts information to
ORDER.
4. The primary key of the ORDERED PARTS relation (Part Number) provides a link to
the PART relation (the link numbered 2 in Figure 4.8).
5. The PART relation supplies the Part Description, Unit Price, and Supplier Number to
ORDER.
6. The Supplier Number in the PART relation provides a link to the SUPPLIER relation
(the link numbered 3 in Figure 4.8).
7. The SUPPLIER relation provides the Supplier Name and Supplier Address to ORDER.
8. The Customer Number in ORDER provides a link to the CUSTOMER relation (the
link numbered 4 in Figure 4.8).
9. The CUSTOMER relation supplies the Customer Name and Customer Address to
ORDER.
Databases in Action
It is safe to say that almost all organizations have one or more databases. Furthermore, there
are a large number of interesting database applications. ITs About Business 4.1 shows us
how databases can be used to catch plagiarists.
FIGURE 4.8 How
normalized relations
produce the order.