Distributed Database Systems
1
Outline
◼ Introduction
◼ Distributed Database Design
◼ Distributed Data Control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multi-database Systems
◼ Web Data Management
2
Outline
◼ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
3
Distributed Computing
◼ A number of autonomous processing elements (not
necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
◼ What is being distributed?
❑ Processing logic
❑ Function
❑ Data
❑ Control
4
Current Distribution – Geographically
Distributed Data Centers
5
What is a Distributed Database System?
A distributed database is a collection of multiple, logically
interrelated databases distributed over a computer network
A distributed database management system (Distributed
DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users
6
What is not a DDBS?
◼ A timesharing computer system
◼ A loosely or tightly coupled multiprocessor system
◼ A database system which resides at one of the nodes of
a network of computers - this is a centralized database
on a network node
7
Distributed DBMS Environment
8
Implicit Assumptions
◼ Data stored at a number of sites → each site logically
consists of a single processor
◼ Processors at different sites are interconnected by a
computer network → not a multiprocessor system
❑ Parallel database systems
◼ Distributed database is a database, not a collection of
files → data logically related as exhibited in the users’
access patterns
❑ Relational data model
◼ Distributed DBMS is a full-fledged DBMS
❑ Not remote file system, not a TP system
9
Important Point
Logically integrated
but
Physically distributed
10
Outline
◼ Introduction
❑
❑ History
❑
11
History – File Systems
12
History – Database Management
13
History – Early Distribution
Peer-to-Peer (P2P)
14
History – Client/Server
15
History – Data Integration
16
History – Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ Cost savings: no need to maintain dedicated compute
power
◼ Elasticity: better adaptivity to changing workload
17
Data Delivery Alternatives
◼ Delivery modes
❑ Pull-only
❑ Push-only
❑ Hybrid
◼ Frequency
❑ Periodic
❑ Conditional
❑ Ad-hoc or irregular
◼ Communication Methods
❑ Unicast
❑ One-to-many
◼ Note: not all combinations make sense
18
Outline
◼ Introduction
❑
❑ Distributed DBMS promises
❑
19
Distributed DBMS Promises
Transparent management of distributed, fragmented,
and replicated data
Improved reliability/availability through distributed
transactions
Improved performance
Easier and more economical system expansion
Transparency
◼ Transparency is the separation of the higher-level
semantics of a system from the lower level
implementation issues.
◼ Fundamental issue is to provide
data independence
in the distributed environment
❑ Network (distribution) transparency
❑ Replication transparency
❑ Fragmentation transparency
◼ horizontal fragmentation: selection
◼ vertical fragmentation: projection
◼ hybrid
Example
22
Transparent Access
Tokyo
SELECT ENAME,SAL
FROM EMP,ASG,PAY Boston Paris
WHERE DUR > 12 Paris projects
Paris employees
AND EMP.ENO = ASG.ENO Communication Paris assignments
Network Boston employees
AND PAY.TITLE = EMP.TITLE
Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
23
Distributed Database - User View
Distributed Database
24
Distributed DBMS - Reality
User
Query
User
DBMS
Application
Software
DBMS
Software
DBMS Communication
Software Subsystem
User
DBMS User Application
Software Query
DBMS
Software
User
Query
25
Types of Transparency
◼ Data independence
◼ Network transparency (or distribution transparency)
❑ Location transparency
❑ Fragmentation transparency
◼ Fragmentation transparency
◼ Replication transparency
26
Reliability Through Transactions
◼ Replicated components and data should make distributed
DBMS more reliable.
◼ Distributed transactions provide
❑Concurrency transparency
❑ Failure atomicity
• Distributed transaction support requires implementation of
❑ Distributed concurrency control protocols
❑ Commit protocols
◼ Data replication
❑ Great for read-intensive workloads, problematic for updates
❑ Replication protocols
27
Potentially Improved Performance
◼ Proximity of data to its points of use
❑ Requires some support for fragmentation and replication
◼ Parallelism in execution
❑ Inter-query parallelism
❑ Intra-query parallelism
28
Scalability
◼ Issue is database scaling and workload scaling
◼ Adding processing and storage power
◼ Scale-out: add more servers
❑ Scale-up: increase the capacity of one server → has limits
29
Outline
◼ Introduction
❑
❑ Design issues
❑
30
Distributed DBMS Issues
◼ Distributed database design
❑ How to distribute the database
❑ Replicated & non-replicated database distribution
❑ A related problem in directory management
◼ Distributed query processing
❑ Convert user transactions to data manipulation instructions
❑ Optimization problem
◼ min{cost = data transmission + local processing}
❑ General formulation is NP-hard
31
Distributed DBMS Issues
◼ Distributed concurrency control
❑ Synchronization of concurrent accesses
❑ Consistency and isolation of transactions' effects
❑ Deadlock management
◼ Reliability
❑ How to make the system resilient to failures
❑ Atomicity and durability
32
Distributed DBMS Issues
◼ Replication
❑ Mutual consistency
❑ Freshness of copies
❑ Eager vs lazy
❑ Centralized vs distributed
◼ Parallel DBMS
❑ Objectives: high scalability and performance
❑ Not geo-distributed
❑ Cluster computing
33
Related Issues
◼ Alternative distribution approaches
❑ Modern P2P
❑ World Wide Web (WWW or Web)
◼ Big data processing
❑ 4V: volume, variety, velocity, veracity
❑ MapReduce & Spark
❑ Stream data
❑ Graph analytics
❑ NoSQL
❑ NewSQL
❑ Polystores
34
Outline
◼ Introduction
❑
❑ Distributed DBMS architecture
35
DBMS Implementation Alternatives
36
Dimensions of the Problem
◼ Distribution
❑ Whether the components of the system are located on the same machine or
not
◼ Heterogeneity
❑ Various levels (hardware, communications, operating system)
❑ DBMS important one
◼ data model, query language,transaction management algorithms
◼ Autonomy
❑ Not well understood and most troublesome
❑ Various versions
◼ Design autonomy: Ability of a component DBMS to decide on issues related to its
own design.
◼ Communication autonomy: Ability of a component DBMS to decide whether and
how to communicate with other DBMSs.
◼ Execution autonomy: Ability of a component DBMS to execute local operations in
any manner it wants to.
37
Client/Server Architecture
38
Advantages of Client-Server
Architectures
◼ More efficient division of labor
◼ Horizontal and vertical scaling of resources
◼ Better price/performance on client machines
◼ Ability to use familiar tools on client machines
◼ Client access to remote data (via standards)
◼ Full DBMS functionality provided to client workstations
◼ Overall better system price/performance
39
Database Server
40
Distributed Database Servers
41
Peer-to-Peer Component Architecture
42
MDBS Components & Execution
43
Mediator/Wrapper Architecture
44
Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
◼ IaaS – Infrastructure-as-a-Service
◼ PaaS – Platform-as-a-Service
◼ SaaS – Software-as-a-Service
◼ DaaS – Database-as-a-Service
45
Simplified Cloud Architecture
46