Load Balancing in Parallel and Distributed Database

Load Balancing in Distributed Database Md. Shamsur Rahim 14-98181-3 Student, MScCS, AIUB AZM Ehtesham Chowdhury 15-98451-1 Student, MScCS, AIUB Saiful Akhter 15-98502-1 Student, MScCS, AIUB

Load Balancing:  Means distributing transaction and queries among different nodes.  The goal is to maximize the throughput.  Parallel Execution Problems  1. Initialization  2. Interference  3. Skew

Parallel Execution Problems : Initialization  Initialization is necessary before execution.  This sequential steps includes  Process/ Thread Creation and initialization  Communication Initialization etc.  The duration is proportional to the degree of parallelism  The degree of parallelism should be fixed according to query complexity.  Formula for finding response time for an Operator: 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑇𝑖𝑚𝑒 = 𝑎 ∗ 𝑛 + 𝑐∗𝑁 𝑛  The equation can be further derived to obtain: 𝑁 = 𝑡𝑜𝑢𝑝𝑙𝑒𝑠, 𝑐 = 𝑎𝑣𝑔 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 n = No. Of Processors optimal number of processors to allocate (n) maximal achievable speedup (S) 𝑛 = √ 𝑐 ∗ 𝑁 𝑎 𝑆 = 𝑛 2

Parallel Execution Problems : Interferences  Parallel execution can be slowed down by interference.  Interference occurs when several processors simultaneously access the same resource,  Hardware  Solution: Duplicate Shared resource  Software.  Solution: Partition the shared resource into several independent resources

Parallel Execution Problems : Skew  Problem appears with intra- operator parallelism (variation in partition size) is known as data skew.  Classification of Skew:  Attribute Value Skew : inherent in the dataset  e.g., there are more citizens in Paris than in Waterloo  Tuple Placement Skew: introduced when the data are initially partitioned  e.g., with range partitioning  Selectivity Skew  introduced when there is variation in the selectivity of select predicates on each node  Redistribution Skew  occurs in the redistribution step between two operators.  Join Product Skew  occurs because the join selectivity may vary between nodes

Inter-Query Parallelism  Form of parallelism where many different Queries or Transactions are executed in parallel with one another on many processors.  Advantages:  Increases Transaction Throughput.  Scales up the Transaction processing system  Easy to implement in Shared Memory Parallel System.  Example: Oracle 8 & Oracle Rdb.

Intra-Query Parallelism  Form of parallelism where Single Query is executed in parallel on many processors.  2 Types.  Intra-operation parallelism  Inter-operation parallelism  Advantages:  speed up a single complex long running queries.  Best suited for complex scientific calculations (queries).  Example: Informix, Terradata.

Intra-operation parallelism  The process of speeding up a query through parallelizing the execution of individual operations.  The operations which can be parallelized are Sort, Join, Projection, Selection and so on.

Inter-operation parallelism  The process of speeding up a query through parallelizing various operations which are part of the query.  Example Step:  A query which involves join of 4 tables executed in two processors  Each processor shall join two relations locally and the result1 and result2 can be joined further to produce the final result.

Intra-Operator Load Balancing  Depends on  The degree of parallelism.  Allocation of processors for the operator.  The home of the operator (the set of processors where it is executed) must be carefully decided.  The skew problem makes it hard for a parallel query optimizer to make this decision statically.  Require a very accurate and detailed cost model.

 Two Solutions incorporated in a hybrid query optimizer.  Adaptive  Specialized

Adaptive Technique  The main idea is to statically decide on an initial allocation of the processors to the operator (using a cost model).  Adapt to skew using load reallocation.  Load reallocation is to detect the oversized partitions.  Partition them again onto several processors.

Adaptive Technique(Continued)  Advantage:  More dynamic adjustment of the degree of parallelism.  useful to improve intra-operator load balancing in all kinds of parallel architectures.  By reducing processor interference  Excellent load balancing for intra-operator parallelism

Adaptive Technique(Continued)  specific control operators.  Detect whether the static estimates for intermediate result sizes differ from the run-time values.  Relation redistribution in order to prevent join product skew and redistribution skew.  Depends on difference between the estimate and the real value is sufficiently high.

Specialized techniques  Two main techniques.  Range partitioning  Sampling  Avoid redistribution skew of the building relation.  Processors can get partitions of equal numbers of tuples, corresponding to different ranges of join attribute values.

Specialized techniques(Continued)  To deal with skew as follows:  Sample the building relation to determine the partitioning ranges.  Redistribute the building relation to the processors using the ranges. Each processor builds a hash table containing the incoming tuples.  Redistribute the probing relation using the same ranges to the processors. For each tuple received, each processor probes the hash table to perform the join.

Inter-Operator Load Balancing  Important to Choose for each operator  How many and which processors to assign for its execution.  Taking into account pipeline parallelism, which requires inter-operator communication.  Harder to achieve in shared-nothing for this Reasons:  Choice of the degree of parallelism cause to errors  Reason: Both processors and operators are discrete entities.

Inter-Operator Load Balancing(Continued)  Processors associated with the latest operators in a pipeline chain may remain idle a significant time.  Shared-memory allows the parallel execution of independent pipeline chains  It is known as Tasks.  Dynamically adjusting the degree of intra-operator parallelism of the tasks in order to reach maximum resource utilization.

Activations  Represents a sequential unit of work  Can be executed by any thread  Self-contained  Can only be executed in the same SM(shared memory)-node

Activation Queues Moving data activation along pipeline chains Also called table queues Threads have unrestricted access to the same SM-node queues Small number of queue results interference A thread a queue

Thread  Simple strategy for good load balancing if number of threads are higher than the processors  One thread per processor per query reduce the overhead of interference  Thread will consume activation as much as possible to limit thread interference

Reference:  M. Tamer Özsu • Patrick Valduriez, Principles of Distributed Database Systems, Third Edition

Load Balancing in Parallel and Distributed Database

More Related Content

What's hot

Viewers also liked

Similar to Load Balancing in Parallel and Distributed Database

More from Md. Shamsur Rahim

Recently uploaded

In this document

Load Balancing in Parallel and Distributed Database