Load Balancing in Distributed Database Md. Shamsur Rahim 14-98181-3 Student, MScCS, AIUB AZM Ehtesham Chowdhury 15-98451-1 Student, MScCS, AIUB Saiful Akhter 15-98502-1 Student, MScCS, AIUB
Load Balancing:  Means distributing transaction and queries among different nodes.  The goal is to maximize the throughput.  Parallel Execution Problems  1. Initialization  2. Interference  3. Skew
Parallel Execution Problems : Initialization  Initialization is necessary before execution.  This sequential steps includes  Process/ Thread Creation and initialization  Communication Initialization etc.  The duration is proportional to the degree of parallelism  The degree of parallelism should be fixed according to query complexity.  Formula for finding response time for an Operator: 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑇𝑖𝑚𝑒 = 𝑎 ∗ 𝑛 + 𝑐∗𝑁 𝑛  The equation can be further derived to obtain: 𝑁 = 𝑡𝑜𝑢𝑝𝑙𝑒𝑠, 𝑐 = 𝑎𝑣𝑔 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 n = No. Of Processors optimal number of processors to allocate (n) maximal achievable speedup (S) 𝑛 = √ 𝑐 ∗ 𝑁 𝑎 𝑆 = 𝑛 2
Parallel Execution Problems : Interferences  Parallel execution can be slowed down by interference.  Interference occurs when several processors simultaneously access the same resource,  Hardware  Solution: Duplicate Shared resource  Software.  Solution: Partition the shared resource into several independent resources
Parallel Execution Problems : Skew  Problem appears with intra- operator parallelism (variation in partition size) is known as data skew.  Classification of Skew:  Attribute Value Skew : inherent in the dataset  e.g., there are more citizens in Paris than in Waterloo  Tuple Placement Skew: introduced when the data are initially partitioned  e.g., with range partitioning  Selectivity Skew  introduced when there is variation in the selectivity of select predicates on each node  Redistribution Skew  occurs in the redistribution step between two operators.  Join Product Skew  occurs because the join selectivity may vary between nodes
Inter-Query Parallelism  Form of parallelism where many different Queries or Transactions are executed in parallel with one another on many processors.  Advantages:  Increases Transaction Throughput.  Scales up the Transaction processing system  Easy to implement in Shared Memory Parallel System.  Example: Oracle 8 & Oracle Rdb.
Intra-Query Parallelism  Form of parallelism where Single Query is executed in parallel on many processors.  2 Types.  Intra-operation parallelism  Inter-operation parallelism  Advantages:  speed up a single complex long running queries.  Best suited for complex scientific calculations (queries).  Example: Informix, Terradata.
Intra-operation parallelism  The process of speeding up a query through parallelizing the execution of individual operations.  The operations which can be parallelized are Sort, Join, Projection, Selection and so on.
Inter-operation parallelism  The process of speeding up a query through parallelizing various operations which are part of the query.  Example Step:  A query which involves join of 4 tables executed in two processors  Each processor shall join two relations locally and the result1 and result2 can be joined further to produce the final result.
Intra-Operator Load Balancing  Depends on  The degree of parallelism.  Allocation of processors for the operator.  The home of the operator (the set of processors where it is executed) must be carefully decided.  The skew problem makes it hard for a parallel query optimizer to make this decision statically.  Require a very accurate and detailed cost model.
 Two Solutions incorporated in a hybrid query optimizer.  Adaptive  Specialized
Adaptive Technique  The main idea is to statically decide on an initial allocation of the processors to the operator (using a cost model).  Adapt to skew using load reallocation.  Load reallocation is to detect the oversized partitions.  Partition them again onto several processors.
Adaptive Technique(Continued)  Advantage:  More dynamic adjustment of the degree of parallelism.  useful to improve intra-operator load balancing in all kinds of parallel architectures.  By reducing processor interference  Excellent load balancing for intra-operator parallelism
Adaptive Technique(Continued)  specific control operators.  Detect whether the static estimates for intermediate result sizes differ from the run-time values.  Relation redistribution in order to prevent join product skew and redistribution skew.  Depends on difference between the estimate and the real value is sufficiently high.
Specialized techniques  Two main techniques.  Range partitioning  Sampling  Avoid redistribution skew of the building relation.  Processors can get partitions of equal numbers of tuples, corresponding to different ranges of join attribute values.
Specialized techniques(Continued)  To deal with skew as follows:  Sample the building relation to determine the partitioning ranges.  Redistribute the building relation to the processors using the ranges. Each processor builds a hash table containing the incoming tuples.  Redistribute the probing relation using the same ranges to the processors. For each tuple received, each processor probes the hash table to perform the join.
Inter-Operator Load Balancing  Important to Choose for each operator  How many and which processors to assign for its execution.  Taking into account pipeline parallelism, which requires inter-operator communication.  Harder to achieve in shared-nothing for this Reasons:  Choice of the degree of parallelism cause to errors  Reason: Both processors and operators are discrete entities.
Inter-Operator Load Balancing(Continued)  Processors associated with the latest operators in a pipeline chain may remain idle a significant time.  Shared-memory allows the parallel execution of independent pipeline chains  It is known as Tasks.  Dynamically adjusting the degree of intra-operator parallelism of the tasks in order to reach maximum resource utilization.
Activations  Represents a sequential unit of work  Can be executed by any thread  Self-contained  Can only be executed in the same SM(shared memory)-node
Activation Queues Moving data activation along pipeline chains Also called table queues Threads have unrestricted access to the same SM-node queues Small number of queue results interference A thread a queue
Thread  Simple strategy for good load balancing if number of threads are higher than the processors  One thread per processor per query reduce the overhead of interference  Thread will consume activation as much as possible to limit thread interference
THANK YOU
Reference:  M. Tamer Özsu • Patrick Valduriez, Principles of Distributed Database Systems, Third Edition

Load Balancing in Parallel and Distributed Database

  • 1.
    Load Balancing in DistributedDatabase Md. Shamsur Rahim 14-98181-3 Student, MScCS, AIUB AZM Ehtesham Chowdhury 15-98451-1 Student, MScCS, AIUB Saiful Akhter 15-98502-1 Student, MScCS, AIUB
  • 2.
    Load Balancing:  Meansdistributing transaction and queries among different nodes.  The goal is to maximize the throughput.  Parallel Execution Problems  1. Initialization  2. Interference  3. Skew
  • 3.
    Parallel Execution Problems: Initialization  Initialization is necessary before execution.  This sequential steps includes  Process/ Thread Creation and initialization  Communication Initialization etc.  The duration is proportional to the degree of parallelism  The degree of parallelism should be fixed according to query complexity.  Formula for finding response time for an Operator: 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒𝑇𝑖𝑚𝑒 = 𝑎 ∗ 𝑛 + 𝑐∗𝑁 𝑛  The equation can be further derived to obtain: 𝑁 = 𝑡𝑜𝑢𝑝𝑙𝑒𝑠, 𝑐 = 𝑎𝑣𝑔 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 n = No. Of Processors optimal number of processors to allocate (n) maximal achievable speedup (S) 𝑛 = √ 𝑐 ∗ 𝑁 𝑎 𝑆 = 𝑛 2
  • 4.
    Parallel Execution Problems: Interferences  Parallel execution can be slowed down by interference.  Interference occurs when several processors simultaneously access the same resource,  Hardware  Solution: Duplicate Shared resource  Software.  Solution: Partition the shared resource into several independent resources
  • 5.
    Parallel Execution Problems: Skew  Problem appears with intra- operator parallelism (variation in partition size) is known as data skew.  Classification of Skew:  Attribute Value Skew : inherent in the dataset  e.g., there are more citizens in Paris than in Waterloo  Tuple Placement Skew: introduced when the data are initially partitioned  e.g., with range partitioning  Selectivity Skew  introduced when there is variation in the selectivity of select predicates on each node  Redistribution Skew  occurs in the redistribution step between two operators.  Join Product Skew  occurs because the join selectivity may vary between nodes
  • 6.
    Inter-Query Parallelism  Formof parallelism where many different Queries or Transactions are executed in parallel with one another on many processors.  Advantages:  Increases Transaction Throughput.  Scales up the Transaction processing system  Easy to implement in Shared Memory Parallel System.  Example: Oracle 8 & Oracle Rdb.
  • 7.
    Intra-Query Parallelism  Formof parallelism where Single Query is executed in parallel on many processors.  2 Types.  Intra-operation parallelism  Inter-operation parallelism  Advantages:  speed up a single complex long running queries.  Best suited for complex scientific calculations (queries).  Example: Informix, Terradata.
  • 8.
    Intra-operation parallelism  Theprocess of speeding up a query through parallelizing the execution of individual operations.  The operations which can be parallelized are Sort, Join, Projection, Selection and so on.
  • 9.
    Inter-operation parallelism  Theprocess of speeding up a query through parallelizing various operations which are part of the query.  Example Step:  A query which involves join of 4 tables executed in two processors  Each processor shall join two relations locally and the result1 and result2 can be joined further to produce the final result.
  • 10.
    Intra-Operator Load Balancing Depends on  The degree of parallelism.  Allocation of processors for the operator.  The home of the operator (the set of processors where it is executed) must be carefully decided.  The skew problem makes it hard for a parallel query optimizer to make this decision statically.  Require a very accurate and detailed cost model.
  • 11.
     Two Solutionsincorporated in a hybrid query optimizer.  Adaptive  Specialized
  • 12.
    Adaptive Technique  Themain idea is to statically decide on an initial allocation of the processors to the operator (using a cost model).  Adapt to skew using load reallocation.  Load reallocation is to detect the oversized partitions.  Partition them again onto several processors.
  • 13.
    Adaptive Technique(Continued)  Advantage: More dynamic adjustment of the degree of parallelism.  useful to improve intra-operator load balancing in all kinds of parallel architectures.  By reducing processor interference  Excellent load balancing for intra-operator parallelism
  • 14.
    Adaptive Technique(Continued)  specificcontrol operators.  Detect whether the static estimates for intermediate result sizes differ from the run-time values.  Relation redistribution in order to prevent join product skew and redistribution skew.  Depends on difference between the estimate and the real value is sufficiently high.
  • 15.
    Specialized techniques  Twomain techniques.  Range partitioning  Sampling  Avoid redistribution skew of the building relation.  Processors can get partitions of equal numbers of tuples, corresponding to different ranges of join attribute values.
  • 16.
    Specialized techniques(Continued)  Todeal with skew as follows:  Sample the building relation to determine the partitioning ranges.  Redistribute the building relation to the processors using the ranges. Each processor builds a hash table containing the incoming tuples.  Redistribute the probing relation using the same ranges to the processors. For each tuple received, each processor probes the hash table to perform the join.
  • 17.
    Inter-Operator Load Balancing Important to Choose for each operator  How many and which processors to assign for its execution.  Taking into account pipeline parallelism, which requires inter-operator communication.  Harder to achieve in shared-nothing for this Reasons:  Choice of the degree of parallelism cause to errors  Reason: Both processors and operators are discrete entities.
  • 18.
    Inter-Operator Load Balancing(Continued)  Processorsassociated with the latest operators in a pipeline chain may remain idle a significant time.  Shared-memory allows the parallel execution of independent pipeline chains  It is known as Tasks.  Dynamically adjusting the degree of intra-operator parallelism of the tasks in order to reach maximum resource utilization.
  • 19.
    Activations  Represents asequential unit of work  Can be executed by any thread  Self-contained  Can only be executed in the same SM(shared memory)-node
  • 20.
    Activation Queues Moving dataactivation along pipeline chains Also called table queues Threads have unrestricted access to the same SM-node queues Small number of queue results interference A thread a queue
  • 21.
    Thread  Simple strategyfor good load balancing if number of threads are higher than the processors  One thread per processor per query reduce the overhead of interference  Thread will consume activation as much as possible to limit thread interference
  • 22.
  • 23.
    Reference:  M. TamerÖzsu • Patrick Valduriez, Principles of Distributed Database Systems, Third Edition