Query Processing and Optimization
Basic Concepts 2 • Query Processing – activities involved in retrieving data from the database: – SQL query translation into low-level language implementing relational algebra – Query execution • Query Optimization – selection of an efficient query execution plan
Phases of Query Processing 3
Relational Algebra • Relational algebra defines basic operations on relation instances • Results of operations are also relation instances 4
Basic Operations • Unary algebra operations: – Selection – Projection • Binary algebra operations: – Union – Set difference – Cross-product 5
Additional Operations • Can be expressed through 5 basic operations: – Join – Intersection – Division 6
Selection σcriterion (I) where criterion – selection condition, and I- an instance of a relation. • Result: – the same schema – A subset of tuples from the instance I • Criterion: conjunction (AND) and disjunction (OR) • Comparison operators: <,<=,=,≠,>=,> 7
Projection • Vertical subset of input relation instance • The schema of the result : – is determined by the list of desired fields – types of fields are inherited Πa1,a2,…,am (I), where a1,a2,…,am – desired fields from the relation with the instance I 8
Binary Operations • Union-compatible relations: – The same number of fields – Corresponding fields have the same domains • Union of 2 relations • Intersection of 2 relations • Set-difference • Cross-product – does not require union- compatibility Marina G. Erechtchoukova 9
Joins • Join is defined as cross-product followed by selections • Based on the conditions, joins are classified: – Theta-joins – Natural joins – Other… 10
Theta Join RCond S = σCond (R x S) Where Cond – refers to the attributes of both relations R and S in the form of comparison expressions with operators: <,<=,=,≠,>=,> 11
Relational Algebra Expressions • The result of a relational operation is a relation instance • Relational algebra expression combines relation instances using relational algebra operations • Relational algebra expression produces the result of a query 12
Simple SQL Query SELECT select-list  πselect-list FROM from-list  Cross Product WHERE qualification;  σqualification 13
Conceptual Evaluation Strategy for Simple Query • Compute the cross-product of tables in from- list • Delete those rows which fail the qualification condition • Delete all columns that do not appear in the select-list • If DISTINCT clause is specified, eliminate duplicate rows. 14
Nested Queries • Query block: – Single SELECT_FROM_WHERE expression – May include GROUP BY and HAVING • Query block – basic unit that is translated into RA expression and optimized • SQL query is decomposed into query blocks 15
Different Processing Strategies • Algorithms implementing basic relational algebra operations • Algorithms implementing additional relational algebra operations • Example: Find the students who have marks higher than 75 and are younger than 23 16
Query Decomposition • Analysis – Relational algebra tree • Normalization • Semantic analysis • Simplification • Query restructuring 17
Analysis • Analyze query using compiler techniques • Verify that relations and attributes exist • Verify that operations are appropriate for object type • Transform the query into some internal representation 18
Relational Algebra Tree • Leaf nodes are created for each base relation. • Non-leaf nodes are created for each intermediate relation produced by RA operation. • Root of the tree represents query result. • Sequence is directed from leaves to root. 19
Relational Algebra Tree (Cont…) 20 Root Intermediate operations Intermediate operations Leaves …
Criterion Normalization • Conjunctive normal form – a sequence of boolean expressions connected by conjunction (AND): – Each expression contains terms of comparison operators connected by disjunctions (OR) • Disjunctive normal form – a sequence of boolean expressions connected by disjunction (OR): – Each expression contains terms of comparison operators connected by conjunction (AND) 21
Criterion Normalization (Cont…) • Arbitrary complex qualification condition can be converted into one of the normal forms • Algorithms for computation: – CNF – only tuples that satisfy all expressions – DNF – tuples that are the result of union of tuples that satisfy the exprssions 22
Semantic Analysis • Applied to normalized queries • Rejects contradictory queries: – Qualification condition cannot be satisfied by any tuple • Rejects incorrectly formulated queries: – Condition components do not contribute to generation of the result. 23
Relation Connection Graph • Conjunctive queries without negation • Each node corresponds to a base relation and the result • An edge between two nodes is created: – If there a join – If a node is a source for projection. • If the graph is not connected, the query is incorrectly formulated 24
Simplification • Eliminates redundancy in qualification • Queries against views: – Access privileges – Redundancy in qualification • Transform query to equivalent efficiently computed form • Main tool – rules of boolean algebra 25
Queries against Views • View resolution: – View select-list is translated into corresponding select-list in the view defining query – From-list of the query is modified to hold the names of base tables – Qualifications from WHERE clause are combined – GROUP BY and HAVING clauses are modified 26
Rules of Boolean Algebra ptruep pfalsep falsefalsep ppp ppp ≡∧ ≡∨ ≡∧ ≡∨ ≡∧ )( )( pqpp pqpp truepp falsepp truetruep ≡∧∨ ≡∨∧ ≡¬∨ ≡¬∧ ≡∨ )( )( )( )( 27
Query Restructuring • Rewriting a query using relational algebra operations • Modifying relational algebra expression to provide more efficient implementation 28
Query Optimization • Optimization criteria: – Reduce total execution time of the query: • Minimize the sum of the execution times of all individual operations • Reduce the number of disk accesses – Reduce response time of the query: • Maximize parallel operations • Dynamic vs. static optimization 29
Heuristic Approach • Heuristic - problem-solving by experimental methods • Applying general rules to choose the most appropriate internal query representation • Based on transformation rules for relational algebra operations 30
Transformation Rules • Cascade of selection operations: • Commutativity of selection operations • Sequence of projection operations where )...( )(... NML R LNML ∩∩⊂ ∏=∏∏∏ )))((()( RR rqprqp σσσσ =∧∧ 31 ))(())(( RR pqqp σσσσ =
Transformation Rules (Cont…) • Commutativity of selection and projection where p involves only attributes from {A1,…,Am} • Commutativity of binary operations ; ; ; ))(())(( ,...,,..., 11 RR mm AAppAA ∏=∏ σσ 32 RSSR RSSR pp ×=× =  RSSR RSSR ∪=∪ ∩=∩
Transformation Rules (Cont…) • Commutativity of selection and theta join • Commutativity of projection and theta join Where A1contains only attributes from R and A2- only attributes from S SRRR rprp  ))(()( σσ = 33 )()()( 2121 SRSR ArArAA ∏∏=∏ ∪ 
Transformation Rules (Cont…) • Commutativity of projection and union • Associativity of binary operations 34 )()()( SRSR LLL ∏∪∏=∪∏ ).()( );()( );()( );()( TSRTSR TSRTSR TRSTRR TRSTSR ××=×× = ∩∩=∩∩ ∪∪=∪∪ 
Heirustic Rules • Perform selection as early as possible • Combine Cross product with a subsequent selection • Rearrange base relations so that the most restrictive selection is executed first. • Perform projection as early as possible • Compute common expressions once. 35
Cost Estimation Components • Cost of access to secondary storage • Storage cost – cost of storing intermediate results • Computation cost • Memory usage cost – usage of RAM buffers 36
Cost Estimation for Relational Algebra Expressions • Formulae for cost estimation of each operation • Estimation of relational algebra expression • Choosing the expression with the lowest cost 37
Cost Estimation in Query Optimization • Based on relational algebra tree • For each node in the tree the estimation is to be done for: – the cost of performing the operation; – the size of the result of the operation; – whether the result is sorted. 38
Database Statistics for a Relation • Cardinality of relation instance • Block (of tuples) – page • Number of blocks required to store a relation (data) • Blocking factor – number of tuples in one block • Number of blocks required to store an index 39
Database Statistics for an Attribute of a Relation • The number of distinct values • Possible minimum and maximum values • Selection cardinality of an attribute: – For equality condition on the attribute – For inequality condition on the attribute 40
Algorithms for Relational Algebra Operations Implementation • Linear search • Binary search • Sort-merge • External sorting • Hashing 41
File Organization • The physical arrangement of data in a file into records and blocks (pages) on secondary storage • Storing and retrieving data depends on the file organization 42
Heap Files • Unordered files • Records are placed in the file in the same order as they are inserted • If there is insufficient space in the last block, a new block is added. • Records are retrieved based on scan 43
Ordered Files • Files sorted on the values of the ordering fields • Ordering key – ordering fields with unique constraint • Under certain conditions records can be retrieved based on binary search 44
Hash Files • Records are randomly distributed across the available space • To store a record the address of the block (page) is calculated by Hash function • Blocks are kept at about 80% occupancy • To retrieve the data all blocks are scanned which is about 1.25 times more than for heap files 45
Indexes • A data structure that allows the DBMS to locate particular records • Index files are not required but very helpful • Index files can be ordered by the values of indexing fields 46
Retrieval Algorithms • Files without indexes: – Records are selected by scanning data files • Indexed files: – Matching selection condition – Records are selected by scanning index files and finding corresponding blocks in data files 47
Search Space • Collection of possible execution strategies for a query • Strategies can use: – Different join ordering – Different selection methods – Different join methods • Enumeration algorithm – an algorithm to determine an optimal strategy from the search space 48
Pipelining • Materialization - saving intermediate results in a temporary table • Pipelining – submitting the results of one operation to another operation without creating a temporary table • A pipeline is implemented for each join operation • Requires specific algorithms 49
Linear Trees • In a linear tree at least one child of a join node is a base relation • Left-deep tree – the right child of each join node is a base relation • Right-deep tree – the left child of each join node is a base relation • Bushy tree – non-linear tree 50
Left-Deep Tree • Supports fully pipelined strategies • Advantage: – Reduces search space • Disadvantage: – Excludes alternative strategies which may be of a lower cost 51
Query Optimization in Oracle • Rule-based optimizer – Specify the goal in init.ora file OPTIMIZER_MODE = RULE • Cost-based optimizer – Specify the goal in init.ora file OPTIMIZER_MODE = CHOOSE 52
Rule-Based Optimizer • 15 rules are ranked • RowID describes the physical location of the record • RowID is associated with table indeces • Access path for a table only chosen if statement contains a predicate or other construct that makes that access path available. 53
Cost-Based Optimizer • Statistics: – ANALYZE - command to generates statistics – PL/SQL package DBMS_STAT • Hints – To access full table – To use a rule – To use a certain index – … 54
Example • SELECT /*+ full(student) */ sname FROM student WHERE Y_of_B = 1983; 55

Query processing-and-optimization

  • 1.
  • 2.
    Basic Concepts 2 • QueryProcessing – activities involved in retrieving data from the database: – SQL query translation into low-level language implementing relational algebra – Query execution • Query Optimization – selection of an efficient query execution plan
  • 3.
    Phases of QueryProcessing 3
  • 4.
    Relational Algebra • Relationalalgebra defines basic operations on relation instances • Results of operations are also relation instances 4
  • 5.
    Basic Operations • Unaryalgebra operations: – Selection – Projection • Binary algebra operations: – Union – Set difference – Cross-product 5
  • 6.
    Additional Operations • Canbe expressed through 5 basic operations: – Join – Intersection – Division 6
  • 7.
    Selection σcriterion (I) where criterion –selection condition, and I- an instance of a relation. • Result: – the same schema – A subset of tuples from the instance I • Criterion: conjunction (AND) and disjunction (OR) • Comparison operators: <,<=,=,≠,>=,> 7
  • 8.
    Projection • Vertical subsetof input relation instance • The schema of the result : – is determined by the list of desired fields – types of fields are inherited Πa1,a2,…,am (I), where a1,a2,…,am – desired fields from the relation with the instance I 8
  • 9.
    Binary Operations • Union-compatiblerelations: – The same number of fields – Corresponding fields have the same domains • Union of 2 relations • Intersection of 2 relations • Set-difference • Cross-product – does not require union- compatibility Marina G. Erechtchoukova 9
  • 10.
    Joins • Join isdefined as cross-product followed by selections • Based on the conditions, joins are classified: – Theta-joins – Natural joins – Other… 10
  • 11.
    Theta Join RCond S =σCond (R x S) Where Cond – refers to the attributes of both relations R and S in the form of comparison expressions with operators: <,<=,=,≠,>=,> 11
  • 12.
    Relational Algebra Expressions •The result of a relational operation is a relation instance • Relational algebra expression combines relation instances using relational algebra operations • Relational algebra expression produces the result of a query 12
  • 13.
    Simple SQL Query SELECTselect-list  πselect-list FROM from-list  Cross Product WHERE qualification;  σqualification 13
  • 14.
    Conceptual Evaluation Strategyfor Simple Query • Compute the cross-product of tables in from- list • Delete those rows which fail the qualification condition • Delete all columns that do not appear in the select-list • If DISTINCT clause is specified, eliminate duplicate rows. 14
  • 15.
    Nested Queries • Queryblock: – Single SELECT_FROM_WHERE expression – May include GROUP BY and HAVING • Query block – basic unit that is translated into RA expression and optimized • SQL query is decomposed into query blocks 15
  • 16.
    Different Processing Strategies •Algorithms implementing basic relational algebra operations • Algorithms implementing additional relational algebra operations • Example: Find the students who have marks higher than 75 and are younger than 23 16
  • 17.
    Query Decomposition • Analysis –Relational algebra tree • Normalization • Semantic analysis • Simplification • Query restructuring 17
  • 18.
    Analysis • Analyze queryusing compiler techniques • Verify that relations and attributes exist • Verify that operations are appropriate for object type • Transform the query into some internal representation 18
  • 19.
    Relational Algebra Tree •Leaf nodes are created for each base relation. • Non-leaf nodes are created for each intermediate relation produced by RA operation. • Root of the tree represents query result. • Sequence is directed from leaves to root. 19
  • 20.
    Relational Algebra Tree(Cont…) 20 Root Intermediate operations Intermediate operations Leaves …
  • 21.
    Criterion Normalization • Conjunctivenormal form – a sequence of boolean expressions connected by conjunction (AND): – Each expression contains terms of comparison operators connected by disjunctions (OR) • Disjunctive normal form – a sequence of boolean expressions connected by disjunction (OR): – Each expression contains terms of comparison operators connected by conjunction (AND) 21
  • 22.
    Criterion Normalization (Cont…) •Arbitrary complex qualification condition can be converted into one of the normal forms • Algorithms for computation: – CNF – only tuples that satisfy all expressions – DNF – tuples that are the result of union of tuples that satisfy the exprssions 22
  • 23.
    Semantic Analysis • Appliedto normalized queries • Rejects contradictory queries: – Qualification condition cannot be satisfied by any tuple • Rejects incorrectly formulated queries: – Condition components do not contribute to generation of the result. 23
  • 24.
    Relation Connection Graph •Conjunctive queries without negation • Each node corresponds to a base relation and the result • An edge between two nodes is created: – If there a join – If a node is a source for projection. • If the graph is not connected, the query is incorrectly formulated 24
  • 25.
    Simplification • Eliminates redundancyin qualification • Queries against views: – Access privileges – Redundancy in qualification • Transform query to equivalent efficiently computed form • Main tool – rules of boolean algebra 25
  • 26.
    Queries against Views •View resolution: – View select-list is translated into corresponding select-list in the view defining query – From-list of the query is modified to hold the names of base tables – Qualifications from WHERE clause are combined – GROUP BY and HAVING clauses are modified 26
  • 27.
    Rules of BooleanAlgebra ptruep pfalsep falsefalsep ppp ppp ≡∧ ≡∨ ≡∧ ≡∨ ≡∧ )( )( pqpp pqpp truepp falsepp truetruep ≡∧∨ ≡∨∧ ≡¬∨ ≡¬∧ ≡∨ )( )( )( )( 27
  • 28.
    Query Restructuring • Rewritinga query using relational algebra operations • Modifying relational algebra expression to provide more efficient implementation 28
  • 29.
    Query Optimization • Optimizationcriteria: – Reduce total execution time of the query: • Minimize the sum of the execution times of all individual operations • Reduce the number of disk accesses – Reduce response time of the query: • Maximize parallel operations • Dynamic vs. static optimization 29
  • 30.
    Heuristic Approach • Heuristic- problem-solving by experimental methods • Applying general rules to choose the most appropriate internal query representation • Based on transformation rules for relational algebra operations 30
  • 31.
    Transformation Rules • Cascadeof selection operations: • Commutativity of selection operations • Sequence of projection operations where )...( )(... NML R LNML ∩∩⊂ ∏=∏∏∏ )))((()( RR rqprqp σσσσ =∧∧ 31 ))(())(( RR pqqp σσσσ =
  • 32.
    Transformation Rules (Cont…) •Commutativity of selection and projection where p involves only attributes from {A1,…,Am} • Commutativity of binary operations ; ; ; ))(())(( ,...,,..., 11 RR mm AAppAA ∏=∏ σσ 32 RSSR RSSR pp ×=× =  RSSR RSSR ∪=∪ ∩=∩
  • 33.
    Transformation Rules (Cont…) •Commutativity of selection and theta join • Commutativity of projection and theta join Where A1contains only attributes from R and A2- only attributes from S SRRR rprp  ))(()( σσ = 33 )()()( 2121 SRSR ArArAA ∏∏=∏ ∪ 
  • 34.
    Transformation Rules (Cont…) •Commutativity of projection and union • Associativity of binary operations 34 )()()( SRSR LLL ∏∪∏=∪∏ ).()( );()( );()( );()( TSRTSR TSRTSR TRSTRR TRSTSR ××=×× = ∩∩=∩∩ ∪∪=∪∪ 
  • 35.
    Heirustic Rules • Performselection as early as possible • Combine Cross product with a subsequent selection • Rearrange base relations so that the most restrictive selection is executed first. • Perform projection as early as possible • Compute common expressions once. 35
  • 36.
    Cost Estimation Components •Cost of access to secondary storage • Storage cost – cost of storing intermediate results • Computation cost • Memory usage cost – usage of RAM buffers 36
  • 37.
    Cost Estimation forRelational Algebra Expressions • Formulae for cost estimation of each operation • Estimation of relational algebra expression • Choosing the expression with the lowest cost 37
  • 38.
    Cost Estimation inQuery Optimization • Based on relational algebra tree • For each node in the tree the estimation is to be done for: – the cost of performing the operation; – the size of the result of the operation; – whether the result is sorted. 38
  • 39.
    Database Statistics fora Relation • Cardinality of relation instance • Block (of tuples) – page • Number of blocks required to store a relation (data) • Blocking factor – number of tuples in one block • Number of blocks required to store an index 39
  • 40.
    Database Statistics foran Attribute of a Relation • The number of distinct values • Possible minimum and maximum values • Selection cardinality of an attribute: – For equality condition on the attribute – For inequality condition on the attribute 40
  • 41.
    Algorithms for RelationalAlgebra Operations Implementation • Linear search • Binary search • Sort-merge • External sorting • Hashing 41
  • 42.
    File Organization • Thephysical arrangement of data in a file into records and blocks (pages) on secondary storage • Storing and retrieving data depends on the file organization 42
  • 43.
    Heap Files • Unorderedfiles • Records are placed in the file in the same order as they are inserted • If there is insufficient space in the last block, a new block is added. • Records are retrieved based on scan 43
  • 44.
    Ordered Files • Filessorted on the values of the ordering fields • Ordering key – ordering fields with unique constraint • Under certain conditions records can be retrieved based on binary search 44
  • 45.
    Hash Files • Recordsare randomly distributed across the available space • To store a record the address of the block (page) is calculated by Hash function • Blocks are kept at about 80% occupancy • To retrieve the data all blocks are scanned which is about 1.25 times more than for heap files 45
  • 46.
    Indexes • A datastructure that allows the DBMS to locate particular records • Index files are not required but very helpful • Index files can be ordered by the values of indexing fields 46
  • 47.
    Retrieval Algorithms • Fileswithout indexes: – Records are selected by scanning data files • Indexed files: – Matching selection condition – Records are selected by scanning index files and finding corresponding blocks in data files 47
  • 48.
    Search Space • Collectionof possible execution strategies for a query • Strategies can use: – Different join ordering – Different selection methods – Different join methods • Enumeration algorithm – an algorithm to determine an optimal strategy from the search space 48
  • 49.
    Pipelining • Materialization -saving intermediate results in a temporary table • Pipelining – submitting the results of one operation to another operation without creating a temporary table • A pipeline is implemented for each join operation • Requires specific algorithms 49
  • 50.
    Linear Trees • Ina linear tree at least one child of a join node is a base relation • Left-deep tree – the right child of each join node is a base relation • Right-deep tree – the left child of each join node is a base relation • Bushy tree – non-linear tree 50
  • 51.
    Left-Deep Tree • Supportsfully pipelined strategies • Advantage: – Reduces search space • Disadvantage: – Excludes alternative strategies which may be of a lower cost 51
  • 52.
    Query Optimization inOracle • Rule-based optimizer – Specify the goal in init.ora file OPTIMIZER_MODE = RULE • Cost-based optimizer – Specify the goal in init.ora file OPTIMIZER_MODE = CHOOSE 52
  • 53.
    Rule-Based Optimizer • 15rules are ranked • RowID describes the physical location of the record • RowID is associated with table indeces • Access path for a table only chosen if statement contains a predicate or other construct that makes that access path available. 53
  • 54.
    Cost-Based Optimizer • Statistics: –ANALYZE - command to generates statistics – PL/SQL package DBMS_STAT • Hints – To access full table – To use a rule – To use a certain index – … 54
  • 55.
    Example • SELECT /*+full(student) */ sname FROM student WHERE Y_of_B = 1983; 55