Query processing and Optimization in Database

QUERY PROCESSING AND OPTIMIZATION

2 QUERY PROCESSING CYCLE • Scanner: Breaks the query into meaningful tokens (keywords, table names, attribute names, symbols). • Parser: Checks the syntax of the query to ensure it adheres to the rules of the query language (SQL grammar). • Validator: Ensures that all tokens (attribute names and table names) are semantically valid and consistent with the database schema. • Together, these steps ensure that the query is syntactically correct and semantically meaningful before proceeding to further processing stages.

QUERY PROCESSING Query processing involves transforming a high-level SQL query into a correct and efficient execution strategy for retrieving results from a database. It includes decomposition, optimization, execution, and code generation.

1. QUERY DECOMPOSITION • Query decomposition involves breaking down a high- level SQL query into a relational algebra query while ensuring its correctness and efficiency. The process includes: a) Analysis: Checking syntax, semantics, and creating a query tree. b) Normalization: Converting the query into CNF or DNF. c) Semantic Analysis: Rejecting incorrect or contradictory queries. d) Simplification: Optimizing the query by eliminating redundancies and enforcing access permissions. • This step ensures that the query is clean, executable, and ready for the optimization phase. 4

BASIC ALGORITHMS FOR EXECUTING QUERY OPERATIONS… 5 Algorithm Purpose Key idea Efficient Nested Loop Iterates over rows in two tables. Compare each row with all rows. Simple but inefficient for large data. Sort-Merge Combines rows after sorting on join keys. Sort both tables, then merge. Efficient for large, sorted datasets. Hash Join Uses a hash table for matching rows. Build hash table, probe for matches. Fast for large datasets without sorting. Index Scan Quickly locate rows using an index. Use index to avoid full scans. Extremely efficient for selective queries.

QUERY OPTIMIZATION Query Optimization is the process of selecting the most efficient execution strategy to process a query from the many possible alternatives. The DBMS uses statistics, query structure, and available algorithms to determine this efficient plan. Although the chosen plan may not be the absolute best, it is one that minimizes resource usage, including CPU time, I/O operations, and response time.

7 KEY ASPECTS OF QUERY OPTIMIZATION 1. Relational Algebra Expression: Determine which equivalent algebraic expression leads to the most efficient execution plan. 2. Algorithm Choice: Select the algorithm to compute each algebraic operator (e.g., Hash Join, Sort-Merge Join). 3. Data Flow: Decide how operations pass data between main memory and disk buffers. 4. Resource Minimization: Ensure the chosen execution plan minimizes CPU time, I/O cost, and response time.

TECHNIQUES FOR QUERY OPTIMIZATION 1. Heuristic Optimization: • Use rules of thumb to reorder operations in a query to reduce intermediate results. Goal: • Narrow down the dataset as quickly as possible (e.g., apply SELECT and PROJECT before JOIN). 2. Selectivity in Query Optimization: 3. Cost-Based Optimization: • Systematically estimate the cost of various strategies and choose the one with the lowest cost. • Cost includes factors like disk I/O, CPU processing, and buffer space usage. 9

10 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 1. Cascade of SELECTION • Rule: Multiple selection conditions combined with logical AND can be written as a single selection operation. • Example: • Initial Query: σSalary > 50,000(σAge > 30(Employees)) • Optimized Query: σSalary > 50,000 Age > 30(Employees) ∧ • Explanation: Instead of performing two separate SELECTION operations, the conditions can be combined into a single SELECTION operation, reducing the number of operations.

11 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 2. Commutativity of SELECTION • Rule: The order of applying selection conditions does not matter. • Example: • Initial Query: σAge > 30(σDepartment = 'HR'(Employees)) • Equivalent Query: σDepartment = 'HR'(σAge > 30(Employees)) • Explanation: Whether you filter employees older than 30 first or those in the HR department, the result remains the same.

12 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 3. Cascade of Projection • Rule: In a sequence of PROJECTION (π) operations, only the last one is necessary. • Example: • Initial Query: πName, Age(πName, Age, Salary(Employees)) • Optimized Query: πName, Age(Employees) • Explanation: The intermediate PROJECTION operations are unnecessary. Only the final projection is needed to produce the desired attributes.

13 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 4. Commutativity of Selection with Projection • Rule: Selection and projection can be swapped if the selection condition involves only the attributes in the projection list. • Example: • Initial Query: πName, Age(σAge > 30(Employees)) • Equivalent Query: σAge > 30(πName, Age(Employees)) • Explanation: The result is the same whether you project the attributes first and then filter by age, or vice versa, as long as the SELECTION predicate involves only projected attributes

14 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 5. Commutativity of Theta Join or Cartesian Product • Rule: For both Cartesian Product (×) and Theta Join ( θ) ⋉ , the order of relations does not matter. • Explanation: This property enables flexibility in optimizing query execution plans. 6. Commutativity of SELECTION with THETA JOIN • Rule: If the SELECTION predicate involves only attributes of one of the relations being joined, the SELECTION and JOIN operations can be interchanged.  Case a: SELECTION Predicate Involves Only Attributes of One Relation  Case b: SELECTION Predicate Involves Attributes of Both Relations

15 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA Case a: SELECTION Predicate Involves Only Attributes of One Relation • Explanation: If the predicate c1 involves only attributes of R, you can first filter R and then join with S. Case b: SELECTION Predicate Involves Attributes of Both Relations • Explanation: If c1 involves only attributes of R and c2 involves only attributes of S, you can filter each relation first and then join them.

16 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 7. Commutativity of PROJECTION and THETA JOIN • Rule: Project relevant attributes before joining to reduce intermediate data size. • Explanation: Instead of projecting after the join, you can project relevant attributes from each relation before performing the join. 8. Commutativity of UNION and INTERSECTION Rule: UNION ( ) and INTERSECTION ( ) operations are commutative, but ∪ ∩ SET DIFFERENCE ( ) is not. − Explanation: The order of UNION and INTERSECTION does not affect the result. However, for SET DIFFERENCE, the order matters because R S is not − the same as S R. −

17 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 9. Associativity of THETA JOIN, CARTESIAN PRODUCT, UNION, and INTERSECTION • Rule: These operations are associative, meaning the grouping of operations does not affect the final result. 10. Commuting SELECTION with SET OPERATIONS • Rule: SELECTION (σ) operations can commute with UNION ( ) and ∪ INTERSECTION ( ). ∩

18 TRANSFORMATION RULES FOR RELATIONAL ALGEBRA 11. Commuting Projection with Union • Rule: PROJECTION (π) operations can commute with UNION ( ). ∪ • Example: • Initial Query: πName, Age(R S) ∪ • Equivalent Query: (πName, Age(R)) (πName, Age(S)) ∪ • Explanation: Instead of projecting attributes after the UNION, you can project relevant attributes from each relation first.

19 SEQUENCE FOR APPLYING TRANSFORMATION RULES

20 PROCESS FOR HEURISTIC OPTIMIZATION Heuristic optimization in query processing involves using rule- based techniques to transform a query into a more efficient form. 1. Initial Internal Representation: A high-level query, like SQL, is translated into an initial internal representation, often as a relational algebra tree. This tree represents the logical steps for executing the query. 2. Applying Heuristic Rules: Heuristic rules are applied to optimize the internal representation. a) Selection Pushdown: Move selection (σ) operations as close to the base relations as possible. b) Projection Pushdown: Move projection (π) operations down the query tree to eliminate unnecessary columns early. c) Join Reordering: Rearrange join operations to minimize the size of intermediate results. d) Combining Operations: Merge adjacent operations that can be executed together more efficiently (e.g., combining consecutive projections).

21 PROCESS FOR HEURISTIC OPTIMIZATION Main Principle: Apply operations that reduce the size of intermediate results (e.g., SELECT and PROJECT) before performing binary operations like JOIN or UNION. 3. Generating a Query Execution Plan: After applying heuristic rules, the optimized internal representation is used to generate a query execution plan. Key Considerations for Execution Plans: • Access Paths: Determines the best method to retrieve data, such as using indexes or sequential scans. • Join Strategies: Chooses appropriate join methods, such as nested loop joins, hash joins, or sort-merge joins. • Optimization Goal: Minimize query execution time by selecting efficient paths and methods.

22 HEURISTIC APPROACH TO OPTIMIZATION Intermediate results are temporary data sets produced during query execution before arriving at the final result. These results exist only during query execution and discarded once the final result is produced. Components of the Heuristic Approach: • Properties of Operators: Utilize operator properties, such as commutativity and associativity, to reorganize query trees. • Association Between Operators: Leverage relationships between operators to combine or rearrange operations. • Query Tree: A graphical representation of operators, relations, attributes, predicates, and processing sequence.  Execution starts from the leaves (base relations), proceeds to intermediate nodes (operations), and ends at the root (final result).

CONTD. The query block is the basic unit of optimization. It contains: • A single SELECT-FROM-WHERE expression. • Optional GROUP BY and HAVING clauses. Nested queries within a query are treated as separate query blocks. Nested queries can be categorized into two types: 1. Uncorrelated Nested Queries: can be executed independently of the outer query. Example: The inner query is executed first, producing results used by the outer query. 23

CONTD. 2. Correlated Nested Queries: depend on values from the outer query for their execution. Example: The inner query is executed repeatedly, once for each row in the outer query. Query graph is a graphical representation of a database query that shows how different tables (relations) and conditions are related. • It is useful for visualizing joins, selections, and projections, and helps optimize queries by identifying efficient execution strategies. 24

25 CONTD. Key Elements of a Query Graph: • Nodes: • Single Circles: Represent relations (tables). • Double Circles or Ovals: Represent constant values, typically from selection conditions. • Edges: • Represent join or selection conditions between relations or attributes. • Attributes: • Displayed above each node to show the columns involved in projections.

26 CONTD. Initial Query Graph: Nodes: Represent PROJECT, DEPARTMENT, and EMPLOYEE relations. Edges: Represent join conditions (P.Dnum = D.Dnumber, D.Mgrssn = E.Ssn) and selection conditions (P.Plocation = ‘Stafford’). Optimized Query: A combined selection condition with P.Dnum = D.Dnumber, D.Mgr_ssn = E.Ssn, and P.Plocation = 'Stafford’ …..(combined selection more efficient) Detail example on query optimization Video < click here

27 USING SELECTIVITY IN QUERY OPTIMIZATION Selectivity is a measure of how many rows a condition filters from a table. It is usually expressed as a fraction: • Low Selectivity: Filters more rows, resulting in fewer records (more restrictive). • High Selectivity: Filters fewer rows, resulting in more records (less restrictive). The Goal: To optimize queries, apply low selectivity conditions first to reduce the size of intermediate results. This minimizes the processing cost for subsequent operations.

28 USING COST ESTIMATION IN QUERY OPTIMIZATION The Cost Estimation Approach the query optimizer estimates the cost of executing a query in different ways and selects the strategy with the lowest cost. The cost function is typically composed of • I/O Cost: Number of disk accesses (reads/writes). • CPU Cost: Cost of performing operations like sorting, searching, and merging. • Storage Cost: Cost of storing intermediate results in primary memory. • Network Cost / Communication: Data transfer cost over a network (for distributed queries). The optimizer uses these metrics to estimate the total cost and pick the optimal query plan.

29 KEY COST COMPONENTS FOR QUERY EXECUTION 1. Access Cost to Secondary Storage Queries require reading or writing data from the disk. Factors Influencing Access Cost: • File organization: Is the data stored sequentially or scattered? • Access method: E.g., using indexes reduces access time compared to a full table scan. Example: • A full table scan is costly as it reads all rows from the disk. • Using an index scan reduces disk access by reading only the relevant rows.

30 KEY COST COMPONENTS FOR QUERY EXECUTION 2. Storage Cost When processing a query, intermediate results are often generated before producing the final output. Storing these results in memory consumes space. Goal is to minimize the size of intermediate results. • Impact: Larger intermediate results increase memory usage and may require additional disk writes. 3. Computation Cost:- Queries often require computational operations, such as: • Searching for data. • Sorting rows (e.g., for ORDER BY). • Merging rows (e.g., during joins). • Performing calculations on column values. • Example: Sorting a table with millions of rows has a high computational cost compared to filtering data first.

31 KEY COST COMPONENTS FOR QUERY EXECUTION 4. Communication Cost In distributed database systems, data is often transported across the network. Communication cost refers to the data transfer cost between: • Database servers and clients. • Different servers storing parts of the data. • Optimization: Reduce data transfer by minimizing the amount of data being communicated.  Query execution plan is a set of steps the database engine follows to execute a query efficiently. It includes: • Relational Algebra Query Tree: Logical structure of the query. • Access Methods: How data is accessed (e.g., full scan, index scan). • Operator Execution Methods: Algorithms for join, sort, and filter operations. • The optimizer evaluates multiple execution plans and chooses the one with the lowest estimated cost.

32 USING SEMANTIC IN QUERY OPTIMIZATION Semantic Query Optimization improves performance by using the logical meaning of data to simplify queries, eliminate redundancies, and rewrite them for efficiency. Key Techniques in Semantic Query Optimization: 1. Integrity Constraints to simplify joins. 2. Redundant Data Elimination to remove unnecessary operations. Example: Query: If the Age column has a constraint that values are between 0 and 99, the condition Age < 100 is redundant. Optimized: Reduces computation by removing unnecessary checks.

33 CONTD. 3. Query Rewriting to transform queries into more efficient forms. Example: Query: In this query, the optimizer filters Classes after the join, which processes unnecessary rows. Optimized: Reduces the number of rows processed during the join, improving performance.

THANK YOU Yordanos Zewge yordanosjordan5@gmail.com

Query processing and Optimization in Database

More Related Content

Similar to Query processing and Optimization in Database

Recently uploaded

Query processing and Optimization in Database