Overview of Query Evaluation [R&G] Chapter 12 CS4320 1

Overview of Query Evaluation � Plan : Tree of R.A. ops, with choice of alg for each op. � Each operator typically implemented using a `pull’ interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them. � Two main issues in query optimization: � For a given query, what plans are considered? • Algorithm to search plan space for cheapest (estimated) plan. � How is the cost of a plan estimated? � Ideally: Want to find best plan. Practically: Avoid worst plans! � We will study the System R approach. CS4320 2

Some Common Techniques � Algorithms for evaluating relational operators use some simple ideas extensively: � Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins) � Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.) � Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs. * Watch for these techniques as we discuss query evaluation! CS4320 3

Statistics and Catalogs � Need information about the relations and indexes involved. Catalogs typically contain at least: � # tuples (NTuples) and # pages (NPages) for each relation. � # distinct key values (NKeys) and NPages for each index. � Index height, low/high key values (Low/High) for each tree index. � Catalogs updated periodically. � Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. � More detailed information (e.g., histograms of the values in some field) are sometimes stored. CS4320 4

Access Paths � An access path is a method of retrieving tuples: � File scan, or index that matches a selection (in the query) � A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. � E.g., Tree index on < a, b, c > matches the selection a=5 AND b=3 , and a=5 AND b>6 , but not b=3 . � A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. � E.g., Hash index on < a, b, c > matches a=5 AND b=3 AND c=5 ; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5 . CS4320 5

A Note on Complex Selections (day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 � Selection conditions are first converted to conjunctive normal form (CNF): (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3) � We only discuss case with no ORs; see text if you are curious about the general case. CS4320 6

One Approach to Selections � Find the most selective access path , retrieve tuples using it, and apply any remaining terms that don’t match the index: � Most selective access path: An index or file scan that we estimate will require the fewest page I/Os. � Terms that match this index reduce the number of tuples retrieved ; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched. � Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on < bid, sid > could be used; day<8/9/94 must then be checked. CS4320 7

Using an Index for Selections � Cost depends on #qualifying tuples, and clustering. � Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large w/o clustering). � In example, assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10000 tuples). With a clustered index, cost is little more than 100 I/Os; if unclustered, upto 10000 I/Os! SELECT * Reserves R FROM WHERE R.rname < ‘C%’ CS4320 8

SELECT DISTINCT Projection R.sid, R.bid Reserves R FROM � The expensive part is removing duplicates. � SQL systems don’t remove duplicates unless the keyword DISTINCT is specified in a query. � Sorting Approach: Sort on <sid, bid> and remove duplicates. (Can optimize this by dropping unwanted information while sorting.) � Hashing Approach: Hash on <sid, bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates. � If there is an index with both R.sid and R.bid in the search key, may be cheaper to sort data entries! CS4320 9

Join: Index Nested Loops foreach tuple r in R do foreach tuple s in S where r i == s j do add <r, s> to result � If there is an index on the join column of one relation (say S), can make it the inner and exploit the index. � Cost: M + ( (M*p R ) * cost of finding matching S tuples) � M=#pages of R, p R =# R tuples per page � For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering. � Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. CS4320 10

Examples of Index Nested Loops � Hash-index (Alt. 2) on sid of Sailors (as inner): � Scan Reserves: 1000 page I/Os, 100*1000 tuples. � For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000 I/Os. � Hash-index (Alt. 2) on sid of Reserves (as inner): � Scan Sailors: 500 page I/Os, 80*500 tuples. � For each Sailors tuple: 1.2 I/Os to find index page with data entries, plus cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is clustered. CS4320 11

> < Join: Sort-Merge (R S) i=j � Sort R and S on the join column, then scan them to do a ``merge’’ (on join col.), and output result tuples. � Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple. � At this point, all R tuples with same value in Ri ( current R group ) and all S tuples with same value in Sj ( current S group ) match ; output <r, s> for all pairs of such tuples. � Then resume scanning R and S. � R is scanned once; each S group is scanned once per matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.) CS4320 12

Example of Sort-Merge Join sid bid day rname 28 103 12/4/96 guppy sid sname rating age 28 103 11/3/96 yuppy 22 dustin 7 45.0 31 101 10/10/96 dustin 28 yuppy 9 35.0 31 102 10/12/96 lubber 31 lubber 8 55.5 31 101 10/11/96 lubber 44 guppy 5 35.0 58 103 11/12/96 dustin 58 rusty 10 35.0 � Cost: M log M + N log N + (M+N) � The cost of scanning, M+N, could be M*N (very unlikely!) � With 35, 100 or 300 buffer pages, both Reserves and Sailors can be sorted in 2 passes; total join cost: 7500. CS4320 13

Highlights of System R Optimizer � Impact: � Most widely used currently; works well for < 10 joins. � Cost estimation: Approximate art at best. � Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. � Considers combination of CPU and I/O costs. � Plan Space: Too large, must be pruned. � Only the space of left-deep plans is considered. • Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation. � Cartesian products avoided. CS4320 14

Cost Estimation � For each plan considered, must estimate cost: � Must estimate cost of each operation in plan tree. • Depends on input cardinalities. • We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.) � Must also estimate size of result for each operation in tree! • Use information about the input relations. • For selections and joins, assume independence of predicates. CS4320 15

Size Estimation and Reduction Factors SELECT attribute list FROM relation list � Consider a query block: WHERE term1 AND ... AND termk � Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause. � Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples * product of all RF’s. � Implicit assumption that terms are independent! � Term col=value has RF 1/NKeys(I), given index I on col � Term col1=col2 has RF 1/ MAX (NKeys(I1), NKeys(I2)) � Term col>value has RF (High(I)-value)/(High(I)-Low(I)) CS4320 16

Schema for Examples Sailors ( sid : integer, sname : string, rating : integer, age : real) Reserves ( sid : integer, bid : integer, day : dates, rname : string) � Similar to old schema; rname added for variations. � Reserves: � Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. � Sailors: � Each tuple is 50 bytes long, 80 tuples per page, 500 pages. CS4320 17

Overview of Query Evaluation [R&G] Chapter 12 CS4320 1 - PowerPoint PPT Presentation

Overview of Query Evaluation [R&G] Chapter 12 CS4320 1 Overview of Query Evaluation Plan : Tree of R.A. ops, with choice of alg for each op. Each operator typically implemented using a `pull interface: when an operator is

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

Trade-offs in Static and Dynamic Query Evaluation Ahmet Kara, Milos Nikolic Dan Olteanu, and

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Execuon Declarave Query (SQL) We start from

Database Systems II Query Compiler CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

Previous Lecture Introduction to SMIL 2 Introduction to W3C and XML Introduction to SMIL

www.drupaleurope.org No photos please image Responsible disclosure, cross-project

PROGETTAZIONE E MANAGEMENT DEL MULTIMEDIA PER LA COMUNICAZIONE XVIII EDIZIONE 2014/2015

a11y and UX Junaid Masoodi Accessibility enables people with disabilities to perceive,

Social Media Platform Thomas Bauerei Armando Pesenti Gritti Andrei Popescu Franco Raimondi

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

US 77 Homestead Expressway I-80 Managing Expectations Communication is Key - Public

An Introduction to Social Mining Vladimir Gorovoy and Yana Volkovich @yvolkovich

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us