why is this important
play

Why Is This Important? Now that we know about the benefits of - PDF document

Why Is This Important? Now that we know about the benefits of indexes, how does the DBMS know when to use them? Overview of Query Evaluation An SQL query can be implemented in many ways, but which one is best? Perform selection before


  1. Why Is This Important?  Now that we know about the benefits of indexes, how does the DBMS know when to use them? Overview of Query Evaluation  An SQL query can be implemented in many ways, but which one is best?  Perform selection before or after join etc. Chapter 12  Many ways of physically implementing a join (or other relational operator), how to choose the right one?  The DBMS does this automatically, but we need to understand it to know what performance to expect 1 2 Overview of Query Evaluation Some Common Techniques  SQL query is implemented by a query plan  Algorithms for evaluating relational operators use  Tree of relational operators some simple ideas extensively: • `Pull ’ interface: when an operator is `pulled’ for the next output tuples , it `pulls’ on its inputs and computes them.  Indexing: Can use WHERE conditions to retrieve small set • Can change structure of tree of tuples (selections, joins) • Can choose different operator implementations  Iteration: Sometimes, faster to scan all tuples even if there  Two main issues in query optimization:  For a given query, what plans are considered? is an index. (And sometimes, we can scan the data entries • Algorithm to search plan space for cheapest (estimated) plan. in an index instead of the table itself.)  How is the cost of a plan estimated?  Partitioning: By using sorting or hashing, we can partition  Ideally: Want to find best plan. the input tuples and replace an expensive operation by  Practically: Avoid worst plans! similar operations on smaller inputs.  We will study the System R approach. * Watch for these techniques as we discuss query evaluation! 3 4 Statistics and Catalogs Access Paths  Need information about the relations and indexes  Access path = way of retrieving tuples: involved. Catalog typically contains:  File scan, or index that matches a selection (in the query)  #tuples (NTuples) and #pages (NPages) for each relation.  Cost depends heavily on access path selected  #distinct key values (NKeys), INPages, and low/high key values  A tree index matches (a conjunction of) conditions that (ILow/IHigh) for each index. involve only attributes in a prefix of the search key.  Index height (IHeight) for each tree index.  E.g., Tree index on <a, b, c> matches “a=5 AND b=3” and “a=5  Catalog data stored in tables; can be queried AND b>6”, but not “b=3”.  Catalogs updated periodically.  A hash index matches (a conjunction of) conditions that  Updating whenever data changes is too expensive; costs are has a term attribute = value for every attribute in the approximate anyway, so slight inconsistency ok. search key of the index.  More detailed information (e.g., histograms of the values  E.g., Hash index on <a, b, c> matches “a=5 AND b=3 AND c=5”; in some field) sometimes stored. but not “b=3”, “a=5 AND b=3”, or “a>5 AND b=3 AND c=5”. 5 6

  2. A Note on Complex Selections Selectivity of Access Paths (day<8/9/94 AND rname =‘Paul’) OR bid=5 OR sid=3  Selectivity = #pages retrieved (index + data pages)  Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:  Selection conditions are first converted to  Terms that match the index reduce the number of tuples conjunctive normal form (CNF): retrieved  E.g., (day<8/9/94 OR bid=5 OR sid=3 ) AND  Other terms are used to discard some retrieved tuples, but do (rname =‘Paul’ OR bid=5 OR sid=3) not affect number of tuples fetched.  We only discuss case with no ORs; see text if you are  Consider “day < 8/9/94 AND bid=5 AND sid =3”. curious about the general case. • Can use B+ tree index on day; then check bid=5 and sid=3 for each retrieved tuple • Could similarly use a hash index on <bid,sid>; then check day < 8/9/94 7 8 SELECT DISTINCT Using an Index for Selections Projection R.sid, R.bid Reserves R FROM  Without index on R.rname, have to scan  The expensive part is removing duplicates.  DBMS does not remove duplicates by default.  Index cost depends on #qualifying tuples and clustering.  Sorting Approach  Cost of finding qualifying data entries (small) plus cost of  Sort on <sid, bid> and remove duplicates: scan of Reserves retrieving records (could be large). (1000 pages), plus 2-3 more passes of projected data set (~1000  Data: 100K tuples on 1000 pages pages)  Assuming uniform distribution of names, about 10% of tuples  Hashing Approach qualify (100 pages, 10000 tuples).  Hash on <sid, bid> to create partitions.  Clustered index on rname: little more than 100 I/O  Load partitions into memory one at a time  Unclustered index on rname: up to 10,000 I/O! • Build in-memory hash structure, eliminate duplicates.  Scan of Reserves (1000 pages), plus write and read projected SELECT * data (~500 I/O); but could be more Reserves R  If there is an index with all selected attributes in the FROM search key, use index-only access on index leaves. WHERE R.rname < ‘C%’ 9 10 Join: Index Nested Loops Examples of Index Nested Loops foreach tuple r in R do foreach tuple s in S where r i == s j do  Join Sailors and Reserves on sid add <r, s> to result  Assumption: R has 100K tuples on 1000 pages; S has 40K tuples on 500 pages  Naïve implementation: scan S for each tuple in R  Hash-index (Alt. 2) on sid of Sailors (as inner):  #pages(R) + |R| * #pages(S) page accesses  Scan Reserves: 1000 page I/Os, 100K tuples.  Improved by block nested loops: foreach block of R, process  For each Reserves tuple: 1.2 I/Os to get data entry in hash index, plus each block from S 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000  #pages(R) + #pages(R) * #pages(S) page accesses I/Os.  Can do even better with an index on the join column of one  Hash-index (Alt. 2) on sid of Reserves (as inner): relation (say S) by making it the inner.  Scan Sailors: 500 page I/Os, 40K tuples.  Cost: #pages(R) + ( |R| * costOfFindingMatchingStuples )  For each Sailors tuple: 1.2 I/Os to find index page with data entries,  For each R tuple, cost of probing S index is about 1.2 I/O for hash plus cost of retrieving matching Reserves tuples. Assuming uniform index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of or (3) for data entries) depends on clustering. retrieving them from heap file is 2.5 I/Os. Total: 148,000 I/Os. • Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. 11 12

Recommend


More recommend