storage and indexing continued
play

Storage and Indexing (continued) CMPSCI 645 Mar 4, 2008 Slides - PowerPoint PPT Presentation

Storage and Indexing (continued) CMPSCI 645 Mar 4, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Today Index selection & performance tuning Storing data: disks and files 2 Cost of Operations Scan Equality Range


  1. Storage and Indexing (continued) CMPSCI 645 Mar 4, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1

  2. Today  Index selection & performance tuning  Storing data: disks and files 2

  3. Cost of Operations Scan Equality Range Insert Delete Heap File BD .5BD BD 2D Search + D Sorted File BD Dlog 2 B D(log 2 B + Search + Search + #matching BD BD pages) Clustered 1.5BD Dlog F 1.5 D(log F 1.5B + Search + Search + #matching Tree Index B D D pages) Unclustered BD(R+. D(1+log F. D(log F .15B + Search + Search + #matching Tree Index 15) . 15B) 3D 3D recs ) Unclustered BD(R+. 2D BD 4D 4D Hash Index 125)  Several assumptions underlie these (rough) estimates!

  4. Index selection  For each query in the workload:  Which relations does it access?  Which attributes are retrieved?  Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?  For each update in the workload:  The type of update ( INSERT/DELETE/UPDATE ), and the attributes that are affected. 4

  5. Choice of Indexes  What indexes should we create?  Which relations should have indexes?  What field(s) should be the search key?  Should we build several indexes?  For each index, what kind of an index should it be?  Hash/tree?  Clustered? 5

  6. Choice of Indexes (Contd.)  One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it.  We must understand how a DBMS evaluates queries and creates query evaluation plans!  For now, we discuss simple 1-table queries.  Before creating an index, must also consider the impact on updates in the workload!  Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too. 6

  7. Index Selection Guidelines  Attributes in WHERE clause are candidates for index keys.  Exact match condition suggests hash index.  Range query suggests tree index. • Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.  Multi-attribute search keys should be considered when a WHERE clause contains several conditions.  Order of attributes is important for range queries.  Such indexes can sometimes enable index-only strategies for important queries. • For index-only strategies, clustering is not important!  Choose indexes that benefit as many queries as possible.  Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. 7

  8. Examples of Clustered Indexes SELECT E.dno FROM Emp E WHERE E.age>40  B+ tree index on E.age can be used to get qualifying tuples.  How selective is the condition?  Is the index clustered? 8

  9. Examples of Clustered Indexes Compare index on < age >, index on < dno >. SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno  Consider the GROUP BY query.  If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples by E.dno may be costly.  Clustered E.dno index may be better! 9

  10. Examples of Clustered Indexes SELECT E.dno FROM Emp E WHERE E.hobby=’Stamps’  Equality queries and duplicates:  Clustering on E.hobby helps! 10

  11. Index-Only Plans  Some queries can be answered without < E.dno > retrieving any tuples SELECT E.dno, COUNT (*) from one or more of FROM Emp E the relations involved, GROUP BY E.dno if a suitable index is available. 11

  12. Index-Only Plans Tree index < E.dno,E.sal > SELECT E.dno, MIN (E.sal) FROM Emp E GROUP BY E.dno What about < E.sal,E.dno > ? 12

  13. Index-Only Plans Tree index < E.age, E.sal > or < E.sal, E.age > SELECT AVG (E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 13

  14. Index-Only Plans (Contd.)  Index-only plans SELECT E.dno, COUNT (*) are possible if we FROM Emp E have a tree index WHERE E.age=30 GROUP BY E.dno with key <dno,age> or <age,dno> SELECT E.dno, COUNT (*)  Which is better? FROM Emp E  What if we consider WHERE E.age>30 the second query? GROUP BY E.dno 14

  15. Creating indexes in SQL  SQL:1999 standard does not include any statement for creating or dropping index structures! CREATE INDEX age_index USING BTREE ON Emp(age) 15

  16. Summary  Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design.  What are the important queries and updates? What attributes/relations are involved?  Indexes must be chosen to speed up important queries (and perhaps some updates!).  Index maintenance overhead on updates to key fields.  Choose indexes that can help many queries, if possible.  Build indexes to support index-only strategies.  Clustering is an important decision; only one index on a given relation can be clustered!  Order of fields in composite index key can be important. 16

  17. Disks and Files 17

  18. Disks and DBMS Design  DBMS stores information on disks.  This has major implications for DBMS design!  READ: transfer data from disk to main memory (RAM) for data processing.  WRITE: transfer data from RAM to disk for persistent storage.  Both are high-cost operations, relative to in-memory operations, so must be planned carefully! 18

  19. Why Not Store Everything in Main Memory?  Main memory is volatile . We want data to be saved between runs. (Obviously!)  Costs too much . $100 will buy you either 1GB of RAM or 160GB of disk today.  32-bit addressing limitation .  2 32 bytes can be directly addressed in memory.  Number of objects cannot exceed this number. 19

  20. Arranging Pages on Disk  ` Next ’ block concept:  blocks on same track, followed by  blocks on same cylinder, followed by  blocks on adjacent cylinder  Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.  For a sequential scan, pre-fetching several pages at a time is a big win! 20

  21. Buffer Management in a DBMS Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated DB by replacement policy  Data must be in RAM for DBMS to operate on it!  Table of <frame#, pageid> pairs is maintained. 21

  22. More on Buffer Management  Requestor of page must unpin it, and indicate whether page has been modified:  dirty bit is used for this.  Page in pool may be requested many times,  a pin count is used. A page is a candidate for replacement iff pin count = 0.  CC & recovery may entail additional I/O when a frame is chosen for replacement. ( Write-Ahead Log protocol; more later.) 22

  23. When a Page is Requested ...  If requested page is not in pool:  Choose a frame for replacement  If frame is dirty, write it to disk  Read requested page into chosen frame  Pin the page and return its address.  If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time! 23

  24. Buffer Replacement Policy  Frame is chosen for replacement by a replacement policy:  Least-recently-used (LRU), Clock, MRU etc.  Policy can have big impact on # of I/O’s; depends on the access pattern .  Sequential flooding : Nasty situation caused by LRU + repeated sequential scans.  # buffer frames < # pages in file means each page request causes an I/O. MRU much better in this situation (but not in all situations, of course). 24

  25. DBMS vs. OS File System OS does disk space & buffer mgmt: why not let OS manage these tasks?  Differences in OS support: portability issues  Some limitations, e.g., files can’t span disks.  Buffer management in DBMS requires ability to:  pin a page in buffer pool, force a page to disk (important for implementing CC & recovery),  adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. 25

  26. Files  Access method layer offers an abstraction of data on disk: a file of records residing on multiple pages  A number of fields are organized in a record  A collection of records are organized in a page  A collection of pages are organized in a file 26

  27. Files of Records  Page or block is OK when doing I/O, but higher levels of DBMS operate on records and files of records .  FILE : A collection of pages, each containing a collection of records. Must support:  insert/delete/modify record  read a particular record (specified using record id )  scan all records (possibly with some conditions on the records to be retrieved) 27

  28. Unordered (Heap) Files  Simplest file structure contains records in no particular order.  As file grows and shrinks, disk pages are allocated and de-allocated.  To support record level operations, we must:  keep track of the pages in a file  keep track of free space on pages  keep track of the records on a page  There are many alternatives for keeping track of this. 28

Recommend


More recommend