Storage and Indexing (continued) CMPSCI 645 Mar 4, 2008 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1
Today Index selection & performance tuning Storing data: disks and files 2
Cost of Operations Scan Equality Range Insert Delete Heap File BD .5BD BD 2D Search + D Sorted File BD Dlog 2 B D(log 2 B + Search + Search + #matching BD BD pages) Clustered 1.5BD Dlog F 1.5 D(log F 1.5B + Search + Search + #matching Tree Index B D D pages) Unclustered BD(R+. D(1+log F. D(log F .15B + Search + Search + #matching Tree Index 15) . 15B) 3D 3D recs ) Unclustered BD(R+. 2D BD 4D 4D Hash Index 125) Several assumptions underlie these (rough) estimates!
Index selection For each query in the workload: Which relations does it access? Which attributes are retrieved? Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? For each update in the workload: The type of update ( INSERT/DELETE/UPDATE ), and the attributes that are affected. 4
Choice of Indexes What indexes should we create? Which relations should have indexes? What field(s) should be the search key? Should we build several indexes? For each index, what kind of an index should it be? Hash/tree? Clustered? 5
Choice of Indexes (Contd.) One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. We must understand how a DBMS evaluates queries and creates query evaluation plans! For now, we discuss simple 1-table queries. Before creating an index, must also consider the impact on updates in the workload! Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too. 6
Index Selection Guidelines Attributes in WHERE clause are candidates for index keys. Exact match condition suggests hash index. Range query suggests tree index. • Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates. Multi-attribute search keys should be considered when a WHERE clause contains several conditions. Order of attributes is important for range queries. Such indexes can sometimes enable index-only strategies for important queries. • For index-only strategies, clustering is not important! Choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. 7
Examples of Clustered Indexes SELECT E.dno FROM Emp E WHERE E.age>40 B+ tree index on E.age can be used to get qualifying tuples. How selective is the condition? Is the index clustered? 8
Examples of Clustered Indexes Compare index on < age >, index on < dno >. SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno Consider the GROUP BY query. If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples by E.dno may be costly. Clustered E.dno index may be better! 9
Examples of Clustered Indexes SELECT E.dno FROM Emp E WHERE E.hobby=’Stamps’ Equality queries and duplicates: Clustering on E.hobby helps! 10
Index-Only Plans Some queries can be answered without < E.dno > retrieving any tuples SELECT E.dno, COUNT (*) from one or more of FROM Emp E the relations involved, GROUP BY E.dno if a suitable index is available. 11
Index-Only Plans Tree index < E.dno,E.sal > SELECT E.dno, MIN (E.sal) FROM Emp E GROUP BY E.dno What about < E.sal,E.dno > ? 12
Index-Only Plans Tree index < E.age, E.sal > or < E.sal, E.age > SELECT AVG (E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 13
Index-Only Plans (Contd.) Index-only plans SELECT E.dno, COUNT (*) are possible if we FROM Emp E have a tree index WHERE E.age=30 GROUP BY E.dno with key <dno,age> or <age,dno> SELECT E.dno, COUNT (*) Which is better? FROM Emp E What if we consider WHERE E.age>30 the second query? GROUP BY E.dno 14
Creating indexes in SQL SQL:1999 standard does not include any statement for creating or dropping index structures! CREATE INDEX age_index USING BTREE ON Emp(age) 15
Summary Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design. What are the important queries and updates? What attributes/relations are involved? Indexes must be chosen to speed up important queries (and perhaps some updates!). Index maintenance overhead on updates to key fields. Choose indexes that can help many queries, if possible. Build indexes to support index-only strategies. Clustering is an important decision; only one index on a given relation can be clustered! Order of fields in composite index key can be important. 16
Disks and Files 17
Disks and DBMS Design DBMS stores information on disks. This has major implications for DBMS design! READ: transfer data from disk to main memory (RAM) for data processing. WRITE: transfer data from RAM to disk for persistent storage. Both are high-cost operations, relative to in-memory operations, so must be planned carefully! 18
Why Not Store Everything in Main Memory? Main memory is volatile . We want data to be saved between runs. (Obviously!) Costs too much . $100 will buy you either 1GB of RAM or 160GB of disk today. 32-bit addressing limitation . 2 32 bytes can be directly addressed in memory. Number of objects cannot exceed this number. 19
Arranging Pages on Disk ` Next ’ block concept: blocks on same track, followed by blocks on same cylinder, followed by blocks on adjacent cylinder Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay. For a sequential scan, pre-fetching several pages at a time is a big win! 20
Buffer Management in a DBMS Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated DB by replacement policy Data must be in RAM for DBMS to operate on it! Table of <frame#, pageid> pairs is maintained. 21
More on Buffer Management Requestor of page must unpin it, and indicate whether page has been modified: dirty bit is used for this. Page in pool may be requested many times, a pin count is used. A page is a candidate for replacement iff pin count = 0. CC & recovery may entail additional I/O when a frame is chosen for replacement. ( Write-Ahead Log protocol; more later.) 22
When a Page is Requested ... If requested page is not in pool: Choose a frame for replacement If frame is dirty, write it to disk Read requested page into chosen frame Pin the page and return its address. If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time! 23
Buffer Replacement Policy Frame is chosen for replacement by a replacement policy: Least-recently-used (LRU), Clock, MRU etc. Policy can have big impact on # of I/O’s; depends on the access pattern . Sequential flooding : Nasty situation caused by LRU + repeated sequential scans. # buffer frames < # pages in file means each page request causes an I/O. MRU much better in this situation (but not in all situations, of course). 24
DBMS vs. OS File System OS does disk space & buffer mgmt: why not let OS manage these tasks? Differences in OS support: portability issues Some limitations, e.g., files can’t span disks. Buffer management in DBMS requires ability to: pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. 25
Files Access method layer offers an abstraction of data on disk: a file of records residing on multiple pages A number of fields are organized in a record A collection of records are organized in a page A collection of pages are organized in a file 26
Files of Records Page or block is OK when doing I/O, but higher levels of DBMS operate on records and files of records . FILE : A collection of pages, each containing a collection of records. Must support: insert/delete/modify record read a particular record (specified using record id ) scan all records (possibly with some conditions on the records to be retrieved) 27
Unordered (Heap) Files Simplest file structure contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. To support record level operations, we must: keep track of the pages in a file keep track of free space on pages keep track of the records on a page There are many alternatives for keeping track of this. 28
Recommend
More recommend