overview of storage and indexing
play

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir - PDF document

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Database Management Systems, R. Ramakrishnan and J.


  1. Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive pages is much cheaper than reading them in random order  File organization: Method of arranging a file of records on external storage.  Record id (rid) is sufficient to physically locate record  Indexes are data structures that allow us to find the record ids of records with given values in index search key fields  Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager. 2 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  2. Indexes  An index on a file speeds up selections on the search key fields for the index.  Any subset of the fields of a relation can be the search key for an index on the relation.  Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).  An index supports efficient retrieval of all data entries with a given key value k . 3 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Index Classification  Primary vs. secondary : If search key contains primary key, then called primary index.  Unique index: Search key contains a candidate key.  Clustered vs. unclustered : If order of data records is the same as, or `close to ’ , order of data entries, then called clustered index.  A file can be clustered on at most one search key.  Cost of retrieving data records through index varies greatly based on whether index is clustered or not!  Dense vs. sparse : If there is an index data entry for each data records then called dense index. 4 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  3. B+ Tree Indexes Non-leaf Pages Leaf Pages  Leaf pages contain data entries , and are chained (prev & next)  Non-leaf pages contain index entries and direct searches: index entry P0 K 1 P 1 K 2 P m P 2 K m 5 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny B+ Tree  Insert/delete at log F N cost; keep tree height- balanced . (F = fanout, N = # leaf pages)  Minimum 50% occupancy (except for root). Each node contains d <= m <= 2 d entries. The parameter d is called the order of the tree.  Supports equality and range-searches efficiently. Index Entries (Direct search) Data Entries ("Sequence set") 6 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  4. Queries on B + -Trees  Find all records with a search-key value of k.  Start with the root node • Examine the node for the smallest search-key value > k. • If such a value exists, assume it is K j . The follow P i to the child node • Otherwise k  K m – 1 , where there are m pointers in the node. Then follow P m to the child node.  If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer.  Eventually reach a leaf node. If key K i = k, follow pointer P i to the desired record or bucket. Else no record with search-key value k exists. 7 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Example B+ Tree Root 17 Entries < 17 Entries >= 17 5 13 27 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16*  Find 28*? 29*? All > 15* and < 30*  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes.  And change sometimes bubbles up the tree 8 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  5. B+ Trees in Practice  Typical order: 100. Typical fill-factor: 67%.  average fanout = 133  Typical capacities:  Height 4: 133 4 = 312,900,700 records  Height 3: 133 3 = 2,352,637 records  Can often hold top levels in buffer pool:  Level 1 = 1 page = 8 Kbytes  Level 2 = 133 pages = 1 Mbyte  Level 3 = 17,689 pages = 133 MBytes 9 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Index Definition in SQL  Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index b-index on branch(branch-name)  Use create unique index to indirectly specify and enforce the condition that the search key is a candidate keyis a candidate key.  To drop an index drop index <index-name> 10 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  6. Multiple-Key Access  Use multiple indices for certain types of queries.  Example: select account-number from account where branch-name = “ Perryridge ” and balance = 1000  Possible strategies for processing query using indices on single attributes: 1. Use index on branch-name to find accounts with branch-name = “ Perryridge ” .; test balance = $1000. 2. Use index on balance to find accounts with balances of $1000; test branch-name = “ Perryridge ” . 3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance . Take intersection of both sets of pointers obtained. 11 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Indices on Multiple Attributes Suppose we have an index on combined search-key ( branch-name, balance ).  With the where clause where branch-name = “ Perryridge ” and balance = 1000 the index on the combined search-key will fetch only records that satisfy both conditions. Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the conditions. 12 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  7. Choice of Indexes  What indexes should we create?  Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?  For each index, what kind of an index should it be?  Clustered? Hash/tree? 13 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Hash-Based Indexes  Good for equality selections. • Index is a collection of buckets. Bucket = primary page plus zero or more overflow pages. • Hashing function h : h ( r ) = bucket in which record r belongs. h looks at the search key fields of r. 14 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  8. Summary  Many alternative file organizations exist, each appropriate in some situation.  If selection queries are frequent, sorting the file or building an index is important.  Hash-based indexes only good for equality search.  Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)  Index is a collection of data entries plus a way to quickly find entries with given key values. 15 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Summary (Contd.)  Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.  Choice orthogonal to indexing technique used to locate data entries with a given key value.  Can have several indexes on a given file of data records, each with a different search key.  Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance. 16 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Recommend


More recommend