Comment Sparse indices are uncommon Because Only a primary index can be a sparse index This requires that the file is sorted on the search key Most commonly used indexes is: secondary index Very flexible: The file does not need to be sorted 30 / 104
Multi-level Index If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index - a sparse index of primary index inner index - the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file. 31 / 104
Two-Level Index Example 32 / 104
Comment { FILE,INDEX } may be layed out on disk as either contiguous or blocks chained strategy 33 / 104
Question Can we (do we want to) build a dense, 2 nd level index for a dense index? 34 / 104
Question Can we (do we want to) build a dense 2 nd level index for a dense index? Second and higher level indexes must be sparse, otherwise no savings The reason is that a dense index on an index would have exactly as many key-pointer pairs as the first-level index, and therefore would take exactly as much space as the first-level index. 35 / 104
Notes on pointers 1) Block pointer (used in sparse index) can be smaller than record pointer (used in dense index) 2) If file is contiguous, then we can omit pointers (i.e., compute them) 36 / 104
Notes on pointers 37 / 104
Notes on pointers 38 / 104
Notes on pointers 39 / 104
Sparse vs. Dense Tradeoff Sparse uses much less space (Later: sparse better for insertions) Dense unlike sparse, can tell if any record exists without accessing file (Later: dense needed for secondary indexes) 40 / 104
Next Duplicate keys Deletion/Insertion Secondary indexes 41 / 104
Duplicate keys What if more than one record has a given search key value? Then the search key is not a key of the relation 42 / 104
Duplicate Search Keys with Dense Index Dense index, one way to implement? (Point to each value) one entry with key K for each record of the data file that has search key K To find all data records with search key K , follow all the pointers in the index with search key K 43 / 104
Duplicate Search Keys with Dense Index Dense index, better way? 44 / 104
Duplicate Search Keys with Dense Index keys Dense index, better way? Point to each distinct value! only keep record in index for first data record with each search key value (saves some space in the index) To find all data records with search key K , follow the one pointer in the index and then move forward in the data file 45 / 104
Duplicate search keys with sparse index Sparse index, one way? key-pointer pairs corresponding to the first search key on each block of the data file. Find all records with search key 20? Search key =10? 46 / 104
Duplicate search keys with sparse index To find all data records with search key K : find last entry ( E 1 ) in index with key ≤ K move towards front of index until either reaching entry ( E 2 ) with key < K or come to the 1st entry check data blocks pointed to by entries from E 2 to E 1 for records with search key K 47 / 104
Next Duplicate keys Deletion/Insertion Secondary indexes 48 / 104
Deletion from sparse index 49 / 104
Deletion from sparse index delete record 40 50 / 104
Deletion from sparse index delete record 40 51 / 104
Deletion from sparse index delete record 30 52 / 104
Deletion from sparse index delete record 30 53 / 104
Deletion from sparse index delete records 30 & 40 54 / 104
Deletion from sparse index delete records 30 & 40 55 / 104
Deletion from sparse index delete records 30 & 40 56 / 104
Deletion from dense index 57 / 104
Deletion from dense index delete record 30 58 / 104
Deletion from dense index delete record 30 59 / 104
Deletion from dense index delete record 30 60 / 104
Insertion, sparse index case 61 / 104
Insertion, sparse index case insert record 34 62 / 104
Insertion, sparse index case insert record 15 63 / 104
Insertion, sparse index case insert record 15 Illustrated: Immediate reorganization Variation insert new block (chained file) update index 64 / 104
Insertion, sparse index case insert record 25 65 / 104
Insertion, sparse index case insert record 25 66 / 104
Insertion, dense index case Similar Often more expensive . . . 67 / 104
Topics Conventional indexes Basic Ideas: sparse, dense, multi-level . . . Duplicate Keys Deletion/Insertion Secondary indexes B-Trees Hashing schemes 68 / 104
Secondary Indexes Sometimes we want multiple indexes on a relation. Ex: search Candies(name,manf) both by name and by manufacturer Typically the file would be sorted using the key (ex: name ) and the primary index would be on that field. The secondary index is on any other attribute (ex: manf ). Secondary index also facilitates finding records, but cannot rely on them being sorted 69 / 104
Recall: Secondary index Secondary index : an ordered index whose search key is NOT the sort key used for the sequential file 70 / 104
Sparse Secondary Index? 71 / 104
Sparse Secondary Index? 72 / 104
Sparse Secondary Index? 73 / 104
Sparse Secondary Index? No! Since records are not sorted on that key, cannot predict the location of a record from the location of any other record. Thus secondary indexes are always dense . 74 / 104
Design of Secondary Indexes Always dense , usually with duplicates Consists of key-pointer pairs (“key” means search key , not relation key) Entries in index file are sorted by key value Therefore second-level index is sparse (if we wish to place a second level of index) 75 / 104
Secondary indexes 76 / 104
Secondary indexes 77 / 104
Secondary indexes 78 / 104
Secondary indexes Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed) 79 / 104
Secondary Index and Duplicate Keys Scheme in previous diagram wastes space in the present of duplicate keys If a search key value appears n times in the data file, then there are n entries for it in the index. 80 / 104
Secondary Index and Duplicate Keys one option 81 / 104
Secondary Index and Duplicate Keys one option 82 / 104
Secondary Index and Duplicate Keys one option 83 / 104
Secondary Index and Duplicate Keys another option 84 / 104
Secondary Index and Duplicate Keys another option 85 / 104
Secondary Index and Duplicate Keys another option 86 / 104
Secondary Index and Duplicate Keys another idea: Chain records with same key option 87 / 104
Secondary Index and Duplicate Keys another idea: Chain records with same key option 88 / 104
Buckets To avoid repeating values, use a level of indirection Put buckets between the secondary index file and the data file One entry in index for each search key K ; its pointer goes to a location in a “bucket file”, called the “bucket” for K Bucket holds pointers to all records with search key K 89 / 104
Secondary Index and Duplicate Keys 90 / 104
Why “bucket” idea is useful Saves space as long as search-keys are larger than pointers and average key appears at least twice We can use the pointers in the buckets to help answer queries without ever looking at most of the records in the data file. When there are several conditions to a query, and each condition has a secondary index to help it, find the bucket pointers that satisfy all the conditions by intersecting sets of pointers in memory, and retrieving only the records pointed to by the surviving pointers. Save the I/O cost of retrieving records that satisfy some, but not all, of the conditions 91 / 104
Why “bucket” idea is useful Consider the relation Emp (name, dept, floor) Suppose we have a primary index on name , secondary indexes with indirect buckets on both dept and floor . Query: SELECT name Emp FROM WHERE dept = ’Toy ’ AND f l o o r = 2; 92 / 104
Query: Get employees in (Toy Dept) & (2nd floor) Intersect Toy dept bucket and floor 2 bucket to get set of matching Emp ’s Retrieving the minimum possible number of data blocks. Saves disk I/O’s 93 / 104
Inverted files Inverted file defines index over non-unique, non-ordering search key of data set Index entries: < key value, block address > Block address refers to block containing record pointers or block pointers to all records with that particular key value Requires additional random block access to block with pointers to records Queries that involve multiple attribute types can be executed efficiently by taking the intersection of blocks with pointers 94 / 104
Inverted files 95 / 104
This idea used in text information retrieval 96 / 104
This idea used in text information retrieval inverted index consists of a set of word-pointer pairs; the words are in effect the search key for the index. 97 / 104
IR Queries Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog” Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words 98 / 104
Common technique: more info in inverted list 99 / 104
Summary so far Conventional index Basic Ideas: sparse, dense, multi-level Duplicate Keys Deletion/Insertion Secondary indexes 100 / 104
Recommend
More recommend