Hash-Based Indexes Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Introduction � As for any index, 3 alternatives for data entries k* : � Data record with key value k � < k , rid of data record with search key value k > � < k , list of rids of data records with search key k > � Hash-based indexes are best for equality selections . – Provide constant-time searches – But cannot support range searches � Static and dynamic hashing techniques exist – Trade-offs similar to ISAM vs. B+ trees Database Management Systems, R. Ramakrishnan and J. Gehrke 2 Static Hashing � # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. � h ( k ) mod N = bucket to which data entry with key k belongs . (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages Database Management Systems, R. Ramakrishnan and J. Gehrke 3
Static Hashing (Contd.) � Buckets contain data entries . � Hash fn works on search key field of record r. Must distribute values over range 0 ... N-1. – h ( key ) = (a * key + b) usually works well. – a and b are constants; lots known about how to tune h . � Long overflow chains can develop and degrade performance – Extendible and Linear Hashing : Dynamic techniques to fix this problem. Database Management Systems, R. Ramakrishnan and J. Gehrke 4 Extendible Hashing � Main idea: If bucket (primary page) becomes full, why not re-organize file by doubling # of buckets? – Essentially “splitting” buckets � But reading and writing all buckets is expensive! – Idea : Use directory of pointers to buckets , – Double # of buckets by doubling the directory, splitting just the bucket that overflowed! – Directory much smaller than file, so doubling it is much cheaper. – No overflow pages ! Database Management Systems, R. Ramakrishnan and J. Gehrke 5 Insert h (r)=14 2 2 LOCAL DEPTH LOCAL DEPTH Bucket A Bucket A 32*16* 4* 12* 32*16* 4* 12* GLOBAL DEPTH GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 1* 5* 21*13* 00 1* 5* 21*13* 01 01 2 2 10 10 Bucket C Bucket C 10* 11 10* 11 14* 2 2 DIRECTORY DIRECTORY Bucket D Bucket D 15* 7* 19* 15* 7* 19* Database Management Systems, R. Ramakrishnan and J. Gehrke 6
Insert h (r)=20 2 2 LOCAL DEPTH LOCAL DEPTH Bucket A Bucket A 32* 16* 32* 16* GLOBAL DEPTH 4* 12* GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 1* 5* 21*13* 00 1* 5* 21*13* 01 01 2 10 10 2 Bucket C Bucket C 11 10* 11 10* 2 2 DIRECTORY Bucket D DIRECTORY Bucket D 15* 7* 19* 15* 7* 19* 2 Bucket A2 4* 12* 20* (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 7 Insert h (r)=20 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 32* 16* GLOBAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 2 3 2 Bucket B 00 1* 5* 21*13* 000 1* 5* 21*13* Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* 011 Bucket C 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY (`split image' 4* 12* 20* Bucket A2 of Bucket A) (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 8 Insert h (r)=32 LOCAL DEPTH LOCAL DEPTH GLOBAL DEPTH GLOBAL DEPTH 1 Bucket A 1 4* 12* 10* 32* 0 0 Bucket A 0 1* 10* 4* 12* 1 1 DIRECTORY Bucket B 1* DIRECTORY Database Management Systems, R. Ramakrishnan and J. Gehrke 9
Insert h (r)=16 LOCAL DEPTH 2 LOCAL DEPTH Bucket A 32* 16* 4* 12* GLOBAL DEPTH GLOBAL DEPTH 1 1 2 Bucket A 1 4* 12* 10* 32* Bucket B 1* 00 0 01 1 1 2 10 Bucket C Bucket B 10* 11 1* DIRECTORY DIRECTORY Database Management Systems, R. Ramakrishnan and J. Gehrke 10 Insert h (r)=20 3 LOCAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 2 LOCAL DEPTH 3 1 Bucket A 32* 16* 4* 12* GLOBAL DEPTH 000 1* Bucket B 001 1 2 2 010 Bucket B 1* 00 10* 011 Bucket C 01 100 2 10 101 Bucket C 10* 11 110 111 DIRECTORY 3 DIRECTORY 4* 12* 20* Bucket A2 (`split image' of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 11 Insert h (r)=5, 15, 7, 19 3 3 LOCAL DEPTH LOCAL DEPTH 32* 16* Bucket A 32* 16* Bucket A GLOBAL DEPTH GLOBAL DEPTH 3 1 3 2 5* 15* 7* 000 1* Bucket B 000 1* 5* Bucket B 001 001 2 2 010 010 10* Bucket C 10* Bucket C 011 011 100 100 2 101 101 15* 7* 19* Bucket B2 110 110 (`split image' 111 111 of Bucket B) 3 3 DIRECTORY DIRECTORY 4* 12* 20* Bucket A2 4* 12* 20* Bucket A2 (`split image' (`split image' of Bucket A) of Bucket A) Database Management Systems, R. Ramakrishnan and J. Gehrke 12
Deletions � Inverse of insertion � If removal of data entry makes bucket empty, merge with ‘split image’ � If each directory element points to same bucket as its split image, can halve directory Database Management Systems, R. Ramakrishnan and J. Gehrke 13 Comments on Extendible Hashing � If directory fits in memory, equality search answered with one disk access; else two – 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. � Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large – Multiple entries with same hash value cause problems! – When would this happen? Database Management Systems, R. Ramakrishnan and J. Gehrke 14 Linear Hashing � This is another dynamic hashing scheme, an alternative to Extendible Hashing � LH handles the problem of long overflow chains without using a directory, and handles duplicates � Main idea: split one bucket at a time in rounds Database Management Systems, R. Ramakrishnan and J. Gehrke 15
Inserting h(r) = 43 Level=2, N=4 Level=2 h PRIMARY h OVERFLOW h h PRIMARY 3 2 PAGES 3 2 Next=0 Next=0 PAGES PAGES 32* 44* 36* 32* 000 00 000 00 Data entry r 9* 25* 5* 9* 25* 5* with h(r)=5 001 01 001 01 14* 18*10*30* 14* 18*10*30* 010 10 Primary 010 10 bucket page 31*35* 7* 11* 31*35* 7* 11* 43* 011 11 011 11 ( This info (The actual contents is for illustration of the linear hashed only!) file) Database Management Systems, R. Ramakrishnan and J. Gehrke 16 Example (Inserting h(r) = 43) Level=2 Level=2 h h PRIMARY OVERFLOW h h OVERFLOW PRIMARY 3 2 PAGES PAGES 3 2 Next=0 PAGES PAGES 32* 32* 000 00 000 00 Next=1 9* 25* 5* 9* 25* 5* 001 01 001 01 14* 18*10*30* 14* 18*10*30* 010 10 010 10 31*35* 7* 11* 43* 31*35* 7* 11* 43* 011 11 011 11 100 00 44* 36* Database Management Systems, R. Ramakrishnan and J. Gehrke 17 Inserting h(r) = 50 (End of a Round) Level=3 PRIMARY OVERFLOW h3 h PAGES 2 PAGES Next=0 Level=2 000 00 32* PRIMARY OVERFLOW PAGES h3 h PAGES 2 001 01 9* 25* 32* 000 00 010 10 66* 18* 10* 34* 50* 9* 25* 001 01 011 11 43* 35* 11* 010 10 66* 18* 10* 34* Next=3 100 00 44* 36* 31* 35* 7* 11* 43* 011 11 101 11 5* 37* 29* 100 44* 36* 00 14* 30* 22* 5* 110 10 101 37*29* 01 14* 30* 22* 31*7* 111 11 110 10 Database Management Systems, R. Ramakrishnan and J. Gehrke 18
Overview of LH File � In the middle of a round. Buckets split in this round: Bucket to be split If h ( search key value ) Level Next is in this range, must use h Level+1 ( search key value ) Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round Database Management Systems, R. Ramakrishnan and J. Gehrke 19 Summary � Hash-based indexes: best for equality searches, cannot support range searches. � Static Hashing can lead to long overflow chains. � Extendible Hashing uses directory doubling to avoid overflow pages – Duplicates may require overflow pages � Linear hashing avoids directory by splitting in rounds – Naturally handles skew and duplicates – Uses overflow buckets (but not very long in practice) Database Management Systems, R. Ramakrishnan and J. Gehrke 20
Recommend
More recommend