Hash-Based Indexing Torsten Grust Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static Hashing Hash Functions Architecture and Implementation of Database Systems Extendible Hashing Summer 2016 Search Insertion Procedures Linear Hashing Insertion (Split, Rehashing) Running Example Procedures Torsten Grust Wilhelm-Schickard-Institut für Informatik Universität Tübingen 1
Hash-Based Indexing Hash-Based Indexing Torsten Grust • We now turn to a different family of index structures: hash indexes. • Hash indexes are “unbeatable” when it comes to support for Hash-Based Indexing equality selections : Static Hashing Hash Functions Extendible Hashing Search Equality selection Insertion Procedures 1 SELECT * Linear Hashing 2 FROM R Insertion (Split, Rehashing) 3 WHERE A = k Running Example Procedures • Further, other query operations internally generate a flood of equality tests ( e.g. , nested-loop join). (Non-)presence of hash index support can make a real difference in such scenarios. 2
Hashing vs. B + -trees Hash-Based Indexing Torsten Grust • Hash indexes provide no support for range queries , however (hash indexes are also known as scatter storage ). • In a B + -tree-world, to locate a record with key k means to Hash-Based Indexing compare k with other keys k ′ organized in a (tree-shaped) Static Hashing search data structure. Hash Functions • Hash indexes use the bits of k itself (independent of all Extendible Hashing Search Insertion other stored records) to find the location of the associated Procedures record. Linear Hashing • We will now briefly look into static hashing to illustrate the Insertion (Split, Rehashing) Running Example Procedures basics. • Static hashing does not handle updates well (much like ISAM). • Later, we introduce extendible hashing and linear hashing which refine the hashing principle and adapt well to record insertions and deletions. 3
Hash-Based Indexing Static Hashing Torsten Grust • To build a static hash index on attribute A : Build static hash index on column A Hash-Based Indexing Static Hashing 1 Allocate a fixed area of N (successive) disk pages, the Hash Functions so-called primary buckets . Extendible Hashing Search 2 In each bucket, install a pointer to a chain of overflow Insertion Procedures pages (initially set the pointer to null ). Linear Hashing Insertion (Split, Rehashing) 3 Define a hash function h with range [ 0 , . . . , N − 1 ] . The Running Example Procedures domain of h is the type of A , e.g. . h : INTEGER → [ 0 , . . . , N − 1 ] if A is of SQL type INTEGER . 4
Hash-Based Indexing Static Hashing Torsten Grust Static hash table hash table 0 ... Hash-Based Indexing 1 bucket Static Hashing 2 ... Hash Functions k h Extendible Hashing Search Insertion Procedures Linear Hashing Insertion (Split, Rehashing) N-1 ... Running Example Procedures overflow pages primary buckets • A primary bucket and its associated chain of overflow pages is referred to as a bucket ( above). • Each bucket contains index entries k ∗ (implemented using any of the variants A , B , C , see slide 2.22. 5
Hash-Based Indexing Static Hashing Torsten Grust • To perform hsearch( k ) (or hinsert( k ) / hdelete( k ) ) for a record with key A = k : Hash-Based Indexing Static hashing scheme Static Hashing Hash Functions 1 Apply hash function h to the key value, i.e. , compute h ( k ) . Extendible Hashing Search 2 Access the primary bucket page with number h ( k ) . Insertion Procedures 3 Search (insert/delete) subject record on this page or, if Linear Hashing Insertion (Split, Rehashing) required, access the overflow chain of bucket h ( k ) . Running Example Procedures • If the hashing scheme works well and overflow chain access is avoidable, • hsearch( k ) requires a single I/O operation , • hinsert( k ) / hdelete( k ) require two I/O operations . 6
Hash-Based Indexing Static Hashing: Collisions and Overflow Chains Torsten Grust • At least for static hashing, overflow chain management is important. • Generally, we do not want hash function h to avoid collisions , i.e. , Hash-Based Indexing Static Hashing Hash Functions h ( k ) = h ( k ′ ) even if k � = k ′ Extendible Hashing Search (otherwise we would need as many primary bucket pages as Insertion Procedures different key values in the data file). Linear Hashing • At the same time, we want h to scatter the key attribute Insertion (Split, Rehashing) Running Example domain evenly across [ 0 , . . . , N − 1 ] to avoid the Procedures development of long overflow chains for few buckets. This makes the hash tables’ I/O behavior non-uniform and unpredictable for a query optimizer. • Such “good” hash functions are hard to discover, unfortunately. 7
Hash-Based Indexing The Birthday Paradox (Need for Overflow Chain Management) Torsten Grust Example (The birthday paradox) Consider the people in a group as the domain and use their birthday as hash function h ( h : Person → [ 0 , . . . , 364 ] ). If the group has 23 or more members, chances are Hash-Based Indexing > 50 % that two people share the same birthday Static Hashing ( collision ). Hash Functions Extendible Hashing Search Check : Compute the probability that n people all have different Insertion birthdays : Procedures Linear Hashing 1 Function: different_birthday ( n ) Insertion (Split, Rehashing) Running Example 2 if n = 1 then Procedures return 1; 3 4 else 365 − ( n − 1 ) return different_birthday( n − 1 ) × ; 5 365 � �� � � �� � probability that n − 1 per- probability that n th per- sons have different birth- son has birthday different days from first n − 1 persons 8
Hash-Based Indexing Hash Functions Torsten Grust • Goal: Devise a mapping from keys k to hash values that scatters values better than a random function. Not easy, since value distributions in real-world tables are often skewed. • A good hash function h . . . Hash-Based Indexing Static Hashing • considers all bits of its input key k , Hash Functions • is sensitive to the change of any bit position (even if Extendible Hashing Search k and k ′ differ in bit only, h ( k ) and h ( k ′ ) differ greatly), Insertion Procedures • is sensitive to bit permutation , Linear Hashing • scatters input records evenly over the entire hash table. Insertion (Split, Rehashing) Running Example Procedures Hash functions based on the Golden Ratio Hash value computation based on the (inverse) Golden Ratio √ 5 + 1 ) ≈ 0 . 6180339887 shows particularly nice properties. 1 Z = 2 / ( Multiplicative hashing based on Z spreads outs evenly. PostgreSQL also builds on the random bit pattern of Z . 1 See D.E.Knuth, “ Sorting and Searching .” 9
Hash-Based Indexing Static Hashing and Dynamic Files Torsten Grust • For a static hashing scheme: • If the underlying data file grows , the development of overflow chains spoils the otherwise predictable behavior Hash-Based Indexing hash I/O behavior (1–2 I/O operations). Static Hashing Hash Functions • If the underlying data file shrinks , a significant fraction Extendible Hashing Search of primary hash buckets may be (almost) empty—a Insertion Procedures waste of page space. Linear Hashing • As in the ISAM case, however, static hashing has Insertion (Split, Rehashing) Running Example Procedures advantages when it comes to concurrent access. • We may perodicially rehash the data file to restore the ideal situation (20 % free space, no overflow chains). ⇒ Expensive and the index cannot be used while rehashing is in progress. 10
Hash-Based Indexing Extendible Hashing Torsten Grust • Extendible Hashing can adapt to growing (or shrinking) data files. • To keep track of the actual primary buckets that are part of the current hash table, we hash via an in-memory bucket Hash-Based Indexing directory : Static Hashing Hash Functions Extendible Hashing Example (Extendible hash table setup; ignore the 2 fields for now 2 ) Search Insertion 2 Procedures bucket A 4* 12* 32* 16* Linear Hashing Insertion (Split, Rehashing) 2 2 Running Example Procedures 00 1* 5* 21* bucket B 01 h 2 10 11 10* bucket C directory 2 bucket D 15* 7* 19* hash table 2 Note : This figure depicts the entries as h ( k ) ∗ , not k ∗ . 11
Hash-Based Indexing Extendible Hashing: Search Torsten Grust Search for a record with key k Hash-Based Indexing 1 Apply h , i.e. , compute h ( k ) . Static Hashing Hash Functions 2 Consider the last 2 bits of h ( k ) and follow the Extendible Hashing corresponding directory pointer to find the bucket. Search Insertion Procedures Linear Hashing Insertion (Split, Rehashing) Example (Search for a record) Running Example Procedures To find a record with key k such that h ( k ) = 5 = 101 2 , follow the second directory pointer (101 2 ∧ 11 2 = 01 2 ) to bucket B, then use entry 5 ∗ to access the wanted record. 12
Hash-Based Indexing Extendible Hashing: Global and Local Depth Torsten Grust Global and local depth annotations Hash-Based Indexing Static Hashing • Global depth ( n at hash directory): Hash Functions Extendible Hashing Use the last n bits of h ( k ) to lookup a bucket pointer in the Search directory (the directory size is 2 n ). Insertion Procedures Linear Hashing • Local depth ( d at individual buckets): Insertion (Split, Rehashing) Running Example The hash values h ( k ) of all entries in this bucket agree on Procedures their last d bits. 13
Recommend
More recommend