hashing
play

Hashing 0 Hash function. Method for computing array index from - PowerPoint PPT Presentation

Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hashing 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 Tyler Moore 3 "it" ?? 4 CS 2123, The


  1. Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hashing 0 Hash function. Method for computing array index from key. 1 2 hash("it") = 3 Tyler Moore 3 "it" ?? 4 CS 2123, The University of Tulsa Issues. hash("times") = 3 5 ・ Computing the hash function. ・ Equality test: Method for checking whether two keys are equal. ・ Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index. Some slides created by or adapted from Dr. Kevin Wayne. For more information see http://www.cs.princeton.edu/courses/archive/fall12/cos226/lectures.php . Classic space-time tradeoff. ・ No space limitation: trivial hash function with key as index. ・ No time limitation: trivial collision resolution with sequential search. ・ Space and time limitations: hashing (the real world). 3 2 / 22 Computing the hash function Uniform hashing assumption Idealistic goal. Scramble the keys uniformly to produce a table index. Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1 . ・ Efficiently computable. key ・ Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Bins and balls. Throw balls uniformly at random into M bins. Ex 1. Phone numbers. ・ Bad: first three digits. table ・ Better: last three digits. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 index Ex 2. Social Security numbers. Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses. ・ Bad: first three digits. 573 = California, 574 = Alaska (assigned in chronological order within geographic region) ・ Better: last three digits. Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses. Load balancing. After M tosses, expect most loaded bin has Practical challenge. Need different approach for each key type. Θ ( log M / log log M ) balls. 5 13 3 / 22 4 / 22

  2. Options for dealing with collisions Collisions Collision. Two distinct keys hashing to same index. ・ Birthday problem ⇒ can't avoid collisions unless you have a ridiculous (quadratic) amount of memory. 1 Open hashing aka separate chaining: store collisions in a linked list ・ Coupon collector + load balancing ⇒ collisions are evenly distributed. 2 Closed hashing aka open addressing: keep keys in the table, shift to unused space Collision resolution policies 0 1 Linear probing 1 hash("it") = 3 2 Quadratic probing aka quadratic residue search 2 3 "it" Double hashing 3 ?? 4 hash("times") = 3 5 Challenge. Deal with collisions efficiently. 16 5 / 22 6 / 22 Separate chaining symbol table Analysis of separate chaining Use an array of M < N linked lists. [H. P . Luhn, IBM 1953] Proposition. Under uniform hashing assumption, prob. that the number of ・ Hash: map key to integer i between 0 and M - 1 . keys in a list is within a constant factor of N / M is extremely close to 1 . ・ Insert: put at front of i th chain (if not already there). ・ Search: need to search only i th chain. Pf sketch. Distribution of list size obeys a binomial distribution. key hash value (10, .12511...) .125 S 2 0 E 0 1 A 8 E 12 0 A 0 2 30 0 10 20 R 4 3 st[] null Binomial distribution ( N = 10 4 , M = 10 3 , � = 10) C 4 4 0 1 H 4 5 2 X 7 S 0 E 0 6 3 equals() and hashCode() X 2 7 4 L 11 P 10 A 0 8 Consequence. Number of probes for search/insert is proportional to N / M . M 4 9 ・ M too large ⇒ too many empty chains. P 3 10 M 9 H 5 C 4 R 3 ・ M too small ⇒ chains too long. M times faster than L 3 11 sequential search ・ Typical choice: M ~ N / 5 ⇒ constant-time ops. E 0 12 17 20 7 / 22 8 / 22

  3. Closed hashing Closed hashing: insert Records stored directly in table of size M at hash index h ( x ) for key x Hash(key) into table at position i When a collision occurs: Repeat up to the size of the table { Hashes to occupied home position If entry at position i in table is blank or marked as deleted Record stored in first available slot based on repeatable collision then insert and exit resolution policy Let i be the next position using the collision resolution function Formally, for each i collisions h 0 ( x ) , h 1 ( x ) , . . . h i ( x ) tried in succession } where h i ( x ) = ( h ( x ) + f ( i )) mod M 9 / 22 10 / 22 Closed hashing: search Closed hashing: delete Hash(key) into table at position i Hash(key) into table at position i Repeat up to the size of the table { Repeat up to the size of the table { If entry at position i in table matches key and not marked as deleted If entry at position i in table matches key then found and exit then mark as deleted and exit If entry at position i in table is blank If entry at position i in table is blank then not found and exit then not found and exit Let i be the next position using the collision resolution function Let i be the next position using the collision resolution function } Not found and exit } Not found and exit 11 / 22 12 / 22

  4. Linear probing Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. Collision resolution function f ( i ) = i : h i ( x ) = ( h ( x ) + i ) mod M Work example 28 13 / 22 14 / 22 Knuth's parking problem Analysis of linear probing Model. Cars arrive at one-way street with M parking spaces. Proposition. Under uniform hashing assumption, the average # of probes Each desires a random space i : if space i is taken, try i + 1, i + 2, etc. in a linear probing hash table of size M that contains N = α M keys is: ∼ 1 � 1 � ∼ 1 � 1 � Q. What is mean displacement of a car? 1 + 1 + 2 1 − α 2 (1 − α ) 2 search hit search miss / insert displacement = 3 Pf. Half-full. With M / 2 cars, mean displacement is ~ 3 / 2 . Full. With M cars, mean displacement is ~ π M / 8 . Parameters. ・ M too large ⇒ too many empty array entries. ・ M too small ⇒ search time blows up. # probes for search hit is about 3/2 ・ Typical choice: α = N / M ~ ½ . # probes for search miss is about 5/2 29 30 15 / 22 16 / 22

  5. Performance comparison of search Load factors and cost of probing Tree Worst-case cost Avg.-case cost What size hash table do we need when using linear probing and a load (after n inserts) (after n inserts) Ordered factor of α = 0 . 75 for closed hashing to achieve a more efficient search insert delete search insert delete iteration? expected search time than a balanced binary search tree? Sequential search (unordered list) Θ( n ) Θ( n ) Θ( n ) Θ( n ) Θ( n ) Θ( n ) no Search hit: 1 1 2 (1 + 1 − 3 / 4 ) = 2 . 5 Binary search (ordered array) Θ(log( n )) Θ( n ) Θ( n ) Θ(log( n )) Θ( n ) Θ( n ) yes Search miss/insert: 1 1 2 (1 + (1 − 3 / 4) 2 ) = 8 . 5 BST Θ( n ) Θ( n ) Θ( n ) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes AVL Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes Thus we need a hash table of size M where log 2 M = 8 . 5, so B-tree Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) Θ(log( n )) yes M ≥ 2 8 . 5 = 362 Hash table Θ( n ) Θ( n ) Θ( n ) Θ(1) Θ(1) Θ(1) no 17 / 22 18 / 22 Load factors and cost of probing Quadratic probing 50 1000000000000 Collision resolution function f ( i ) = ± i 2 : h i ( x ) = ( h ( x ) ± i 2 ) mod M expected # probes insert/search miss breakeven input size hash table/BST 20 for 1 ≤ i ≤ ( M − 1) 2 M is a prime number of the form 4 j + 3, which guarantees that the 10 100000000 probe sequence is a permutation of the table address space alpha = 0.9 5 Eliminates primary clustering (when collisions group together causing 10000 alpha = 0.9 more collisions for keys that hash to different values) 2 Work example 1 1 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 load factor alpha load factor alpha 19 / 22 20 / 22

  6. Double hashing Rehashing We have already seen how hash table performance falls rapidly as the table load factor approaches 1 (in practice, any load factor above 1/2 With quadratic probing, secondary clustering remains: keys that should be avoided) collide must follow sequence of prior collisions to find an open spot To rehash: create a new table whose capacity M ′ is the first prime Double hashing reduces both primary and secondary clustering: probe more than twice as large as M sequence is dependent on original key, not just one hash value Scan through the old table and insert into the new table, ignoring Collision resolution function f ( i ) = i · h b ( x ): cells marked as deleted h i ( x ) = ( h A ( x ) + i · h B ( x )) mod M Running time Θ( M ) Works best if M is prime Relatively expensive operation on its own Our approach: h A ( x ) = x mod M , h B ( x ) = R − ( x mod R ) where But good hash table implementations will only rehash when the table is R is a prime < M . half full, then double in size, so the operation should be rare Can even consider the cost amortized over the M / 2 insertions as constant addition to the insertions 21 / 22 22 / 22

Recommend


More recommend