Hashing Today’s announcements: ◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2 Today’s Plan 0 1 ◮ Hashing 2 3 ◮ Universal hash functions 4 5 6 12345 7 8 9 10 11 12 = 6 13 14 15 16 17 18 19 20 1 / 12
Expected Value Definition: The expected value of a number X that depends on random events ( X is called a random variable ) is: � E [ X ] = Prob [ X = x ] · x . x X is the sum of two six-sided dice. E [ X ] = 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Linearity of Expectation For any two random variables X and Y , E [ X + Y ] = E [ X ] + E [ Y ]. 2 / 12
More Birthdays What is the expected number of people who share a birthday in this room? � 1 if person i and j have same birthday Let X ij = 0 otherwise X = � i < j X ij is the number of pairs who share a birthday. E [ X ] = E [ � i < j X ij ] = � i < j E [ X ij ] = � i < j Generalized birthdays k ( k − 1) If we randomly put k people into m bins, we expect 1 pairs √ m 2 to share a bin, which is greater than 1 for k = 2 m + 1. 3 / 12
Hash table approach Choose a hash function to map keys to indices. keys hash table 0 GNU 1 Linux 2 GNU’s Not Unix Multics 3 Unics Unix m − 1 hash function hash(“GNU”) = 2 4 / 12
Hashing string keys with mod and Horner’s Rule int hash( string s ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be a power of 2 int hash( char *s ) { int h = *s++; while( *s ) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different? 5 / 12
Fixed hash functions are dangerous! Good hash table performance depends on few collisions. If a user knows your hash function, she can cause many elements to hash to the same slot. Why would she want to do that? Yacc h ( s ) = (31 k − 1 s [0] + 31 k − 2 s [1] + · · · + 31 0 s [ k − 1])mod1023 h ( XY ) = h ( xy ). Find many strings that hash to the same slot? Protection ◮ Use a cryptographically secure hash function (e.g. SHA-512). ◮ Choose a new hash function at random for every hash table. 6 / 12
Universal hash functions A set H of hash functions is universal if for all x � = y , the probability that hash( x ) = hash( y ) is at most 1 / m when hash() is chosen at random from H . Example: Let p be a prime number larger than any key. Choose a at random from { 1 , 2 , . . . , p − 1 } and choose b at random from { 0 , 1 , . . . , p − 1 } . hash( x ) = (( a · x + b ) mod p ) mod m 7 / 12
Collisions Birthday Paradox With probability > , two people, in a room of 23, have the same birthday. (Hash 23 people into m = 365 slots. Collision?) General birthday paradox √ If we randomly hash 2 m keys into m slots, we get a collision with probability > . Collision Unless we know all the keys in advance and design a perfect hash function, we must handle collisions. What do we do when two keys hash to the same slot? ◮ separate chaining: store multiple items in each slot ◮ open addressing: pick a next slot to try 8 / 12
Hashing with Chaining Store multiple items in each slot. How? ◮ Common choice is an unordered linked list 0 (a chain). A D 1 ◮ Could use any dictionary ADT 2 implementation. E B 3 Result 4 ◮ Can hash more than m items into a table 5 of size m . C 6 ◮ Performance depends on the length of the chains. ◮ Memory is allocated on each insertion. hash( A ) = hash( D ) = 1 hash( E ) = hash( B ) = 3 9 / 12
Access time for Chaining Load Factor α = # hashed items = n table size m Assume we have a uniform hash function (every item hashes to each slot with equal probability). Search cost On average, ◮ an unsuccessful search examines items. ◮ a successful search examines 1 + n − 1 2 m = 1 + α 2 − α 2 n items. We want the load factor to be small. 10 / 12
Open Addressing Allow only one item in each slot. The hash function specifies a sequence of slots to try. Insert If the first slot is occupied, try the next, then the next, ... until an empty slot is found. 0 A Find If the first slot doesn’t match, try the 1 next, then the next, ... until a match (found) D 2 or an empty slot (not found). E 3 Result B 4 ◮ Cannot hash more than m items into a 5 table of size m . [Pigeonhole Principle] C 6 ◮ Hash table memory allocated once. ◮ Performance depends on number of trys. 11 / 12
Linear probing Try location (hash( k ) + i ) mod m for i = 0 , 1 , ... insert(76) insert(93) insert(40) insert(47) insert(10) insert(55) 76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6 47 47 47 0 0 0 0 0 0 55 1 1 1 1 1 1 93 93 93 93 93 2 2 2 2 2 2 10 10 3 3 3 3 3 3 4 4 4 4 4 4 40 40 40 40 5 5 5 5 5 5 6 76 6 76 6 76 6 76 6 76 6 76 here hash( k ) = k %7 12 / 12
Recommend
More recommend