Hashing and Birthdays Today’s announcements: ◮ PA2 out, due Nov 1, 23:59 ◮ MT2 Nov 7, 19:00-21:00 WOOD 2 Today’s Plan ◮ Hashing ◮ Birthdays and probability Warm up: Thinking about AVL trees ◮ AVL trees are binary search trees that allow only slight imbalance ◮ Worst-case O (log n ) time for find, insert, and remove ◮ Elements (even siblings) may be scattered in memory Could we preserve optimal balance always? 5 3 7 2 4 6 1 / 10
Dictionary ADT key value (data) Multics MULTiplexed Information and Computing Service Operations Unix Uniplexed Multics ◮ insert BSD Berkeley Software Distribution ◮ remove GNU GNU’s Not Unix ◮ find ◮ insert(Linux, Linus Torvald’s Unix) ◮ find(Unix) returns “Uniplexed Multics” 2 / 10
Hash Table Goal We can do: We want to do: a[2]=“GNU’s Not Unix” a[“GNU”]=“GNU’s Not Unix” 0 Multics 1 Linux 2 GNU’s Not Unix GNU’s Not Unix GNU 3 Unix m − 1 Unics 3 / 10
Hash table approach Choose a hash function to map keys to indices. keys hash table 0 GNU 1 Linux 2 GNU’s Not Unix Multics 3 Unics Unix m − 1 hash function hash(“GNU”) = 2 4 / 10
Collisions A collision occurs when two different keys x and y map to the same index (i.e. slot in table), hash( x ) = hash( y ). hash table 0 GNU 1 Linux 2 GNU’s Not Unix 3 Multics Unics Unix m − 1 Mac OS X hash function Can we prevent collisions? 5 / 10
Birthdays and Probability Probability that someone in this room has a birthday today? What if this was a birthday party? Probability that two people in this room have the same birthday? What if the room contained 366 people? 183? 6 / 10
Expected Value Definition: The expected value of a number X that depends on random events ( X is called a random variable ) is: � E [ X ] = Prob [ X = x ] · x . x X is the sum of two six-sided dice. E [ X ] = 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Linearity of Expectation For any two random variables X and Y , E [ X + Y ] = E [ X ] + E [ Y ]. 7 / 10
More Birthdays What is the expected number of people who share a birthday in this room? � 1 if person i and j have same birthday Let X ij = 0 otherwise X = � i < j X ij is the number of pairs who share a birthday. E [ X ] = E [ � i < j X ij ] = � i < j E [ X ij ] = � i < j Generalized birthdays k ( k − 1) If we randomly put k people into m bins, we expect 1 pairs m 2 √ to share a bin, which is greater than 1 for k = 2 m + 1. 8 / 10
Hashing string keys with mod and Horner’s Rule int hash( string s ) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256 * h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be a power of 2 int hash( char *s ) { int h = *s++; while( *s ) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different? 9 / 10
Fixed hash functions are dangerous! Good hash table performance depends on few collisions. If a user knows your hash function, she can cause many elements to hash to the same slot. Why would she want to do that? Yacc h ( s ) = (31 k − 1 s [0] + 31 k − 2 s [1] + · · · + 31 0 s [ k − 1])mod1023 h ( XY ) = h ( xy ). Find many strings that hash to the same slot? Protection ◮ Use a cryptographically secure hash function (e.g. SHA-512). ◮ Choose a new hash function at random for every hash table. 10 / 10
Recommend
More recommend