unit 6 hash functions and the pigeonhole principle
play

Unit #6: Hash functions and the Pigeonhole principle CPSC 221: - PowerPoint PPT Presentation

Unit #6: Hash functions and the Pigeonhole principle CPSC 221: Algorithms and Data Structures Lars Kotthoff 1 larsko@cs.ubc.ca 1 With material from Will Evans, Steve Wolfman, Alan Hu, Ed Knorr, and Kim Voll. Unit Outline Constant-Time


  1. Unit #6: Hash functions and the Pigeonhole principle CPSC 221: Algorithms and Data Structures Lars Kotthoff 1 larsko@cs.ubc.ca 1 With material from Will Evans, Steve Wolfman, Alan Hu, Ed Knorr, and Kim Voll.

  2. Unit Outline ▷ Constant-Time Dictionaries? ▷ Hash Table Outline ▷ Hash Functions ▷ Collisions and the Pigeonhole Principle ▷ Collision Resolution: ▷ Separate Chaining ▷ Open Addressing

  3. Learning Goals ▷ Provide examples of the types of problems that can benefit from a hash data structure. ▷ Identify the types of search problems that do not benefit from hashing (e.g. range searching) and explain why. ▷ Evaluate collision resolution policies. ▷ Compare and contrast open addressing and chaining. ▷ Describe the conditions under which find using a hash table takes Ω( n ) time. ▷ Insert , delete , and find using various open addressing and chaining schemes. ▷ Define various forms of the pigeonhole principle; recognize and solve the specific types of counting and hashing problems to which they apply.

  4. Reminder: Dictionary ADT key value Multics MULTiplexed Information Dictionary operations and Computing Service ▷ create Unics single-user Multics ▷ destroy Unix multi-user Unics ▷ insert GNU GNU’s Not Unix ▷ find ▷ delete ▷ insert(Linux, Linus Torvald’s Unix) ▷ find(Unix) Stores values associated with user-specified keys ▷ values may be any type ▷ keys must be comparable

  5. Implementations so far Worst-case runtimes insert delete find Unsorted list O (1) Θ( n ) Θ( n ) Balanced Trees Θ(log n ) Θ(log n ) Θ(log n )

  6. Implementations so far Worst-case runtimes insert delete find Unsorted list O (1) Θ( n ) Θ( n ) Balanced Trees Θ(log n ) Θ(log n ) Θ(log n ) Special case: keys in { 0 , 1 , . . . , m − 1 } O (1) O (1) O (1) Can we get O (1) insert/find/delete for any key type?

  7. Hash Table Goal We can do: We want to do: a[2]=“GNU’s Not Unix” a[“GNU”]=“GNU’s Not Unix” 0 Multics 1 Linux GNU’s Not Unix GNU’s Not Unix 2 GNU 3 Unix m − 1 Unics

  8. Hash table approach Use a hash function to map keys to indices. keys hash table 0 GNU 1 Linux 2 GNU’s Not Unix Multics 3 Unics Unix m − 1 hash function hash ( “GNU” ) = 2

  9. Collisions A collision occurs when two different keys x and y map to the same index, hash ( x ) = hash ( y ) . hash table 0 GNU 1 Linux 2 GNU’s Not Unix 3 Multics Unics Unix m − 1 Mac OS X hash function Can we prevent collisions?

  10. Hash table: find (first try) Value &find(Key &key) { int index = hash(key) % m; return HashTable[index]; } What should the hash function, hash, be? What should the table size, m , be? What do we do about collisions?

  11. Good hash function properties Using knowledge of the kind and number of keys to be stored, we choose our hash function so that it is: ▷ fast to compute, and ▷ causes few collisions (we hope). Numeric keys We might use hash ( x ) = x mod m with m a prime number larger than the number of keys we expect to store. Why a prime number? 0 Example: hash ( x ) = x mod 7 1 insert(4) 2 insert(17) find(12) 3 insert(9) 4 delete(17) 5 6 m = 7

  12. Hashing strings One option Let string s = s 0 s 1 s 2 . . . s k − 1 where each s i is an 8-bit character. hash ( s ) = s 0 + 256 s 1 + 256 2 s 2 + · · · + 256 k − 1 s k − 1 Hash function treats string an a base 256 number.

  13. Hashing strings One option Let string s = s 0 s 1 s 2 . . . s k − 1 where each s i is an 8-bit character. hash ( s ) = s 0 + 256 s 1 + 256 2 s 2 + · · · + 256 k − 1 s k − 1 Hash function treats string an a base 256 number. Problems ▷ hash ( “really, really big” ) = well. . . something really, really big ▷ hash ( “anything” ) mod 256 = hash ( “anything else” ) mod 256

  14. Hashing strings with Horner’s Rule int hash(string s) { int h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (256*h + s[i]) % m; } return h; } Compare that to the hash function from yacc: #define TABLE_SIZE 1024 // must be power of 2 int hash( char *s) { int h = *s++; while (*s) h = (31 * h + *s++) & (TABLE_SIZE - 1); return h; } What’s different?

  15. Hash Function Summary Goals of a hash function ▷ Fast to compute ▷ Cause few collisions Sample hash functions ▷ For numeric keys x , hash ( x ) = x mod m ▷ hash ( s ) = string as base 256 number mod m ▷ Multiplicative hash: hash ( k ) = ⌊ m · frac ( ka ) ⌋ where frac ( x ) is the fractional part of x and a = 0 . 6180339887 (for example). ▷ Universal hash: hash ( k ) = ( a · k + b ) mod m where a and b were chosen at random from [1 , m − 1] and m prime. ▷ Cryptographically secure hash (such as SHA-1)

  16. Universal hash functions A set H of hash functions is universal if the probability that hash ( x ) = hash ( y ) is at most 1 /m when hash () is chosen at random from H . Example: Suppose m = 2 b and keys are r bits long. Choose a random 0/1 matrix A of size b × r . hash ( x ) = A · x .  0   1 0 0 0 0  1  0     ·    = hash ( x ) A · x = 0 1 1 0 1 0 = 1       1 1 0 1 1 1 0   0

  17. General form of hash functions 1. Map key to a sequence of bytes. ▷ Two equal sequences iff two equal keys. ▷ Easy. The key probably is a sequence of bytes already. 2. Map sequence of bytes to an integer x . ▷ Changing bytes should cause apparently random changes to x . ▷ Hard. May be expensive. Cryptographic hash. 3. Map x to a table index using x mod m .

  18. Collisions Pigeonhole principle If more than m pigeons fly into m pigeonholes then some pigeonhole contains at least two pigeons. Corollary If we hash n > m keys into m slots, two keys will collide (but may already with fewer keys!).

  19. The Pigeonhole Principle Let X and Y be finite sets where | X | > | Y | . If f : X → Y , then f ( x 1 ) = f ( x 2 ) for some x 1 ̸ = x 2 . X Y

  20. The Pigeonhole Principle: Example #0 Image from Wikipedia.

  21. The Pigeonhole Principle: Example #1 Suppose we have 5 colours of Halloween candy, and that there’s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get 2 of the same colour? a. 2 b. 4 c. 6 d. 8 e. None of these

  22. The Pigeonhole Principle: Example #2 If there are 1000 pieces of each colour, how many do we need to pull to guarantee that we’ll get 2 purple pieces of candy (assuming that purple is one of the 5 colours)? a. 2 b. 4 c. 6 d. 8 e. None of these

  23. The Pigeonhole Principle: Example #3 If 5 points are placed in a 6cm x 8cm rectangle, argue that there are two points that are not more than 5 cm apart. Hint: How long is this diagonal?

  24. The Pigeonhole Principle: Example #4 Consider n + 1 distinct positive integers, each ≤ 2 n . Show that one of them must divide one of the others. For example, if n = 4 , consider the following sets: { 1 , 2 , 3 , 7 , 8 } { 2 , 3 , 4 , 7 , 8 } { 2 , 3 , 5 , 7 , 8 } Hint: Any integer can be written as 2 k · q where k is an integer and q is odd. E.g., 129 = 2 0 · 129 ; 60 = 2 2 · 15 .

  25. General Pigeonhole Principle Let X and Y be finite sets with | X | = n , | Y | = m , and k = ⌈ n/m ⌉ . If f : X → Y then there exist k distinct values x 1 , x 2 , . . . , x k ∈ X such that f ( x 1 ) = f ( x 2 ) = · · · = f ( x k ) . Informally: If n pigeons fly into m holes, at least one hole contains at least k = ⌈ n/m ⌉ pigeons. Proof: Assume there’s no such hole. Then there are at most ( ⌈ n/m ⌉ − 1) m < ( n/m ) m = n pigeons.

  26. Pigeonhole Principle: Example #5 Show that in a group of 6 people, where each two people are either friends or enemies (i.e. they can’t be “neutral”), there must be either 3 pairwise friends or 3 pairwise enemies. Proof: Let A be one of the 6 people. A has at least 3 friends or at least 3 enemies by the general pigeonhole principle because ⌈ 5 / 2 ⌉ = 3 . (5 people into 2 holes (friend/enemy).) Suppose A has ≥ 3 friends (the enemies case is similar) and call three of them B , C , and D . If ( B, C ) or ( C, D ) or ( B, D ) are friends then we’re done because those two friends with A forms a triple of friends. Otherwise ( B, C ) and ( C, D ) and ( B, D ) are enemies and BCD forms a triple of enemies.

  27. Collision Resolution Birthday Paradox With probability > , two people, in a room of 23, have the same birthday. General birthday paradox √ Even if we randomly hash only 2 m keys into m slots, we get a collision with probability > . Collision Unless we know all the keys in advance and design a perfect hash function, we must handle collisions. What do we do when two keys hash to the same entry? ▷ separate chaining: store multiple items in each entry ▷ open addressing: pick a next entry to try

  28. Hashing with Chaining Store multiple items in each entry. How? ▷ Common choice is an unordered linked list 0 (a chain). 1 A D ▷ Could use any dictionary ADT 2 implementation. 3 E B Result 4 ▷ Can hash more than m items into a table 5 of size m . 6 C ▷ Performance depends on the length of the chains. ▷ Memory is allocated on each insertion. hash ( A ) = hash ( D ) = 1 hash ( E ) = hash ( B ) = 3

Recommend


More recommend