lecture 2 advanced hashing and concentration bounds
play

Lecture #2: Advanced hashing and concentration bounds o Bloom - PowerPoint PPT Presentation

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter:


  1. outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds

  2. Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter: Hash table that has only false positives (may report that a key is present when it is not, but always reports a key that is present) Very simple and fast Example: Google Chrome uses a Bloom filter to maintain its list of potentially malicious web sites. - Most queried keys are not in the table - If a key is in the table, can check against a slower (errorless) hash table Many applications in networking (see survey by Broder and Mitzenmacher)

  3. Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž 1 , β„Ž 2 , … , β„Ž 𝑙 : 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis)

  4. Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž 1 , β„Ž 2 , … , β„Ž 𝑙 : 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 βŠ† 𝒱 , set bits 𝐡 β„Ž 1 𝑦 ≔ 1, 𝐡 β„Ž 2 𝑦 ≔ 1, … , 𝐡 β„Ž 𝑙 𝑦 ≔ 1

  5. Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž 1 , β„Ž 2 , … , β„Ž 𝑙 : 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 βŠ† 𝒱 , set bits 𝐡 β„Ž 1 𝑦 ≔ 1, 𝐡 β„Ž 2 𝑦 ≔ 1, … , 𝐡 β„Ž 𝑙 𝑦 ≔ 1 To answer a query: π‘Ÿ ∈ 𝑇 ? Check whether 𝐡 β„Ž 𝑗 𝑦 = 1 for all 𝑗 = 1,2, … , 𝑙 If yes, answer Yes . If no, answer No .

  6. Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions β„Ž 1 𝑦 , … , β„Ž 𝑙 (𝑦) to be set even if 𝑦 βˆ‰ 𝑇 .

  7. Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions β„Ž 1 𝑦 , … , β„Ž 𝑙 (𝑦) to be set even if 𝑦 βˆ‰ 𝑇 . 𝑁 1 (Here we use the approximation 1 βˆ’ β‰ˆ 𝑓 βˆ’1 Heuristic analysis: 𝑁 for 𝑁 large enough.) Let us assume that 𝑇 = π‘œ . Compute β„™[𝐡 β„“ = 0] for some location β„“ ∈ [𝑁] : 𝑙𝑂 1 βˆ’ 1 β‰ˆ 𝑓 βˆ’π‘™π‘‚ π‘ž 𝑙, 𝑂 = 𝑁 𝑁

  8. Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions β„Ž 1 𝑦 , … , β„Ž 𝑙 (𝑦) to be set even if 𝑦 βˆ‰ 𝑇 . 𝑁 1 (Here we use the approximation 1 βˆ’ β‰ˆ 𝑓 βˆ’1 Heuristic analysis: 𝑁 for 𝑁 large enough.) Let us assume that 𝑇 = π‘œ . Compute β„™[𝐡 β„“ = 0] for some location β„“ ∈ [𝑁] : 𝑙𝑂 1 βˆ’ 1 β‰ˆ 𝑓 βˆ’π‘™π‘‚ π‘ž 𝑙, 𝑂 = 𝑁 𝑁 If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂) , then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 𝑙 𝑙 β‰ˆ 1 βˆ’ 𝑓 βˆ’π‘™π‘‚ 1 βˆ’ π‘ž 𝑙, 𝑂 𝑁

  9. Bloom filters Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂) , then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 𝑙 𝑙 β‰ˆ 1 βˆ’ 𝑓 βˆ’π‘™π‘‚ 1 βˆ’ π‘ž 𝑙, 𝑂 𝑁

  10. Bloom filters Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂) , then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 𝑙 𝑙 β‰ˆ 1 βˆ’ 𝑓 βˆ’π‘™π‘‚ 1 βˆ’ π‘ž 𝑙, 𝑂 𝑁 But the actual fraction of 0 β€² 𝑑 in the hash table is a random variable π‘Œ 𝑙,𝑂 with expectation 𝔽 π‘Œ 𝑙,𝑂 = π‘ž 𝑙, 𝑂 To get the analysis right, we need a concentration bound : Want to say that π‘Œ 𝑙,𝑂 is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture]

  11. Bloom filters Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂) , then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 𝑙 𝑙 β‰ˆ 1 βˆ’ 𝑓 βˆ’π‘™π‘‚ 1 βˆ’ π‘ž 𝑙, 𝑂 𝑁 But the actual fraction of 0 β€² 𝑑 in the hash table is a random variable π‘Œ 𝑙,𝑂 with expectation 𝔽 π‘Œ 𝑙,𝑂 = π‘ž 𝑙, 𝑂 To get the analysis right, we need a concentration bound : Want to say that π‘Œ 𝑙,𝑂 is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture] If the heuristic analysis is correct, it gives nice estimates: For instance, if 𝑁 = 8𝑂 , then choosing the optimal value of 𝑙 = 7 gives false positive rate about 2% .

  12. outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Cuckoo hashing is a hash scheme with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

  13. Cuckoo hashing Idea: Simple hashing without errors Lookups are worst case 𝑃(1) time Deletions are worst case 𝑃(1) time Insertions are expected 𝑃(1) time Insertion time is 𝑃(1) with good probability [will require a concentration bound]

  14. Cuckoo hashing Data structure: Two tables 𝐡 1 and 𝐡 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž 1 , β„Ž 2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡 1 β„Ž 1 𝑦 or 𝐡 2 [β„Ž 2 𝑦 ] is empty, store 𝑦 there.

  15. Cuckoo hashing Data structure: Two tables 𝐡 1 and 𝐡 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž 1 , β„Ž 2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡 1 β„Ž 1 𝑦 or 𝐡 2 [β„Ž 2 𝑦 ] is empty, store 𝑦 there. Bump: If both locations are occupied, then place 𝑦 in 𝐡 1 β„Ž 1 𝑦 and bump the current occupant. Whenever an element 𝑨 is bumped from 𝐡 𝑗 β„Ž 𝑗 𝑨 , attempt to store it in the other location 𝐡 π‘˜ β„Ž π‘˜ 𝑨 (here 𝑗, π‘˜ = 1,2 or 2,1 )

  16. Cuckoo hashing Data structure: Two tables 𝐡 1 and 𝐡 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž 1 , β„Ž 2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡 1 β„Ž 1 𝑦 or 𝐡 2 [β„Ž 2 𝑦 ] is empty, store 𝑦 there. Bump: If both locations are occupied, then place 𝑦 in 𝐡 1 β„Ž 1 𝑦 and bump the current occupant. Whenever an element 𝑨 is bumped from 𝐡 𝑗 β„Ž 𝑗 𝑨 , attempt to store it in the other location 𝐡 π‘˜ β„Ž π‘˜ 𝑨 (here 𝑗, π‘˜ = 1,2 or 2,1 ) Abort: After 6 log 𝑂 consecutive bumps, stop the process and build a fresh hash table using new random hash functions β„Ž 1 , β„Ž 2 .

  17. Cuckoo hashing Alternately (as in the picture), we can use a single table with 2𝑁 entries and two hash functions β„Ž 1 , β„Ž 2 : 𝒱 β†’ 2𝑁 (with the same β€œbumping” algorithm) Arrows represent the alternate location for each key. If we insert an item at the location of 𝐡 , it will get bumped, thereby bumping 𝐢 , and then we are done. Cycles are possible (where the insertion process never completes). What’s an example?

  18. Cuckoo hashing Data structure: Two tables 𝐡 1 and 𝐡 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž 1 , β„Ž 2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 β‰₯ 4𝑂 .

  19. Cuckoo hashing Data structure: Two tables 𝐡 1 and 𝐡 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž 1 , β„Ž 2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 β‰₯ 4𝑂 . Pretty good… but only 25% memory utilization. Can actually get about 50% memory utilization. Experimentally, with 3 hash functions instead of 2 , can get β‰ˆ 90% utilization, but it is an open question to provide tight analyses for 𝑒 hash functions when 𝑒 β‰₯ 3 .

  20. outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds

  21. Load balancing Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule.

  22. Load balancing Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂

Recommend


More recommend