outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds
Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter: Hash table that has only false positives (may report that a key is present when it is not, but always reports a key that is present) Very simple and fast Example: Google Chrome uses a Bloom filter to maintain its list of potentially malicious web sites. - Most queried keys are not in the table - If a key is in the table, can check against a slower (errorless) hash table Many applications in networking (see survey by Broder and Mitzenmacher)
Bloom filters Data structure: Universe π± . Parameters π, π β₯ 1 Maintain an array π΅ of π bits; initially π΅ 0 = π΅ 1 = β― = π΅ π β 1 = 0 Choose π hash functions β 1 , β 2 , β¦ , β π : π± β π (assume completely random functions for sake of analysis)
Bloom filters Data structure: Universe π± . Parameters π, π β₯ 1 Maintain an array π΅ of π bits; initially π΅ 0 = π΅ 1 = β― = π΅ π β 1 = 0 Choose π hash functions β 1 , β 2 , β¦ , β π : π± β π (assume completely random functions for sake of analysis) To add a key π¦ β π± to the dictionary π β π± , set bits π΅ β 1 π¦ β 1, π΅ β 2 π¦ β 1, β¦ , π΅ β π π¦ β 1
Bloom filters Data structure: Universe π± . Parameters π, π β₯ 1 Maintain an array π΅ of π bits; initially π΅ 0 = π΅ 1 = β― = π΅ π β 1 = 0 Choose π hash functions β 1 , β 2 , β¦ , β π : π± β π (assume completely random functions for sake of analysis) To add a key π¦ β π± to the dictionary π β π± , set bits π΅ β 1 π¦ β 1, π΅ β 2 π¦ β 1, β¦ , π΅ β π π¦ β 1 To answer a query: π β π ? Check whether π΅ β π π¦ = 1 for all π = 1,2, β¦ , π If yes, answer Yes . If no, answer No .
Bloom filters No false negatives: Clearly if π¦ β π , we return Yes . But there is some chance that other keys have caused the bits in positions β 1 π¦ , β¦ , β π (π¦) to be set even if π¦ β π .
Bloom filters No false negatives: Clearly if π¦ β π , we return Yes . But there is some chance that other keys have caused the bits in positions β 1 π¦ , β¦ , β π (π¦) to be set even if π¦ β π . π 1 (Here we use the approximation 1 β β π β1 Heuristic analysis: π for π large enough.) Let us assume that π = π . Compute β[π΅ β = 0] for some location β β [π] : ππ 1 β 1 β π βππ π π, π = π π
Bloom filters No false negatives: Clearly if π¦ β π , we return Yes . But there is some chance that other keys have caused the bits in positions β 1 π¦ , β¦ , β π (π¦) to be set even if π¦ β π . π 1 (Here we use the approximation 1 β β π β1 Heuristic analysis: π for π large enough.) Let us assume that π = π . Compute β[π΅ β = 0] for some location β β [π] : ππ 1 β 1 β π βππ π π, π = π π If each location in π΅ is 0 with probability π(π, π) , then a false positive for π¦ β π should happen with probability at most π π β 1 β π βππ 1 β π π, π π
Bloom filters Heuristic analysis: If each location in π΅ is 0 with probability π(π, π) , then a false positive for π¦ β π should happen with probability at most π π β 1 β π βππ 1 β π π, π π
Bloom filters Heuristic analysis: If each location in π΅ is 0 with probability π(π, π) , then a false positive for π¦ β π should happen with probability at most π π β 1 β π βππ 1 β π π, π π But the actual fraction of 0 β² π‘ in the hash table is a random variable π π,π with expectation π½ π π,π = π π, π To get the analysis right, we need a concentration bound : Want to say that π π,π is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture]
Bloom filters Heuristic analysis: If each location in π΅ is 0 with probability π(π, π) , then a false positive for π¦ β π should happen with probability at most π π β 1 β π βππ 1 β π π, π π But the actual fraction of 0 β² π‘ in the hash table is a random variable π π,π with expectation π½ π π,π = π π, π To get the analysis right, we need a concentration bound : Want to say that π π,π is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture] If the heuristic analysis is correct, it gives nice estimates: For instance, if π = 8π , then choosing the optimal value of π = 7 gives false positive rate about 2% .
outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Cuckoo hashing is a hash scheme with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.
Cuckoo hashing Idea: Simple hashing without errors Lookups are worst case π(1) time Deletions are worst case π(1) time Insertions are expected π(1) time Insertion time is π(1) with good probability [will require a concentration bound]
Cuckoo hashing Data structure: Two tables π΅ 1 and π΅ 2 both of size π = π π Two hash functions β 1 , β 2 βΆ π± β [π] (will assume hash functions are fully random) When an element π¦ β π is inserted, if either π΅ 1 β 1 π¦ or π΅ 2 [β 2 π¦ ] is empty, store π¦ there.
Cuckoo hashing Data structure: Two tables π΅ 1 and π΅ 2 both of size π = π π Two hash functions β 1 , β 2 βΆ π± β [π] (will assume hash functions are fully random) When an element π¦ β π is inserted, if either π΅ 1 β 1 π¦ or π΅ 2 [β 2 π¦ ] is empty, store π¦ there. Bump: If both locations are occupied, then place π¦ in π΅ 1 β 1 π¦ and bump the current occupant. Whenever an element π¨ is bumped from π΅ π β π π¨ , attempt to store it in the other location π΅ π β π π¨ (here π, π = 1,2 or 2,1 )
Cuckoo hashing Data structure: Two tables π΅ 1 and π΅ 2 both of size π = π π Two hash functions β 1 , β 2 βΆ π± β [π] (will assume hash functions are fully random) When an element π¦ β π is inserted, if either π΅ 1 β 1 π¦ or π΅ 2 [β 2 π¦ ] is empty, store π¦ there. Bump: If both locations are occupied, then place π¦ in π΅ 1 β 1 π¦ and bump the current occupant. Whenever an element π¨ is bumped from π΅ π β π π¨ , attempt to store it in the other location π΅ π β π π¨ (here π, π = 1,2 or 2,1 ) Abort: After 6 log π consecutive bumps, stop the process and build a fresh hash table using new random hash functions β 1 , β 2 .
Cuckoo hashing Alternately (as in the picture), we can use a single table with 2π entries and two hash functions β 1 , β 2 : π± β 2π (with the same βbumpingβ algorithm) Arrows represent the alternate location for each key. If we insert an item at the location of π΅ , it will get bumped, thereby bumping πΆ , and then we are done. Cycles are possible (where the insertion process never completes). Whatβs an example?
Cuckoo hashing Data structure: Two tables π΅ 1 and π΅ 2 both of size π = π π Two hash functions β 1 , β 2 βΆ π± β [π] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is π(1) if π β₯ 4π .
Cuckoo hashing Data structure: Two tables π΅ 1 and π΅ 2 both of size π = π π Two hash functions β 1 , β 2 βΆ π± β [π] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is π(1) if π β₯ 4π . Pretty goodβ¦ but only 25% memory utilization. Can actually get about 50% memory utilization. Experimentally, with 3 hash functions instead of 2 , can get β 90% utilization, but it is an open question to provide tight analyses for π hash functions when π β₯ 3 .
outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds
Load balancing Suppose we have π jobs to assign to π servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule.
Load balancing Suppose we have π jobs to assign to π servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Letβs again consider the case of a uniformly random hash function β βΆ π β π
Recommend
More recommend