2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, “Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey,” Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking , Vol. 8, No. 3, June 2000. o Origin of counting Bloom filters O i in f ntin Bl m filt s 2/16/2017 Bloom Filters (Simon S. Lam) 1 1
2/16/2017 Origin and applications Randomized data structure introduced by Burton Bloom [CACM 1970] o It represents a set for membership queries, with false positives o Probability of false positive can be controlled by o Probability of false positive can be controlled by design parameters o When space efficiency is important, a Bloom filter ma may be used if the effect of false positives can be be used if the effect f false p sitives can be mitigated. First applications in dictionaries and databases 2/16/2017 Bloom Filters (Simon S. Lam) 2 2
2/16/2017 First application in networking: distributed cache (2000) distributed cache (2000) Proxy 2 Proxy 2 Cache 2 Summary 1 Proxy 1 Summary 3 y Cache 1 Cache 1 Summary 2 Summary 3 Proxy 3 Proxy 3 Cache 3 Summary 1 Summary 2 Summary 2 Numerous applications in networking since 2000 N li ti i t ki i 2000 2/16/2017 Bloom Filters (Simon S. Lam) 3 3
2/16/2017 Standard Bloom Filter A Bloom filter is an array of m bits representing a set S = { x 1 , x 2 , … , x n } of n elements { n } 1 2 o Array set to 0 initially k independent hash functions h 1 , … , h k with range {1 2 {1, 2, …, m} } o Assume that each hash function maps each item in the universe to a random number uniformly over the range universe to a random number uniformly over the range {1, 2, …, m} For each element x in S, the bit h i (x) in the array is set to 1, for 1 ≤ i ≤ k, i t t 1 f 1 i k o A bit in the array may be set to 1 multiple times for different elements ff m 2/16/2017 Bloom Filters (Simon S. Lam) 4 4
2/16/2017 A Bloom filter example (three hash functions) ( ) Insert X 1 and X 2 Check Y 1 and Y 2 2/16/2017 Bloom Filters (Simon S. Lam) 5 5
2/16/2017 Standard Bloom Filter (cont.) To check membership of y in S, check whether h i (y), 1 ≤ i ≤ k, are all set to 1 whether h i (y), ≤ ≤ k, are all set to o If not, y is definitely not in S o Else, we conclude that y is in S, but sometimes this conclusion is wrong (false positive) For many applications, false positives are acceptable as long as the probability of a t bl l th b bilit f false positive is small enough We will assume that kn < m 2/16/2017 Bloom Filters (Simon S. Lam) 6 6
2/16/2017 False positive probability After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is 1 / ' (1 ) kn − kn m = − = p e p m m For a non member, it may be found to be a member of S (all of its k bits are nonzero) with false positive of S (all of its k bits are nonzero) with false positive probability (1 ') (1 ) k k − − p p 2/16/2017 Bloom Filters (Simon S. Lam) 7 7
2/16/2017 False positive probability (cont.) Define 1 ' (1 ') (1 (1 ) ) k kn k = − = − − f p m / (1 ) (1 e − ) k kn m k = − = − f p Two competing forces as k increases (1 (1 ') k ) o Larger k o Larger k -> is smaller for a fixed p > is smaller for a fixed p’ − − p p o Larger k -> p’= is smaller -> 1-p’ larger (1 1/ ) kn − m 2/16/2017 Bloom Filters (Simon S. Lam) 8 8
2/16/2017 False positive rate vs. k m m Number of bits per member 8 n = Number of 2/16/2017 Bloom Filters (Simon S. Lam) 9 9
2/16/2017 Optimal number k from derivative Rewrite Rewrite as as f f / / exp(ln(1 ) ) exp( ln(1 )) − − kn m k kn m = − = − f e k e / Let ln(1 ) − kn m = − g k e Minimizing will minimize g exp( ) p( ) = g g f f g g / (1 − ) kn m ∂ ∂ − g k e / ln(1 − ) kn m = − + e / / 1 1 − kn m kn m ∂ ∂ − ∂ ∂ k k e k k k n / / ln(1 − ) − ln(2) ln(2) 0 kn m kn m = − + = − + = e e / 1 1 − kn m − e e m m if we plug ( / )ln 2 which is optimal = k m n ( (It is in fact a global optimum) i i f l b l i ) 2/16/2017 Bloom Filters (Simon S. Lam) 10 10
2/16/2017 Optimal k from symmetry / e − Alternatively, from we get kn m = p m ln( ) ln( ) = − k k p p n From previous slide, we have From previous slide, we have m / ln(1 − ) ln( )ln(1 ) kn m = − = − − g k e p p n From above, symmetry indicates that the minimum value for g occurs when p=1/2. g p Thus m m ln(1/ 2) ln(2) = − = k opt n n n n 2/16/2017 Bloom Filters (Simon S. Lam) 11 11
2/16/2017 Optimal k from symmetry using the precise probability of false positive using the precise probability of false positive ' ( (1 ') ) exp( ln(1 p( ( ')) )) k = − = − f f p p k p p From ' ( (1 1 / ) , solving for ) , g kn = − p p m k 1 = ln( ') k p l (1 ln(1 1 / 1 / ) ) − n m Let ' ln(1 ( ') ) = − ( (in equation for ' above) q ) g g k p p f f 1 ln( ')ln(1 ') = − p p ln(1 1/ ln(1 1/ ) ) − n n m m 2/16/2017 Bloom Filters (Simon S. Lam) 12 12
2/16/2017 Using the precise probability of false positive to get optimal k (cont.) p g p ( ) From previous slide 1 ' ln( ')ln(1 ') = − g p p ln(1 1/ ) − n m By symmetry, g’ (also f’) minimized at p’=1/2 Optimal k is 1 1 ' ln( ') ln(1/ 2) = = k p opt ln(1 1/ ) ln(1 1/ ) − − n m n m 2/16/2017 Bloom Filters (Simon S. Lam) 13 13
2/16/2017 Optimal number of hash functions Using m the false positive rate is ln(2) = k opt n m m ln(2) ( ) ln(2) ( ) / / (1 ) (0.5) (0.6185) , where ln(2) 0.6931 m n − = = p n n In practice, k should be an integer. May choose an integer value smaller than k opt to reduce hashing overhead l ll h k d h hi h d m/n denotes False positive rate bits per entry bits per entry 2/16/2017 Bloom Filters (Simon S. Lam) 14 14
2/16/2017 False positive rate vs. bits per entry False positive 4 hash functions rate rate Using optimal number of hash functions m/n 2/16/2017 Bloom Filters (Simon S. Lam) 15 15
2/16/2017 Standard Bloom Filter tricks Two Bloom filters representing sets S 1 and S 2 with the same number of bits and using g the same hash functions. o A Bloom filter that represents the union of S 1 and S 2 can be obtained by taking the OR of the bit S 2 can be obtained by taking the OR of the bit vectors A Bloom filter can be halved in size. Suppose the size is a power of 2. h i i f 2 o Just OR the first and second halves of the bit vector vector o When hashing to do a lookup, the highest order bit is masked Notation: OR denotes bitwise or 2/16/2017 Bloom Filters (Simon S. Lam) 16 16
2/16/2017 Counting Bloom filters Proposed by Fan et al. [2000] for distributed caching cach ng Every entry in a counting Bloom filter is a small counter (rather than a single bit). ( g ) o When an item is inserted into the set, the corresponding counters are each incremented by 1 o When an item is deleted from the set, the h d l d f h h corresponding counters are each decremented by 1 To avoid counter overflow its size must be To avoid counter overflow, its size must be sufficiently large. It was found that 4 bits per counter are enough. u ug . 2/16/2017 Bloom Filters (Simon S. Lam) 17 17
2/16/2017 Counter overflow probability Consider a set of n elements, k hash Consider a set of n elements, k hash functions, and m counters o C(i) is the count for the i th counter − j nk j 1 1 nk [ ( ) ] 1 = = − P c i j j j m m 1 1 nk ≤ [ ( ) [ ( ) ] ] ≥ P c i j j j m j enk ≤ (a very loose upper bound) jm 2/16/2017 Bloom Filters (Simon S. Lam) 18 18
2/16/2017 Counter overflow probability (cont.) Choose k such that k ≤ m/n (ln 2) Then j j ln2 ln2 enk enk e e [ ( ) ] ≥ ≤ ≤ P c i j jm j j j ln 2 e for some i [max ( ) ] ≥ ≤ P c i j m 1 j ≤ ≤ i m Using 4 bits, each counter counts from 0 to 15 15 [max ( ) 16] 1.37 10 − ≥ ≤ × × P c i m 1 ≤ ≤ i m 2/16/2017 Bloom Filters (Simon S. Lam) 19 19
Recommend
More recommend