bloom filters
play

Bloom Filters References A. Broder and M. Mitzenmacher, Network - PowerPoint PPT Presentation

2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li


  1. 2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, “Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey,” Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking , Vol. 8, No. 3, June 2000. o Origin of counting Bloom filters O i in f ntin Bl m filt s 2/16/2017 Bloom Filters (Simon S. Lam) 1 1

  2. 2/16/2017 Origin and applications  Randomized data structure introduced by Burton Bloom [CACM 1970] o It represents a set for membership queries, with false positives o Probability of false positive can be controlled by o Probability of false positive can be controlled by design parameters o When space efficiency is important, a Bloom filter ma may be used if the effect of false positives can be be used if the effect f false p sitives can be mitigated.  First applications in dictionaries and databases 2/16/2017 Bloom Filters (Simon S. Lam) 2 2

  3. 2/16/2017 First application in networking: distributed cache (2000) distributed cache (2000) Proxy 2 Proxy 2 Cache 2 Summary 1 Proxy 1 Summary 3 y Cache 1 Cache 1 Summary 2 Summary 3 Proxy 3 Proxy 3 Cache 3 Summary 1 Summary 2 Summary 2  Numerous applications in networking since 2000  N li ti i t ki i 2000 2/16/2017 Bloom Filters (Simon S. Lam) 3 3

  4. 2/16/2017 Standard Bloom Filter  A Bloom filter is an array of m bits representing a set S = { x 1 , x 2 , … , x n } of n elements { n } 1 2 o Array set to 0 initially  k independent hash functions h 1 , … , h k with range {1 2 {1, 2, …, m} } o Assume that each hash function maps each item in the universe to a random number uniformly over the range universe to a random number uniformly over the range {1, 2, …, m}  For each element x in S, the bit h i (x) in the array is set to 1, for 1 ≤ i ≤ k, i t t 1 f 1 i k o A bit in the array may be set to 1 multiple times for different elements ff m 2/16/2017 Bloom Filters (Simon S. Lam) 4 4

  5. 2/16/2017 A Bloom filter example (three hash functions) ( ) Insert X 1 and X 2 Check Y 1 and Y 2 2/16/2017 Bloom Filters (Simon S. Lam) 5 5

  6. 2/16/2017 Standard Bloom Filter (cont.)  To check membership of y in S, check whether h i (y), 1 ≤ i ≤ k, are all set to 1 whether h i (y), ≤ ≤ k, are all set to o If not, y is definitely not in S o Else, we conclude that y is in S, but sometimes this conclusion is wrong (false positive)  For many applications, false positives are acceptable as long as the probability of a t bl l th b bilit f false positive is small enough  We will assume that kn < m 2/16/2017 Bloom Filters (Simon S. Lam) 6 6

  7. 2/16/2017 False positive probability  After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is 1 /  ' (1 ) kn − kn m = − = p e p m m  For a non member, it may be found to be a member of S (all of its k bits are nonzero) with false positive of S (all of its k bits are nonzero) with false positive probability  (1 ') (1 ) k k − − p p 2/16/2017 Bloom Filters (Simon S. Lam) 7 7

  8. 2/16/2017 False positive probability (cont.)  Define 1 ' (1 ') (1 (1 ) ) k kn k = − = − − f p m / (1 ) (1 e − ) k kn m k = − = − f p  Two competing forces as k increases (1 (1 ') k ) o Larger k o Larger k -> is smaller for a fixed p > is smaller for a fixed p’ − − p p o Larger k -> p’= is smaller -> 1-p’ larger (1 1/ ) kn − m 2/16/2017 Bloom Filters (Simon S. Lam) 8 8

  9. 2/16/2017 False positive rate vs. k m m Number of bits per member 8 n = Number of 2/16/2017 Bloom Filters (Simon S. Lam) 9 9

  10. 2/16/2017 Optimal number k from derivative Rewrite Rewrite as as f f / / exp(ln(1 ) ) exp( ln(1 )) − − kn m k kn m = − = − f e k e / Let ln(1 ) − kn m = − g k e Minimizing will minimize g exp( ) p( ) = g g f f g g / (1 − ) kn m ∂ ∂ − g k e / ln(1 − ) kn m = − + e / / 1 1 − kn m kn m ∂ ∂ − ∂ ∂ k k e k k k n / / ln(1 − ) − ln(2) ln(2) 0 kn m kn m = − + = − + = e e / 1 1 − kn m − e e m m if we plug ( / )ln 2 which is optimal = k m n ( (It is in fact a global optimum) i i f l b l i ) 2/16/2017 Bloom Filters (Simon S. Lam) 10 10

  11. 2/16/2017 Optimal k from symmetry / e −  Alternatively, from we get kn m = p m ln( ) ln( ) = − k k p p n From previous slide, we have From previous slide, we have m / ln(1 − ) ln( )ln(1 ) kn m = − = − − g k e p p n  From above, symmetry indicates that the minimum value for g occurs when p=1/2. g p Thus m m ln(1/ 2) ln(2) = − = k opt n n n n 2/16/2017 Bloom Filters (Simon S. Lam) 11 11

  12. 2/16/2017 Optimal k from symmetry using the precise probability of false positive using the precise probability of false positive ' ( (1 ') ) exp( ln(1 p( ( ')) )) k = − = − f f p p k p p From ' ( (1 1 / ) , solving for ) , g kn = − p p m k 1 = ln( ') k p l (1 ln(1 1 / 1 / ) ) − n m Let ' ln(1 ( ') ) = − ( (in equation for ' above) q ) g g k p p f f 1 ln( ')ln(1 ') = − p p ln(1 1/ ln(1 1/ ) ) − n n m m 2/16/2017 Bloom Filters (Simon S. Lam) 12 12

  13. 2/16/2017 Using the precise probability of false positive to get optimal k (cont.) p g p ( )  From previous slide 1 ' ln( ')ln(1 ') = − g p p ln(1 1/ ) − n m  By symmetry, g’ (also f’) minimized at p’=1/2  Optimal k is 1 1 ' ln( ') ln(1/ 2) = = k p opt ln(1 1/ ) ln(1 1/ ) − − n m n m 2/16/2017 Bloom Filters (Simon S. Lam) 13 13

  14. 2/16/2017 Optimal number of hash functions  Using m the false positive rate is ln(2) = k opt n m m ln(2) ( ) ln(2) ( ) / / (1 ) (0.5)  (0.6185) , where ln(2) 0.6931 m n − = = p n n  In practice, k should be an integer. May choose an integer value smaller than k opt to reduce hashing overhead l ll h k d h hi h d m/n denotes False positive rate bits per entry bits per entry 2/16/2017 Bloom Filters (Simon S. Lam) 14 14

  15. 2/16/2017 False positive rate vs. bits per entry False positive 4 hash functions rate rate Using optimal number of hash functions m/n 2/16/2017 Bloom Filters (Simon S. Lam) 15 15

  16. 2/16/2017 Standard Bloom Filter tricks  Two Bloom filters representing sets S 1 and S 2 with the same number of bits and using g the same hash functions. o A Bloom filter that represents the union of S 1 and S 2 can be obtained by taking the OR of the bit S 2 can be obtained by taking the OR of the bit vectors  A Bloom filter can be halved in size. Suppose the size is a power of 2. h i i f 2 o Just OR the first and second halves of the bit vector vector o When hashing to do a lookup, the highest order bit is masked Notation: OR denotes bitwise or 2/16/2017 Bloom Filters (Simon S. Lam) 16 16

  17. 2/16/2017 Counting Bloom filters  Proposed by Fan et al. [2000] for distributed caching cach ng  Every entry in a counting Bloom filter is a small counter (rather than a single bit). ( g ) o When an item is inserted into the set, the corresponding counters are each incremented by 1 o When an item is deleted from the set, the h d l d f h h corresponding counters are each decremented by 1  To avoid counter overflow its size must be  To avoid counter overflow, its size must be sufficiently large. It was found that 4 bits per counter are enough. u ug . 2/16/2017 Bloom Filters (Simon S. Lam) 17 17

  18. 2/16/2017 Counter overflow probability  Consider a set of n elements, k hash  Consider a set of n elements, k hash functions, and m counters o C(i) is the count for the i th counter   − j nk j    1 1 nk [ ( ) ] 1 = =       −       P c i j           j j m m     1 1 nk ≤  [ ( ) [ ( ) ] ] ≥  P c i j j   j m j   enk ≤  (a very loose upper bound)      jm 2/16/2017 Bloom Filters (Simon S. Lam) 18 18

  19. 2/16/2017 Counter overflow probability (cont.)  Choose k such that k ≤ m/n (ln 2) Then j j         ln2 ln2 enk enk e e [ ( ) ] ≥ ≤ ≤     P c i j     jm j j j     ln 2 e for some i [max ( ) ] ≥ ≤   P c i j m   1 j ≤ ≤ i m  Using 4 bits, each counter counts from 0 to 15 15 [max ( ) 16] 1.37 10 − ≥ ≤ × × P c i m 1 ≤ ≤ i m 2/16/2017 Bloom Filters (Simon S. Lam) 19 19

Recommend


More recommend