bloom filters
play

Bloom Filters Rapha el Clifford (Slides by Benjamin Sach and - PowerPoint PPT Presentation

Data Structures and Algorithms COMS21103 Bloom Filters Rapha el Clifford (Slides by Benjamin Sach and Ashley Montanaro) Introduction In this lecture we are interested in space efficient data structures for storing a set S which support


  1. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) In a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1% ) that it will still say ‘yes’ Why use a Bloom filter then? Both operations run in O (1) time and the space used is very very good It will use O ( n ) bits of space to store up to n keys

  2. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) In a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1% ) that it will still say ‘yes’ Why use a Bloom filter then? Both operations run in O (1) time and the space used is very very good It will use O ( n ) bits of space to store up to n keys - the exact number of bits will depend on the failure probability

  3. Bloom filters A Bloom filter is a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) In a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (say 1% ) that it will still say ‘yes’ Why use a Bloom filter then? Both operations run in O (1) time and the space used is very very good It will use O ( n ) bits of space to store up to n keys - the exact number of bits will depend on the failure probability we’ll come back to this at the end

  4. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | .

  5. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B

  6. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B Example: 1 2 3 4 5 6 7 8 9 10 B 0 0 1 0 0 1 0 1 0 0 | U |

  7. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B where B [ k ] = 1 if k ∈ S and B [ k ] = 0 otherwise Example: 1 2 3 4 5 6 7 8 9 10 B 0 0 1 0 0 1 0 1 0 0 | U |

  8. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B where B [ k ] = 1 if k ∈ S and B [ k ] = 0 otherwise Example: 1 2 3 4 5 6 7 8 9 10 B 0 0 1 0 0 1 0 1 0 0 | U | here | U | = 10 and S contains 3 , 6 and 8

  9. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B where B [ k ] = 1 if k ∈ S and B [ k ] = 0 otherwise Example: 1 2 3 4 5 6 7 8 9 10 B 0 0 1 0 0 1 0 1 0 0 | U | here | U | = 10 and S contains 3 , 6 and 8 While the operations take O (1) time, this array is | U | bits long!

  10. Approach 1: build an array Before discussing Bloom filters, lets consider a naive approach using an array. . . For simplicity, let us think of the universe U as containing numbers 1 , 2 , 3 . . . | U | . We could maintain a bit string B where B [ k ] = 1 if k ∈ S and B [ k ] = 0 otherwise Example: 1 2 3 4 5 6 7 8 9 10 B 0 0 1 0 0 1 0 1 0 0 | U | here | U | = 10 and S contains 3 , 6 and 8 While the operations take O (1) time, this array is | U | bits long! It certainly isn’t suitable for the application we have seen

  11. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Example: 1 2 3 B 0 0 0

  12. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m Example: 1 2 3 B 0 0 0

  13. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m Example: 1 2 3 Imagine that m = 3 and B 0 0 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3

  14. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 Example: 1 2 3 Imagine that m = 3 and B 0 0 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3

  15. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 0 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3

  16. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 0 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3

  17. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 1 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3

  18. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 1 0 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3 I NSERT ( www.VirusStore.com )

  19. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 1 1 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3 I NSERT ( www.VirusStore.com )

  20. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 1 1 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3 I NSERT ( www.VirusStore.com ) M EMBER ( www.BBC.co.uk ) - returns ‘yes’

  21. Approach 2: build a hash table We could solve the problem by hashing. . . We now maintain a much shorter bit string B of some length m < | U | (to be determined later) Assume we have access to a hash function h which maps each key k ∈ U to an integer h ( k ) between 1 and m I NSERT ( k ) sets B [ h ( k )] = 1 M EMBER ( k ) returns ‘yes’ if B [ h ( k )] = 1 and ‘no’ if B [ h ( k )] = 0 Example: 1 2 3 Imagine that m = 3 and B 0 1 1 h ( www.AwfulVirus.com ) = 2 h ( www.VirusStore.com ) = 3 I NSERT ( www.AwfulVirus.com ) h ( www.BBC.co.uk ) = 3 h ( www.BBC.co.uk ) = 3 I NSERT ( www.VirusStore.com ) M EMBER ( www.BBC.co.uk ) - returns ‘yes’ This is called a collision

  22. Approach 2: build a hash table The problem with hashing is that if m < | U | then there will be some keys that hash to the same positions (these are called collisions)

  23. Approach 2: build a hash table The problem with hashing is that if m < | U | then there will be some keys that hash to the same positions (these are called collisions) If we call M EMBER ( k ) for some key k not in S but there is a key k ′ ∈ S with h ( k ) = h ( k ′ ) we will incorrectly output ‘yes’

  24. Approach 2: build a hash table The problem with hashing is that if m < | U | then there will be some keys that hash to the same positions (these are called collisions) If we call M EMBER ( k ) for some key k not in S but there is a key k ′ ∈ S with h ( k ) = h ( k ′ ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence , we pick the hash function h at random

  25. Approach 2: build a hash table The problem with hashing is that if m < | U | then there will be some keys that hash to the same positions (these are called collisions) If we call M EMBER ( k ) for some key k not in S but there is a key k ′ ∈ S with h ( k ) = h ( k ′ ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence , we pick the hash function h at random Important: h is chosen before any operations happen and never changes

  26. Approach 2: build a hash table The problem with hashing is that if m < | U | then there will be some keys that hash to the same positions (these are called collisions) If we call M EMBER ( k ) for some key k not in S but there is a key k ′ ∈ S with h ( k ) = h ( k ′ ) we will incorrectly output ‘yes’ To make sure that the probability of an error is low for every operation sequence , we pick the hash function h at random Important: h is chosen before any operations happen and never changes For every key k ∈ U , the value of h ( k ) is chosen independently and uniformly at random: that is, the probability that h ( k ) = j is 1 m for all j between 1 and m (each position is equally likely)

  27. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 )

  28. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad)

  29. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions

  30. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions B 1 1 1 1 1 1 1 1 m

  31. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions B 1 1 1 1 1 1 1 1 m By definition, h ( k ) is equally likely to be any position between 1 and m

  32. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h ( k ) B 1 1 1 1 1 1 1 1 m By definition, h ( k ) is equally likely to be any position between 1 and m

  33. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h ( k ) B 1 1 1 1 1 1 1 1 m By definition, h ( k ) is equally likely to be any position between 1 and m

  34. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h ( k ) B 1 1 1 1 1 1 1 1 m By definition, h ( k ) is equally likely to be any position between 1 and m Therefore the probability that B [ h ( k )] = 1 is at most n m

  35. What is the probability of an error? Assume we have already I NSERTED n keys into the structure Further, we have just called M EMBER ( k ) for some key k not in S (which will check whether B [ h ( k )] = 1 ) We want to know the probability that the answer returned is ‘yes’ (which would be bad) The bit-string B contains at most n 1’s among the m positions h ( k ) B 1 1 1 1 1 1 1 1 m By definition, h ( k ) is equally likely to be any position between 1 and m Therefore the probability that B [ h ( k )] = 1 is at most n m If we choose m = 100 n then we get a failure probability of at most 1%

  36. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations

  37. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S

  38. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly)

  39. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation

  40. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S

  41. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’

  42. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits

  43. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits when storing up to n keys

  44. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits when storing up to n keys neither the space nor the failure probability depend on | U |

  45. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits when storing up to n keys neither the space nor the failure probability depend on | U | if we wanted a better probability, we could use more space

  46. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits when storing up to n keys neither the space nor the failure probability depend on | U | if we wanted a better probability, we could use more space Why use a Bloom filter then?

  47. Approach 2: build a hash table We have developed a randomised data structure for storing a set S which supports two operations The I NSERT ( k ) operation inserts the key k from U into S (it never does this incorrectly) Like in a bloom filter, the M EMBER ( k ) operation always returns ‘yes’ if k ∈ S however, if k is not in S there is a small chance (in fact 1% ) that it will still say ‘yes’ Both operations run in O (1) time and the space used is 100 n bits when storing up to n keys neither the space nor the failure probability depend on | U | if we wanted a better probability, we could use more space Why use a Bloom filter then? we will get much better space usage for the same probability

  48. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m

  49. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 0 0 0 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4

  50. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 0 0 0 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4

  51. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 0 0 0 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4

  52. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 0 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4

  53. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 0 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4 I NSERT ( ViSt.com )

  54. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 1 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4 I NSERT ( ViSt.com )

  55. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 1 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4 I NSERT ( ViSt.com ) M EMBER ( BBC.com ) - returns ‘no’

  56. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 1 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4 I NSERT ( ViSt.com ) Much better! M EMBER ( BBC.com ) - returns ‘no’

  57. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r Imagine that m = 4 , r = 2 and 1 2 3 4 h 1 ( AwVi.com ) = 2 h 2 ( AwVi.com ) = 1 Example: 1 1 1 0 h 1 ( ViSt.com ) = 3 h 2 ( ViSt.com ) = 2 I NSERT ( AwVi.com ) h 1 ( BBC.com ) = 2 h 2 ( BBC.com ) = 4 I NSERT ( ViSt.com ) Much better! M EMBER ( BBC.com ) - returns ‘no’ (not convinced?)

  58. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r For every key k ∈ U , the value of each h i ( k ) is chosen independently and uniformly at random: that is, the probability that h i ( k ) = j is 1 m for all j between 1 and m (each position is equally likely)

  59. Approach 3: build a bloom filter We still maintain a bit string B of some length m < | U | Now we have r hash functions: h 1 , h 2 , . . . , h r h 1 , h 2 , . . . , h r (we will choose r and m later) Each hash function h i maps a key k , to an integer h i ( k ) between 1 and m I NSERT ( k ) sets B [ h i ( k )] = 1 M EMBER ( k ) returns ‘yes’ if and only if for all i , B [ h i ( k )] = 1 for all i between 1 and r For every key k ∈ U , the value of each h i ( k ) is chosen independently and uniformly at random: that is, the probability that h i ( k ) = j is 1 m for all j between 1 and m (each position is equally likely) but what is the probability of a wrong answer?

  60. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r

  61. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1

  62. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening

  63. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1

  64. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 )

  65. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m

  66. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m So the fraction of bits set to 1 is at most nr m

  67. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m

  68. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m

  69. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m So the fraction of bits set to 1 is at most nr m so the probability that a randomly chosen bit is 1 is at most nr m so the probability that r randomly chosen bits all equal 1 is at most � nr � r m

  70. What is the probability of an error? Assume we have already I NSERTED n keys into the bloom filter Further, we have just called M EMBER ( k ) for some key k not in S this will check whether B [ h i ( k )] = 1 for all j = 1 , 2 , . . . r This is the same as checking whether r randomly chosen bits of B all equal 1 We will now show that there is only a small probability of this happening As there are at most n keys in the filter, at most nr bits of B are set to 1 (each I NSERT sets at most r bits to 1 ) B 1 1 1 1 1 1 1 1 m So the fraction of bits set to 1 is at most nr m (do this independently r times) so the probability that a randomly chosen bit is 1 is at most nr m so the probability that r randomly chosen bits all equal 1 is at most � nr � r m

  71. What is the probability of a collision? We now choose r to minimise this probability. . .

  72. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that � nr � r is minimised by m letting r = m/ ( ne ) where e = 2 . 7813 . . .

  73. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that � nr � r is minimised by m letting r = m/ ( ne ) where e = 2 . 7813 . . . � 1 If we plug this in we get that, � m m ne ≈ (0 . 69) the probability of failure, is at most n e

  74. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that � nr � r is minimised by m letting r = m/ ( ne ) where e = 2 . 7813 . . . � 1 If we plug this in we get that, � m m ne ≈ (0 . 69) the probability of failure, is at most n e In particular to achieve a 1% failure probability, we can set m ≈ 12 . 52 n bits

  75. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that � nr � r is minimised by m letting r = m/ ( ne ) where e = 2 . 7813 . . . � 1 If we plug this in we get that, � m m ne ≈ (0 . 69) the probability of failure, is at most n e In particular to achieve a 1% failure probability, we can set m ≈ 12 . 52 n bits neither the space nor the failure probability depend on | U |

  76. What is the probability of a collision? We now choose r to minimise this probability. . . By differentiating, we can find that � nr � r is minimised by m letting r = m/ ( ne ) where e = 2 . 7813 . . . � 1 If we plug this in we get that, � m m ne ≈ (0 . 69) the probability of failure, is at most n e In particular to achieve a 1% failure probability, we can set m ≈ 12 . 52 n bits neither the space nor the failure probability depend on | U | if we wanted a better probability, we could use more space

Recommend


More recommend