Bloom Filters Anil Maheshwari Bloom Filter Data Structure Bloom Filters Queries False-Positives Analysis Summary Anil Maheshwari anil@scs.carleton.ca School of Computer Science Carleton University Canada
Outline Bloom Filters Anil Maheshwari Bloom Filter Bloom Filter Data Structure 1 Queries False-Positives Data Structure 2 Analysis Summary Queries 3 False-Positives 4 Analysis 5 Summary 6
Bloom Filters Bloom Filters Anil Maheshwari Bloom Filter Problem Definition Data Structure Let U be the universe. Queries Input: A subset S ✓ U . False-Positives Query: For any q 2 U , decide whether q 2 S quickly. Analysis Summary Objective Answer queries quickly and use very little extra space. SPAM Detection U = All possible email addresses; S = My collection of non-junk email addresses. Query: Given any q 2 U , report whether q 2 S ?
History of Bloom Filters Bloom Filters Anil Maheshwari Bloom Filter Bloom, - Space/Time tradeoffs in Hash Coding with Data Structure Allowable Errors , Communications of ACM 1970 Queries Space-Efficient Probabilistic Data Structure for False-Positives Membership Testing Analysis May have false positives Summary Numerous Variants: Counting Filters, Dynamic Filters with insertion/deletion of elements in S . Applications: Estimating size of union/intersection of sets, Avoid cashing ‘one-hit wonders’, Google Bigtable, Chrome’s used it to detect malicious URLs, .... Refined Analysis in 2008 by members of our school.
Bloom Filter Data Structure Bloom Filters Anil Maheshwari Bloom Filter Data Structure Data Structure An array B consisting of m bits and k hash functions Queries h 1 , h 2 , . . . , h k , where h i : U ! { 1 , . . . , m } False-Positives Analysis Summary Initialization B 0 . For all x 2 S , set B [ h 1 ( x )] = B [ h 2 ( x )] = · · · = B [ h k ( x )] = 1 .
An Illustration Bloom Filters Anil Maheshwari Bloom Filter Data Structure Queries False-Positives Analysis Summary
Queries Bloom Filters Anil Maheshwari Bloom Filter Answering Query Data Structure For any query q 2 U , Queries if B [ h 1 ( q )] = B [ h 2 ( q )] = · · · = B [ h k ( q )] = 1 , report q 2 S , False-Positives else report q 62 S . Analysis Summary Observation If q 2 S , the queries are answered correctly. False Positives Suppose q 62 S If B [ h 1 ( q )] = B [ h 2 ( q )] = · · · = B [ h k ( q )] = 1 , we will report that q 2 S .
Estimating Probability of False-Positives Bloom Filters Anil Maheshwari Claim: Let n = | S | . After initializing Bloom filter B of size Bloom Filter m with k hash-functions for elements of S , Data Structure Pr ( B [ l ] = 1) = p = 1 � (1 � 1 m ) nk , where l 2 { 1 , . . . , m } . Queries False-Positives Analysis Summary
Estimating Probability of False-Positives Bloom Filters Anil Maheshwari On query q 62 S , for False-Positive to occur, all of the k Bloom Filter specified locations B [ h 1 ( q )] , . . . , B [ h k ( q )] must be "1". Data Structure Queries Bloom70 False-Positives Pr ( B [ h 1 ( q )] = B [ h 2 ( q )] = · · · = B [ h k ( q )] = 1) = p k . Analysis Summary
An Example Bloom Filters Anil Maheshwari Let n = 1 , m = 2 , k = 2 , Bloom Filter U = { x, y } , S = { x } and q = y 6 = x . Data Structure Queries False-Positives Analysis Summary
Independence Assumption? Bloom Filters Anil Maheshwari Implicit assumption that B [ h 2 ( q )] = 1 is independent of Bloom Filter B [ h 1 ( q )] = 1 may not be true . . . Data Structure Queries False-Positives Analysis Summary
A Possible Fix Bloom Filters Anil Maheshwari We came up with a fairly technical proof and showed that Bloom Filter Data Structure Theorem Queries Let p k,n,m be the false-positive rate for a Bloom filter that False-Positives stores n elements of a set S in a bit-vector of size m Analysis Summary using k hash functions. We can express p k,n,m in terms of the Stirling 1 number of second kind as follows: m 1 ✓ m ◆⇢ kn � X i k i ! p k,n,m = m k ( n +1) i i i =1 q ln m − 2 k ln p Let p = 1 � (1 � 1 /m ) kn , k � 2 and k c 2 p m for some c < 1 . Upper and lower bounds on p k,n,m are given by r ⇣ k ln m � 2 k ln p p k < p k,n,m p k ⇣ ⌘⌘ 1 + O p m
Summary of Bloom Filters Bloom Filters Anil Maheshwari Bloom Filter A simple scheme for testing membership. 1 Data Structure Has one-sided error, i.e., false positives. Queries How to find the right number of hash functions and 2 False-Positives right size of the filter? Analysis Implemented in various search engines, routers, Summary 3 SPAM filters, . . . Unpleasant analysis in our work 4 (Reference: P . Bose, H.Guo, E. Kranakis, A. Maheshwari, P . Morin, J. Morrison, M. Smid, Y. Tang: On the false-positive rate of Bloom filters. Inf. Process. Letters 108(4): 210-213 (2008)) Challenge: A nicer analysis. Hopefully, this will help 5 with the analysis of variants of Bloom Filters.
Recommend
More recommend