Bloom Filters and their Applications These slides were developed by -- and used with permission from -- Shengquan Wang. CPSC 662 Introduction • Membership Query Given a set S={x 1 , x 2 , …, x n } on a universe U , want to answer the query of the form: Is y � S ? – Spell check • Data structure – Space – Search time x i can be a long string n can be a very large number • Hashing is one of the good candidates (randomized) 1
Hash Function • It converts an input from a (typically) large domain into an output in a (typically) smaller range H(x) 0 1 1 XXXXXXXXXXX 2 2 XXXXXXXXXXX 3 3 collision XXXXXXXXXXX 4 4 XXXXXXXXXXX 5 false positive XXXXXXXXXXX 6 7 7 y � H(y) ? Examples of Simple Hash Functions • Truncation : If students have an 9-digit identification number, take the last 3 digits as the table position – e.g. 925371622 becomes 622 • Folding: Split a 9-digit number into three 3-digit numbers, and add them – e.g. 925371622 becomes 925 + 376 + 622 = 1923 • Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) – e.g. 1923 mod 1000 = 923 (1923 % 1000) 2
Hashing Performance • Hash each element of the set to b number of bits, with b = 2 log 2 n – The probability that two elements collide is 1/n 2 . – False positive probability = 1/n (Asymptotically vanishing probability of error) – Binary search time = O(log 2 n) – Space = � (n log 2 n) Bloom Filters • Generalized randomized data structure • Invented by Burton Bloom in 1970 • Basic idea: Use m -bit array to represent a set with n elements with k hashing functions • Bloom filter provides a answer in – “Constant” search time (time to hash). – Small amount of space. – But with some probability of being wrong B. Bloom, “ Space/time tradeoffs in hash coding with allowable errors,” CACM 13 (1970). 3
Example • Start with an m bit array, filled with 0 s B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Hash each item x j � S into [1,…,m] , k number of times. If H i (x j ) = a � [1,…,m] , then set B[a] = 1 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 • To check if y � S , check if all H i (y) are ones B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 • False positive: All H i (y) are ones, but y not in S B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Example Y 2 Y 3 X 2 Y 1 X 1 False Positive h 3 h 1 h 2 1 2 3 4 5 6 7 8 9 10 11 12 =m x 1 -> {2, 5, 9} x 2 -> {5, 7, 11} 4
Probabilities 1 0 1 0 0 1 1 1 0 1 1 0 • Notation: – n = number of elements in the set to be represented – m = size of the bloom filter – k = number of hash functions • Probability that a bit is still zero after all elements are hashed into the Bloom filter • Probability of a false positive Determining the value of k • Goal: Optimize k that minimizes false positive rate Optimal result: k = (ln 2)m/n � f = (0.6185) m/n • – m = number of bits in bloom filter – n = number of elements in the set 5
Example 0.1 m / n = 8 0.09 0.08 False positive rate 0.07 Opt k = 8 ln 2 = 5.45 ... 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions Tradeoffs • Three parameters. – Size m / n : bits per item. – Time k : number of hash functions. – Error f : false positive probability. False positive probability decreases exponentially with linear increase in the number of hash functions & space 6
Comparison Hashing Bloom filters bit per element bit per element 2 log 2 n m/n (m/n = 8) space � (n log 2 n) n * (m/n) space false postive false postive rate (f) rate (f) 1/n (1-e –k n/m ) k ( � 0.02) Lookup time Lookup time O(log 2 n) O(k) k = 1 tradeoff between m/n and f Application: Distributed Caching • Send Bloom filters of URLs • False positives do not hurt much – Get errors from cache changes anyway Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 4 Web Cache 5 Web Cache 6 L. Fan, P. Cao, J. Almeida and A.Z. Broder “Summary Cache: A scalable wide-area Web cache sharing protocol” IEEE/ACM Transactions on Networking 2000 7
Example http://www.perl.com/pub/a/2004/04/08/bloom_filters.html http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html http://www.flipcode.com/articles/article_bloomfilters.shtml http://loaf.cantbedone.org/about.htm http://www.cap-lore.com/code/BloomTheory.html http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf http://lemonodor.com/archives/000881.html http://citeseer.ist.psu.edu/mitzenmacher01compressed.html Application: Set Reconciliation for Content Delivery • Suppose two hosts A and B have S A and S B • A wants to know S A -S B so that it can send those documents to B, that B does not have • B sends Bloom filter corresponding to S B • A sends its documents which are not in that bloom filter • False positives: approximate J. Byers, J. Considine, M. Mitzenmacher, S. Rost, “Informed Content Delivery Across Adaptive Overlay Networks” SIGCOMM 2002 8
Application: Set Intersection for Keyword Search • Let H A , H B be hosts responsible for keywords A and B respectively • Suppose we want documents having both keywords A and B � FIND S A ∩ S B • Steps: – H A sends Bloom filter corresponding to S A to H B – H B computes approximate S A ∩ S B and sends back to H A • False positives : H A can find out, so no problem P. Reynolds and A. Vahdat, “Efficient Peer-to-peer keyword searching” Application: Moderate-sized P2P networks • Distributed hash tables for scalability • For moderate sized P2P network – per-node Bloom filter – Use 8 or 16 bits per object instead of 64 bit identifiers – False positives : Not much problem F. M. Cuena-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen, “PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities.” 9
Application: Resource Routing • Network has tree topology. • B has bloom filters for all children S b , S f , S g , S h A sub -trees collectively and also for each child sub-tree individually. B C D E F G H I J K L M N S. Rhea and J. Kubiatowicz, “Probabilistic Location and Routing” INFOCOMM 2002 Application: Multicast • Typically routers maintain a list of interfaces for each multicast address • An Efficient Solution: Keep list of addresses for each interface and use Bloom filter to represent these addresses – Parallelizable • False Positives: Not bad, just wastes some resources B. Gronvall “Scalable Multicast Forwarding” SIGCOMM 2002 10
Application: Detecting Routing Loops • Current mechanism: TTL • Each packet contain a small Bloom filter to track the nodes visited – If filter does not change at a node, then a possible loop !! • False positives: Problematic A. Whitaker and D. Wetherall “Forwarding without Loops in Icarus” OPENARCH 2002 Application: IP Traceback • Use Bloom filters to record the packets seen by each router • False positives: – Router mistakenly identifies packet as having been seen – Multiple possible paths A.C. Snoeren, C. Partridge, L.A. Sanchez, C.E. Jones, F. Tchakountio, S.T.Kent and W.T. Strayer “Hash-based IP traceback” SIGCOMM 2001 11
Summary • The Bloom Filter Principle: Wherever a list or set is used, and space is a consideration, a Bloom filter should be considered. When using a Bloom filter, consider the potential effects of false positives. References • Space/time tradeoffs in hash coding with allowable errors. B. Bloom. CACM 13 (1970). • Network Applications of Bloom Filters: A Survey. A. Broder and M. Mitzenmacher. Allerton Conference 2002. • Compressed Bloom Filters. M. Mitzenmacher. PODC 2001 . • Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003. • The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004 12
Recommend
More recommend