120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 The Dynamic Bloom Filters Deke Guo, Member , IEEE , Jie Wu, Fellow , IEEE , Honghui Chen, Ye Yuan, and Xueshan Luo Abstract —A Bloom filter is an effective, space-efficient data structure for concisely representing a set, and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets, as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in stand- alone applications, as well as distributed applications. Index Terms —Bloom filters, dynamic Bloom filters, information representation. Ç 1 I NTRODUCTION I great potential for representing a set in main memory [13] in NFORMATION representation and processing of member- stand-alone applications. For example, SBFs have been used ship queries are two associated issues that encompass the to provide a probabilistic approach for explicit state model core problems in many computer applications. Representa- checking of finite-state transition systems [13], to summar- tion means organizing information based on a given format ize the contents of stream data in memory [14], [15], to store and mechanism such that information is operable by a the states of flows in the on-chip memory at networking corresponding method. The processing of membership devices [16], and to store the statistical values of tokens to queries involves making decisions based on whether an speed up the statistical-based Bayesian filters [17]. item with a specific attribute value belongs to a given set. A The SBF has been modified and improved from different standard Bloom filter (SBF) is a space-efficient data aspects for a variety of specific problems. The most structure for representing a set and answering membership queries within a constant delay [1]. The space efficiency is important variations include compressed Bloom filters [18], counting Bloom filters [12], distance-sensitive Bloom achieved at the cost of false positives in membership queries, and for many applications, the space savings filters [19], Bloom filters with two hash functions [20], space- outweigh this drawback when the probability of an error code Bloom filters [21], spectral Bloom filters [22], general- is sufficiently low. ized Bloom filters [23], Bloomier filters [24], and Bloom The SBF has been extensively used in many database filters based on partitioned hashing [25]. Compressed Bloom applications [2], for example, the Bloom join [3]. Recently, it filters can improve performance in terms of bandwidth has started receiving more widespread attention in net- saving when an SBF is passed on as a message. Counter working literature [4]. An SBF can be used as a summariz- Bloom filters deal mainly with the item deletion operation. Distance-sensitive Bloom filters, using locality-sensitive ing technique to aid global collaboration in peer-to-peer hash functions, can answer queries of the form, “Is x close (P2P) networks [5], [6], [7], support probabilistic algorithms to an item of S ?” Bloom filters with two hash functions use a for routing and locating resources [8], [9], [10], [11], and standard technique in hashing to simplify the implementa- share Web cache information [12]. In addition, SBFs have tion of SBFs significantly. Space-code Bloom filters and spectral Bloom filters focus on multisets, which support . D. Guo, H. Chen, and X. Luo are with the Key Laboratory of C 4 ISR queries of the form, “How many occurrences of an item are Technology, National University of Defense Technology, Changsha there in a given multiset?” The SBF and its mainstream 410073, China. variations are suitable for representing static sets whose E-mail: {guodeke, chh0808}@gmail.com, xsluo@nudt.edu.cn. . J. Wu is with the Department of Computer and Information Sciences, cardinality is known prior to design and deployment. Temple University, 1805 N. Borad Street, Philadelphia, PA 19122. Although the SBF and its variations have found suitable E-mail: jiewu@temple.edu. applications in different fields, the following three obstacles . Y. Yuan is with the Institute of Computer Systems, Northeastern University, 132#, Shen Yang City, Liao Ning Province 110004, China. still lack suitable and practical solutions: E-mail: linuxyy@gmail.com. 1. For stand-alone applications that know the upper Manuscript received 26 May 2007; revised 19 July 2008; accepted 10 Feb. 2009; published online 18 Feb. 2009. bound on set cardinality for a dynamic set in Recommended for acceptance by D. Gunopulos advance, a large number of bits are allocated for an For information on obtaining reprints of this article, please send e-mail to: SBF to represent all possible items of the dynamic set tkde@computer.org, and reference IEEECS Log Number TKDE-2007-05-0239. at the outset. This approach diminishes the space Digital Object Identifier no. 10.1109/TKDE.2009.57. 1041-4347/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society
Recommend
More recommend