Sampling and Reconstruction Using Bloom Filters Neha Sengupta 1 , Amitabha Bagchi 1 , Srikanta Bedathur 2 , Maya Ramanath 1 1 IIT Delhi 2 IBM India Research Lab
Bloom Filters ● Compact storage ● m bits, k hash functions ● Set Membership Query: W ● Union ● Intersection 0 1 1 1 0 1 0 1 0 1 1 0 ● Sampling ● Reconstruction S = { A, B, C }
Sampling from a Bloom filter Two Approaches: Dictionary Attack Hash Sample Slow Non-uniform Sample Invertible hash function t e e c S 0100010001011110 a e p t s a Membership Membership Invert e d m i d Set Bit a n N a C Candidate Set 0100010001011110 Bloom Filter Sample Sample
BloomSampleTree Sampling from Bloom filters BloomSampleTree bT: (0..15) b 1 , b 2 , b 3 storing sets S 1 , S 2 , S 3 respectively. 1111111110 1111111010 (0..7) (8..15) Set S contains elements in range [0,15] M = 16 S stored in Bloom filter b (0..3) (4..7) (8..11) (12..15) with 10 bits m = 10 1101100010 0110011000 1100110010 0111011100 Bloom filters use 2 hash functions k = 2 Bloom filter b 1 Bloom filter b 2 Bloom filter b 3 BloomSampleTree bT 0100111000 0111000000 1000010100 created using S 1 = {1,12} S 2 = {4,6} S 3 = {3,7} (M = 16, m = 10, k = 2) Can be used to sample from all 3 sets
BloomSampleTree Example: Sampling from Bloom filter b 1 using Start at the root (0..15) BloomSampleTree bT 1111111010 1111111110 . . 0100111000 Original Set S 1 = {1, 12} 0100111000 (0..7) (8..15) = 0100111000 = 0100111000 b 1 = 0100111000 (0..3) (4..7) (8..11) (12..15) 1101100010 0110011000 1100110010 0111011100
BloomSampleTree Example: Sampling from Bloom filter b 1 using Start at the root (0..15) BloomSampleTree bT 1111111010 p R = 0.5 1111111110 . p L = 0.5 . 0100111000 Original Set S 1 = {1, 12} 0100111000 (0..7) (8..15) = 0100111000 = 0100111000 b 1 = 0100111000 (0..3) (4..7) (8..11) (12..15) 1101100010 0110011000 1100110010 0111011100
BloomSampleTree Example: Sampling from Bloom filter b 1 using (0..15) BloomSampleTree bT 1111111110 1111111010 Original Set S 1 = {1, 12} Chosen (0..7) (8..15) subtree b 1 = 0100111000 p R = 0.48 p L = 0.52 (0 .. 3) (4..7) (8..11) (12..15) 1101100010 0110011000 1100110010 0111011100 . . 0100111000 0100111000 = 0100110000 = 0100011000
BloomSampleTree Example: Sampling from Bloom filter b 1 using (0..15) BloomSampleTree bT 1111111110 1111111010 Original Set S 1 = {1, 12} (0..7) (8..15) Chosen leaf b 1 = 0100111000 (0..3) (4..7) (8..11) (12..15) 1101100010 0110011000 1100110010 0111011100 ● Membership(4, b 1 ) = false ● Membership(5, b 1 ) = false This path was a false positive ● Membership(6, b 1 ) = false ● Membership(7, b 1 ) = false
BloomSampleTree Example: Sampling from Bloom filter b 1 using (0..15) BloomSampleTree bT 1111111110 1111111010 Original Set S 1 = {1, 12} (0..7) (8..15) Chosen leaf b 1 = 0100111000 (0..3) (4..7) (8..11) (12..15) 1101100010 0110011000 1100110010 0111011100 ● Membership(0, b 1 ) = false ● Membership(1, b 1 ) = true Sample = 1 ● Membership(2, b 1 ) = false ● Membership(3, b 1 ) = false
BloomSampleTree - Sampling 1 (1 ... 10M) False Positive Path 2 3 Empty Intersection 4 5 Potential Path 7 6 True Path 8 9 10 11 14 15 13 12 Subtree pruned Subtree not from 25 37 visited at all search 38
BloomSampleTree - Sampling Setting:
BloomSampleTree - Sampling Setting: Sample Quality:
BloomSampleTree - Sampling Setting: Running Time:
BloomSampleTree - Sampling Simple Hash Functions Algorithms: ● Dictionary Attack (DA) ● BloomSampleTree(BST) Setting: ● M = 10 7 ● k = 3 ● m increases with desired Accuracy ● Size of set S, |S| varies from 100 to 50K MD5/Murmur Hash Functions
BloomSampleTree - Reconstruction ● BloomSampleTree bT Similar to Sampling ● S = { 4, 6 } Follow all positive paths ● Bloom filter b Multi-threaded 0111000000 ● Challenge: − Bloom filter intersections almost never empty Reconstruction − Every path is a “true path” − Follow path only if the size of intersection exceeds a threshold. ● False Positive of set S part of reconstructed Reconstructed Set: set S’ S’ = { 4, 6, 13 }
BloomSampleTree - Reconstruction Simple Hash Functions Algorithms: ● Dictionary Attack (DA): test each element in namespace for membership ● HashInvert (HI): Invert each set bit in the bloom filter and prune using membership ● BloomSampleTree(BST) Setting: ● M = 10 7 ● MD5 Hash Functions k = 3 ● m increases with desired Precision ● Size of set S, |S| varies from 100 to 50K
Pruned BloomSampleTree Large range of possible values Actually used namespace much smaller (0..15) ● Do not expand nodes corresponding to 1111111110 1111111010 ‘unused’ regions (0..7) (8..15) ● Smaller BloomSampleTree, faster Sampling ● Add nodes to BloomSampleTree as namespace changes (0..3) (4..7) (8..11) (12..15) 1100110010 0110011000
Pruned BloomSampleTree ● Sampling for smaller Namespace with Simple Hash Functions ● How large is the actually used section of the range = Namespace Fraction ● How are the actually used parts of namespace distributed within it? ● Uniform or Clustered ● Affects BST size, sampling time
Thank you!
Recommend
More recommend