Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1
Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 2-1
Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-2
Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-3
Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Cost: Space: O ( s ) , time O (1) 2-4
Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1
Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window 3-2
Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3
Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1
Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: S 2 Internet routers coordinator Sensor networks S 1 Distributed computing time sites 4-2
Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1
Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2
Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3
Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 exact size of the population coordinator S 1 in order to sample! time sites 5-4
Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1
Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2
Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes. And the trackings are “approximate”. 6-3
Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1
Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2
Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3
ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Rank: for each item coming, generate a random number in [0 , 1] as its rank. 8-1
ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-2
ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-3
ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-4
ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-5
ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-6
ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-7
Recommend
More recommend