optimal sampling from distributed streams
play

Optimal Sampling from Distributed Streams Qin Zhang Joint work with - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1 Reservoir sampling [Waterman ??; Vitter 85] Problem: Maintain a (uniform)


  1. Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1

  2. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 2-1

  3. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-2

  4. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-3

  5. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Cost: Space: O ( s ) , time O (1) 2-4

  6. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1

  7. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window 3-2

  8. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3

  9. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1

  10. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: S 2 Internet routers coordinator Sensor networks S 1 Distributed computing time sites 4-2

  11. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1

  12. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2

  13. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3

  14. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 exact size of the population coordinator S 1 in order to sample! time sites 5-4

  15. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1

  16. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2

  17. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes. And the trackings are “approximate”. 6-3

  18. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1

  19. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2

  20. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3

  21. ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Rank: for each item coming, generate a random number in [0 , 1] as its rank. 8-1

  22. ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-2

  23. ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-3

  24. ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-4

  25. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-5

  26. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-6

  27. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-7

Recommend


More recommend