sublinear algorithms for big data part 4 random topics
play

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - PowerPoint PPT Presentation

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS10, JACM12) 2-1 Distributed streaming Motivated by


  1. Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1

  2. Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS’10, JACM’12) 2-1

  3. Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD’03] A generic geometric approach [Scharfman et al. SIGMOD’06] Prediction models [Cormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD’05] sensor networks network monitoring environment monitoring cloud computing 3-1

  4. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 4-1

  5. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s / i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s / i , throw it away 4-2

  6. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 S 1 time 5-1

  7. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · C S 3 S 2 S 1 time 5-2

  8. Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · Key observation: C S 3 We don’t have to know the size of the population in order S 2 to sample! S 1 time 5-3

  9. Basic idea: binary Bernoulli sampling 6-1

  10. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 6-2

  11. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items 6-3

  12. Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size between s and O ( s ) 6-4

  13. Random sampling – Algorithm [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] C coordinator Initialize i = 0 sites S 2 S 1 S 3 S k · · · In epoch i : Sites send in every item w.pr. 2 − i 17-1

  14. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) 17-2

  15. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample 17-3

  16. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: 17-4

  17. Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: (2): Always ≥ s items are maintained in C 17-5

  18. upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 18-1

  19. upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-2

  20. upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-3

  21. upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-4

  22. upper A running example 2 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-5

  23. upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 4 18-6

  24. upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-7

  25. upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-8

  26. upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-9

  27. upper A running example 4 1 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-10

  28. upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 19-1

  29. upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 19-2

  30. upper 4 7 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 19-3

  31. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 19-4

  32. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 19-5

  33. upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-6

  34. upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-7

  35. upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-8

  36. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-1

  37. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Epoch 2 ( p = 1 / 4) coordinator C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-2

  38. upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Intuition: maintain a sample prob. Epoch 2 ( p = 1 / 4) coordinator at each site p ≈ s/n ( n : total # items) without knowing n . C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-3

Recommend


More recommend