Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1
Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS’10, JACM’12) 2-1
Distributed streaming Motivated by database/networking applications Adaptive filters [Olston, Jiang, Widom, SIGMOD’03] A generic geometric approach [Scharfman et al. SIGMOD’06] Prediction models [Cormode, Garofalakis, Muthukrishnan, Rastogi, SIGMOD’05] sensor networks network monitoring environment monitoring cloud computing 3-1
Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 4-1
Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s / i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s / i , throw it away 4-2
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 S 1 time 5-1
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · C S 3 S 2 S 1 time 5-2
Reservoir sampling from distributed streams When k = 1, reservoir sampling has cost Θ( s log n ) When k ≥ 2, reservoir sampling has cost O ( n ) because it’s costly to track i Tracking i approximately? S k Sampling won’t be uniform · · · Key observation: C S 3 We don’t have to know the size of the population in order S 2 to sample! S 1 time 5-3
Basic idea: binary Bernoulli sampling 6-1
Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 6-2
Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items 6-3
Basic idea: binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size between s and O ( s ) 6-4
Random sampling – Algorithm [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] C coordinator Initialize i = 0 sites S 2 S 1 S 3 S k · · · In epoch i : Sites send in every item w.pr. 2 − i 17-1
Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) 17-2
Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample 17-3
Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: 17-4
Random sampling – Algorithm upper [with Cormode, Muthu & Yi , PODS ’10 JACM ’11] lower Initialize i = 0 C coordinator In epoch i : Sites send in every item w.pr. 2 − i sites S 2 S 1 S 3 S k · · · Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. (Each item is included in lower sample w.pr. 2 − ( i +1) ) When the lower sample reaches size s , the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample (1): In epoch i , each item is maintained in C w. pr. 2 − i Correctness: (2): Always ≥ s items are maintained in C 17-5
upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 18-1
upper A running example lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-2
upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 18-3
upper A running example 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-4
upper A running example 2 1 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 18-5
upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 4 18-6
upper A running example 2 3 1 4 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-7
upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Epoch 0 ( p = 1) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-8
upper A running example 2 3 1 4 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-9
upper A running example 4 1 5 lower Maintain s = 3 samples Now | lower sample | = 3 Epoch 0 ( p = 1) coordinator • discard upper sample C • split lower sample • advance to Epoch 1 S 3 S 4 sites S 2 S 1 1 2 3 5 4 18-10
upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 19-1
upper 4 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 19-2
upper 4 7 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 19-3
upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 19-4
upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 19-5
upper 4 7 8 A running example (cont.) 1 5 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-6
upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Epoch 1 ( p = 1 / 2) coordinator C S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-7
upper 4 7 8 A running example (cont.) 1 5 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 19-8
upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Again | lower sample | = 3 Epoch 1 ( p = 1 / 2) coordinator • discard upper sample C • split lower sample • advance to Epoch 2 S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-1
upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Epoch 2 ( p = 1 / 4) coordinator C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-2
upper 1 5 A running example (cont.) 10 lower Maintain s = 3 samples Intuition: maintain a sample prob. Epoch 2 ( p = 1 / 4) coordinator at each site p ≈ s/n ( n : total # items) without knowing n . C More items will be discarded locally S 3 S 4 sites S 2 S 1 1 2 3 5 4 6 (discard) 7 8 9 (discard) 10 20-3
Recommend
More recommend