Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk
“Big Data” in one slide The 3 V’s : Volume Velocity Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data 2 Random Sampling on Big Data
Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models • MapReduce, Pregel, Dremel, Spark… • BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! 3 Random Sampling on Big Data
Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms • See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures • Good old RAM model no longer applies 4 Random Sampling on Big Data
Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice 5 Random Sampling on Big Data
Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times The statistical difference is Sampling with replacement very small, for 𝑜 ≫ 𝑡 – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model 6 Random Sampling on Big Data
Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic 7 Random Sampling on Big Data
8 Random Sampling on Big Data
Reservoir Sampling Maintain a sample of size 𝑡 drawn (without replacement) from all elements in the stream so far Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡 Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜 , use it to replace an item in the current sample chosen uniformly at random – With probability 1 − 𝑡/𝑜 , throw it away Perhaps the first “streaming” algorithm [Waterman ??; Knuth’s book] 9 Random Sampling on Big Data
Correctness Proof By induction on 𝑜 – 𝑜 = 𝑡 : trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1 : • The new element is sampled with probability 𝑡 𝑜+1 • Any element in the current sample is sampled with probability 𝑡 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡 𝑡 𝑜 ⋅ 1 − 𝑜+1 + = 𝑜+1 . Yeah! 𝑡 This is a wrong (incomplete) proof 𝑡 Each element being sampled with probability 𝑜 is not a sufficient condition of random sampling – Counter example: Divide elements into groups of 𝑡 and pick one group randomly 10 Random Sampling on Big Data
11 Random Sampling on Big Data
Reservoir Sampling Correctness Proof Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same probability to be the sample Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12 Random Sampling on Big Data
Sampling from Distributed Streams One coordinator and 𝑙 sites Each site can communicate with the coordinator Goal: Maintain a random sample of size 𝑡 over the union of all streams with minimum communication Difficulty: Don’t know 𝑜 , so can’t run reservoir sampling algorithm Key observation: Don’t have to know 𝑜 in order to sample! [Cormode, Muthukrishnan , Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura , DISC’11] 13 Random Sampling on Big Data
Reduction from Coin Flip Sampling Flip a fair coin for each element until we get “1” An element is active on a level if it is “0” If a level has ≥ 𝑡 active elements, we can draw a sample from those active elements Key: The coordinator does not want all the active elements, which are too many! – Choose a level appropriately 14 Random Sampling on Big Data
The Algorithm Initialize 𝑗 ← 0 In round 𝑗 : – Sites send in every item w.p. 2 −𝑗 (This is a coin-flip sample with prob. 2 −𝑗 ) – Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 −(𝑗+1) ) – When the lower sample reaches size 𝑡 , the coordinator broadcasts to advance to round 𝑗 ← 𝑗 + 1 – Discard the upper sample – Split the lower sample into a new lower sample and a higher sample 15 Random Sampling on Big Data
Communication Cost of Algorithm Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙) Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2 −𝑗 to contribute: need Θ(2 𝑗 𝑡) items Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log 𝑙/𝑡 𝑜 + 𝑡 log 𝑜) – A matching lower bound Sliding windows 16 Random Sampling on Big Data
Random Sampling for Range Queries [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award] 17 Random Sampling on Big Data
Online Range Sampling Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination. Naïve solutions: Parameters: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – 𝑜: data size 𝑡𝑜 – Sample then query: 𝑃 𝑟 – 𝑟: query size (store data in random order) – 𝑡: sample size 𝑡𝑜 (not known beforehand) New solution: 𝑃 𝑔 + 𝑡 𝑟 – 𝑜 ≫ 𝑟 ≫ 𝑡 𝑔(𝑦) : # canonical nodes in tree of size 𝑦 , between log 𝑦 and 𝑦 [Wang, Christensen, Li, Yi, VLDB’16] 18 Random Sampling on Big Data
Indexing Spatial Data Numerous spatial indexing structures in the literature R-tree 19 Random Sampling on Big Data
RS-tree Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜) 20 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 21 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 22 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 7 or 14 with equal prob. 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 3, 8, or 14 7 14 with prob. 1:1:2 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 7 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 25 Random Sampling on Big Data
RS-tree: A 1D Example Report: 5 7 12 Active nodes 5 Pick 3, 8, or 12 7 14 with equal prob 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 26 Random Sampling on Big Data
Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible
Frequency Estimation on Distributed Data Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets 𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦 𝑗𝑘 : frequency of item 𝑗 on node 𝑘 – Global count 𝑧 𝑗 = 𝑦 𝑗𝑘 𝑘 Goal: Estimate 𝑧 𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧 𝑗 – Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM’11] 28 Random Sampling on Big Data
Recommend
More recommend