random sampling on big data techniques and applications
play

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong - PowerPoint PPT Presentation

Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 Vs : Volume Velocity Variety Integers, real numbers Points in a


  1. Random Sampling on Big Data: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk

  2. “Big Data” in one slide The 3 V’s :  Volume  Velocity  Variety – Integers, real numbers – Points in a multi-dimensional space – Records in relational database – Graph-structured data 2 Random Sampling on Big Data

  3. Dealing with Big Data  The first approach: scale up / out the computation  Many great technical innovations: – Distributed/parallel systems – Simpler programming models • MapReduce, Pregel, Dremel, Spark… • BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL  This talk is not about this approach! 3 Random Sampling on Big Data

  4. Downsizing data  A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms • See tutorial by Graham Cormode for other data summaries  Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures • Good old RAM model no longer applies 4 Random Sampling on Big Data

  5. Outline for the talk  Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries  Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins)  Will jump back and forth between theory and practice 5 Random Sampling on Big Data

  6. Simple Random Sampling  Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times The statistical difference is  Sampling with replacement very small, for 𝑜 ≫ 𝑡 – Randomly draw an element – Put it back – Repeat s times  Trivial in the RAM model 6 Random Sampling on Big Data

  7. Random Sampling from a Data Stream  A stream of elements coming in at high speed  Limited memory  Need to maintain the sample continuously  Applications – Data stored on disk – Network traffic 7 Random Sampling on Big Data

  8. 8 Random Sampling on Big Data

  9. Reservoir Sampling  Maintain a sample of size 𝑡 drawn (without replacement) from all elements in the stream so far  Keep the first 𝑡 elements in the stream, set 𝑜 ← 𝑡  Algorithm for a new element – 𝑜 ← 𝑜 + 1 – With probability 𝑡/𝑜 , use it to replace an item in the current sample chosen uniformly at random – With probability 1 − 𝑡/𝑜 , throw it away  Perhaps the first “streaming” algorithm [Waterman ??; Knuth’s book] 9 Random Sampling on Big Data

  10. Correctness Proof  By induction on 𝑜 – 𝑜 = 𝑡 : trivially correct – Assume each element so far is sampled with probability 𝑡/𝑜 – Consider 𝑜 + 1 : • The new element is sampled with probability 𝑡 𝑜+1 • Any element in the current sample is sampled with probability 𝑡 𝑡 𝑜+1 ⋅ 𝑡−1 𝑡 𝑡 𝑜 ⋅ 1 − 𝑜+1 + = 𝑜+1 . Yeah! 𝑡  This is a wrong (incomplete) proof 𝑡  Each element being sampled with probability 𝑜 is not a sufficient condition of random sampling – Counter example: Divide elements into groups of 𝑡 and pick one group randomly 10 Random Sampling on Big Data

  11. 11 Random Sampling on Big Data

  12. Reservoir Sampling Correctness Proof  Many “proofs” found online are actually wrong – They only show that each item is sampled with probability 𝑡/𝑜 – Need to show that every subset of size 𝑡 has the same probability to be the sample  Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12 Random Sampling on Big Data

  13. Sampling from Distributed Streams  One coordinator and 𝑙 sites  Each site can communicate with the coordinator  Goal: Maintain a random sample of size 𝑡 over the union of all streams with minimum communication  Difficulty: Don’t know 𝑜 , so can’t run reservoir sampling algorithm  Key observation: Don’t have to know 𝑜 in order to sample! [Cormode, Muthukrishnan , Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura , DISC’11] 13 Random Sampling on Big Data

  14. Reduction from Coin Flip Sampling  Flip a fair coin for each element until we get “1”  An element is active on a level if it is “0”  If a level has ≥ 𝑡 active elements, we can draw a sample from those active elements  Key: The coordinator does not want all the active elements, which are too many! – Choose a level appropriately 14 Random Sampling on Big Data

  15. The Algorithm  Initialize 𝑗 ← 0  In round 𝑗 : – Sites send in every item w.p. 2 −𝑗 (This is a coin-flip sample with prob. 2 −𝑗 ) – Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 −(𝑗+1) ) – When the lower sample reaches size 𝑡 , the coordinator broadcasts to advance to round 𝑗 ← 𝑗 + 1 – Discard the upper sample – Split the lower sample into a new lower sample and a higher sample 15 Random Sampling on Big Data

  16. Communication Cost of Algorithm  Communication cost of each round: 𝑃(𝑙 + 𝑡) – Expect to receive 𝑃(𝑡) sampled items before round ends – Broadcast to end round: 𝑃(𝑙)  Number of rounds: 𝑃(log 𝑜) – In each round, need Θ(𝑡) items being sampled to end round – Each item has prob. 2 −𝑗 to contribute: need Θ(2 𝑗 𝑡) items  Total communication: 𝑃( 𝑙 + 𝑡 log 𝑜) – Can be improved to 𝑃(𝑙 log 𝑙/𝑡 𝑜 + 𝑡 log 𝑜) – A matching lower bound  Sliding windows 16 Random Sampling on Big Data

  17. Random Sampling for Range Queries [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award] 17 Random Sampling on Big Data

  18. Online Range Sampling  Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination.  Naïve solutions:  Parameters: – Query then sample: 𝑃 𝑔 𝑜 + 𝑟 – 𝑜: data size 𝑡𝑜 – Sample then query: 𝑃 𝑟 – 𝑟: query size (store data in random order) – 𝑡: sample size 𝑡𝑜 (not known beforehand)  New solution: 𝑃 𝑔 + 𝑡 𝑟 – 𝑜 ≫ 𝑟 ≫ 𝑡 𝑔(𝑦) : # canonical nodes in tree of size 𝑦 , between log 𝑦 and 𝑦 [Wang, Christensen, Li, Yi, VLDB’16] 18 Random Sampling on Big Data

  19. Indexing Spatial Data  Numerous spatial indexing structures in the literature R-tree 19 Random Sampling on Big Data

  20. RS-tree  Attach a sample to node 𝑣 drawn from leaves below 𝑣 – Total space: 𝑃(𝑜) – Construction time: 𝑃(𝑜) 20 Random Sampling on Big Data

  21. RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 21 Random Sampling on Big Data

  22. RS-tree: A 1D Example Report: 5 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 22 Random Sampling on Big Data

  23. RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 7 or 14 with equal prob. 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 23 Random Sampling on Big Data

  24. RS-tree: A 1D Example Report: 5 7 Active nodes 5 Pick 3, 8, or 14 7 14 with prob. 1:1:2 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 Random Sampling on Big Data

  25. RS-tree: A 1D Example Report: 5 7 Active nodes 5 7 14 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 25 Random Sampling on Big Data

  26. RS-tree: A 1D Example Report: 5 7 12 Active nodes 5 Pick 3, 8, or 12 7 14 with equal prob 12 14 3 8 1 4 5 7 9 12 14 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 26 Random Sampling on Big Data

  27. Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

  28. Frequency Estimation on Distributed Data  Given: A multiset 𝑇 of 𝑜 items drawn from the universe [𝑣] – For example: IP addresses of network packets  𝑇 is partitioned arbitrarily and stored on 𝑙 nodes – Local count 𝑦 𝑗𝑘 : frequency of item 𝑗 on node 𝑘 – Global count 𝑧 𝑗 = 𝑦 𝑗𝑘 𝑘  Goal: Estimate 𝑧 𝑗 with additive error 𝜁𝑜 for all 𝑗 – Can’t hope for relative error for all 𝑧 𝑗 – Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM’11] 28 Random Sampling on Big Data

Recommend


More recommend