mining big data streams
play

Mining Big Data Streams Better Algorithms or Faster Systems? - PowerPoint PPT Presentation

Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG


  1. 
 
 Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales 
 gdfm@acm.org 
 QCRI

  2. Agenda SAMOA 
 API (Scalable Advanced Massive Online Analysis) Algorithm VHT 
 (Vertical Hoeffding Tree) PKG 
 System (Partial Key Grouping) 2

  3. Apache SAMOA Scalable Advanced Massive Online Analysis 
 G. De Francisci Morales, A. Bifet 
 JMLR 2015 3

  4. Taxonomy Data Mining Non Distributed Distributed Batch Stream Batch Stream Storm, S4, Hadoop Samza R, SAMOA Mahout MOA WEKA, … 4

  5. Architecture SAMOA% SA 5

  6. Status Status https://samoa.incubator.apache.org 6

  7. Status Status https://samoa.incubator.apache.org Parallel algorithms 6

  8. Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) 6

  9. Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) 6

  10. Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) 6

  11. Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines 
 6

  12. VHT Vertical Hoeffding Tree 
 A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis 
 BigData 2016 7

  13. P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00 Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x 1 be the most informative attribute, 
 x 2 the second most informative one R 2 ln(1 / � ) r Hoeffding bound: split if ∆ G ( x 1 , x 2 ) > ✏ = 2 n 8

  14. Parallel Decision Trees 9

  15. Parallel Decision Trees Which kind of parallelism? 9

  16. Parallel Decision Trees Which kind of parallelism? Task 9

  17. Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data 9

  18. Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal 9

  19. Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal Vertical 9

  20. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  21. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  22. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  23. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  24. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  25. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  26. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  27. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  28. Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

  29. Vertical Parallelism Model Stats Attributes Stream Stats Stats Single attribute Splits tracked in single node 10

  30. Advantages of Vertical High number of attributes => high level of parallelism 
 (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 11

  31. 
 PKG Partial Key Grouping 
 M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini 
 ICDE 2015, ICDE 2016 12

  32. 10 0 10 -2 10 -4 CCDF 10 -6 10 -8 10 -10 words in tweets 10 -12 wikipedia links 10 -14 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 key frequency Systems Challenges Skewed key distribution 13

  33. Key Grouping and Skew Worker Source Stream Worker Source Worker 14

  34. Problem Statement Input stream of messages m = h t, k, v i Load of worker i ∈ W L i ( t ) = |{ h τ , k, v i : P τ ( k ) = i ^ τ  t }| Imbalance of the system I ( t ) = max i ( L i ( t )) − avg ( L i ( t )) , for i ∈ W i Goal: partitioning function that minimizes imbalance P t : K → N 15

  35. Shuffle Grouping Worker Source Stream Worker Aggr. Source Worker 16

  36. Existing Stream Partitioning Key Grouping Memory and communication efficient :) Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :( 17

  37. Solution: PKG Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting : 
 assign each key independently with PoTC Load information in distributed system Local load estimation : 
 estimate worker load locally at each source 18

  38. Power of Both Choices Worker Source Stream Worker Aggr. Source Worker 19

  39. Comparison Stream Grouping Pros Cons Key Grouping Memory efficient Load imbalance Memory overhead Shuffle Grouping Load balance Aggregation O(W) Memory efficient Partial Key Grouping Aggregation O(1) Load balance 20

  40. Graph Streams Betweenness centrality in evolving graphs (TKDE '15) Dynamic graph summarization (BigData '16) Top-k densest subgraph in evolving graphs (CIKM '17) Mining frequent patterns in evolving graphs (w.i.p.) 21

Recommend


More recommend