Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI
Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG System (Partial Key Grouping) 2
Apache SAMOA Scalable Advanced Massive Online Analysis G. De Francisci Morales, A. Bifet JMLR 2015 3
Taxonomy Data Mining Non Distributed Distributed Batch Stream Batch Stream Storm, S4, Hadoop Samza R, SAMOA Mahout MOA WEKA, … 4
Architecture SAMOA% SA 5
Status Status https://samoa.incubator.apache.org 6
Status Status https://samoa.incubator.apache.org Parallel algorithms 6
Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) 6
Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) 6
Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) 6
Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines 6
VHT Vertical Hoeffding Tree A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis BigData 2016 7
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00 Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x 1 be the most informative attribute, x 2 the second most informative one R 2 ln(1 / � ) r Hoeffding bound: split if ∆ G ( x 1 , x 2 ) > ✏ = 2 n 8
Parallel Decision Trees 9
Parallel Decision Trees Which kind of parallelism? 9
Parallel Decision Trees Which kind of parallelism? Task 9
Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data 9
Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal 9
Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal Vertical 9
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10
Vertical Parallelism Model Stats Attributes Stream Stats Stats Single attribute Splits tracked in single node 10
Advantages of Vertical High number of attributes => high level of parallelism (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 11
PKG Partial Key Grouping M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini ICDE 2015, ICDE 2016 12
10 0 10 -2 10 -4 CCDF 10 -6 10 -8 10 -10 words in tweets 10 -12 wikipedia links 10 -14 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 key frequency Systems Challenges Skewed key distribution 13
Key Grouping and Skew Worker Source Stream Worker Source Worker 14
Problem Statement Input stream of messages m = h t, k, v i Load of worker i ∈ W L i ( t ) = |{ h τ , k, v i : P τ ( k ) = i ^ τ t }| Imbalance of the system I ( t ) = max i ( L i ( t )) − avg ( L i ( t )) , for i ∈ W i Goal: partitioning function that minimizes imbalance P t : K → N 15
Shuffle Grouping Worker Source Stream Worker Aggr. Source Worker 16
Existing Stream Partitioning Key Grouping Memory and communication efficient :) Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :( 17
Solution: PKG Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting : assign each key independently with PoTC Load information in distributed system Local load estimation : estimate worker load locally at each source 18
Power of Both Choices Worker Source Stream Worker Aggr. Source Worker 19
Comparison Stream Grouping Pros Cons Key Grouping Memory efficient Load imbalance Memory overhead Shuffle Grouping Load balance Aggregation O(W) Memory efficient Partial Key Grouping Aggregation O(1) Load balance 20
Graph Streams Betweenness centrality in evolving graphs (TKDE '15) Dynamic graph summarization (BigData '16) Top-k densest subgraph in evolving graphs (CIKM '17) Mining frequent patterns in evolving graphs (w.i.p.) 21
Recommend
More recommend