Mining Big Data Streams Better Algorithms or Faster Systems? - PowerPoint PPT Presentation

    Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales   gdfm@acm.org   QCRI

Agenda SAMOA   API (Scalable Advanced Massive Online Analysis) Algorithm VHT   (Vertical Hoeffding Tree) PKG   System (Partial Key Grouping) 2

Apache SAMOA Scalable Advanced Massive Online Analysis   G. De Francisci Morales, A. Bifet   JMLR 2015 3

Taxonomy Data Mining Non Distributed Distributed Batch Stream Batch Stream Storm, S4, Hadoop Samza R, SAMOA Mahout MOA WEKA, … 4

Architecture SAMOA% SA 5

Status Status https://samoa.incubator.apache.org 6

Status Status https://samoa.incubator.apache.org Parallel algorithms 6

Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) 6

Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) 6

Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) 6

Status Status https://samoa.incubator.apache.org Parallel algorithms Classification (Vertical Hoeffding Tree) Clustering (CluStream) Regression (Adaptive Model Rules) Execution engines   6

VHT Vertical Hoeffding Tree   A. Murdopo, A. Bifet, G. De Francisci Morales, N. Kourtellis   BigData 2016 7

P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00 Hoeffding Tree Sample of stream enough for near optimal decision Estimate merit of alternatives from prefix of stream Choose sample size based on statistical principles When to expand a leaf? Let x 1 be the most informative attribute,   x 2 the second most informative one R 2 ln(1 / � ) r Hoeffding bound: split if ∆ G ( x 1 , x 2 ) > ✏ = 2 n 8

Parallel Decision Trees 9

Parallel Decision Trees Which kind of parallelism? 9

Parallel Decision Trees Which kind of parallelism? Task 9

Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data 9

Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal 9

Parallel Decision Trees Attributes Which kind of parallelism? Instances Task Data Data Horizontal Vertical 9

Vertical Parallelism Model Stats Attributes Stream Stats Stats Splits 10

Vertical Parallelism Model Stats Attributes Stream Stats Stats Single attribute Splits tracked in single node 10

Advantages of Vertical High number of attributes => high level of parallelism   (e.g., documents) Vs task parallelism Parallelism observed immediately Vs horizontal parallelism Reduced memory usage (no model replication) Parallelized split computation 11

  PKG Partial Key Grouping   M. A. U. Nasir, G. De Francisci Morales, D. Garcia-Soriano, N. Kourtellis, M. Serafini   ICDE 2015, ICDE 2016 12

10 0 10 -2 10 -4 CCDF 10 -6 10 -8 10 -10 words in tweets 10 -12 wikipedia links 10 -14 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 key frequency Systems Challenges Skewed key distribution 13

Key Grouping and Skew Worker Source Stream Worker Source Worker 14

Problem Statement Input stream of messages m = h t, k, v i Load of worker i ∈ W L i ( t ) = |{ h τ , k, v i : P τ ( k ) = i ^ τ  t }| Imbalance of the system I ( t ) = max i ( L i ( t )) − avg ( L i ( t )) , for i ∈ W i Goal: partitioning function that minimizes imbalance P t : K → N 15

Shuffle Grouping Worker Source Stream Worker Aggr. Source Worker 16

Existing Stream Partitioning Key Grouping Memory and communication efficient :) Load imbalance :( Shuffle Grouping Load balance :) Additional memory and aggregation phase :( 17

Solution: PKG Fully distributed adaptation of PoTC, handles skew Consensus and state to remember choice Key splitting :   assign each key independently with PoTC Load information in distributed system Local load estimation :   estimate worker load locally at each source 18

Power of Both Choices Worker Source Stream Worker Aggr. Source Worker 19

Comparison Stream Grouping Pros Cons Key Grouping Memory efficient Load imbalance Memory overhead Shuffle Grouping Load balance Aggregation O(W) Memory efficient Partial Key Grouping Aggregation O(1) Load balance 20

Graph Streams Betweenness centrality in evolving graphs (TKDE '15) Dynamic graph summarization (BigData '16) Top-k densest subgraph in evolving graphs (CIKM '17) Mining frequent patterns in evolving graphs (w.i.p.) 21

Mining Big Data Streams Better Algorithms or Faster Systems? - PowerPoint PPT Presentation

Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Introducing Apache Isis Ubiquitous Language With a conscious effort by the team, the domain

Deploying PostgreSQL on Kubernetes Jimmy Angelakos

MSc European and Professor Chris Anderson Professor in European Politics and Policy

Why are we here? Purpose Understand drivers that generate the need for innovation

PDE Declarative Services Graduation Review Rafael Oliveira Nbrega Chris Aniszczyk 1

DataSketches An introduction Claude N. Warren, Jr January 12, 2020 Email: claude@xenei.com

Tips and Tricks for Migrating to Eclipse 4 Olivier Prouvost (OpCoach) Brian de Alwis

Mining Big Data Streams Better Algorithms or Faster Systems? - PowerPoint PPT Presentation

Mining Big Data Streams Better Algorithms or Faster Systems? Gianmarco De Francisci Morales gdfm@acm.org QCRI Agenda SAMOA API (Scalable Advanced Massive Online Analysis) Algorithm VHT (Vertical Hoeffding Tree) PKG

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Introducing Apache Isis Ubiquitous Language With a conscious effort by the team, the domain

Deploying PostgreSQL on Kubernetes Jimmy Angelakos

MSc European and Professor Chris Anderson Professor in European Politics and Policy

Why are we here? Purpose Understand drivers that generate the need for innovation

PDE Declarative Services Graduation Review Rafael Oliveira Nbrega Chris Aniszczyk 1

DataSketches An introduction Claude N. Warren, Jr January 12, 2020 Email: claude@xenei.com

Tips and Tricks for Migrating to Eclipse 4 Olivier Prouvost (OpCoach) Brian de Alwis

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams