samoa a platform for mining big data streams
play

SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis - PowerPoint PPT Presentation

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1 SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2


  1. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 1 SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona

  2. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 2 What is Big Data? § Search queries § Facebook posts § Emails § Tweets § Photo shares § Clicks on ads § …

  3. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 3 How BIG is your data? § Volume (+ Variety) § Too large for RAM of single commodity server § Velocity § Too fast for CPU of single commodity server

  4. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 4 What is the Streaming Paradigm? § High amount of data, high speed of arrival § Updated models at “real” time § Potentially infinite sequence of data § Change over time (concept drift)

  5. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 5 Mining Big Data Streams § Approximation algorithms: § Single pass, one data item at a time § Sub-linear space and time per data item § Small error with high probability § A platform solution: § Support different algorithms & processing engines § Distributed § Scalable

  6. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 6 What is SAMOA? § Scalable Advanced Massive Online Analysis § A platform for mining big data streams § Framework for developing new distributed stream mining algorithms § Framework for deploying algorithms on new distributed stream processing engines

  7. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 7 Taxonomy Machine Learning Non Distributed Distributed Batch Stream Batch Stream Hadoop S4, Storm R, SAMOA Mahout WEKA, MOA …

  8. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 8 SAMOA Architecture Machine Learning Algorithms SAMOA% SA Distributed Stream Flink Processing Engines

  9. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 9 Why is SAMOA important? § Program once, run everywhere § Reuse existing infrastructure § Avoid deploy cycles § No system downtime § No complex backup/update process § No need to select update frequency

  10. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 10 ML Developer API Processing Item Processor Stream

  11. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 11 ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);

  12. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 12 Deployment S4 bindings To S4 cluster SAMOA-S4.jar samoa-s4-deployable.s4r API. Algorithm developer SAMOA-API.jar depends only on this samoa-storm-deployable.jar SAMOA-Storm.jar Storm bindings To Storm cluster

  13. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 13 Easy to get!

  14. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 14 Easy to get!

  15. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 15 Easy to get!

  16. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 16 Easy to test! bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4 -k) -s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"

  17. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 17 Case study: Decision Trees § VHT: Vertical Hoeffding Tree Task parallelism

  18. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 18 Case study: VHT Model Stats Instances Histograms Stream Stats Stats Model Updates Horizontal Parallelism

  19. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 19 Case study: VHT Model Stats Attributes Stream Stats Stats Splits Vertical Parallelism

  20. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 20 Benefits of Vertical Parallelism § High number of attributes: § high level parallelism (e.g., documents) § vs. task parallelism: § obvious parallelism observed § vs. horizontal parallelism: § reduced memory usage (no model replication) § parallelized split computation

  21. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 21 Vertical Hoeffding Tree Source (n) Model (n) Stats (n) Evaluator (1) Stream Instance Control Shuffle Grouping Key Grouping Split All Grouping Result

  22. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 22 Preliminary results: Tweets § Zipf skew: 1.5 § Bag of words: 100, 1000, 10000 (attributes) § Size of tweet: ~15 words § Instances: 1,000,000 § Class: positive or negative (Gaussian random variable) § 10 runs § Local vs. Storm virtual cluster

  23. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 23 Results: Accuracy Classification Accuracy vs. 100 words Parallelism Level vs. 1000 words Number of Attributes 10000 words 100 Correct Classification % 80 60 40 20 0 4 8 16 local Parallelism Level

  24. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 24 Results: Speedup Speedup vs. 100 words Parallelism Level vs. 1000 words Number of Attributes 10000 words 5 4 Speedup 3 2 1 0 4 8 16 Parallelism Level

  25. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 25 Is SAMOA for you? § Are you dealing with: § Big fast data? § Possibly endless streams of data? § Evolving data? § Do you need updated models at real time? § Do you want to test an algorithm on different DSPEs?

  26. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 26 SAMOA Team Albert Bifet Matthieu Morel Gianmarco Arinto Murdopo De Francici Morales Olivier Van Laere Nicolas Kourtellis

  27. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 27 Status § Apache Incubator § Released version 0.3.0 in July § Execution Engines Heron? § Input: § Local FS § HDFS § Kafka [pending]

  28. SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 28 Algorithms in SAMOA § Existing: § Vertical Hoeffding Tree (classification) § CluStream (clustering) § Adaptive Model Rules (regression) § Pending: § Distributed Naïve Bayes § Stochastic Gradient Descent Looking for § Adaptive + Boosting VHT contributors! § Parallelized Gradient Boosted Decision Tree § PARMA (frequent pattern mining) § … § Check Samoa Roadmap for more

  29. 29 SAMOA: Scalable Advanced Machine Online Analysis 30/09/15 SAMOA: A Platform for Mining Big Data Streams @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com

Recommend


More recommend