big data learning in prac ce
play

Big Data Learning in Prac.ce Isaac Triguero School of Computer - PowerPoint PPT Presentation

Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016 Outline q What is Big data? q


  1. Big Data Learning in Prac.ce Isaac Triguero School of Computer Science University of Nottingham United Kingdom Isaac.Triguero@no:ngham.ac.uk h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 12th September 2016

  2. Outline q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions 2

  3. What is Big Data? There is no a standard defini.on! “ Big Data ” involves data whose volume, diversity and complexity requires new techniques, algorithms and analyses to extract valuable knowledge (hidden) . Data Intensive applica.ons 3

  4. What is Big Data? The 5V’s definiKon 4

  5. Big data has many faces 5

  6. Outline q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions 6

  7. How to deal with data intensive applicaKons? • Problem statement: scalability to big data sets. • Example: – Explore 100 TB by 1 node @ 50 MB/sec = 23 days – ExploraKon with a cluster of 1000 nodes = 33 minutes • Solu.on è Divide-And-Conquer What happens if we have to manage 1000 or 10000 TB ? 7

  8. MapReduce • Parallel Programming model • Divide & conquer strategy § divide : parKKon dataset into smaller, independent chunks to be processed in parallel ( map ) § conquer : combine, merge or otherwise aggregate the results from the previous step ( reduce ) • Based on simplicity and transparency to the programmers, and assumes data locality . • Becomes popular thanks to the open-source project Hadoop! (Used by Google, Facebook, Amazon, …) 8

  9. TradiKonal HPC way of doing things CommunicaKon network (Infiniband) Lots of communica.on … worker c c c c c nodes (lots of them) Lots of OS OS OS OS OS computa.ons Network for I/O Limited I/O input data (relaKvely small) central storage Source: Jan Fos.er. Introduc.on to MapReduce and its Applica.on to Post-Sequencing Analysis

  10. Data-intensive jobs Fast communicaKon network (Infiniband) Limited communicaKon … Low compute intensity OS OS OS OS OS doesn’t scale Network for I/O Lots of I/O central storage a a b c d b c d e e input data (lots of it) f f g h i g h i j j

  11. Data-intensive jobs CommunicaKon network Limited … communicaKon Low compute intensity input data b c a c b e d f a d (lots of it) e j g j h i g i f h Solu.on: store data on local disks of the nodes that perform computaKons on that data (“ data locality ”)

  12. Hadoop • Hadoop is: – An open-source framework wri<en in Java – Distributed storage of very large data sets (Big Data) – Distributed processing of very large data sets • This framework consists of a number of modules – Hadoop Common – Hadoop Distributed File System (HDFS) – Hadoop YARN – resource manager – Hadoop MapReduce – programming model h<p://hadoop.apache.org/ 12

  13. Hadoop MapReduce: Main CharacterisKcs • Automa.c paralleliza.on: – Depending on the size of the input data è there will be mulKple MAP tasks! – Depending on the number of Keys <k, value> è there will be mulKple REDUCE tasks! • Scalability: – It may work over every data center or cluster of computers. • Transparent for the programmer – Fault-tolerant mechanism. – AutomaKc communicaKons among computers 13

  14. Data Sharing in Hadoop MapReduce HDFS HDFS HDFS HDFS read write read write . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 result 3 query 3 Input . . . Slow due to replicaKon, serializaKon, and disk IO 14

  15. Paradigms that do not fit with Hadoop MapReduce • Directed Acyclic Graph (DAG) model: – The DAG defines the dataflow of the applicaKon, and the verKces of the graph defines the operaKons on the data. • Graph model: – More complex graph models that be<er represent the dataflow of the applicaKon. – Cyclic models -> IteraKvity. • Itera.ve MapReduce model: – An extented programming model that supports iteraKve MapReduce computaKons efficiently. 15

  16. New plauorms to overcome Hadoop’s limitaKons GIRAPH (APACHE Project) (h<p://giraph.apache.org/) Twister (Indiana University) Itera8ve graph processing h<p://www.iteraKvemapreduce.org/ Private Clusters GPS - A Graph Processing System, PrIter (University of Massachuse<s (Stanford) Amherst, Northeastern University- h<p://infolab.stanford.edu/gps/ China) Amazon's EC2 h<p://code.google.com/p/priter/ Private cluster and Amazon EC2 cloud Distributed GraphLab (Carnegie Mellon Univ.) HaLoop h<ps://github.com/graphlab-code/ (University of Washington) graphlab h<p://clue.cs.washington.edu/node/14 Amazon's EC2 h<p://code.google.com/p/haloop/ Amazon’s EC2 GPU based plauorms Spark (UC Berkeley) Mars h<p://spark.incubator.apache.org/ Grex research.html 16

  17. Big data technologies 17

  18. What is Spark? Fast Fast and Expr Expressive essive Cluster Computing � Engine Compatible with Apache Hadoop Up to 10× faster on disk, 1 2-5× less code 0 0 × in memory Efficient Usable • General execuKon graphs • Rich APIs in Java, Scala, Python • In-memory storage • InteracKve shell 18

  19. Spark Goal • Provide distributed memory abstracKons for clusters to support apps with working sets • Retain the aZrac.ve proper.es of MapReduce : – Fault tolerance (for crashes & stragglers) – Data locality – Scalability Ini.al Solu.on: augment data flow model with “resilient distributed datasets” (RDDs) 19

  20. RDDs in Detail • An RDD is a fault-tolerant collecKon of elements that can be operated on in parallel. • There are two ways to create RDDs: – Parallelizing an exisKng collecKon in your driver program – Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Hbase. • Can be cached for future reuse 20

  21. OperaKons with RDDs • TransformaKons (e.g. map, filter, groupBy, join) – Lazy operaKons to build RDDs from other RDDs • AcKons (e.g. count, collect, save) – Return a result or write it to storage Transformations Parallel operations (define a new RDD) (return a result to driver) map reduce filter collect sample count union save groupByKey lookupKey reduceByKey … join cache 21 …

  22. Spark vs. hadoop K-Means 274 300 Hadoop Lines of code for K- HadoopBinMem 250 Means Iteration time (s) Spark 197 Spark ~ 90 lines – 200 157 143 150 121 Hadoop ~ 4 files, > 300 106 lines 87 100 61 33 50 0 25 50 100 Number of machines [Zaharia et. al, NSDI’12] 22

  23. Apache Spark – new collecKons DataFrame (Spark 1.3+) - Equivalent to a table in a relaKonal database (data frame in R/ Python) - Avoid Java serializaKon performed by RDDs. - API natural for developers who are familiar with building query plans (e.g. SQL expressions). Datasets (Spark 1.6+) - Best of both DataFrame and RDDs. - FuncKonal transformaKons (map, flatMap, filter, etc) - Spark SQL’s opKmised execuKon engine. 23

  24. Flink h<ps://flink.apache.org/ 24

  25. Big Data: Technology 2001-2010 and Chronology 2010-2016 2001 3V’s Gartner Doug Laney 2009-2013 Flink 2010-2016: 2004 TU Berlin MapReduce Big Data Google Flink Apache (Dec. Big 2014) Volker AnalyKcs: Jeffrey Dean Markl Mahout, MLLib, … Data Hadoop Ecosystem 2010 Spark 2008 ApplicaKons U Berckeley Hadoop New Apache Spark Yahoo! Technology Feb. 2014 Doug Cufng Matei Zaharia 25

  26. Outline q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions 26

  27. Big Data AnalyKcs Poten.al scenarios Real Time Analytics/ Clustering Classification Big Data Streams Association Social Media Mining Recommendation Social Big Data 27 Systems

  28. Big Data AnalyKcs: A 3 generaKonal view 28

  29. Mahout (Samsara) • First ML library iniKally based on Hadoop MapReduce. • Abandoned MapReduce implementaKons from version 0.9. • Nowadays it is focused on a new math environment called Samsara. • It is integrated with Spark, Flink and H2O • Main algorithms: – StochasKc Singular Value DecomposiKon (ssvd, dssvd) – StochasKc Principal Component Analysis (spca, dspca) – Distributed Cholesky QR (thinQR) – Distributed regularized AlternaKng Least Squares (dals) – CollaboraKve Filtering: Item and Row Similarity – Naive Bayes ClassificaKon 29 h<p://mahout.apache.org/

  30. Spark Libraries h<ps://spark.apache.org/mllib/ 30

  31. As of Spark 2.0 31

  32. FlinkML h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/ 32

  33. Outline q What is Big data? q How to deal with Data Intensive applicaKons? q Big Data AnalyKcs q A demo with MLlib q Conclusions 33

  34. Demo • In this demo I will show two ways of working with Apache Spark: – InteracKve mode with Spark Notebook. – Standalone mode with Scala IDE. • All the code used in this presentaKon is available at: h<p://www.cs.no<.ac.uk/~pszit/benelearn.html 34

  35. DEMO with Spark Notebook in local h<p://spark-notebook.io/ 35

  36. DEMO with Spark Notebook in local 36

  37. DEMO with Spark Notebook in local 37

  38. DEMO with Spark Notebook in local 38

  39. DEMO with Spark Notebook in local 39

  40. DEMO with Spark Notebook in local 40

  41. DEMO with Spark Notebook in local 41

  42. DEMO with Spark Notebook in local 42

Recommend


More recommend