big data analytics beyond map reduce
play

Big Data Analytics beyond Map/Reduce 17.7. 22.7. 2011 Prof. Dr. - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN


  1. Asymmetric Fragment-and-Replicate Join ■ We can do better, if relation S is much smaller than R. ■ Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. ■ Cost: p * B(S) transport ??? local join ■  Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. ■ The Asymmetric Fragment-and-replicate join is also called Broadcast Join 7/25/2011 DIMA – TU Berlin 25

  2. Broadcast Join Equi-Join: L(A,X) R(X,C) ■ assumption: |L| << |R| □ M M M Idea ■ broadcast L to each node completely before □ L R R R the map phase begins  by utilities, like Hadoop's distributed cache  or mappers read L from the cluster filesystem at startup Mapper ■ only over R □ step 1: read assigned input split of R into a hash-table (build phase) □ step 2: scan local copy of L and find matching R tuples (probe) □ step 3: emit each such pair □ Alternatively read L into Hash-Table, then read R and probe □ No need for partition / sort / reduce processing ■ Mapper outputs the final join result □ 7/25/2011 DIMA – TU Berlin 26

  3. Repartition Join ■ Equi-Join: L(A,X) R(X,C) R R R assumption: |L| < |R| □ build L L R L R L R ■ Mapper L(A,X) R(X,C) h(key) % n identical processing logic for L and R □ M M M emit each tuple once □ read the intermediate key is a pair of □ L R L R L R  the value of the actual join key X  an annotation identifying to which relation the tuple belongs to (L or R) ■ Partition and sort partition by join key hash value □ input is ordered first on the join key, then on the relation name □ output: a sequence of L( i ), R( i ) blocks of tuples for ascending join key i □ ■ Reduce collect all L-tuples for the current L( i ) block in a hash map □ combine them with each R-tuple of the corresponding R( i )-tuple block □ 7/25/2011 DIMA – TU Berlin 27

  4. Multi-Dimensional Partitioned Join ■ Equi-Join: D1(A,X) D2(B,Y) F(C,X,Y) D2 D1 D1 F star-schema with fact table F and dimensions D i □ D2 ■ Fragment D1 and D2 are partitioned independently □ the partitions for F are defined as D1 x D2 □ ■ Replicate for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y)) □ for D1-tuple d1 there is one degree of freedom (d1.y is undefined) □  D1-tuples are thus replicated for each possible y value symmetric for D2 □ ■ Reduce find and emit (f, d1, d2) pairs □ depending on the input sorting, different join strategies are possible □ 7/25/2011 DIMA – TU Berlin 29

  5. Joins in Hadoop nodes time Asym. = Multi-Dimensional Partitioned Join selectivity 7/25/2011 DIMA – TU Berlin 31

  6. Parallel DBMS vs. Map/Reduce Parallel DBMS Map/Reduce Schema Support   Indexing   Presenting an algorithm Stating what you want Programming Model (procedural: C/C++, (declarative: SQL) Java, …) Optimization   Scaling 1 – 500 10 - 5000 Fault Tolerance Limited Good Pipelines results Materializes results Execution between operators between phases 7/25/2011 DIMA – TU Berlin 32

  7. Simplified Relational Data Processing on Large Clusters MAP-REDUCE-MERGE 7/25/2011 DIMA – TU Berlin 33

  8. Map-Reduce-Merge ■ Motivation □ Map/Reduce does not directly support processing multiple related heterogeneous datasets □ difficulties and/or inefficiency when one must implement relational operators like joins ■ Map-Reduce-Merge □ adds a merge phase that  Goal: efficiently merge data already partitioned and sorted (or hashed) □ Map-Reduce-Merge workflows are comparable to RDBMS execution plans □ Can more easily implement parallel join algorithms   map: □  ( k , v ) ( k , v )   1 1 2 2     reduce: ( k , [ v ]) ( k , v ) □  2 2 2 3          merge:  □ ( k , v ) , ( k , v ) ( k v )   2 3 3 4 4 , 5 7/25/2011 DIMA – TU Berlin 34

  9. Introducing … THE CLOUD 7/25/2011 DIMA – TU Berlin 35

  10. In the Cloud … 7/25/2011 DIMA – TU Berlin 36

  11. "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop? "We'll make cloud computing announcements. I'm not going to fight this thing. But I don't understand what we would do differently in the light of cloud." 7/25/2011 DIMA – TU Berlin 37

  12. Steve Ballmer’s Vision of Cloud Computing 7/25/2011 DIMA – TU Berlin 38

  13. What does Hadoop have to do with Cloud? A few months back, Hamid Pirahesh and I were doing a roundtable with a customer of ours, on cloud and data. We got into a set of standard issues -- data security being the primary but when the dialog turned to Hadoop, a person raised his hands and asked: “What has Hadoop got to do with cloud?" I responded, somewhat quickly perhaps, "Nothing specific, and I am willing to have a dialog with you on Hadoop in and out of the cloud context", but it got me thinking. Is there a relationship, or not? 7/25/2011 DIMA – TU Berlin 39

  14. Re-inventing the wheel - or not? 7/25/2011 DIMA – TU Berlin 40

  15. Parallel Analytics in the Cloud beyond Map/Reduce STRATOSPHERE 7/25/2011 DIMA – TU Berlin 41

  16. The Stratosphere Project * Explore the power of Cloud ■ computing for complex Use-Cases information management applications Scientific Data Life Sciences Linked Data Database-inspired approach ■ Analyze, aggregate, and ■ StratoSphere Query Processor query Above the Clouds Textual and (semi-) ■ structured data Infrastructure as a Service Research and prototype a ... ■ web-scale data analytics infrastructure * FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam 7/25/2011 DIMA – TU Berlin 42

  17. Example: Climate Data Analysis PS,1,1,0,Pa, surface pressure T_2M,11,105,0,K,air_temperature TMAX_2M,15,105,2,K,2m maximum temperature Analysis Tasks on Climate Data Sets TMIN_2M,16,105,2,K,2m minimum temperature  Validate climate models U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind  Locate „hot - spots“ in climate models QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover  Monsoon … (Up to 200 parameters)  Drought  Flooding  Compare climate models  Based on different parameter settings Necessary Data Processing Operations 2km resolution  Filter 10TB 1100km,  Aggregation (sliding window)  Join  Multi-dimensional sliding-window operations  Geospatial/Temporal joins  Uncertainty 950km, 2km resolution 7/25/2011 DIMA – TU Berlin 43

  18. Further Use-Cases ■ Text Mining in the biosciences ■ Cleansing of linked open data 7/25/2011 DIMA – TU Berlin 44

  19. Outline ■ Architecture of the Stratosphere System ■ The PACT Programming Model ■ The Nephele Execution Engine ■ Parallelizing PACT Programs 7/25/2011 DIMA – TU Berlin 45

  20. Architecture Overview Higher-Level JAQL, JAQL? Language Scope, Pig, Pig? DryadLINQ Hive Hive? Parallel Programming PACT Map/Reduce Model Programming Programming Model Model Execution Engine Hadoop Dryad Nephele Stratosphere Hadoop Stack Dryad Stack Stack 7/25/2011 DIMA – TU Berlin 46

  21. Data-Centric Parallel Programming Map / Reduce Relational Databases γ Map Reduce ⋈ π Map Map σ Reduce Reduce σ ■ Schema Free ■ Schema bound (relational model) ■ Many semantics hidden inside the ■ Well defined properties and user code (tricks required to push requirements for parallelization operations into map/reduce) ■ Flexible and optimizable ■ Single default way of parallelization GOAL: Advance the m/r programming model 7/25/2011 DIMA – TU Berlin 47

  22. Stratosphere in a Nutshell PACT Programming Model ■ □ Parallelization Contract (PACT) □ Declarative definition of data parallelism □ Centered around second-order functions PACT Compiler □ Generalization of map/reduce Nephele ■ Nephele □ Dryad-style execution engine □ Evaluates dataflow graphs in parallel □ Data is read from distributed filesystem □ Flexible engine for complex jobs Stratosphere = Nephele + PACT ■ □ Compiles PACT programs to Nephele dataflow graphs □ Combines parallelization abstraction and flexible execution □ Choice of execution strategies gives optimization potential 7/25/2011 DIMA – TU Berlin 48

  23. Overview ■ Parallelization Contracts (PACTs) ■ The Nephele Execution Engine ■ Compiling/Optimizing Programs ■ Related Work 7/25/2011 DIMA – TU Berlin 49

  24. Intuition for Parallelization Contracts Map and reduce are second-order functions ■ □ Call first-order functions (user code) □ Provide first-order functions with subsets of the input data Define dependencies between the ■ Key Value Independent records that must be obeyed when subsets splitting them into subsets □ Cp: Required partition properties Map ■ Input set □ All records are independently processable Reduce ■ □ Records with identical key must be processed together 7/25/2011 DIMA – TU Berlin 50

  25. Contracts beyond Map and Reduce Cross ■ □ Two inputs □ Each combination of records from the two inputs is built and is independently processable Match ■ □ Two inputs, each combination of records with equal key from the two inputs is built □ Each pair is independently processable CoGroup ■ □ Multiple inputs □ Pairs with identical key are grouped for each input □ Groups of all inputs with identical key are processed together 7/25/2011 DIMA – TU Berlin 51

  26. Parallelization Contracts (PACTs) Second-order function that defines properties on the input and ■ output data of its associated first-order function Input First-order function Output Data Data Contract (user code) Contract Input Contract ■ □ Specifies dependencies between records (a.k.a. "What must be processed together?") □ Generalization of map/reduce □ Logically: Abstracts a (set of) communication pattern(s)  For "reduce": repartition-by-key  For "match" : broadcast-one or repartition-by-key Output Contract ■ □ Generic properties preserved or produced by the user code  key property, sort order, partitioning, etc. □ Relevant to parallelization of succeeding functions 7/25/2011 DIMA – TU Berlin 52

  27. Optimizing PACT Programs ■ For certain PACTs, several distribution patterns exist that fulfill the contract Choice of best one is up to the system □ ■ Created properties (like a partitioning) may be reused for later operators Need a way to find out whether they still hold after the user code □ Output contracts are a simple way to specify that □ Example output contracts: Same-Key, Super-Key, Unique-Key □ ■ Using these properties, optimization across multiple PACTs is possible Simple System-R style optimizer approach possible □ 7/25/2011 DIMA – TU Berlin 53

  28. From PACT Programs to Data Flows PACT code invoke(): while (!input2.eof) (grouping) KVPair p = input2.next(); hash-table.put(p.key, p.value); function match(Key k, Tuple val1, while (!input1.eof) Tuple val2) KVPair p = input1.next(); -> (Key, Tuple) KVPait t = hash-table.get(p.key); { User if (t != null) Tuple res = val1.concat(val2); KVPair[] result = res.project(...); Function UF.match(p.key, p.value, t.value); Key k = res.getColumn(1); output.write(result); Return (k, res); end } Nephele code (communication) V4 V4 In-Memory UF1 Channel V1 V3 V3 V3 V3 (map) UF3 UF4 span V3 V4 compile (match) (reduce) V2 UF2 V1 V2 V1 V2 (map) Network Channel Nephele DAG Spanned Data Flow PACT Program 7/25/2011 DIMA – TU Berlin 54

  29. NEPHELE EXECUTION ENGINE 7/25/2011 DIMA – TU Berlin 55

  30. Nephele Execution Engine ■ Executes Nephele schedules compiled from PACT programs □ ■ Design goals PACT Compiler Exploit scalability/flexibility of clouds □ Provide predictable performance □ Efficient execution on 1000+ cores □ Flexible fault tolerance mechanisms □ Nephele ■ Inherently designed to run on top of an IaaS Cloud Heterogeneity through different types of VMs □ Knows Cloud‟s pricing model □ Infrastructure-as-a-Service  VM allocation and de-allocation Network topology inference □ 7/25/2011 DIMA – TU Berlin 56

  31. Nephele Architecture ■ Standard master worker pattern Workload over time ■ Workers can be allocated on demand Client Public Network (Internet) Compute Cloud Cloud Controller Persistent Storage Master Private / Virtualized Network Worker Worker Worker 7/25/2011 DIMA – TU Berlin 57

  32. Structure of a Nephele Schedule Nephele Schedule is represented as DAG ■ □ Vertices represent tasks Output 1 □ Edges denote communication channels Task: LineWriterTask.program Output: s3://user:key@storage/outp Mandatory information for each vertex ■ □ Task program □ Input/output data location (I/O vertices Task 1 only) Task: MyTask.program Optional information for each vertex ■ □ Number of subtasks (degree of parallelism) □ Number of subtasks per virtual machine Input 1 □ Type of virtual machine (#CPU cores, RAM…) Task: LineReaderTask.program □ Channel types Input: s3://user:key@storage/input □ Sharing virtual machines among tasks 7/25/2011 DIMA – TU Berlin 58

  33. Internal Schedule Representation Nephele schedule is converted into internal ■ representation Output 1 Output 1 (1) Explicit parallelization ■ ID: 2 □ Parallelization range (mpl) derived from PACT Type: m1.large □ Wiring of subtasks derived from PACT Explicit assignment to virtual machines ■ Task 1 Task 1 (2) □ Specified by ID and type □ Type refers to hardware profile ID: 1 Type: m1.small Input 1 Input 1 (1) 7/25/2011 DIMA – TU Berlin 59

  34. Execution Stages Issues with on-demand allocation: ■ □ When to allocate virtual machines? Stage 1 □ When to deallocate virtual machines? Output 1 (1) Output 1 □ No guarantee of resource availability! ID: 2 Type: m1.large Stage 0 Stages ensure three properties: ■ □ VMs of upcoming stage are available □ All workers are set up and ready Task 1 Task 1 (2) □ Data of previous stages is stored in persistent manner ID: 1 Type: m1.small Input 1 (1) Input 1 7/25/2011 DIMA – TU Berlin 60

  35. Channel Types Network channels (pipeline) ■ □ Vertices must be in same stage Stage 1 Output 1 (1) Output 1 In-memory channels (pipeline) ■ ID: 2 Type: m1.large □ Vertices must run on same VM □ Vertices must be in same stage Stage 0 File channels ■ □ Vertices must run on same VM Task 1 Task 1 (2) □ Vertices must be in different stages ID: 1 Type: m1.small Input 1 (1) Input 1 7/25/2011 DIMA – TU Berlin 61

  36. Some Evaluation (1/2) Demonstrates benefits of dynamic resource allocation ■ Challenge: Sort and Aggregate ■ □ Sort 100 GB of integer numbers (from GraySort benchmark) □ Aggregate TOP 20% of these numbers (exact result!) First execution as map/reduce jobs with Hadoop ■ □ Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB RAM) □ TeraSort code used for sorting □ Custom code for aggregation Second execution as map/reduce jobs with Nephele ■ □ Map/reduce compatilibilty layer allows to run Hadoop M/R programs □ Nephele controls resource allocation □ Idea: Adapt allocated resources to required processing power 7/25/2011 DIMA – TU Berlin 62

  37. First Evaluation (2/2) USR USR SYS SYS Automatic VM WAIT (b) WAIT Network traffic Network traffic Poor resource (a) 100 500 100 500 deallocation (c) utilization! (d) (b) 450 450 (e) (f) (g) (h) (c) (g) (d) Average network traffic among instances [MBit/s] Average network traffic among instances [MBit/s] (a) 400 400 80 80 (e) Average instance utilization [%] Average instance utilization [%] 350 350 (f) (h) 300 300 60 60 250 250 200 200 40 40 150 150 100 100 20 20 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Time [minutes] Time [minutes] ■ M/R jobs on Nephele M/R jobs on Hadoop ■ 7/25/2011 DIMA – TU Berlin 63

  38. References ■ [WK09] Daniel Warneke, Odej Kao: Nephele: efficient parallel data processing in the cloud. SC-MTAGS 2009 ■ [BEH+10] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: 119-130 ■ [ABE+10] A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010) ■ [AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hüske, et al.: MapReduce and PACT - Comparing Data Parallel Programming Models, to appear at BTW 2011 7/25/2011 DIMA – TU Berlin 64

  39. Ongoing Work ■ Adaptive Fault-Tolerance (Odej Kao) ■ Robust Query Optimization (Volker Markl) ■ Parallelization of the PACT Programming Model (Volker Markl) ■ Continuous Re-Optimization (Johann-Christoph Freytag) ■ Validating Climate Simulations with Stratosphere (Volker Markl) ■ Text Analysis with Stratosphere (Ulf Leser) ■ Data Cleansing with Stratosphere (Felix Naumann) ■ JAQL on Stratosphere: Student Project at TUB ■ Open Source Release: Nephele + PACT (TUB, HPI, HU) 7/25/2011 DIMA – TU Berlin 65

  40. Overview ■ Introduction ■ Big Data Analytics ■ Map/Reduce/Merge ■ Introducing … the Cloud ■ Stratosphere (PACT and Nephele) ■ Demo (Thomas Bodner, Matthias Ringwald) ■ Mahout and Scalable Data Mining (Sebastian Schelter) 7/25/2011 DIMA – TU Berlin 66

  41. The Information Revolution http://mediatedcultures.net/ksudigg/?p=120 7/25/2011 DIMA – TU Berlin 67

  42. Demo Screenshots WEBLOG ANALYSIS QUERY 7/25/2011 DIMA – TU Berlin 74

  43. Weblog Query and Plan SELECT r.url, r.rank, r.avg_duration FROM Documents d JOIN Rankings r ON r.url = d.url WHERE CONTAINS(d.text, [keywords]) AND r.rank > [rank] AND NOT EXISTS (SELECT * FROM Visits v WHERE v.url = d.url AND v.date < [date]); 7/25/2011 DIMA – TU Berlin 75

  44. Weblog Query – Job Preview 7/25/2011 DIMA – TU Berlin 76

  45. Weblog Query – Optimized Plan 7/25/2011 DIMA – TU Berlin 77

  46. Weblog Query – Nephele Schedule in Execution 7/25/2011 DIMA – TU Berlin 78

  47. Demo Screenshots ENUMERATING TRIANGLES FOR SOCIAL NETWORK MINING 7/25/2011 DIMA – TU Berlin 79

  48. Enumerating Triangles – Graph and Job 7/25/2011 DIMA – TU Berlin 80

  49. Enumerating Triangles – Job Preview 7/25/2011 DIMA – TU Berlin 81

  50. Enumerating Triangles – Optimized Plan 7/25/2011 DIMA – TU Berlin 82

  51. Enumerating Triangles – Nephele Schedule in Execution 7/25/2011 DIMA – TU Berlin 83

  52. Scalable data mining APACHE MAHOUT Sebastian Schelter 7/25/2011 DIMA – TU Berlin 85

  53. Apache Mahout: Overview ■ What is Apache Mahout? An Apache Software Foundation project aiming to create scalable □ machine learning libraries under the Apache License focus on scalability , not a competitor for R or Weka □ in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo □ ■ Scalability time is proportional to problem size by resource size □ P t  does not imply Hadoop or parallel, although □ R the majority of implementations use Map/Reduce 7/25/2011 DIMA – TU Berlin 86

  54. Apache Mahout: Clustering ■ Clustering Unsupervised learning: assign a set of data points into subsets (called □ clusters) so that points in the same cluster are similar in some sense ■ Algorithms K-Means □ Fuzzy K-Means □ Canopy □ Mean Shift □ Dirichlet Process □ Spectral Clustering □ 7/25/2011 DIMA – TU Berlin 87

  55. Apache Mahout: Classification ■ Classification supervised learning: learn a decision function that predicts labels y on □ data points x given a set of training samples {(x,y)} ■ Algorithms Logistic Regression (sequential but fast) □ Naive Bayes / Complementary Naïve Bayes □ Random Forests □ 7/25/2011 DIMA – TU Berlin 88

  56. Apache Mahout: Collaborative Filtering ■ Collaborative Filtering approach to recommendation mining: given a user's preferences for □ items, guess which other items would be highly preferred ■ Algorithms Neighborhood methods: Itembased Collaborative Filtering □ Latent factor models: matrix factorization using „Alternating Least □ Squares“ 7/25/2011 DIMA – TU Berlin 89

  57. Apache Mahout: Singular Value Decomposition ■ Singular Value Decomposition matrix decomposition technique used to create an optimal low-rank □ approximation of a matrix used for dimensional reduction, unsupervised feature selection, “Latent □ Semantic Indexing” ■ Algorithms Lanczos Algorithm □ Stochastic SVD □ 7/25/2011 DIMA – TU Berlin 90

  58. Comparing implementations of data mining algorithms in Hadoop/Mahout and Nephele/PACT SCALABLE DATA MINING 7/25/2011 DIMA – TU Berlin 92

  59. Problem description Pairwise row similarity computation ■ Computes the pairwise similarities of the rows (or columns) of a sparse matrix using a predefined similarity function used for computing document □ similarities in large corpora used to precompute item-item- □ similarities for recommendations (Collaborative Filtering) similarity function can be cosine, □ Pearson-correlation, loglikelihood ratio, Jaccard coefficient, … 7/25/2011 DIMA – TU Berlin 93

  60. Map/Reduce ■ Map/Reduce – Step 1 □ compute similarity specific row weights transpose the matrix, there by create an inverted index □ ■ Map/Reduce – Step 2 map out all pairs of cooccurring values □ collect all cooccurring values per row pair, compute similarity value □ ■ Map/Reduce – Step 3 use secondary sort to only keep the k most similar rows □ ■ PACT 7/25/2011 DIMA – TU Berlin 94

  61. Comparison ■ Equivalent implementations in Mahout and PACT problem maps relatively well to the Map/Reduce paradigm □ insight: standard Map/Reduce code can be ported to Nephele/PACT □ with very little effort output contracts and memory forwards offer hooks for performance □ improvements (unfortunately not applicable in this particular usecase) 7/25/2011 DIMA – TU Berlin 95

  62. Problem description K-Means ■ Simple iterative clustering algorithm uses a predefined number of clusters (k) □ start with a random selection of cluster centers □ assign points to nearest cluster □ recompute cluster centers, iterate until convergence □ 7/25/2011 DIMA – TU Berlin 96

  63. Mahout ■ Initialization generate k random cluster centers from datapoints (optional) □ put centers to distributed cache □ ■ Map find nearest cluster for each data point □ emit (cluster id, data point) □ ■ Combine Repeat partially aggregate distances per cluster □ ■ Reduce compute new centroid for each cluster □ ■ output converged cluster centers or centers after n iterations ■ optionally output clustered data points 7/25/2011 DIMA – TU Berlin 97

  64. Stratosphere Implementation Source: www.stratosphere.eu 7/25/2011 DIMA – TU Berlin 98

  65. Code analysis Comparison of the implementations actual execution plan in the underlying distributed systems is nearly ■ equivalent Stratosphere implementation is more intuitive and closer to the ■ mathematical formulation of the algorithm 7/25/2011 DIMA – TU Berlin 99

  66. Problem description Naïve Bayes ■ Simple classification algorithm based on Bayes ‟ theorem ■ General Naïve Bayes assumes feature independence □ often good results even this is □ not given ■ Mahout‟s version of Naïve Bayes Specialized approach for document □ classification based on tf-idf weight metric □ 7/25/2011 DIMA – TU Berlin 100

  67. M/R Overview ■ Classification straight-forward approach, simply reads complete model into memory □ classification is done in the mapper, reducer only sums up statistics for □ confusion matrix ■ Trainer much higher complexity □ needs to count documents, features, features per document, features □ per corpus Mahout‟s implementation is optimized by exploiting Hadoop specific □ features like secondary sort and reading results in memory from the cluster filesystem 7/25/2011 DIMA – TU Berlin 101

  68. M/R Trainer Overview Train Data Feature Weight Summer Tf-Idf Extractor Calculation termDocC σ k σ k σ k σ j TermDoc Counter Tf-Idf tfIdf wordFreq σ k σ j σ k σ j WordFr. Counter Theta docC Doc Normalizer Counter Vocab Theta N. Feature featureC Counter Counter vocabC thetaNorm 7/25/2011 DIMA – TU Berlin 102

  69. Pact Trainer Overview ■ PACT implementation looks even more complex, but PACTs can be combined in a much more □ fine-grained manner as PACT offers the ability to use local memory forwards, more and □ higher level functions can be used like Cross and Match less framework specific tweaks necessary for a performant □ implementation visualized execution plan is much more similar to the algorithmic □ formulation of computing several counts and combining them to a model in the end subcalculations can be seen and unit-tested in isolation □ 7/25/2011 DIMA – TU Berlin 103

  70. PACT Trainer Overview 7/25/2011 DIMA – TU Berlin 104

  71. Hot Path 7,4 GB 8 kB 3,53 GB 84 kB 5,89 GB 14,8 GB 5,89 GB 5 kB 7/25/2011 DIMA – TU Berlin 105

  72. Pact Trainer Overview ■ Future work: PACT implementation can still be tuned by sampling input data □ more variable memory management of Stratosphere □ employing context-concept of PACTs for simpler distribution of □ computed parameters 7/25/2011 DIMA – TU Berlin 106

  73. Hindi Thai Traditional Chinese Gracias Russian Spanish Thank You Obrigado English Brazilian Portuguese Arabic Danke German Grazie Merci Italian French Simplified Chinese Tamil Japanese Korean 7/25/2011 DIMA – TU Berlin 107

  74. Programming in a more abstract way PARALLEL DATA FLOW LANGUAGES 7/25/2011 DIMA – TU Berlin 108

  75. Introduction ■ MapReduce paradigm is too low-level Only two declarative primitives (map + reduce) □ Extremely rigid (one input, two-stage data flow) □ Custom code for e.g.: projection and filtering □  Code is difficult to reuse and maintain □ Impedes Optimizations □ ■ Combination of high-level declarative querying and low-level programming with MapReduce ■ Dataflow Programming Languages Hive □ JAQL □ Pig □ 7/25/2011 DIMA – TU Berlin 109

  76. Hive ■ Data warehouse infrastructure built on top of Hadoop, providing: Data Summarization □ Ad hoc querying □ ■ Simple query language: Hive QL (based on SQL) ■ Extendable via custom mappers and reducers ■ Subproject of Hadoop ■ No „Hive format“ ■ http://hadoop.apache.org/hive/ 7/25/2011 DIMA – TU Berlin 110

Recommend


More recommend