stratosphere for hadoop users
play

Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise - PowerPoint PPT Presentation

Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise Outline 2 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis


  1. Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise

  2. Outline 2 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  3. Stratosphere Stack 3 Simple Script Simple Parser Higher-Level Language SOPREMO Compiler SOPREMO Plan Programming PACT Optimizer PACT Program Model Execution Execution Graph Nephele Scheduler Engine Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  4. Pact (PArallelization ConTracts) 4 Parallel programming model Implementation and generalization of Map/Reduce Similar interface as Hadoop Defines the parallelization semantics of tasks Pact plan is dataflow-oriented Pact optimizes plans and compiles them to execution graphs for Nephele Alexandrov et al. 2010. MapReduce and Pact - Comparing Data Parallel Programming Models. Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  5. Hadoop and Stratosphere Job 5 SELECT ∗ FROM Documents d JOIN Rankings r ON r . u r l = d . u r l WHERE CONTAINS( d . text , [ keywords ] ) AND r . rank > [ rank ] AND NOT EXISTS ( SELECT ∗ FROM V i s i t s v WHERE v . u r l = d . u r l AND v . v i s i t D a t e = CURDATE( ) ) ; d r d r v U N Y * + I Q E - K E U * + § MAP * :CONTAINS [keywords]: * MAP MAP + :rank > [rank]: * CONTAINS [keywords] rank > [rank] * * SAME-KEY REDUCE * * IF : * * MATCH MAP * IF : * * visitDate = CURDATE() SAME-KEY j v * * * § COGROUP MAP IF NONE : * * § * :visitDate = CURDATE(): * * : SAME-KEY * * * REDUCE IF none : * * sink * sink + § * * * ip_addr (ip_addr, url, url (url, content) rank (rank, url, url () url (rank, url, avg_duration) date, ad_revenue, ...) avg_duration) Battré et al. 2010. Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  6. Hadoop and Stratosphere Job 5 d r d r v U Y N E I K * + Q U E - + § * MAP * :CONTAINS [keywords]: * MAP MAP + * :rank > [rank]: CONTAINS [keywords] rank > [rank] * * SAME-KEY REDUCE * * * * IF : MATCH MAP * * * IF : visitDate = CURDATE() SAME-KEY v j * * * § COGROUP MAP * IF NONE : * :visitDate = CURDATE(): § * * * : SAME-KEY * * * REDUCE sink * * IF none : * sink + § * * * ip_addr (ip_addr, url, url () url (rank, url, rank (rank, url, url (url, content) date, ad_revenue, ...) avg_duration) avg_duration) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  7. Nephele 6 Executes an Execution Graph Decides for each task how many instances are appropriate Assigns task instances to computation units Manages fault tolerance and adapts to changes Daniel Warneke and Odej Kao. 2009. Nephele: Efficient Parallel Data Processing in the Cloud Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  8. Sopremo and Simple 7 High level language layer Simple = query language Sopremo = semi-structured data model (JSON) and operators Extensible operators for several use cases Text Mining, Data Cleansing, Data Mining Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  9. Outline 8 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  10. Data Analysis Program 9 Hadoop Driver program + multiple jobs Driver program manually connects input and outputs d r d r v U N I E K Y * + Q U E - Fixed pipeline * + § MAP * * :CONTAINS [keywords]: MAP MAP + * :rank > [rank]: * CONTAINS [keywords] rank > [rank] * SAME-KEY 1 Job = 1 Map + 1 Reduce REDUCE * * IF : * * MATCH MAP * IF : * * visitDate = CURDATE() SAME-KEY j v * * * § COGROUP MAP IF NONE : * * :visitDate = CURDATE(): § * : * * SAME-KEY * * Stratosphere * REDUCE IF none : * * sink * sink Directed acyclic graph of arbitrary Pacts + § * * * rank (rank, url, ip_addr (ip_addr, url, url () url (rank, url, url (url, content) avg_duration) date, ad_revenue, ...) avg_duration) Explicit data sources and sinks Pact also support two inputs (for join-like operations) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  11. Map/Reduce in Stratosphere 10 Works same as Hadoop But the semantics are interpreted differently Each Pact defines data dependencies Map: each tuple can be treated separately Reduce: tuple with same key are grouped and guaranteed to be processed by same reducer Key Value Key Value Input Independent Input Independent Subsets Subsets Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  12. Two Input Pacts 11 Currently three additional Pacts for two inputs Cross: make all possible pairs Match: find all matching pairs CoGroup: group all matching tuples All pairs/groups are treated independently Input B Input B Input B Input A Independent Input A Independent Input A Independent Subsets Subsets Subsets Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  13. Comparison 12 Hadoop Often workarounds necessary in a complex M/R program Error-prone, re-occurring manual data partitioning Pattern evolved for joins etc. (Data mining book) Fixed pipeline allows fine-grain tweaks (exploits) Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  14. Comparison 12 T Data Load Phase L T 1 PPartblock T 6 Replicate Replicate Replicate Replicate Replicate Replicate H1 Fetch Fetch Fetch H2 H3 Fetch Fetch Fetch H4 Store Store Store . . . Store Store Store . . . Scan Scan Scan PPartsplit PPartsplit PPartsplit T 1 T 5 T 2 T 4 T 3 T 6 M1 Union M2 M3 M4 RecReaditemize RecReaditemize MMapmap MMapmap PPartmem Map Phase LPartsh LPartsh LPartsh LPartsh Sortcmp Sortcmp Sortcmp Sortcmp SortGrpgrp SortGrpgrp SortGrpgrp SortGrpgrp MMapcombine MMapcombine MMapcombine . . . MMapcombine . . . Store Store Store Mergecmp SortGrpgrp MMapcombine Store Store PPartsh PPartsh Shu ffl e Phase Fetch Fetch Fetch Fetch Buffer Buffer Buffer Buffer Store Store Merge Store Mergecmp . . . Reduce Phase SortGrpgrp MMapreduce R1 Store R2 T ′ T ′ 1 2 Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  15. Comparison 12 Hadoop Often workarounds necessary in a complex M/R program Error-prone, re-occurring manual data partitioning Pattern evolved for joins etc. (Data mining book) Fixed pipeline allows fine-grain tweaks (exploits) Stratosphere More expressiveness for complex data operations Maintains dataflow semantics to some degree Allows optimization and different shipping strategy Less hooks than Hadoop Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  16. Optimization 13 Inspired by traditional query optimization Choose best join/shipping strategy for plan Example Task: Alternative 1 Example Task: Alternative 2 orders lineitem orders lineitem Y U E Y N K U E I Q U E - N I K Q U E - Local Forward Local Forward Local Forward Local Forward MAP MAP MAP MAP none none none none SAME-KEY SAME-KEY SAME-KEY SAME-KEY Repartition Repartition Broadcast Local Forward MATCH MATCH sort-merge sort-merge SUPER-KEY SUPER-KEY COMBINE Local Forward combining-sort REDUCE Repartition combining-sort REDUCE Local Forward combining-sort sink Local Forward sink Reorder Pacts Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  17. Outline 14 1 Overview over Stratosphere 2 Dataflow Orientation 3 Tuple-based Data Model 4 Other Differences 5 Seminar Organization Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  18. Switch to Tuple-based Model 15 Bleeding Edge! Check pact-examples subproject Typical Hadoop job: Map: transforms/filters data, sets key Reduce: combines data, unsets key Preceding maps limit reordering Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  19. Switch to Tuple-based Model 15 Bleeding Edge! Check pact-examples subproject Typical Hadoop job: Map: transforms/filters data, sets key Reduce: combines data, unsets key Preceding maps limit reordering Stratosphere uses tuples instead of k/v-pairs User should set keys as soon as possible for ALL Pacts Each Pact is annotated to define the "key" Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

  20. Reduce Example 16 public static class CountWords extends ReduceStub { 1 public void reduce(Iterator<PactRecord> records, 2 Collector out){ PactRecord element = null ; 3 int sum = 0; 4 while (records.hasNext()) { 5 element = records.next(); 6 PactInteger count = 7 element.getField(1, PactInteger. class ); 8 sum += count.getValue(); 9 10 } 11 element.setField(1, new PactInteger(sum)); 12 out.collect(element); 13 } 14 } 15 16 ... ReduceContract reducer = new ReduceContract( 17 CountWords. class , // <-- UDF 18 PactString. class , // <-- key class 19 0, // <-- key index 20 21 mapper); // <-- input Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

Recommend


More recommend