Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian Hueske, Aljoscha Krettek , Kostas Tzoumas Database Systems and Information Management Technische Universität Berlin aljoscha.krettek@campus.tu-berlin.de September 9 th 2012
Agenda Stratosphere Operator Reordering Motivation Operator Reordering Static Code Analysis Conclusion 1/14
Motivation: Big Data Analytics Stratosphere ◮ “Big Data” revolution Operator Reordering ◮ Huge amounts of machine- and human- generated data, often semi-structured ◮ Need for “deep” analytics beyond simple BI queries ◮ Breed of new parallel data management systems ◮ Hadoop, Stratosphere, Asterix, SCOPE, etc. ◮ Common themes in programming models ◮ Data flows composed (in part) of functions written in arbitrary imperative code ◮ Also seen in modern MPP SQL systems (Greenplum, Aster) ◮ Allows more powerful analytics on diverse data sets 2/14
Stratosphere Stratosphere $res = filter $e in $emp Operator Reordering where $e.income > 30000; Compiler Scientific Data Life Sciences Linked Data Query Processor PACT Optimizer Nephele ... 3/14
The PACT Programming Model Stratosphere Sink 1 [ A , B ] ◮ Generalization of MapReduce Operator Reordering ◮ Data flow consisting of data Reduce ( f 4 , A ) sum ( B ) sources, sinks, and operators ◮ Operators consist of [ A , B , C , D , E ] ◮ Second-order function signature from a fixed set of system-defined Match ( f 3 , A , D ) SOFs - PArallelization ConTracts ◮ First-order function written by programmer in Java [ A , B , C ] [ D , E ] ◮ Intermediate representation, but also exposed to the user Map ( f 1 ) Map ( f 2 ) ◮ E.g., to implement functionality C ← A + B filter ( E > 3 ) not supported by query language [ A , B ] [ D , E ] Src 1 Src 2 4/14
Automatic Parallelization Stratosphere Sink 1 Reduce ( f 4 , A ) Operator Reordering sum ( B ) ◮ Knowledge of PACT signature permits automatic parallelization fifo ◮ E.g., for Match operator ◮ Choice of broadcast, partition, Match ( f 3 , A , D ) SFR, etc ◮ Sort-merge or hash-based partition/sort(A) broadcast physical implementation probeHT buildHT ◮ Cascades-style optimizer ◮ Partitioning strategies Map ( f 1 ) Map ( f 2 ) propagated top-down as C ← A + B filter ( E > 3 ) interesting properties [ A , B ] [ D , E ] Src 1 Src 2 5/14
Need for Operator Reordering Stratosphere ◮ Operator reordering may Operator Reordering reduce amount of Sink 1 intermediate data sets ◮ May introduce new Match ( f 3 , A , D ) opportunities for parallelization strategies Reduce ( f 4 , A ) Map ( f 2 ) ◮ For optimal execution, sum ( B ) filter ( E > 3 ) need to consider operator order, parallelization Map ( f 1 ) [ D , E ] strategies, and physical C ← A + B [ A , B ] Src 2 execution in one step Src 1 ◮ SOF signature not enough - need to look inside FOF 6/14
Experimental Results Stratosphere Operator Reordering 12000 Best Order x10.0 Worst Order 10000 Runtime in sec 8000 6000 x1.8 4000 x7.1 2000 0 TPC-H Q7 Clickstream Textmining Processing 7/14
Reordering Conditions Stratosphere We can reorder operators when we know some specific properties of the user defined code. 1 Operator Reordering Define: ◮ Read set: Attributes that might influence FOFs output ◮ Write set: Attributes that might have different value after application of FOF Example, Map-Map reordering: ◮ Two Map operators can be reordered if the FOFs operate on distinct values or have only read-read conflicts Too cumbersome to ask programmer to specify read and write sets, therefore we want to estimate them using static code analysis on generic FOFs 1 Opening the Black Boxes in Data Flow Optimization (VLDB 2012) 8/14
Example FOF Stratosphere Fixed API for dealing with records: create , copy , get , void match(Record left, 1 Operator Reordering set , setNull , and union . Record right, 2 Collector col) { 3 Read set is easily determined Record out = copy(left); 4 by looking at all get if (right.get(F) > 3) { 5 statements. Write set depends out.set(D, right.get(D)); 6 on the schema of the data: } else { 7 out.setNull(A); 8 ◮ Determine four other sets: } 9 origin, write, copy, out.set(E, right.get(E)); 10 projection out.set(F, 42); 11 ◮ Generate final write set col.emit(out); 12 } from these and schema 13 information 9/14
Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14
Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14
Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14
Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14
Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14
Code Analysis Stratosphere Difficult part is determining the Record out = copy(left) origin, write, copy and projection Operator Reordering sets for a user defined FOF from the control flow graph (CFG). if right.get(F) > 3 Solution is a recursive algorithm out.set(D,right.get(D)) out.setNull(A) that builds the four sets: ◮ Start from the emit out.set(E, right.get(E)) statements and traverse the CFG upwards out.set(F, 42) ◮ The sets at one node in the CFG depend on the sets of col.emit(out) the predecessors and the nature of the statement. 11/14
Code Analysis (cont.) Stratosphere Record out = copy(left) Final recursion cases: ( { 1 } , ∅ , ∅ , ∅ ) $or = create() Operator Reordering → ( ∅ , ∅ , ∅ , ∅ ) if right.get(F) > 3 $or = copy($ir) ( { 1 } , ∅ , ∅ , ∅ ) → ( IN ( $ir ) , ∅ , ∅ , ∅ ) out.set(D,right.get(D)) out.setNull(A) For other statements Merge sets ( { 1 } , ∅ , D , ∅ ) ( { 1 } , ∅ , ∅ , { A } ) of predecessors and then modify out.set(E, right.get(E)) depending on type of statement: ( { 1 } , ∅ , { E } , { A } ) $or.set(n,$ir.get(n)) → add n to copy set out.set(F, 42) $or.set(n, x) ( { 1 } , { F } , { E } , { A } ) → add n to write set col.emit(out) $or.setNull(n) ( { 1 } , { F } , { E } , { A } ) → add n to projection set 12/14
Conclusion Stratosphere ◮ Reordering leads to potentially significant benefits Operator Reordering ◮ Up to 10x for relational and non relational tasks in our experiments ◮ Our static code analysis algorithm can automatically derive reordering properties of generic user-written Java code ◮ Difficulties arise in non-linear CFGs (if, loops) and also because the schema of input records changes with reordering ◮ Safety achieved through conservatism ◮ Related work: Manimal 2 ◮ Techniques are complementary 2 Eaman Jahani, Michael J. Cafarella, Christopher Ré: Automatic Optimization for MapReduce Programs. PVLDB 4(6): 385-396 (2011) 13/14
Thank you! www.stratosphere.eu (New open source release available)
Recommend
More recommend