efficient data parallel cumulative aggregates for large
play

Efficient Data-Parallel Cumulative Aggregates for Large-Scale - PowerPoint PPT Presentation

1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research


  1. 1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research – Almaden; San Jose, CA, USA

  2. Introduction and Motivation Motivation Large-Scale ML 2 Feedback Loop  Large-Scale Machine Learning Data  Variety of ML applications (supervised, semi-/unsupervised)  Large data collection (labels: feedback, weak supervision) Model Usage  State-of-the-art ML Systems  Batch algorithms  Data-/task-parallel operations  Mini-batch algorithms  Parameter server  Data-Parallel Distributed Operations  Linear Algebra (matrix multiplication, element-wise operations, structural and grouping aggregations, statistical functions)  Meta learning (e.g., cross validation, ensembles, hyper-parameters)  In practice: also reorganizations and cumulative aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  3. Introduction and Motivation Motivation Cumulative Aggregates 3  Example Z = cumsum ( X ) 1 2 1 2 � 1 1 2 3 with Z ij = ∑ � �� Prefix Sums = X ij + Z (i-1)j ��� 3 4 5 7 2 1 7 8  Applications  #1 Iterative survival analysis: Cox Regression / Kaplan-Meier  #2 Spatial data processing via linear algebra, cumulative histograms  #3 Data preprocessing: subsampling of rows / remove empty rows  Parallelization MPI: 7 0 2  Recursive formulation looks inherently sequential 2 5 rank rank  Classic example for parallelization via aggregation trees 1 2 (message passing or shared memory HPC systems) 1 1 3 2  Question: Efficient, Data-Parallel Cumulative Aggregates? (blocked matrices as unordered collections in Spark or Flink) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  4. Outline 4  SystemML Overview and Related Work  Data-Parallel Cumulative Aggregates  System Integration  Experimental Results Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  5. 5 SystemML Overview and Related Work Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  6. SystemML Overview and Related Work High-Level SystemML Architecture 6 DML Scripts DML ( D eclarative M achine APIs: Command line, JMLC, Learning L anguage) Spark MLContext, Spark ML, Language (20+ scalable algorithms) Compiler [SIGMOD’15,’17,‘19] [PVLDB’14,’16a,’16b,’18] 05/2017 Apache Top-Level Project [ICDE’11,’12,’15] Runtime 11/2015 Apache Incubator Project [CIDR’17] 08/2015 Open Source Release [VLDBJ’18] [DEBull’14] [PPoPP’15] In-Memory Single Node Hadoop or Spark Cluster (scale-up) (scale-out) In-Progress: GPU since 2014/16 since 2010/11 since 2012 since 2015 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  7. SystemML Overview and Related Work Basic HOP and LOP DAG Compilation 7 LinregDS (Direct Solve) Cluster Config: 8KB HOP DAG • driver mem: 20 GB X = read ( $1 ); Scenario: CP write (after rewrites) y = read ( $2 ); X: 10 8 x 10 3 , 10 11 • exec mem: 60 GB 8MB intercept = $3 ; y: 10 8 x 1, 10 8 16MB CP b(solve) lambda = 0.001; CP b(+) ... 172KB 1.6TB if ( intercept == 1 ) { CP SP 800GB r(diag) ba(+*) ba(+*) ones = matrix (1, nrow (X), 1); SP X = append (X, ones); 1.6TB } r(t) SP 8KB I = matrix (1, ncol (X), 1); 800GB 800MB CP dg(rand) X y A = t (X) %*% X + diag (I)*lambda; (10 3 x1,10 3 ) (10 8 x10 3 ,10 11 ) (10 8 x1,10 8 ) b = t (X) %*% y; beta = solve (A, b); ... 16KB LOP DAG write (beta, $4 ); r’(CP) (after rewrites)  Hybrid Runtime Plans: mapmm(SP) tsmm(SP) 800MB • Size propagation / memory estimates 1.6GB X • Integrated CP / Spark runtime r’(CP) X 1,1 (persisted in  Distributed Matrices MEM_DISK) X 2,1 y • Fixed-size (squared) matrix blocks • Data-parallel operations X m,1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  8. SystemML Overview and Related Work Cumulative Aggregates in ML Systems 8 (Straw-man Scripts and Built-in Support) 1: cumsumN2 = function ( Matrix [ Double ] A) 1: cumsumNlogN = function ( Matrix [ Double ] A) 2: return ( Matrix [ Double ] B) 2: return(Matrix[Double] B) 3: { 3: { 4: B = A; csums = matrix (0,1,ncol(A)); 4: B = A; m = nrow (A); k = 1; 5: for ( i in 1: nrow (A) ) { 5: while ( k < m ) { 6: csums = csums + A[i,]; 6: B[(k+1):m,] = B[(k+1):m,] + B[1:(m-k),]; 7: B[i,] = csums; 7: k = 2 * k; 8: } 8: } copy-on-write     O(n^2)     O(n log n) 9: } 9: }     Qualify for update in-place, but still too slow  ML Systems  Update in-place: R (ref count), SystemML (rewrites), Julia  Builtins in R, Matlab, Julia, NumPy, SystemML (since 2014) cumsum (), cummin (), cummax (), cumprod ()  SQL  SELECT Rid, V, sum (V) OVER ( ORDER BY Rid) AS cumsum FROM X  Sequential and parallelized execution (e.g., [Leis et al, PVLDB’15]) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  9. 9 Data-Parallel Cumulative Aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  10. Data-Parallel Cumulative Aggregates DistCumAgg Framework 10  Basic Idea: self-similar operator chain (forward, local, backward) block-local aggregates cumagg aggregates of aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  11. Data-Parallel Cumulative Aggregates Basic Cumulative Aggregates 11  Instantiating Operation Init f agg f off f cumagg Basic cumsum ( X ) 0 colSums ( B ) B 1: =B 1: +a cumsum ( B ) Cumulative cummin ( X ) ∞ colMins ( B ) B 1: =min(B 1: ,a) cummin ( B ) Aggregates cummax ( X ) -∞ colMaxs ( B ) B 1: =max(B 1: ,a) cummax ( B ) cumprod ( X ) 1 colProds ( B ) B 1: =B 1: *a cumprod ( B )  Example cumsum ( X ) fused to avoid copy Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  12. Data-Parallel Cumulative Aggregates Complex Cumulative Aggregates 12 1 .2 Exponential  Instantiating 1 .1 Z = cumsumprod ( X ) = cumsumprod ( Y , W ) smoothing 3 .0 with Z i = Y i + W i * Z i-1 , Z 0 =0 Complex 2 .1 Recurrences Init f agg f off f cumagg Equations 0 cbind ( cumsumprod ( B ) n1 , B 11 =B 11 +B 12 *a cumsumprod ( B ) prod ( B :2 ))  Example cumsumprod(X) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  13. 13 System Integration Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  14. System Integration Simplification Rewrites 14  Example #1: Suffix Sums  Problem: Distributed reverse causes data shuffling =  Compute via column aggregates and prefix sums rev ( cumsum ( rev ( X )))  X + colSums ( X ) – cumsum ( X ) (broadcast) (partitioning-preserving)  Example #2: Extract Lower Triangular  Problem: Indexing cumbersome/slow; cumsum densifying 1 0 0 0 0 0 0 1 1 0 0 0 0 0  Use dedicated operators 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X * cumsum ( diag ( matrix (1, nrow ( X ),1))) 1 1 1 1 1 1 0  lower.tri ( X ) 1 1 1 1 1 1 1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  15. System Integration Execution Plan Generation 15  Compilation Chain of Cumulative Aggregates  Execution type selection based on memory estimates  Physical operator config (broadcast, aggregation, in-place, #threads)  Example Low-Level Runtime Plan Operators (LOPs) in-place High-Level #threads SP , k+, Operator (HOP) cumagg broadcast 1: ... 2: SP ucumack+ _mVar1 _mVar2 3: CP ucumk+ _mVar2 _mVar3 24 T u(cumsum) CP , 24, 4: CP rmvar _mVar2 u(cumsum) in-place 5: SP bcumoffk+ _mVar1 _mVar3 cumagg _mVar4 0 T SP , k+ 6: CP rmvar _mVar1 _mVar3 X nrow(X) ≥ b 7: ... X broadcast Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Recommend


More recommend