an exploration of optimization algorithms for high
play

An Exploration of Optimization Algorithms for High Performance - PowerPoint PPT Presentation

An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel


  1. An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 ∗ , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel Corporation ∗ shaden@cs.umn.edu 1 / 26

  2. Outline Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions 1 / 26

  3. Table of Contents Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Comparison of Optimization Methods Conclusions 1 / 26

  4. Tensor introduction ◮ Tensors are the generalization of matrices to ≥ 3 D . ◮ Tensors have m dimensions (or modes ). ◮ We will use dimensions I × J × K in this talk. users contexts items 2 / 26

  5. Tensor completion ◮ Many tensors are sparse due to missing or unknown data. ◮ Missing values are not treated as zero. ◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing entries. ◮ Applications: precision healthcare, product recommendation, cybersecurity, and others. 3 / 26

  6. Tensor completion ◮ Many tensors are sparse due to missing or unknown data. ◮ Missing values are not treated as zero. ◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing entries. ◮ Applications: precision healthcare, product recommendation, cybersecurity, and others. ◮ The canonical polyadic decomposition (CPD) models a tensor as the summation of rank-1 tensors. ≈ + · · · + 3 / 26

  7. Tensor completion with the CPD R ( i , j , k ) is written as the inner product of A ( i , :), B ( j , :), and C ( k , :). A C B 4 / 26

  8. Tensor completion with the CPD R ( i , j , k ) is written as the inner product of A ( i , :), B ( j , :), and C ( k , :). A C B We arrive at a non-convex optimization problem: � � || A || 2 F + || B || 2 F + || C || 2 minimize L ( R , A , B , C ) + λ F A , B , C � �� � � �� � Loss Regularization � � 2 F L ( R , A , B , C ) = 1 � � R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) 2 nnz( R ) f =1 4 / 26

  9. Challenges Optimization algorithms ◮ Algorithms for matrix completion are relatively mature. ◮ How do their tensor adaptations perform on HPC systems? ◮ Several properties to consider when comparing algorithms: 1. Number of operations. 2. Convergence rate. 3. Computational intensity. 4. Parallelism. 5 / 26

  10. Experimental setup ◮ Source code was implemented as part of SPLATT with MPI+OpenMP. ◮ Experiments are on the Cori supercomputer at NERSC. ◮ Nodes have two sixteen-core Intel processors (Haswell). ◮ Experiments show a rank-10 factorization of the Yahoo Music (KDD cup) tensor. ◮ 210 million user-song-month ratings. ◮ More datasets and ranks in the paper. ◮ Root-mean-squared error (RMSE) on a test set measures solution quality: � 2 · L ( R , A , B , C ) RMSE = nnz( R ) 6 / 26

  11. Table of Contents Introduction & Preliminaries Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions 6 / 26

  12. Alternating least squares (ALS) ◮ Each row of A is a linear least squares problem. ◮ H i is an | R ( i , : , :) |× F matrix: ◮ R ( i , j , k ) → B ( j , :) ∗ C ( k , :) (elementwise multiplication). � � − 1 H T H T ◮ A ( i , :) ← i H i + λ I i vec( R ( i , : , :)). � �� � normal eq. = A C B 7 / 26

  13. Alternating least squares (ALS) ◮ Normal equations N i = H T i H i are formed one non-zero at a time. ◮ H T i vec( R ( i , : , :)) is similarly accumulated into a vector q i . Algorithm 1 ALS: updating A ( i , :) 1: N i ← 0 F × F , q i ← 0 F × 1 2: for ( i , j , k ) ∈ R ( i , : , :) do x ← B ( j , :) ∗ C ( k , :) 3: N i ← N i + x T x 4: q i ← q i + R ( i , j , k ) x T 5: 6: end for 7: A ( i , :) ← ( N i + λ I ) − 1 q i 8 / 26

  14. BLAS-3 formulation ◮ Element-wise computation is an outer product formulation. ◮ O ( F 2 ) work with O ( F 2 ) data per non-zero. ◮ Instead, append ( B ( j , :) ∗ C ( k , :)) to a matrix Z . ◮ When Z is full, do a rank- k update: N i ← N i + Z T Z . Algorithm 2 ALS: updating A ( i , :) 1: N i ← 0 F × F , q i ← 0 F × 1 , Z ← 0 2: for ( i , j , k ) ∈ R ( i , : , :) do Append ( x ← B ( j , :) ∗ C ( k , :)) to Z 3: q i ← q i + R ( i , j , k ) x T 4: 5: end for 6: N i ← N i + Z T Z 7: A ( i , :) ← ( N i + λ I ) − 1 q i 9 / 26

  15. Parallel ALS ◮ We impose a 1D partition on each of the factors. ◮ Non-zeros are then distributed according to the row partitionings. ◮ Only the updated rows need to be communicated. ◮ If mode is short, cooperatively form rows and aggregate the normal equations. A C B 10 / 26

  16. ALS evaluation 295 × relative speedup and 153 × speedup over base-ALS. 512.00 base-ALS 256.00 ALS 128.00 64.00 Time per epoch (s) 32.00 16.00 8.00 4.00 2.00 1.00 0.50 0.25 0.12 1 2 4 8 16 32 64 128 256 512 1024 Number of cores base-ALS is a pure-MPI implementation in C++ [Karlsson ’15]. ALS is our MPI+OpenMP implementation with one MPI rank per node. 11 / 26

  17. Coordinate descent (CCD++) ◮ Select an variable and update while holding all others constant. ◮ Can be done exactly because problem is quadratic. ◮ Rank-1 factors are updated in sequence. 12 / 26

  18. CCD++ formulation ◮ O ( F ) work per non-zero. ◮ Each epoch requires 3 F passes over the tensor. ◮ Heavily dependent on memory bandwidth. F � δ ijk ← R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) f =1 � α i ← δ ijk ( B ( j , f ) C ( k , f )) R ( i , : , :) � ( B ( j , f ) C ( k , f )) 2 β i ← R ( i , : , :) α i A ( i , f ) ← β i + λ 13 / 26

  19. Compressed sparse fiber (CSF) ◮ CSF is a generalization of the CSR structure for matrices. ◮ Paths from roots to leaves encode non-zeros. ◮ CSF reduces the memory bandwidth of the tensor and also structures accesses to the factors. 14 / 26

  20. Parallel CCD++ ◮ Shared-memory: each entry of A (: , f ) is computed in parallel. ◮ Distributing non-zeros with a 3D grid limits communication to the grid layers. ◮ Distributing non-zeros requires α i and β i to be aggregated. ◮ Communication volume is O ( IF ) per process. ◮ For short modes, use a grid dimension of 1 and fully replicate the factor. A 1 A 2 C 2 C 1 B 1 B 2 B 3 15 / 26

  21. CCD++ shared-memory evaluation Replicating the short mode improves speedup from 4 × to 16 × . 32 ideal CSF CSF+replication 16 8 Speedup 4 2 1 1 2 4 8 16 32 Cores CSF uses a CSF representation. CSF+replication uses a CSF representation and replicates α and β for load balance. 16 / 26

  22. CCD++ distributed-memory evaluation 685 × relative speedup and 21 × speedup over base-CCD++. 256.00 base-CCD++ 128.00 CCD++ 64.00 32.00 Time per epoch (s) 16.00 8.00 4.00 2.00 1.00 0.50 0.25 0.12 0.06 1 2 4 8 16 32 64 128 256 512 1024 Number of cores base-CCD++ is a pure-MPI implementation in C++ [Karlsson ’15]. CCD++ is our MPI+OpenMP implementation with two MPI ranks per node. 17 / 26

  23. Stochastic gradient descent (SGD) ◮ Randomly select entry R ( i , j , k ) and update A , B , and C . ◮ O ( F ) work per non-zero. F � δ ijk ← R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) f =1 A ( i , :) ← A ( i , :) + η [ δ ijk ( B ( j , :) ∗ C ( k , :)) − λ A ( i , :)] B ( j , :) ← B ( j , :) + η [ δ ijk ( A ( i , :) ∗ C ( k , :)) − λ B ( j , :)] C ( k , :) ← C ( k , :) + η [ δ ijk ( A ( i , :) ∗ B ( j , :)) − λ C ( k , :)] η is the step size; typically O (10 − 3 ) . 18 / 26

  24. Stratified SGD [Beutel ’14] ◮ Strata identify independent blocks of non-zeros. ◮ Each stratum is processed in parallel. Limitations of stratified SGD: ◮ There is only as much parallelism as the smallest dimension. ◮ Sparsely populated strata are communication bound. 19 / 26

  25. Asynchronous SGD (ASGD) ◮ Processes overlap updates and exchange to avoid divergence. ◮ Local solutions are combined via a weighted sum. ◮ Go Hogwild! on shared-memory systems. Limitations of ASGD: ◮ Convergence suffers unless updates are frequently exchanged. 20 / 26

  26. Hybrid stratified/asynchronous SGD ◮ Limit the number of strata to reduce communication. ◮ Assign multiple processes to the same stratum (called a team ). ◮ Each process performs updates on its own local factors. ◮ At the end of a strata, updates are exchanged among the team. 21 / 26

  27. Effects of stratification on SGD @ 1024 cores Hybrid stratification combines the speed of ASGD with the stability of stratification. 32 asynchronous hybrid stratified 30 RMSE 28 26 24 0 10 20 30 40 50 60 70 80 Time (seconds) Hybrid uses sixteen teams of four MPI processes. 22 / 26

  28. Table of Contents Introduction & Preliminaries Optimization Algorithms Comparison of Optimization Methods Conclusions 22 / 26

  29. Strong scaling ◮ SGD exhibits initial slowdown as strata teams are populated. ◮ All methods scale to (past) 1024 cores. 4.00 ALS SGD CCD++ 2.00 Time per epoch (s) 1.00 0.50 0.25 0.12 0.06 32 64 128 256 512 1024 Number of cores 23 / 26

Recommend


More recommend