An Exploration of Optimization Algorithms for High Performance - PowerPoint PPT Presentation

An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 ∗ , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel Corporation ∗ shaden@cs.umn.edu 1 / 26

Outline Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions 1 / 26

Table of Contents Introduction & Preliminaries Tensor Completion Evaluation Criteria Optimization Algorithms Comparison of Optimization Methods Conclusions 1 / 26

Tensor introduction ◮ Tensors are the generalization of matrices to ≥ 3 D . ◮ Tensors have m dimensions (or modes ). ◮ We will use dimensions I × J × K in this talk. users contexts items 2 / 26

Tensor completion ◮ Many tensors are sparse due to missing or unknown data. ◮ Missing values are not treated as zero. ◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing entries. ◮ Applications: precision healthcare, product recommendation, cybersecurity, and others. 3 / 26

Tensor completion ◮ Many tensors are sparse due to missing or unknown data. ◮ Missing values are not treated as zero. ◮ Assumption: the underlying data is low rank. ◮ Tensor completion estimates a low rank model to recover missing entries. ◮ Applications: precision healthcare, product recommendation, cybersecurity, and others. ◮ The canonical polyadic decomposition (CPD) models a tensor as the summation of rank-1 tensors. ≈ + · · · + 3 / 26

Tensor completion with the CPD R ( i , j , k ) is written as the inner product of A ( i , :), B ( j , :), and C ( k , :). A C B 4 / 26

Tensor completion with the CPD R ( i , j , k ) is written as the inner product of A ( i , :), B ( j , :), and C ( k , :). A C B We arrive at a non-convex optimization problem: � � || A || 2 F + || B || 2 F + || C || 2 minimize L ( R , A , B , C ) + λ F A , B , C � �� Loss Regularization � � 2 F L ( R , A , B , C ) = 1 � � R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) 2 nnz( R ) f =1 4 / 26

Challenges Optimization algorithms ◮ Algorithms for matrix completion are relatively mature. ◮ How do their tensor adaptations perform on HPC systems? ◮ Several properties to consider when comparing algorithms: 1. Number of operations. 2. Convergence rate. 3. Computational intensity. 4. Parallelism. 5 / 26

Experimental setup ◮ Source code was implemented as part of SPLATT with MPI+OpenMP. ◮ Experiments are on the Cori supercomputer at NERSC. ◮ Nodes have two sixteen-core Intel processors (Haswell). ◮ Experiments show a rank-10 factorization of the Yahoo Music (KDD cup) tensor. ◮ 210 million user-song-month ratings. ◮ More datasets and ranks in the paper. ◮ Root-mean-squared error (RMSE) on a test set measures solution quality: � 2 · L ( R , A , B , C ) RMSE = nnz( R ) 6 / 26

Table of Contents Introduction & Preliminaries Optimization Algorithms Alternating Least Squares Coordinate Descent Stochastic Gradient Descent Comparison of Optimization Methods Conclusions 6 / 26

Alternating least squares (ALS) ◮ Each row of A is a linear least squares problem. ◮ H i is an | R ( i , : , :) |× F matrix: ◮ R ( i , j , k ) → B ( j , :) ∗ C ( k , :) (elementwise multiplication). � � − 1 H T H T ◮ A ( i , :) ← i H i + λ I i vec( R ( i , : , :)). � �� normal eq. = A C B 7 / 26

Alternating least squares (ALS) ◮ Normal equations N i = H T i H i are formed one non-zero at a time. ◮ H T i vec( R ( i , : , :)) is similarly accumulated into a vector q i . Algorithm 1 ALS: updating A ( i , :) 1: N i ← 0 F × F , q i ← 0 F × 1 2: for ( i , j , k ) ∈ R ( i , : , :) do x ← B ( j , :) ∗ C ( k , :) 3: N i ← N i + x T x 4: q i ← q i + R ( i , j , k ) x T 5: 6: end for 7: A ( i , :) ← ( N i + λ I ) − 1 q i 8 / 26

BLAS-3 formulation ◮ Element-wise computation is an outer product formulation. ◮ O ( F 2 ) work with O ( F 2 ) data per non-zero. ◮ Instead, append ( B ( j , :) ∗ C ( k , :)) to a matrix Z . ◮ When Z is full, do a rank- k update: N i ← N i + Z T Z . Algorithm 2 ALS: updating A ( i , :) 1: N i ← 0 F × F , q i ← 0 F × 1 , Z ← 0 2: for ( i , j , k ) ∈ R ( i , : , :) do Append ( x ← B ( j , :) ∗ C ( k , :)) to Z 3: q i ← q i + R ( i , j , k ) x T 4: 5: end for 6: N i ← N i + Z T Z 7: A ( i , :) ← ( N i + λ I ) − 1 q i 9 / 26

Parallel ALS ◮ We impose a 1D partition on each of the factors. ◮ Non-zeros are then distributed according to the row partitionings. ◮ Only the updated rows need to be communicated. ◮ If mode is short, cooperatively form rows and aggregate the normal equations. A C B 10 / 26

ALS evaluation 295 × relative speedup and 153 × speedup over base-ALS. 512.00 base-ALS 256.00 ALS 128.00 64.00 Time per epoch (s) 32.00 16.00 8.00 4.00 2.00 1.00 0.50 0.25 0.12 1 2 4 8 16 32 64 128 256 512 1024 Number of cores base-ALS is a pure-MPI implementation in C++ [Karlsson ’15]. ALS is our MPI+OpenMP implementation with one MPI rank per node. 11 / 26

Coordinate descent (CCD++) ◮ Select an variable and update while holding all others constant. ◮ Can be done exactly because problem is quadratic. ◮ Rank-1 factors are updated in sequence. 12 / 26

CCD++ formulation ◮ O ( F ) work per non-zero. ◮ Each epoch requires 3 F passes over the tensor. ◮ Heavily dependent on memory bandwidth. F � δ ijk ← R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) f =1 � α i ← δ ijk ( B ( j , f ) C ( k , f )) R ( i , : , :) � ( B ( j , f ) C ( k , f )) 2 β i ← R ( i , : , :) α i A ( i , f ) ← β i + λ 13 / 26

Compressed sparse fiber (CSF) ◮ CSF is a generalization of the CSR structure for matrices. ◮ Paths from roots to leaves encode non-zeros. ◮ CSF reduces the memory bandwidth of the tensor and also structures accesses to the factors. 14 / 26

Parallel CCD++ ◮ Shared-memory: each entry of A (: , f ) is computed in parallel. ◮ Distributing non-zeros with a 3D grid limits communication to the grid layers. ◮ Distributing non-zeros requires α i and β i to be aggregated. ◮ Communication volume is O ( IF ) per process. ◮ For short modes, use a grid dimension of 1 and fully replicate the factor. A 1 A 2 C 2 C 1 B 1 B 2 B 3 15 / 26

CCD++ shared-memory evaluation Replicating the short mode improves speedup from 4 × to 16 × . 32 ideal CSF CSF+replication 16 8 Speedup 4 2 1 1 2 4 8 16 32 Cores CSF uses a CSF representation. CSF+replication uses a CSF representation and replicates α and β for load balance. 16 / 26

CCD++ distributed-memory evaluation 685 × relative speedup and 21 × speedup over base-CCD++. 256.00 base-CCD++ 128.00 CCD++ 64.00 32.00 Time per epoch (s) 16.00 8.00 4.00 2.00 1.00 0.50 0.25 0.12 0.06 1 2 4 8 16 32 64 128 256 512 1024 Number of cores base-CCD++ is a pure-MPI implementation in C++ [Karlsson ’15]. CCD++ is our MPI+OpenMP implementation with two MPI ranks per node. 17 / 26

Stochastic gradient descent (SGD) ◮ Randomly select entry R ( i , j , k ) and update A , B , and C . ◮ O ( F ) work per non-zero. F � δ ijk ← R ( i , j , k ) − A ( i , f ) B ( j , f ) C ( k , f ) f =1 A ( i , :) ← A ( i , :) + η [ δ ijk ( B ( j , :) ∗ C ( k , :)) − λ A ( i , :)] B ( j , :) ← B ( j , :) + η [ δ ijk ( A ( i , :) ∗ C ( k , :)) − λ B ( j , :)] C ( k , :) ← C ( k , :) + η [ δ ijk ( A ( i , :) ∗ B ( j , :)) − λ C ( k , :)] η is the step size; typically O (10 − 3 ) . 18 / 26

Stratified SGD [Beutel ’14] ◮ Strata identify independent blocks of non-zeros. ◮ Each stratum is processed in parallel. Limitations of stratified SGD: ◮ There is only as much parallelism as the smallest dimension. ◮ Sparsely populated strata are communication bound. 19 / 26

Asynchronous SGD (ASGD) ◮ Processes overlap updates and exchange to avoid divergence. ◮ Local solutions are combined via a weighted sum. ◮ Go Hogwild! on shared-memory systems. Limitations of ASGD: ◮ Convergence suffers unless updates are frequently exchanged. 20 / 26

Hybrid stratified/asynchronous SGD ◮ Limit the number of strata to reduce communication. ◮ Assign multiple processes to the same stratum (called a team ). ◮ Each process performs updates on its own local factors. ◮ At the end of a strata, updates are exchanged among the team. 21 / 26

Effects of stratification on SGD @ 1024 cores Hybrid stratification combines the speed of ASGD with the stability of stratification. 32 asynchronous hybrid stratified 30 RMSE 28 26 24 0 10 20 30 40 50 60 70 80 Time (seconds) Hybrid uses sixteen teams of four MPI processes. 22 / 26

Table of Contents Introduction & Preliminaries Optimization Algorithms Comparison of Optimization Methods Conclusions 22 / 26

Strong scaling ◮ SGD exhibits initial slowdown as strata teams are populated. ◮ All methods scale to (past) 1024 cores. 4.00 ALS SGD CCD++ 2.00 Time per epoch (s) 1.00 0.50 0.25 0.12 0.06 32 64 128 256 512 1024 Number of cores 23 / 26

An Exploration of Optimization Algorithms for High Performance - PowerPoint PPT Presentation

An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel

Algorithms for unconstrained local optimization Fabio Schoen 2008

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

PERFORMANCE OF PERFORMANCE OF OPTIMIZATION OPTIMIZATION ALGORITHMS ALGORITHMS FOR DERIVING

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Computational Optimization Constrained Optimization Algorithms Same basic algorithms Repeat

Algorithms for constrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Goals Enable consumers & enterprises to move to a secure, password free world inclusive

Project (FCHIP) Demonstration Webinar 2: Budget Neutrality and Savings Examples Steven Johnson,

Environmental Biodynamics: Rethinking the Role of Time in Environmental Health Research Paul

Introduction to the MovieLens dataset Jamen Long Data Scientist DataCamp Building

of Glyphosate-Resistant Palmer Amaranth in the Southeast Herbicide Resistance in US by Mode of

Use R! for estimating forest parameters based on Airborne Laser Scanner Data Johannes Breidenbach

Low Rank Approximation Lecture 8 Daniel Kressner Chair for Numerical Algorithms and HPC

Online Recommenda-on with Implicit Feedback Xiangnan He, Hanwang Zhang, Min-Yen Kan, Tat-Seng Chua

An Exploration of Optimization Algorithms for High Performance - PowerPoint PPT Presentation

An Exploration of Optimization Algorithms for High Performance Tensor Completion Shaden Smith 1 , Jongsoo Park 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Parallel Computing Lab, Intel

Algorithms for unconstrained local optimization Fabio Schoen 2008

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

PERFORMANCE OF PERFORMANCE OF OPTIMIZATION OPTIMIZATION ALGORITHMS ALGORITHMS FOR DERIVING

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Computational Optimization Constrained Optimization Algorithms Same basic algorithms Repeat

Algorithms for constrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Goals Enable consumers &amp; enterprises to move to a secure, password free world inclusive

Project (FCHIP) Demonstration Webinar 2: Budget Neutrality and Savings Examples Steven Johnson,

Environmental Biodynamics: Rethinking the Role of Time in Environmental Health Research Paul

Introduction to the MovieLens dataset Jamen Long Data Scientist DataCamp Building

of Glyphosate-Resistant Palmer Amaranth in the Southeast Herbicide Resistance in US by Mode of

Use R! for estimating forest parameters based on Airborne Laser Scanner Data Johannes Breidenbach

Low Rank Approximation Lecture 8 Daniel Kressner Chair for Numerical Algorithms and HPC

Online Recommenda-on with Implicit Feedback Xiangnan He, Hanwang Zhang, Min-Yen Kan, Tat-Seng Chua

Goals Enable consumers & enterprises to move to a secure, password free world inclusive