Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Parallel Numerical Algorithms Chapter 6 – Structured and Low Rank Matrices Section 6.3 – Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 21
Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Outline Alternating Least Squares 1 Quadratic Optimization Parallel ALS Coordinate Descent 2 Coordinate Descent Cyclic Coordinate Descent Gradient Descent 3 Gradient Descent Stochastic Gradient Descent Parallel SGD Nonlinear Optimization 4 Nonlinear Equations Optimization Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 21
Alternating Least Squares Coordinate Descent Gradient Descent Nonlinear Optimization Quadratic Optimization: Matrix Completion Given a subset of entries Ω ⊆ { 1 , . . . , m } × { 1 , . . . , n } of the entries of matrix A ∈ R m × n , seek rank- k approximation � � 2 � � argmin a ij − w il h jl + λ ( || W || F + || H || F ) W ∈ R m × k , H ∈ R n × k ( i,j ) ∈ Ω l � �� � ( A − W H T ) ij Problems of these type studied in sparse approximation Ω may be randomly selected sample subset Methods for this problem are typical of numerical optimization and machine learning Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 21
Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Alternating Least Squares Alternating least squares (ALS) fixes W and solves for H then vice versa until convergence Each step improves approximation, convergence to a minimum expected given satisfactory starting guess We have a quadratic optimization problem � � 2 � � argmin a ij − w il h jl + λ || W || F W ∈ R m × k ( i,j ) ∈ Ω l The optimization problem is independent for rows of W Letting w i = w i⋆ , h i = h i⋆ , Ω i = { j : ( i, j ) ∈ Ω } , seek � � 2 � a ij − w i h T argmin + λ || w i || 2 j w i ∈ R k j ∈ Ω i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 21
Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization ALS: Quadratic Optimization Seek minimizer w i for quadratic vector equation � � 2 � a ij − w i h T f ( w i ) = + λ || w i || j j ∈ Ω i Differentiating with respect to w i gives ∂f ( w i ) � � � h T w i h T = 2 j − a ij + 2 λ w i = 0 j ∂ w i j ∈ Ω i i and defining G ( i ) = � Rotating w i h T j = h j w T j ∈ Ω i h T j h j , � ( G ( i ) + λ I ) w T h T i = j a ij j ∈ Ω i which is a k × k symmetric linear system of equations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 21
Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization ALS: Iteration Cost For updating each w i , ALS is dominated in cost by two steps G ( i ) = � j ∈ Ω i h T j h j 1 dense matrix-matrix product O ( | Ω i | k 2 ) work logarithmic depth Solve linear system with G ( i ) + λ I 2 dense symmetric k × k linear solve O ( k 3 ) work typically O ( k ) depth Can do these for all m rows of W independently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 21
Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Parallel ALS Let each task optimize a row w i of W Need to compute G ( i ) for each task Specific subset of rows of H needed for each G ( i ) Task execution is embarassingly parallel if all of H stored on each processor Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 21
Alternating Least Squares Coordinate Descent Quadratic Optimization Gradient Descent Parallel ALS Nonlinear Optimization Memory-Constrained Parallel ALS May not have enough memory to replicate H on all processors Communication required and pattern is data-dependent Could rotate rows of H along a ring of processors Each processor computes contributions to the G ( i ) it owns Requires Θ( p ) latency cost for each iteration of ALS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 21
Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Updating a Single Variable Rather than whole rows w i solve for elements of W , recall � � 2 � � argmin a ij − w il h jl + λ || W || F W ∈ R m × k ( i,j ) ∈ Ω l Coordinate descent finds the best replacement µ for w it � � 2 � � + λµ 2 µ = argmin a ij − µh jt − w il h jl µ j ∈ Ω i l � = t The solution is given by � � � a ij − � j ∈ Ω i h jt l � = t w il h jl µ = λ + � j ∈ Ω i h 2 jt Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 21
Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Coordinate Descent For ∀ ( i, j ) ∈ Ω compute elements r ij of R = A − W H T so that we can optimize via � � � � � a ij − � � j ∈ Ω i h jt l � = t w il h jl j ∈ Ω i h jt r ij + w it h jt µ = = λ + � λ + � j ∈ Ω i h 2 j ∈ Ω i h 2 jt jt after which we can update R via r ij ← r ij − ( µ − w it ) h jt ∀ j ∈ Ω i both using O ( | Ω i | ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 21
Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Cyclic Coordinate Descent (CCD) Updating w i costs O ( | Ω i | k ) operations with coordinate descent rather than O ( | Ω i | k 2 + k 3 ) operations with ALS By solving for all of w i at once, ALS obtains a more accurate solution than coordinate descent Coordinate descent with different update orderings: Cyclic coordinate descent (CCD) updates all columns of W then all columns of H (ALS-like ordering) CCD++ alternates between columns of W and H All entries within a column can be updated concurrently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 21
Alternating Least Squares Coordinate Descent Coordinate Descent Gradient Descent Cyclic Coordinate Descent Nonlinear Optimization Parallel CCD++ Yu, Hsieh, Si, and Dhillon 2013 propose using a row-blocked layout of H and W They keep track of a corresponding m/p and n/p rows and columns of A and R on each processor (using twice the minimal amount of memory) Every column update in CCD++ is then fully parallelized, but an allgather of each column is required to update R The complexity of updating all of W and all of H is then T p ( m, n, k ) = Θ( kT allgather ( m + n ) + γQ 1 ( m, n, k ) /p ) p = O ( αk log p + β ( m + n ) k + γ | Ω | k/p ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 21
Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Gradient-Based Update ALS minimizes w i , gradient descent methods only improve it Recall that we seek to minimize � � 2 � a ij − w i h T f ( w i ) = + λ || w i || j j ∈ Ω i and use the partial derivative � � ∂f ( w i ) � � � � h T w i h T = 2 j − a ij +2 λ w i = 2 λ w i − r ij h j j ∂ w i j ∈ Ω i j ∈ Ω i Gradient descent method updates w i = w i − η∂f ( w i ) ∂ w i where parameter η is our step-size Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 21
Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Stochastic Gradient Descent (SGD) Stochastic gradient descent (SGD) performs fine-grained updates based on a component of the gradient Again the full gradient is � � ∂f ( w i ) � � = 2 λ w i − r ij h j = 2 λ w i / | Ω i | − r ij h j ∂ w i j ∈ Ω i j ∈ Ω i SGD selects random ( i, j ) ∈ Ω and updates w i using h j w i ← w i − η ( λ w i / | Ω i | − r ij h j ) SGD then updates r ij = a ij − w T i h j Each update costs O ( k ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 21
Alternating Least Squares Gradient Descent Coordinate Descent Stochastic Gradient Descent Gradient Descent Parallel SGD Nonlinear Optimization Asynchronous SGD Parallelizing SGD is easy aside from ensuring concurrent updates do not conflict Asynchronous shared-memory implementations of SGD are popular and achieve high performance For sufficiently small step-size, inconsistencies among updates (e.g. duplication) are not problematic statistically Asynchronicity can slow down convergence Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 21
Recommend
More recommend