General Nonlinear Optimization Matrix Completion Parallel Numerical Algorithms Chapter 6 – Structured and Low Rank Matrices Section 6.3 – Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 23
General Nonlinear Optimization Matrix Completion Outline General Nonlinear Optimization 1 Nonlinear Equations Optimization Matrix Completion 2 Alternating Least Squares Coordinate Descent Gradient Descent Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 23
General Nonlinear Optimization Nonlinear Equations Matrix Completion Optimization Nonlinear Equations Potential sources of parallelism in solving nonlinear equation f ( x ) = 0 include Evaluation of function f and its derivatives in parallel Parallel implementation of linear algebra computations (e.g., solving linear system in Newton-like methods) Simultaneous exploration of different regions via multiple starting points (e.g., if many solutions are sought or convergence is difficult to achieve) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 23
General Nonlinear Optimization Nonlinear Equations Matrix Completion Optimization Optimization Sources of parallelism in optimization problems include Evaluation of objective and constraint functions and their derivatives in parallel Parallel implementation of linear algebra computations (e.g., solving linear system in Newton-like methods) Simultaneous exploration of different regions via multiple starting points (e.g., if global optimum is sought or convergence is difficult to achieve) Multi-directional searches in direct search methods Decomposition methods for structured problems, such as linear, quadratic, or separable programming Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 23
General Nonlinear Optimization Nonlinear Equations Matrix Completion Optimization Nonlinear Optimization Methods Goal is to minimize objective function f ( x ) Gradient-based (first-order) methods compute x ( s +1) = x ( s ) − α ∇ f ( x ( s ) ) Newton’s method (second-order) computes x ( s +1) = x ( s ) − H f ( x ( s ) ) − 1 ∇ f ( x ( s ) ) Alternating methods fix a subset of variables x 1 at a time and minimize (via one of above two methods) �� �� x ( s ) g ( s ) ( x 2 ) = f 1 x 2 Subgradient methods such as stochastic gradient descent, assume f ( x ( s ) ) = � n i =1 f i ( x ( s ) ) and compute x ( s +1) = x ( s ) − η ∇ f i ( x ( s ) ) i ∈ { 1 , · · · , n } for Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 23
General Nonlinear Optimization Nonlinear Equations Matrix Completion Optimization Parallelism in Nonlinear Optimization In gradient-based methods, parallelism is generally found within calculation of ∇ f ( x ( s ) ) , line optimization (if any) to compute α , and the vector sum x ( s ) − α ∇ f ( x ( s ) ) Newton’s method main source of parallelism is linear solve Alternating methods often fix x 1 so that g ( s ) ( x 2 ) may be decomposed into multiple independent problems g ( s ) ( w ) = g ( s ) 1 ( w 1 ) + · · · + g ( s ) k ( w k ) Subgradient methods exploit the fact that subgradients may be independent, since ∇ f i ( x ( s ) ) is generally mostly zero and depends on subset of elements in x ( s ) Approximate/randomized nature of subgradient methods can permit chaotic/asynchronous optimization Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Optimization Case-Study: Matrix Completion Given a subset of entries Ω ⊆ { 1 , . . . , m } × { 1 , . . . , n } of the entries of matrix A ∈ R m × n , seek rank- k approximation � � 2 � � + λ ( || W || 2 F + || H || 2 argmin a ij − w il h jl F ) W ∈ R m × k , H ∈ R n × k ( i,j ) ∈ Ω l � �� � ( A − W H T ) ij Problems of these type studied in sparse approximation Ω may be randomly selected sample subset Methods for this problem are typical of numerical optimization and machine learning Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Alternating Least Squares Alternating least squares (ALS) fixes W and solves for H then vice versa until convergence Each step improves approximation, convergence to a minimum expected given satisfactory starting guess We have a quadratic optimization problem � � 2 � � + λ || W || 2 argmin a ij − w il h jl F W ∈ R m × k ( i,j ) ∈ Ω l The optimization problem is independent for rows of W Letting w i = w i⋆ , h i = h i⋆ , Ω i = { j : ( i, j ) ∈ Ω } , seek � � 2 � a ij − w i h T + λ || w i || 2 argmin 2 j w i ∈ R k j ∈ Ω i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent ALS: Quadratic Optimization Seek minimizer w i for quadratic vector equation � � 2 � a ij − w i h T + λ || w i || 2 f ( w i ) = j j ∈ Ω i Differentiating with respect to w i gives ∂f ( w i ) � � � h T w i h T = 2 j − a ij + 2 λ w i = 0 j ∂ w i j ∈ Ω i i and defining G ( i ) = � Rotating w i h T j = h j w T j ∈ Ω i h T j h j , � ( G ( i ) + λ I ) w T h T i = j a ij j ∈ Ω i which is a k × k symmetric linear system of equations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent ALS: Iteration Cost For updating each w i , ALS is dominated in cost by two steps G ( i ) = � j ∈ Ω i h T j h j 1 dense matrix-matrix product O ( | Ω i | k 2 ) work logarithmic depth Solve linear system with G ( i ) + λ I 2 dense symmetric k × k linear solve O ( k 3 ) work typically O ( k ) depth Can do these for all m rows of W independently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Parallel ALS Let each task optimize a row w i of W Need to compute G ( i ) for each task Specific subset of rows of H needed for each G ( i ) Task execution is embarassingly parallel if all of H stored on each processor Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Memory-Constrained Parallel ALS May not have enough memory to replicate H on all processors Communication required and pattern is data-dependent Could rotate rows of H along a ring of processors Each processor computes contributions to the G ( i ) it owns Requires Θ( p ) latency cost for each iteration of ALS Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Updating a Single Variable Rather than whole rows w i solve for elements of W , recall � � 2 � � + λ || W || 2 argmin a ij − w il h jl F W ∈ R m × k ( i,j ) ∈ Ω l Coordinate descent finds the best replacement µ for w it � � 2 � � + λµ 2 µ = argmin a ij − µh jt − w il h jl µ j ∈ Ω i l � = t The solution is given by � � � a ij − � j ∈ Ω i h jt l � = t w il h jl µ = λ + � j ∈ Ω i h 2 jt Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Coordinate Descent For ∀ ( i, j ) ∈ Ω compute elements r ij of R = A − W H T so that we can optimize via � � � � � a ij − � � j ∈ Ω i h jt l � = t w il h jl j ∈ Ω i h jt r ij + w it h jt µ = = λ + � λ + � j ∈ Ω i h 2 j ∈ Ω i h 2 jt jt after which we can update R via r ij ← r ij − ( µ − w it ) h jt ∀ j ∈ Ω i both using O ( | Ω i | ) operations Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 23
Alternating Least Squares General Nonlinear Optimization Coordinate Descent Matrix Completion Gradient Descent Cyclic Coordinate Descent (CCD) Updating w i costs O ( | Ω i | k ) operations with coordinate descent rather than O ( | Ω i | k 2 + k 3 ) operations with ALS By solving for all of w i at once, ALS obtains a more accurate solution than coordinate descent Coordinate descent with different update orderings: Cyclic coordinate descent (CCD) updates all columns of W then all columns of H (ALS-like ordering) CCD++ alternates between columns of W and H All entries within a column can be updated concurrently Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 23
Recommend
More recommend