linear and sublinear linear algebra algorithms
play

Linear and Sublinear Linear Algebra Algorithms: Preconditioning - PowerPoint PPT Presentation

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see: http: // www. stat. berkeley. edu/


  1. Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see: http: // www. stat. berkeley. edu/ ~ mmahoney or Google on “Michael Mahoney”) August 2015 Joint work with Jiyan Yang, Yin-Lam Chow, and Christopher R´ e 1/36

  2. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 2/36

  3. RLA and SGD ◮ SGD (Stochastic Gradient Descent) methods 1 ◮ Widely used in practice because of their scalability, efficiency, and ease of implementation. ◮ Work for problems with general convex objective function. ◮ Usually provide an asymptotic bounds on convergence rate. ◮ Typically formulated in terms of differentiability assumptions, smoothness assumptions, etc. ◮ RLA (Randomized Linear Algebra) methods 2 ◮ Better worst-case theoretical guarantees and better control over solution precision. ◮ Less flexible (thus far), e.g., in the presence of constraints. ◮ E.g., may use interior point method for solving constrained subproblem, and this may be less efficient than SGD. ◮ Typically formulated (either TCS-style or NLA-style) for worst-case inputs. 1SGD: iteratively solve the problem by approximating the true gradient by the gradient at a single example. 2RLA: construct (with sampling/projections) a random sketch, and use that sketch to solve the subproblem or construct preconditioners for the original problem. 3/36

  4. Can we get the “best of both worlds”? Consider problems where both methods have something nontrivial to say. Definition Given a matrix A ∈ R n × d , where n ≫ d , a vector b ∈ R n , and a number p ∈ [1 , ∞ ], the overdetermined ℓ p regression problem is min x ∈Z f ( x ) = � Ax − b � p . Important special cases: ◮ Least Squares: Z = R d and p = 2. ◮ Solved by eigenvector methods with O ( nd 2 ) worst-case running time; or by iterative methods for which the running time depending on κ ( A ). ◮ Least Absolute Deviations: Z = R d and p = 1. ◮ Unconstrained ℓ 1 regression problem can be formulated as a linear program and solved by an interior-point method. 4/36

  5. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 5/36

  6. Deterministic ℓ p regression as stochastic optimization ◮ Let U ∈ R n × d be a basis of the range space of A in the form of U = AF , where F ∈ R d × d . ◮ The constrained overdetermined (deterministic) ℓ p regression problem is equivalent to the (stochastic) optimization problem x ∈Z � Ax − b � p y ∈Y � Uy − b � p min = min p p = min y ∈Y E ξ ∼ P [ H ( y , ξ )] , where H ( y , ξ ) = | U ξ y − b ξ | p is the randomized integrand and ξ is a p ξ random variable over { 1 , . . . , n } with distribution P = { p i } n i =1 . ◮ The constraint set of y is given by Y = { y ∈ R d | y = F − 1 x , x ∈ Z} . 6/36

  7. Brief overview of stochastic optimization The standard stochastic optimization problem is of the form min x ∈X f ( x ) = E ξ ∼ P [ F ( x , ξ )] , (1) where ξ is a random data point with underlying distribution P . Two computational approaches for solving stochastic optimization problems of the form (1) based on Monte Carlo sampling techniques: ◮ SA (Stochastic Approximation): ◮ Start with an initial weight x 0 , and solve (1) iteratively. ◮ In each iteration, a new sample point ξ t is drawn from distribution P and the current weight is updated by its information (e.g., (sub)gradient of F ( x , ξ t )). ◮ SAA (Sampling Average Approximation): ◮ Sample n points from distribution P independently, ξ 1 , . . . , ξ n , and solve the Empirical Risk Minimization (ERM) problem, n f ( x ) = 1 ˆ � min F ( x , ξ i ) . n x ∈X i =1 7/36

  8. Solving ℓ p regression via stochastic optimization To solve this stochastic optimization problem, typically one needs to answer the following three questions. ◮ (C 1 ): How to sample: SAA (i.e., draw samples in a batch mode and deal with the subproblem) or SA (i.e., draw a mini-batch of samples in an online fashion and update the weight after extracting useful information)? ◮ (C 2 ): Which probability distribution P (uniform distribution or not) and which basis U (preconditioning or not) to use? ◮ (C 3 ): Which solver to use (e.g., how to solve the subproblem in SAA or how to update the weight in SA)? 8/36

  9. A unified framework for RLA and SGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) ℓ p regression uniform P naive fast gradient descent SA vanilla SGD min x � Ax − b � p U = ¯ A p online using RLA pwSGD non-uniform P fast n e o n l i SA gradient descent (this presentation) stochastic optimization well-conditioned U min y E ξ ∼ P [ | U ξ y − b ξ | p / p ξ ] b a t c h using RLA exact solution vanilla RLA with non-uniform P slow SAA of subproblem algorithmic leveraging well-conditioned U ( C 1): How to sample? ( C 2): Which U and P to use? ( C 3): How to solve? resulting solver ◮ SA + “naive” P and U : vanilla SGD whose convergence rate depends (without additional niceness assumptions) on n ◮ SA + “smart” P and U : pwSGD ◮ SAA + “naive” P : uniform sampling RLA algorithm which may fail if some rows are extremely important (not shown) ◮ SAA + “smart” P : RLA (with algorithmic leveraging or random projections) which has strong worst-case theoretical guarantee and high-quality numerical implementations ◮ For unconstrained ℓ 2 regression (i.e., LS), SA + “smart” P + “naive” U recovers weighted randomized Kaczmarz algorithm [Strohmer-Vershynin]. 9/36

  10. A combined algorithm: pwSGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) pwSGD : Preconditioned weighted SGD consists of two main steps: 1. Apply RLA techniques for preconditioning and construct an importance sampling distribution. 2. Apply an SGD-like iterative phase with weighted sampling on the preconditioned system. 10/36

  11. A closer look: “naive” choices of U and P in SA Consider solving ℓ 1 regression; and let U = A . If we apply the SGD with some distribution P = { p i } n i =1 , then the relative approximation error is � � x ∗ � 2 · max 1 ≤ i ≤ n � A i � 1 / p i x ) − f ( x ∗ ) f (ˆ � = O , � Ax ∗ − b � 1 f (ˆ x ) where f ( x ) = � Ax − b � 1 and x ∗ is the optimal solution. ◮ If { p i } n i =1 is the uniform distribution, i.e., p i = 1 n , then x ) − f ( x ∗ ) n � x ∗ � 2 · M f (ˆ � � = O , � Ax ∗ − b � 1 f (ˆ x ) where M = max 1 ≤ i ≤ n � A i � 1 is the maximum ℓ 1 row norm of A . � A i � 1 ◮ If { p i } n i =1 is proportional to the row norms of A , i.e., p i = i =1 � A i � 1 , then � n x ) − f ( x ∗ ) � � x ∗ � 2 · � A � 1 f (ˆ � = O . � Ax ∗ − b � 1 f (ˆ x ) In either case, the expected convergence time for SGD might blow up (i.e., grow with n ) as the size of the matrix grows ( unless one makes extra assumptions ). 11/36

  12. A closer look: “smart” choices of U and P in SA ◮ Recall that if U is a well-conditioned basis, then (by definition) � U � 1 ≤ α and � y ∗ � ∞ ≤ β � Uy ∗ � 1 , for α and β depending on the small dimension d and not the large dimension n . ◮ If we use a well-conditioned basis U for the range space of A , and if we choose the sampling probabilities proportional to the row norms of U , i.e., leverage scores of A , then the resulting convergence rate on the relative error of the objective becomes x ) − f ( x ∗ ) � � y ∗ � 2 · � U � 1 f (ˆ � = O . � ¯ f (ˆ x ) Uy ∗ � 1 where y ∗ is an optimal solution to the transformed problem. ◮ Since the condition number αβ of a well-conditioned basis depends only on d , it implies that the resulting SGD inherits a convergence rate in a relative scale that depends on d and is independent of n . 12/36

  13. Outline Background A perspective of Stochastic optimization Main Algorithm and Theoretical Results Empirical Results Connection with Coreset Methods 13/36

  14. A combined algorithm: pwSGD (“Weighted SGD for Lp Regression with Randomized Preconditioning,” Yang, Chow, Re, and Mahoney, 2015.) 1. Compute R ∈ R d × d such that U = AR − 1 is an ( α, β ) well-conditioned basis U for the range space of A . 2. Compute or estimate � U i � p p with leverage scores λ i , for i ∈ [ n ]. λ i 3. Let p i = j =1 λ j , for i ∈ [ n ]. � n 4. Construct the preconditioner F ∈ R d × d based on R . 5. For t = 1 , . . . , T Pick ξ t from [ n ] based on distribution { p i } n i =1 . � sgn ( A ξ t x t − b ξ t ) / p ξ t if p = 1; c t = 2 ( A ξ t x t − b ξ t ) / p ξ t if p = 2 . Update x by  x t − η c t H − 1 A ξ t if Z = R d ;  x t +1 = η c t A ξ t x + 1 2 � x t − x � 2 arg min otherwise . H  x ∈Z FF ⊤ � − 1 . � where H = � T x ← 1 6. ¯ t =1 x t . T 7. Return ¯ x for p = 1 or x T for p = 2. 14/36

Recommend


More recommend