random methods for large scale linear problems
play

Random Methods for Large-Scale Linear Problems, Variational - PowerPoint PPT Presentation

Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013


  1. Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013 1/38

  2. A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 A Roadmap 2/38

  3. The Broader Context of Our Work: Large-Scale Problems Linear Systems Ax = b or E [ A v ] x = E [ b v ] (inverse problems, regression, statistical learning, approximate DP) ⇓ Linear & Quadratic Programming min Ax ≤ b x ′ � i Q i x + c ′ x (approximate DP , high performance computation) ⇓ Complementarity Problems (equilibriums, projected equations) ⇓ � Convex Problems & Variational Inequalities min x ∈∩ i X i i f i ( x ) (networks, data-driven problems, cooperative games, online decision making) Address large-scale problems by randomization/simulation A Roadmap 3/38

  4. Use Stochastic Methods to Tackle Large Scale How to obtain random samples? Importance sampling Adaptive sampling Monte Carlo methods Application/implementation-dependent methods: asynchronous, distributed, irregular, unknown random process, etc How to use random samples? Stochastic approximation Sample average approximation Use Monte Carlo estimates to iterate Modify deterministic methods to allow stochasticity A Roadmap 4/38

  5. Our work Part 1: Large scale linear systems Ax = b Deal with the joint effect of singularity and stochastic noise Stabilizing divergent iterative methods Part 2: Large scale optimization problems with complicated constraints Combine optimization and feasibility methods with randomness Incremental/Online Structure: updating based on a part of all constraint/gradient information using minimal storage to deal with large data set allowing various sources of stochasticity Coupled Convergence: x k → x ∗ vs. x k → X A Roadmap 5/38

  6. A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Linear Systems 6/38

  7. Solving linear systems Ax = b by stochastic sampling Assume that: A = E [ A w ] , b = E [ b v ] Moreover, a sequence of samples { ( A w k , b v k ) } is available. Stochastic Approximation (SA) x k + 1 = x k − α k ( A w k x k − b v k ) Using one sample per update is too slow! Sample Average Approximation (SAA) � k � k Obtain finite-sample estimates A k = 1 t = 1 A w t and b k = 1 t = 1 b v t , k k then solve A k x = b k . Stochastic Methods for Linear Systems 7/38

  8. Can we do better? Using Monte Carlo Estimates √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as x k + 1 = x k − γ G ( A k x k − b k ) If ρ ( I − γ GA ) < 1 ⇛ geometric convergence! Not working if (close to) singular! (Wang and Bertsekas, 2011) √ √ k k , x k ∼ e Ax k − b ∼ e w . p . 1 . Divergence rate: Based on random samples of A , we cannot detect the (near) singularity We still like the nonsingular part of the system Stochastic Methods for Linear Systems 8/38

  9. Deal with singularity under noise Stabilized Iterations (Wang and Bertsekas, 2011) √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as ✭ ✭✭✭✭✭✭✭✭✭✭✭✭✭ x k + 1 = x k − γ G ( A k x k − b k ) Add a stabilization term to deal with singularity and multiplicative noise x k + 1 = ( 1 − δ k ) x k − γ G ( A k x k − b k ) where δ k ↓ 0 , � δ k = ∞ and δ k ≫ noise. Then x k a . s . → some x ∗ − Proximal Iteration Naturally Converges (Wang and Bertsekas, 2011) x k + 1 = argmin x � A k x − b k � 2 + λ � x k − x � 2 a . s . a . s . → some x ∗ Then Ax k − b − → 0 and we can extract a subsequence ˆ x k − Stochastic Methods for Linear Systems 9/38

  10. A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI Motivation 10/38

  11. The problems Convex Optimization Problems (COP) min x ∈ X F ( x ) where F : ℜ n �→ ℜ is convex and continuously differentiable Variational Inequalities (VI) G ( x ∗ ) ′ ( x − x ∗ ) ≥ 0 , ∀ x ∈ X where G : ℜ n �→ ℜ n is strongly monotone Strongly monotone: for some σ > 0 ( y − x ) ′ G ( y − x ) ≥ σ � x − y � 2 ∀ x , y VI = COP , if G ( x ) = ∇ F ( x ) Equilibriums/LP/Projected equations/Complementarity Problems Stochastic Methods for Large Scale COP & VI Motivation 11/38

  12. We focus on large-scale problems with incremental structure Linearly Additive Objectives COP: � F ( x ) = F i ( x ) or F ( x ) = E [ f ( x , v )] VI: � G ( x ) = G i ( x ) or G ( x ) = E [ g ( x , v )] Set Intersection Constraints X = ∩ m i = 1 X i where each X i is closed and convex Applications Machine Learning/Distributed Optimization/Computing Nash Equilibriums Stochastic Methods for Large Scale COP & VI Motivation 12/38

  13. Difficulty with practical large-scale problems Operating with X = ∩ X i is difficult, especially for: Big data-driven problems with huge # of constraints stored in external hard drives Distributed problems where each agent can only access part of all constraints Stochastic process-driven problems whose constraints involve a random process only available through simulation Question: Why not replace X with a single X i ? Stochastic Methods for Large Scale COP & VI Motivation 13/38

  14. Putting two ideas together Gradient projection Alternate projection Stochastic Methods for Large Scale COP & VI Motivation 14/38

  15. Related works Incremental COP: min x ∈ X F ( x ) by x k + 1 = Π X [ x k − α g ( x k , v k )] stochastic gradient projection (Nedi´ c and Bertsekas 2001, etc) incremental proximal (Bertsekas 2010, etc) incremental gradient with random projection (Nedi´ c 2011) Feasibility Problems: finding x ∈ ∩ i ∈ M X i by x k + 1 = Π X wk x k alternate/cyclic projection (Gubin 1967, Tseng 1990, Deutsch and Hundal 2006-2008, Lewis 2008, etc) random projection (Nedi´ c 2010) super halfspace projection (Censor 2008, etc ) Stochastic Methods for Large Scale COP & VI Motivation 15/38

  16. A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 16/38

  17. Existing methods Gradient/Subgradient Projection Method for COP � � x k + 1 = Π X x k − α k ∇ F ( x k ) Projection Method for VI � � x k + 1 = Π X x k − α k G ( x k ) Stochastic Gradient Projection Method for COP Projection Method for Stochastic VI � � x k + 1 = Π X x k − α k g ( x k , v k ) Proximal Method for COP � 1 � � x − x k � 2 x k + 1 = argmin x ∈ X F ( x ) + 2 α k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 17/38

  18. The general random incremental algorithm A Two-Step Algorithm Optimality update: z k = x k − α k g (¯ with ¯ x k , v k ) , x k = x k or x k + 1 Feasibility update: x k + 1 = ( 1 − β k ) z k + β k Π X wk z k When β k = 1 x k + 1 = Π X wk [ x k − α k g (¯ x k , v k )] ¯ x k ∈ { x k , x k + 1 } Analytical difficulty: x k no longer feasible! Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 18/38

  19. Special cases of the general algorithm Projection algorithm using random projection and stochastic gradient x k = Π X wk [ x k − α k g ( x k , v k )] Proximal algorithm using random constraint and random cost function � F ( x , v k ) + ( 1 / 2 α k ) � x − x k � 2 � x k + 1 = argmin x ∈ X wk Variations that alternate between proximal and projection Successive projection algorithm x k + 1 = Π w k x k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 19/38

  20. Sampling schemes for X w k Nearly independent samples by random sampling s.t. k ≥ 0 P ( w k = X i | F k ) > 0 , inf i = 1 , . . . , m Cyclic samples by cyclic selection or random shuffling, s.t. { X w k } consists of permutations of { X 1 , . . . , X m } Most distant constraint sets by adaptively select X w k s.t. w k = argmax i = 1 ,..., m � x k − Π X i x k � Markov samples by generating X w k through a recurrent Markov chain with states { X i } m i = 1 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 20/38

  21. Sampling schemes for g ( x k , v k ) Unbiased samples by random sampling s.t. � � E g ( x , v k ) | F k = G ( x ) , ∀ x , k ≥ 0 , w . p . 1 Cyclic samples by cyclic selection or random shuffling of component functions s.t. � � Avg k ∈ cycle E g ( x , v k ) | F begining = G ( x ) , ∀ x , w . p . 1 Markov samples by generating v k through an irreducible Markov chain with invariant distribution ξ , s.t. E v ∼ ξ [ g ( x , v )] = G ( x ) , ∀ x Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 21/38

Recommend


More recommend