Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013 1/38
A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 A Roadmap 2/38
The Broader Context of Our Work: Large-Scale Problems Linear Systems Ax = b or E [ A v ] x = E [ b v ] (inverse problems, regression, statistical learning, approximate DP) ⇓ Linear & Quadratic Programming min Ax ≤ b x ′ � i Q i x + c ′ x (approximate DP , high performance computation) ⇓ Complementarity Problems (equilibriums, projected equations) ⇓ � Convex Problems & Variational Inequalities min x ∈∩ i X i i f i ( x ) (networks, data-driven problems, cooperative games, online decision making) Address large-scale problems by randomization/simulation A Roadmap 3/38
Use Stochastic Methods to Tackle Large Scale How to obtain random samples? Importance sampling Adaptive sampling Monte Carlo methods Application/implementation-dependent methods: asynchronous, distributed, irregular, unknown random process, etc How to use random samples? Stochastic approximation Sample average approximation Use Monte Carlo estimates to iterate Modify deterministic methods to allow stochasticity A Roadmap 4/38
Our work Part 1: Large scale linear systems Ax = b Deal with the joint effect of singularity and stochastic noise Stabilizing divergent iterative methods Part 2: Large scale optimization problems with complicated constraints Combine optimization and feasibility methods with randomness Incremental/Online Structure: updating based on a part of all constraint/gradient information using minimal storage to deal with large data set allowing various sources of stochasticity Coupled Convergence: x k → x ∗ vs. x k → X A Roadmap 5/38
A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Linear Systems 6/38
Solving linear systems Ax = b by stochastic sampling Assume that: A = E [ A w ] , b = E [ b v ] Moreover, a sequence of samples { ( A w k , b v k ) } is available. Stochastic Approximation (SA) x k + 1 = x k − α k ( A w k x k − b v k ) Using one sample per update is too slow! Sample Average Approximation (SAA) � k � k Obtain finite-sample estimates A k = 1 t = 1 A w t and b k = 1 t = 1 b v t , k k then solve A k x = b k . Stochastic Methods for Linear Systems 7/38
Can we do better? Using Monte Carlo Estimates √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as x k + 1 = x k − γ G ( A k x k − b k ) If ρ ( I − γ GA ) < 1 ⇛ geometric convergence! Not working if (close to) singular! (Wang and Bertsekas, 2011) √ √ k k , x k ∼ e Ax k − b ∼ e w . p . 1 . Divergence rate: Based on random samples of A , we cannot detect the (near) singularity We still like the nonsingular part of the system Stochastic Methods for Linear Systems 8/38
Deal with singularity under noise Stabilized Iterations (Wang and Bertsekas, 2011) √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as ✭ ✭✭✭✭✭✭✭✭✭✭✭✭✭ x k + 1 = x k − γ G ( A k x k − b k ) Add a stabilization term to deal with singularity and multiplicative noise x k + 1 = ( 1 − δ k ) x k − γ G ( A k x k − b k ) where δ k ↓ 0 , � δ k = ∞ and δ k ≫ noise. Then x k a . s . → some x ∗ − Proximal Iteration Naturally Converges (Wang and Bertsekas, 2011) x k + 1 = argmin x � A k x − b k � 2 + λ � x k − x � 2 a . s . a . s . → some x ∗ Then Ax k − b − → 0 and we can extract a subsequence ˆ x k − Stochastic Methods for Linear Systems 9/38
A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI Motivation 10/38
The problems Convex Optimization Problems (COP) min x ∈ X F ( x ) where F : ℜ n �→ ℜ is convex and continuously differentiable Variational Inequalities (VI) G ( x ∗ ) ′ ( x − x ∗ ) ≥ 0 , ∀ x ∈ X where G : ℜ n �→ ℜ n is strongly monotone Strongly monotone: for some σ > 0 ( y − x ) ′ G ( y − x ) ≥ σ � x − y � 2 ∀ x , y VI = COP , if G ( x ) = ∇ F ( x ) Equilibriums/LP/Projected equations/Complementarity Problems Stochastic Methods for Large Scale COP & VI Motivation 11/38
We focus on large-scale problems with incremental structure Linearly Additive Objectives COP: � F ( x ) = F i ( x ) or F ( x ) = E [ f ( x , v )] VI: � G ( x ) = G i ( x ) or G ( x ) = E [ g ( x , v )] Set Intersection Constraints X = ∩ m i = 1 X i where each X i is closed and convex Applications Machine Learning/Distributed Optimization/Computing Nash Equilibriums Stochastic Methods for Large Scale COP & VI Motivation 12/38
Difficulty with practical large-scale problems Operating with X = ∩ X i is difficult, especially for: Big data-driven problems with huge # of constraints stored in external hard drives Distributed problems where each agent can only access part of all constraints Stochastic process-driven problems whose constraints involve a random process only available through simulation Question: Why not replace X with a single X i ? Stochastic Methods for Large Scale COP & VI Motivation 13/38
Putting two ideas together Gradient projection Alternate projection Stochastic Methods for Large Scale COP & VI Motivation 14/38
Related works Incremental COP: min x ∈ X F ( x ) by x k + 1 = Π X [ x k − α g ( x k , v k )] stochastic gradient projection (Nedi´ c and Bertsekas 2001, etc) incremental proximal (Bertsekas 2010, etc) incremental gradient with random projection (Nedi´ c 2011) Feasibility Problems: finding x ∈ ∩ i ∈ M X i by x k + 1 = Π X wk x k alternate/cyclic projection (Gubin 1967, Tseng 1990, Deutsch and Hundal 2006-2008, Lewis 2008, etc) random projection (Nedi´ c 2010) super halfspace projection (Censor 2008, etc ) Stochastic Methods for Large Scale COP & VI Motivation 15/38
A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 16/38
Existing methods Gradient/Subgradient Projection Method for COP � � x k + 1 = Π X x k − α k ∇ F ( x k ) Projection Method for VI � � x k + 1 = Π X x k − α k G ( x k ) Stochastic Gradient Projection Method for COP Projection Method for Stochastic VI � � x k + 1 = Π X x k − α k g ( x k , v k ) Proximal Method for COP � 1 � � x − x k � 2 x k + 1 = argmin x ∈ X F ( x ) + 2 α k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 17/38
The general random incremental algorithm A Two-Step Algorithm Optimality update: z k = x k − α k g (¯ with ¯ x k , v k ) , x k = x k or x k + 1 Feasibility update: x k + 1 = ( 1 − β k ) z k + β k Π X wk z k When β k = 1 x k + 1 = Π X wk [ x k − α k g (¯ x k , v k )] ¯ x k ∈ { x k , x k + 1 } Analytical difficulty: x k no longer feasible! Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 18/38
Special cases of the general algorithm Projection algorithm using random projection and stochastic gradient x k = Π X wk [ x k − α k g ( x k , v k )] Proximal algorithm using random constraint and random cost function � F ( x , v k ) + ( 1 / 2 α k ) � x − x k � 2 � x k + 1 = argmin x ∈ X wk Variations that alternate between proximal and projection Successive projection algorithm x k + 1 = Π w k x k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 19/38
Sampling schemes for X w k Nearly independent samples by random sampling s.t. k ≥ 0 P ( w k = X i | F k ) > 0 , inf i = 1 , . . . , m Cyclic samples by cyclic selection or random shuffling, s.t. { X w k } consists of permutations of { X 1 , . . . , X m } Most distant constraint sets by adaptively select X w k s.t. w k = argmax i = 1 ,..., m � x k − Π X i x k � Markov samples by generating X w k through a recurrent Markov chain with states { X i } m i = 1 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 20/38
Sampling schemes for g ( x k , v k ) Unbiased samples by random sampling s.t. � � E g ( x , v k ) | F k = G ( x ) , ∀ x , k ≥ 0 , w . p . 1 Cyclic samples by cyclic selection or random shuffling of component functions s.t. � � Avg k ∈ cycle E g ( x , v k ) | F begining = G ( x ) , ∀ x , w . p . 1 Markov samples by generating v k through an irreducible Markov chain with invariant distribution ξ , s.t. E v ∼ ξ [ g ( x , v )] = G ( x ) , ∀ x Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 21/38
Recommend
More recommend