Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization, Information Processing, and Learning August’17 1 / 31
Overview • conic programming problem (P): minimize c T x subject to Ax = b, x ∈ K K is a closed convex cone • this talk : a first-order iteration • parallel: linear speedup, async • still working if problem is unsolvable 2 / 31
Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: 1 equivalent to standard ADMM, but the different form is important 3 / 31
Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates 1 equivalent to standard ADMM, but the different form is important 3 / 31
Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates • coordinate friendly: break z into m blocks, cost( T i ) ∼ 1 m cost( T ) 1 equivalent to standard ADMM, but the different form is important 3 / 31
Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates • coordinate friendly: break z into m blocks, cost( T i ) ∼ 1 m cost( T ) • divergent nicely: • (P) has no primal-dual sol pair ⇔ � z k � → ∞ • z k +1 − z k tells a whole lot 1 equivalent to standard ADMM, but the different form is important 3 / 31
Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z 4 / 31
Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z • Douglas-Rachford Splitting (DRS) method solves minimize f ( x ) + g ( x ) by iterating z k +1 = Tz k 4 / 31
Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z • Douglas-Rachford Splitting (DRS) method solves minimize f ( x ) + g ( x ) by iterating z k +1 = Tz k defined as : x k + 1 2 = prox γg ( z k ) x k +1 = prox γf (2 z k − x k + 1 2 ) z k +1 = z k + ( x k +1 − x k + 1 2 ) 4 / 31
Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex 5 / 31
Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b 5 / 31
Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b • per-iteration cost: O ( n 2 ) if x ∈ R n (by pre-factorizing AA T ) 5 / 31
Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b • per-iteration cost: O ( n 2 ) if x ∈ R n (by pre-factorizing AA T ) • prior work: ADMM for SDP (Wen-Goldfarb-Y.’09) 5 / 31
Other choices of splitting • linearized ADMM and primal-dual splitting: avoid inverting full A • variations of Frank-Wolfe: avoid expensive projections to SDP cone • subgradient and bundle methods ... 6 / 31
Coordinate friendly 2 (CF) • (Block) coordinate update is fast only if the subproblems are simple • definition : T : H → H is CF if, for any z and i ∈ [ m ] , z + := � � z 1 , . . . , ( Tz ) i , . . . , z m it holds that = O � 1 cost � { z, M ( z ) } �→ { z + , M ( z + ) } � m cost[ z �→ Tz ] � where M ( z ) is some quantity maintained in the memory 2 Peng-Wu-Xu-Yan-Y. AMSA’16 7 / 31
Composed operators • 9 rules 3 for CF T 1 ◦ T 2 cover many examples • general principles: • T 1 ◦ T 2 inherits the (weaker) separability property • if T 1 is CF and T 2 to be either cheap , easy-to-maintain , or directly CF , then T 1 ◦ T 2 is CF • if T 1 is separable or cheap, T 1 ◦ T 2 is easier to CF 3 Peng-Wu-Xu-Yan-Y. AMSA’16 8 / 31
Lists of CF T 1 ◦ T 2 • many convex image processing models • portfolio optimization • most sparse optimization problems • all LPs, all SOCPs, and SDPs without large cones • most ERM problems • ... 9 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } 10 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np 10 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small 10 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 10 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 • given γ and updating x i , refreshing γ costs O (1) 10 / 31
Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 • given γ and updating x i , refreshing γ costs O (1) • by maintaining γ , proj Q n is cheap, and T = linear ◦ cheap is CF 10 / 31
Fixed-point iterations • full update z k +1 = Tz k 11 / 31
Fixed-point iterations • full update z k +1 = Tz k • (block) coordinate update (CU) : choose i k ∈ [ m ] , � z k i + η (( Tz k ) i − z k i ) , if i = i k z k +1 = i z k i , otherwise . 11 / 31
Fixed-point iterations • full update z k +1 = Tz k • (block) coordinate update (CU) : choose i k ∈ [ m ] , � z k i + η (( Tz k ) i − z k i ) , if i = i k z k +1 = i z k i , otherwise . • parallel CU : p agents choose I k ⊂ [ m ] � z k i + η (( Tz k ) i − z k i ) , if i ∈ I k z k +1 = i z k i , otherwise . • η depends on properties of T , i k , and I k 11 / 31
Sync-parallel versus async-parallel Agent 1 Agent 1 idle idle Agent 2 Agent 2 idle Agent 3 Agent 3 idle Synchronous Asynchronous (faster agents must wait) (all agents are non-stop) 12 / 31
ARock: async-parallel CU • p agents • every agent continuously does: pick i k ⊂ [ m ] , � i + η (( Tz k − d k ) i − z k − d k z k ) , if i = i k z k +1 i = i z k i , otherwise . new notation: • k increases after any agent completes an update k − d k, 1 k − d k,m • z k − d k = ( z ) may be stale , . . . , z m 1 • allow inconsistent atomic read/write 13 / 31
Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async 14 / 31
Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async • recent in ML community: async SG and async BCD • early works: random i k , bounded delays, E f has sufficient descent, treat delays as noise, delays independent of i k 14 / 31
Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async • recent in ML community: async SG and async BCD • early works: random i k , bounded delays, E f has sufficient descent, treat delays as noise, delays independent of i k • state-of-the-art: allow essential cyclic i k , unbounded noise ( t − 4 or faster decay), Lyapunov analysis, delays as overdue progress, delays can depend on i k 14 / 31
Recommend
More recommend