averaging algorithms and distributed optimization
play

Averaging algorithms and distributed optimization John N. - PowerPoint PPT Presentation

Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010 Outline Motivation and applications Consensus/averaging in distributed optimization


  1. Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010

  2. Outline Motivation and applications • Consensus/averaging in distributed optimization • Convergence times of consensus/averaging • time-invariant case – time-varying case –

  3. The Setting • n agents starting values x i (0) – reach consensus on some x ∗ , with either: • min i x i (0) ≤ x ∗ ≤ max i x i (0) (consensus) – x ∗ = x 1 (0) + · · · + x n (0) (averaging) – n averaging when x i ∈ { − 1 , +1 } (voting) – interested in: • genuinely distributed algorithm – no synchronization – no “infrastructure” such as spanning trees – x i := x i + x j simple updates, such as: • 2

  4. Social sciences Merging of “expert” opinions • Evolution of public opinion • Evolution of reputation • Modeling of jurors • Language evolution • Preference for “simple” models • behavior described by “rules of thumb” – less complex than Bayesian updating – interested in modeling, analysis (descriptive theory) • • . . . and narratives –

  5. Engineering Distributed computation and sensor networks • – Fusion of individual estimates – Distributed Kalman filtering Distributed optimization – – Distributed reinforcement learning Networking • Load balancing and resource allocation – – Clock synchronization – Reputation management in ad hoc networks Network monitoring – Multiagent coordination and control • – Coverage control Monitoring – – Creating virtual coordinates for geographic routing – Decentralized task assignment – Flocking

  6. The DeGroot opinion pooling model (1974) � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Markov chain theory + “mixing conditions” • convergence of A t , to matrix with equal rows − → convergence of x i to � j π j x j − → convergence rate estimates − → Averaging algorithms • A doubly stochastic: 1 ′ A x = 1 ′ x , where 1 ′ = [1 1 . . . 1] – x 1 + · · · + x n is conserved – convergence to x ∗ = x 1 (0) + · · · + x n (0) – n

  7. Part I: Distributed Optimization

  8. Gradient-like methods � min f ( x ) special case: f ( x ) = f i ( x ) • x i f, f i convex – f smooth; work with ∇ f ( x ) • update: x := x − γ ∇ f ( x ) – with noise: x := x − γ ( ∇ f ( x ) + w ) – (stochastic approximation, γ t → 0) → f nonsmooth, work with subgradient ∂ f ( x ) • update: x := x − γ∂ f ( x ) ( γ t → 0) – with noise: x := x − γ ( ∂ f ( x ) + w ) – More sophisticated variants: Dual averaging methods •

  9. Smooth f ; compentwise decentralization x i j : agent i , component j • i − γ ∂ f x i i := x i ( x i ) update: – ∂ x i j := x j x i reconcile: (occasionally; upper bound B ) – j track y = ( x 1 1 , . . . , x n Analysis: n ) • � y − x i � = O ( B γ ) y := y − γ ∇ f ( y ) + O ( B γ 2 ) − ∇ Convergence theorem for centralized gradient method remains • valid: [Bertsekas, JNT, Athans, 86] need γ ∼ 1 /B – also for stochastic approximation variant – � ∂ f � x i i := x i ( x i ) + w i i − γ ∂ x i

  10. Smooth f ; overlap and cooperate Assume (for simplicity) scalar x • subscript denotes agent’s value of x – x i := x i − γ f ( x i ) redundant/useless – useful in the presence of noise: • update: x i := x i − γ ( ∇ f ( x i ) + w i ) – x := x − γ · 1 � reconcile: ( ∇ f ( x i ) + w i ) – n i

  11. Smooth f ; overlap and cooperate (ctd.) Two-phase version • – update: x i := x i − γ ( ∇ f ( x i ) + w i ) – reconcile: run consensus algorithm x := Ax � converges: x i → y , ∀ i y = π j ≥ 0 π j x j j � y := y − γ π j ( ∇ f ( x j ) + w j ) � j expected update direction is still descent direction • classical convergence results for centralized stochastic • gradient method, with γ t → 0, remain valid

  12. Smooth f ; overlap and cooperate (ctd.) • Interleaved version � x i := a ij x j − γ ( ∇ f ( x i ) + w i ) j � – define y = π i x i i � � � – note: a ij x j = π i x i π i i j i � � y := y − γ π i ( ∇ f ( x i ) + w i ) i i • | x i − y | = O ( γ T · | ∇ f ( y ) | ) T : convergence time (time constant) of consensus algorithm π i ( ∇ f ( y ) + w j ) + O ( γ 2 T · | ∇ f ( y ) | ) � y := y − γ i convergence theorem for centralized stochastic gradient method, • with γ t → 0, remains valid [Bertsekas, JNT, Athans, 86]

  13. Smooth, additive f ; overlap and cooperate f ( x ) = 1 � � f i ( x ) optimality ∇ f i ( x ) = 0 ⇐ ⇒ • n i i Two-phase version • update: x i := x i − γ ∇ f i ( x i ) – reconcile: run consensus algorithm x := Ax – � converges: x i → y , ∀ i y = π i ≥ 0 π i x i i � y := y − γ π i ∇ f i ( x i ) i correctness requires π i = 1 /n • Use averaging algorithm ( A : doubly stochastic) –

  14. Additive f ; overlap and cooperate (ctd.) Interleaved version • � x i := a ij x j − γ ∇ f i ( x i ) + w i j define y = 1 � – x i n i y := y − γ 1 � � ∇ f i ( x i ) n i � � | x i − y | = O i | ∇ f i ( y ) | • � γ T · � � T : convergence time (time constant) of averaging algorithm – for constant γ , error does not vanish at optimum – optimality possible only with γ t → 0 (even in the absence of noise) – hence studied for nonsmooth f or stochastic case [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10]

  15. Convergence times — the big picture T con ( n, ǫ ): time for consensus/averaging algorithm • to reduce disagreement from unity to ǫ generically O (1 / log(1 / ǫ )) – focus on T con ( n ) – T opt ( n, ǫ ): time for centralized (sub)gradient algorithm • to bring cost gap to ǫ hide dependence on other constants – (bounds on first, second derivatives, stepsize details) � � Two-phase version: O T con ( n ) · T opt ( n, ǫ ) • � � • · Interleaved version: Results have the same flavor • [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10] is interleaving faster or “better” than two-phase version? – Our mission: study and reduce T con ( n ) • automatically better overall convergence time e.g., [Nedic, Olshevsky, Ozdaglar & JNT, 08]

  16. Part II: Consensus and averaging

  17. Convergence time of consensus algorithms � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Convergence time (time to get close to “steady-state”) Equal weight to all neighbors Undirected graphs: O ( n 3 ), tight Directed graphs: exponential( n ) (Landau and Odlyzko, 1981) Better results for special graphs Θ ( n 2 ) for line graphs (Erd¨ os-R´ enyi, geometric, small world)

  18. Averaging algorithms A doubly stochastic: 1 ′ A x = 1 ′ x • positive diagonal – nonzero entries are ≥ α > 0 – (0)+ + (0) convergence to x ∗ = x 1 (0)+ ··· + x n (0) – n convergence time = O ( n 2 / α ) – ( x i − x ∗ ) 2 is a Lyapunov function � V ( x ) = • V i (Nedic, Olshevsky, Ozdaglar & JNT, 09) • bidirectional graph, natural algorithm: x i := x i + 1 � ( x j − x i ) 2 n neighbors j α ∼ 1 convergence time = O ( n 3 ) n

  19. A critique The consensus/averaging algorithm x := Ax • assumes constant a ij = fixed graph ⇒ elect a leader, form a spanning tree, accumulate on tree – Want simplicity and robustness in dealing with • changing topologies, failures, etc.

  20. Time-Varying/Chaotic Environments i.i.d. random graphs: same (in expectation) as fixed graphs; • convergence rate ← → “mixing times” (Boyd et al., 2005) Fairly arbitrary sequence of graphs/matrices A ( t ): • worst-case analysis � x i ( t + 1) = a ij ( t ) x j ( t ) j a ij ( t ): nonzero whenever i receives message from j x ( t + 1) = A ( t ) x ( t ) (inhomogeneous Markov chain)

  21. Consensus convergence � � x i ( t + 1) = a ij ( t ) x j t ) j a ii ( t ) > 0; a ij ( t ) > 0 = a ij ( t ) ≥ α > 0 ⇒ • “strong connectivity in bounded time”: • over B time steps “communication graph” is strongly connected Convergence to consensus: • x i ( t ) → x ∗ = convex combination of initial values ∀ i : (JNT, Bertsekas, Athans, 86; Jadbabaie et al., 03) “convergence time”: exponential in n and B • even with: – symmetric graph at each time equal weight to each neighbor (Cao, Spielman, Morse, 05)

  22. Averaging in Time-Varying Setting • V x ( t + 1) = A ( t ) x ( t ) (Nedic, Olshevsky, Ozdaglar & JNT, 09) • A ( t ) doubly stochastic, for all t – O ( n 2 / α ) bound remains valid! – Improved convergence rate • exchange “load” with up to two neighbors at a time – can use α = O (1) – convergence time: O ( n 2 ) – • Averaging in time-varying bidirectional graphs: → O ( n 2 ) − no harder than consensus on fixed graphs • Various convergence proofs of optimization algs. remain valid Improves the convergence time estimate for subgradient – methods [Nedic, Olshevsky, Ozdaglar, JNT, 09]

Recommend


More recommend