Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers)
Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Questions: • How well can devices jointly learn when links are slow(/not fast)? • What are good strategies? Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges • Solution: distributed versions of stochastic mirror descent that carefully balance gradient averaging and mini-batching • Derive network/rate conditions for near-optimum convergence • Accelerated methods provide a substantial speedup Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Distributed Stochastic Learning • Network of m nodes, each with an i.i.d. data stream {ξ i (t)}, for sensor i at time t • Nodes communicate over wireless links, modeled by graph (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X • Nodes have access to noisy gradients: g i (t) := ∇ɸ(x i (t),ξ i (t)) (ξ 1 (1),ξ 1 (2),…) E ξ [g i (t)] = ∇ψ(x i (t)) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) E ξ [||g i (t) - ∇ψ(x i (t)|| 2 ] ≤ σ 2 (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) • Nodes keep search points x i (t) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t • Extensions via Bregman divergences + prox mappings • After T rounds: L � σ E [ ψ ( x av i ( T )) − ψ ( x ∗ )] ≤ O (1) T + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds: L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds: L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T • Optimum convergence order-wise • Noise term dominates in general, but ASMD provides a universal solution to the SO problem • Will prove significant in distributed stochastic learning [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT • Achievable with sufficiently fast communications • In distributed computing environment, noise term is achievable via gradient averaging: 1. Use AllReduce to average gradients over a spanning tree 2. Take a SMD step • Upshot: Averaging reduces gradient noise, provides speedup • Perfect averages difficult to compute over wireless networks • Approaches: average consensus, incremental methods, etc. [Dekel et al., “Optimal distributed online prediction using mini-batches”, 2012] [Duchi et al., “Dual averaging for distributed optimization…”, 2012] [Ram et al., “Incremental stochastic sub-gradient algorithms for convex optimization”, 2009] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives m 1 (r) m 2 (r) m 3 (r) m 4 (r) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives ξ i (t=1) ξ i (t=2) ξ i (t=3) ξ i (t=4) data rounds m i (r=1) m i (r=2) comms rounds ρ = 1/2 ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) comms rounds ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Distributed Mirror Descent Outline • Distribute stochastic MD via averaging consensus : 1. Nodes obtain local gradients 2.Compute distributed gradient averages via consensus 3. Take MD step using the average gradients ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) consensus rounds x i (t=1) x i (t=2) search point updates ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”
Recommend
More recommend