Decentralized Stochastic Approximation, Optimization, and - PowerPoint PPT Presentation

Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020 October 30, 2020

Collaborators Thinh Doan Siva Theja Maguluri Sihan Zeng Virginia Tech, ECE Georgia Tech, ISyE Georgia Tech, ECE

Reinforcement Learning

Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus

Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

Fixed point iterations Classical result (Banach fixed point theorem): when H ( · ) : R N → R N is a contraction � H ( u ) − H ( v ) � ≤ δ � u − v � , δ < 1 , then there is a unique fixed point x ⋆ such that x ⋆ = H ( x ⋆ ) , and the iteration x k +1 = H ( x k ) , finds it k →∞ x k = x ⋆ . lim

Easy proof Choose any point x 0 , then take x k +1 = H ( x k ) so x k +1 − x ⋆ = H ( x k ) − x ⋆ = H ( x k ) − H ( x ⋆ ) and � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ � x k − x ⋆ � ≤ δ k +1 � x 0 − x ⋆ � , so the convergence is geometric

Relationship to optimization Choose any point x 0 , then take x k +1 = H ( x k ) , then � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ k +1 � x 0 − x ⋆ � , Gradient descent takes H ( x ) = x − α ∇ f ( x ) for some differentiable f .

Fixed point iterations: Variation Take x k +1 = x k + α ( H ( x k ) − x k ) , 0 < α ≤ 1 . (More conservative, convex combination of new iterate and old.) Then again x k +1 = (1 − α ) x k + αH ( x k ) and � x k +1 − x ⋆ � ≤ (1 − α ) � x k − x ⋆ � + α � H ( x k ) − H ( x ⋆ ) � ≤ (1 − α − δα ) � x k − x ⋆ � . Still converge, albeit a little more slowly for α < 1 .

What if there is noise? If our observations of H ( · ) are noisy , x k +1 = x k + α ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we don’t get convergence for fixed α , but we do converge to a “ball” around at a geometric rate

Stochastic approximation If our observations of H ( · ) are noisy , x k +1 = x k + α k ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we need to take α k → 0 as we approach the solution. If we take { α k } such that ∞ ∞ � � α 2 k < ∞ , α k = ∞ k =0 k =0 then we so get (much slower) convergence Example: α k = C/ ( k + 1)

Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) .

Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Long-term reward of policy µ : � ∞ � � γ t R ( s t , µ ( s t ) , s t +1 ) | s 0 = s V µ ( s ) = E t =0

Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Bellman equation: V µ obeys � V µ ( s ) = P ( z | s , µ ( s )) [ R ( s , µ ( s ) , z ) + γV µ ( z )] z ∈S � �� b µ + γ P µ V µ This is a fixed point equation for V µ

Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value function ( Q function): � ∞ � � γ t R ( s t , µ ( s t ) s t +1 ) | s 0 = s , a 0 = a Q µ ( s , a ) = E t =0

Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value for the optimal policy obeys � � Q ⋆ ( s , a ) = E R ( s , a , s ′ ) + γ max Q ⋆ ( s ′ , a ′ ) | s 0 = s , a 0 = a a ′ and we take µ ⋆ ( s ) = arg max a Q ⋆ ( s , a ) ... ... this is another fixed point equation

Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� H ( V t ) − V t

Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� H ( V t ) − V t In practice, we don’t have the model P ( z | s ) , only observed data { ( s t , s t +1 ) }

Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) The “noise” is that s t +1 is sampled, rather than averaged over

Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over

Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over This is different from stochastic gradient descent, since H ( · ) is in general not a gradient map

Function approximation State space can be large (or even infinite) ... ... we need a natural way to parameterize/simplify

Linear function approximation Simple (but powerful) model: linear representation   φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ ,  .  V ( s ; θ ) = φ ( s ) = .   k =1 φ K ( s )

Linear function approximation Simple (but powerful) model: linear representation   φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ ,   . V ( s ; θ ) = φ ( s ) = .   k =1 φ K ( s ) 0.14 1 0.9 0.12 0.8 0.1 0.7 0.08 0.6 0.06 0.5 0.4 0.04 0.3 0.02 0.2 0 0.1 −0.02 0 −6 −4 −2 0 2 4 6 8 10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict ....

Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict .... ... but the following iterations θ t +1 = θ t + α t ( R ( s t , s t +1 ) + γV ( s t +1 ; θ t ) − V ( s t ; θ t )) ∇ θ V ( s t , θ t ) � � R ( s t , s t +1 ) + γ φ ( s t +1 ) T θ t − φ ( s t ) T θ t = θ t + α t φ ( s t ) converge to a “near optimal” θ ⋆ Tsitsiklis and Roy, ‘97

Network consensus Each node in a network has a number x ( i ) We want each node to agree on the average ! N x = 1 � x ( i ) = 1 T x ¯ N i =1 Node i communicates with its neighbors N i Iterate, take v 0 = x , then � v k +1 ( i ) = W ij v k ( i ) j ∈N i v k +1 = W v k , W doubly stochastic

Network consensus convergence ! Nodes reach “consensus” quickly: v k +1 = W v k v k +1 − ¯ x 1 = W v k − ¯ x 1 = W ( v k − ¯ x 1 ) � v k +1 − ¯ x 1 � = � W ( v k − ¯ x 1 ) �

Decentralized Stochastic Approximation, Optimization, and - PowerPoint PPT Presentation

Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020

6. Approximation and fitting norm approximation least-norm problems regularized

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS

Lyapunov Function Constructions for Slowly Time-Varying Systems MICHAEL MALISOFF Department of

Shmencode Caml-Shcaml by Example Alec Heller Jesse A. Tov College of Computer and Information

Total mall sales are up, but entertainment related sales are growing faster Index 3-Mo MA (2012

Welcome to Todays Webinar We will begin at 1:00 PM ET Dial-in: 1-800-832-0736 Meeting Room:

Introduction to Metasploit Stefano Cristalli November 29, 2018 Laboratorio di Sicurezza e Reti

ICCP extension for the MSP application draft-hao-pwe3-iccp-extension-for-msp-00 Presenter:

Compact Adaptively Secure ABE from -Lin: Beyond NC 1 and Towards NL Huijia (Rachel) Lin and

The Riddle of the Ribbon, Moving to MS Project 2010 from a customized MS Project 2007, or Where