Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020 October 30, 2020
Collaborators Thinh Doan Siva Theja Maguluri Sihan Zeng Virginia Tech, ECE Georgia Tech, ISyE Georgia Tech, ECE
Reinforcement Learning
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)
Fixed point iterations Classical result (Banach fixed point theorem): when H ( · ) : R N → R N is a contraction � H ( u ) − H ( v ) � ≤ δ � u − v � , δ < 1 , then there is a unique fixed point x ⋆ such that x ⋆ = H ( x ⋆ ) , and the iteration x k +1 = H ( x k ) , finds it k →∞ x k = x ⋆ . lim
Easy proof Choose any point x 0 , then take x k +1 = H ( x k ) so x k +1 − x ⋆ = H ( x k ) − x ⋆ = H ( x k ) − H ( x ⋆ ) and � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ � x k − x ⋆ � ≤ δ k +1 � x 0 − x ⋆ � , so the convergence is geometric
Relationship to optimization Choose any point x 0 , then take x k +1 = H ( x k ) , then � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ k +1 � x 0 − x ⋆ � , Gradient descent takes H ( x ) = x − α ∇ f ( x ) for some differentiable f .
Fixed point iterations: Variation Take x k +1 = x k + α ( H ( x k ) − x k ) , 0 < α ≤ 1 . (More conservative, convex combination of new iterate and old.) Then again x k +1 = (1 − α ) x k + αH ( x k ) and � x k +1 − x ⋆ � ≤ (1 − α ) � x k − x ⋆ � + α � H ( x k ) − H ( x ⋆ ) � ≤ (1 − α − δα ) � x k − x ⋆ � . Still converge, albeit a little more slowly for α < 1 .
What if there is noise? If our observations of H ( · ) are noisy , x k +1 = x k + α ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we don’t get convergence for fixed α , but we do converge to a “ball” around at a geometric rate
Stochastic approximation If our observations of H ( · ) are noisy , x k +1 = x k + α k ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we need to take α k → 0 as we approach the solution. If we take { α k } such that ∞ ∞ � � α 2 k < ∞ , α k = ∞ k =0 k =0 then we so get (much slower) convergence Example: α k = C/ ( k + 1)
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)
Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) .
Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Long-term reward of policy µ : � ∞ � � γ t R ( s t , µ ( s t ) , s t +1 ) | s 0 = s V µ ( s ) = E t =0
Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Bellman equation: V µ obeys � V µ ( s ) = P ( z | s , µ ( s )) [ R ( s , µ ( s ) , z ) + γV µ ( z )] z ∈S � �� � b µ + γ P µ V µ This is a fixed point equation for V µ
Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value function ( Q function): � ∞ � � γ t R ( s t , µ ( s t ) s t +1 ) | s 0 = s , a 0 = a Q µ ( s , a ) = E t =0
Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value for the optimal policy obeys � � Q ⋆ ( s , a ) = E R ( s , a , s ′ ) + γ max Q ⋆ ( s ′ , a ′ ) | s 0 = s , a 0 = a a ′ and we take µ ⋆ ( s ) = arg max a Q ⋆ ( s , a ) ... ... this is another fixed point equation
Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t
Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� � H ( V t ) − V t In practice, we don’t have the model P ( z | s ) , only observed data { ( s t , s t +1 ) }
Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) The “noise” is that s t +1 is sampled, rather than averaged over
Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� � H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over
Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� � H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over This is different from stochastic gradient descent, since H ( · ) is in general not a gradient map
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)
Function approximation State space can be large (or even infinite) ... ... we need a natural way to parameterize/simplify
Linear function approximation Simple (but powerful) model: linear representation φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ , . V ( s ; θ ) = φ ( s ) = . k =1 φ K ( s )
Linear function approximation Simple (but powerful) model: linear representation φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ , . V ( s ; θ ) = φ ( s ) = . k =1 φ K ( s ) 0.14 1 0.9 0.12 0.8 0.1 0.7 0.08 0.6 0.06 0.5 0.4 0.04 0.3 0.02 0.2 0 0.1 −0.02 0 −6 −4 −2 0 2 4 6 8 10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict ....
Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict .... ... but the following iterations θ t +1 = θ t + α t ( R ( s t , s t +1 ) + γV ( s t +1 ; θ t ) − V ( s t ; θ t )) ∇ θ V ( s t , θ t ) � � R ( s t , s t +1 ) + γ φ ( s t +1 ) T θ t − φ ( s t ) T θ t = θ t + α t φ ( s t ) converge to a “near optimal” θ ⋆ Tsitsiklis and Roy, ‘97
Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)
Network consensus Each node in a network has a number x ( i ) We want each node to agree on the average ! N x = 1 � x ( i ) = 1 T x ¯ N i =1 Node i communicates with its neighbors N i Iterate, take v 0 = x , then � v k +1 ( i ) = W ij v k ( i ) j ∈N i v k +1 = W v k , W doubly stochastic
Network consensus convergence ! Nodes reach “consensus” quickly: v k +1 = W v k v k +1 − ¯ x 1 = W v k − ¯ x 1 = W ( v k − ¯ x 1 ) � v k +1 − ¯ x 1 � = � W ( v k − ¯ x 1 ) �
Recommend
More recommend