Steps to understanding Policy-gradient methods • Policy approximation π ( a | s, θ ) • The average-reward (reward rate) objective r ( θ ) ¯ • Stochastic gradient ascent/descent ∆ θ t ≈ α∂ ¯ r ( θ ) ∂ θ • The policy-gradient theorem and its proof • Approximating the gradient • Eligibility functions for a few cases • A final algorithm
Policy Approximation • Policy = a function from state to action • How does the agent select actions? • In such a way that it can be affected by learning? • In such a way as to assure exploration? • Approximation: there are too many states and/or actions to represent all policies • To handle large/continuous action spaces
What is learned and stored? 1. Action-value methods : learn the value of each action; pick the max (usually) 2. Policy-gradient methods : learn the parameters u of a stochastic policy , update by ∇ u Performance • including actor-critic methods , which learn both value and policy parameters 3. Dynamic Policy Programming 4. Drift-diffusion models (Psychology)
Actor-critic architecture World
Action-value methods • The value of an action in a state given a policy is the expected future reward starting from the state taking that first action, then following the policy thereafter " ∞ � # � X γ t − 1 R t q π ( π ( s, a ) = E � S 0 = s, A 0 = a � � t =1 • Policy: pick the max most of the time ˆ A t = arg max Q t ( S t , a ) a but sometimes pick at random ( 휀 -greedy)
Why approximate policies rather than values? • In many problems, the policy is simpler to approximate than the value function • In many problems, the optimal policy is stochastic • e.g., bluffing, POMDPs • To enable smoother change in policies • To avoid a search on every step (the max) • To better relate to biology
Gradient-bandit algorithm • Store action preferences e H t ( a ) " rather than action-value estimates Q t ( a ) + • Instead of 휀 -greedy, pick actions by an exponential soft-max: e H t ( a ) Pr { A t = a } . . = = π t ( a ) P k b =1 e H t ( b ) • � ¯ �� Also store the sample average of rewards as R t • � Then update: R t − ¯ � �� � H t +1 ( a ) = H t ( a ) + α R t 1 a = A t − π t ( a ) 1 or 0, depending on whether the predicate (subscript) is true
Gradient-bandit algorithms on the 10-armed testbed 100% α = 0.1 80% with baseline α = 0.4 % 60% Optimal α = 0.1 action 40% without baseline α = 0.4 20% 0% 0 250 500 750 1000 Steps Figure 2.6: Average performance of the gradient-bandit algorithm with and without a reward baseline on the 10-armed testbed when the q ∗ ( a ) are chosen to be near +4 rather than near zero.
∂ f ( x ) ∂ x g ( x ) − f ( x ) ∂ g ( x ) f ( x ) � ∂ ∂ x = . g ( x ) 2 g ( x ) ∂ x ∂π t ( b ) ∂ ∂ H t ( a ) = ∂ H t ( a ) π t ( b ) " # e H t ( b ) ∂ = P k ∂ H t ( a ) c =1 e H t ( c ) c =1 e H t ( c ) − e H t ( b ) ∂ P k c =1 e Ht ( c ) ∂ e Ht ( b ) P k ∂ H t ( a ) ∂ H t ( a ) = (by the quotient rule) ⌘ 2 ⇣P k c =1 e H t ( c ) = 1 a = b e H t ( a ) P k c =1 e H t ( c ) − e H t ( b ) e H t ( a ) (because ∂ e x ∂ x = e x ) ⌘ 2 ⇣P k c =1 e H t ( c ) = 1 a = b e H t ( b ) e H t ( b ) e H t ( a ) c =1 e H t ( c ) − ⌘ 2 P k ⇣P k c =1 e H t ( c ) = 1 a = b π t ( b ) − π t ( b ) π t ( a ) � � = π t ( b ) 1 a = b − π t ( a ) Q.E.D. .
Steps to understanding Policy-gradient methods • Policy approximation π ( a | s, θ ) • The average-reward (reward rate) objective r ( θ ) ¯ • Stochastic gradient ascent/descent ∆ θ t ≈ α∂ ¯ r ( θ ) ∂ θ • The policy-gradient theorem and its proof • Approximating the gradient • Eligibility functions for a few cases • A complete algorithm
eg, linear-exponential policies (discrete actions) • The “preference” for action a in state s is linear in 휽 and a state-action feature vector 휙 ( s,a ) • The probability of action a in state s is exponential in its preference exp( θ > φ ( s, a )) π ( a | s, θ ) . = P b exp( θ > φ ( s, b )) • Corresponding eligibility function: P r π ( a | s, θ ) X π ( a | s, θ ) = φ ( s, a ) � π ( b | s, θ ) φ ( s, b ) b
Policy-gradient setup parameterized π ( a | s, θ ) . = Pr { A t = a | S t = s } policies n 1 average-reward r ( π ) . X X X X p ( s 0 , r | s, a ) r = lim E π [ R t ] = d π ( s ) π ( a | s ) objective n n !1 t =1 s a s 0 ,r steady-state d π . = lim t !1 Pr { S t = s } distribution 1 differential v π ( s ) . X ˜ = E π [ R t + k � r ( π ) | S t = s ] state-value fn k =1 1 q π ( s, a ) . differential X ˜ = E π [ R t + k � r ( π ) | S t = s, A t = a ] action-value fn k =1 ∆ θ t ⇡ α∂ r ( π ) stochastic . = α r r ( π ) gradient ascent ∂ θ X X r r ( π ) = d π ( s ) q π ( s, a ) r π ( a | s, θ ) ˜ (the policy-gradient theorem) s a � �
q π ( s, a ) . differential X ˜ = E π [ R t + k � r ( π ) | S t = s, A t = a ] action-value fn k =1 ∆ θ t ⇡ α∂ r ( π ) stochastic . = α r r ( π ) gradient ascent ∂ θ policy-gradient X X r r ( π ) = d π ( s ) q π ( s, a ) r π ( a | s, θ ) ˜ (the policy-gradient theorem) theorem s a � ⇣ ⌘ r π ( A t | S t , θ ) � � = E q π ( S t , A t ) � v ( S t ) ˜ � S t ⇠ d π , A t ⇠ π ( ·| S t , θ ) � π ( A t | S t ) � ⇣ � ⌘ r π ( A t | S t , θ ) ˜ � G λ = E t � ˆ v ( S t , w ) � S t ⇠ d π , A t : 1 ⇠ π � π ( A t | S t ) ⌘ r π ( A t | S t , θ ) ⇣ ˜ G λ t � ˆ v ( S t , w ) (by sampling under π ) ⇡ π ( A t | S t ) ⌘ r π ( A t | S t , θ ) θ t +1 . ⇣ ˜ G λ = θ t + α t � ˆ v ( S t , w ) π ( A t | S t ) e.g., in the one-step linear case: ⌘ r π ( A t | S t , θ ) ⇣ R t +1 � ¯ R t + w > t φ t +1 � w > = θ t + α t φ t ) π ( A t | S t ) . = θ t + αδ t e ( A t , S t )
Deriving the policy-gradient theorem: r r ( π ) = P s d π ( s ) P a ˜ q π ( s, a ) r π ( a | s, θ ): X r ˜ v π ( s ) = r π ( a | s, θ )˜ q π ( s, a ) a h i X = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) r ˜ q π ( s, a ) a h ⇤i X X ⇥ p ( s 0 , r | s, a ) v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) r r � r ( π ) + ˜ a s 0 ,r h h ii X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) �r r ( π ) + a s 0 ,r h i X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) ∴ r r ( π ) = r π ( a | s, θ )˜ q π ( s, a )+ π ( a | s, θ ) �r ˜ v π ( s ) a s 0 X X X d π ( s ) r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) ∴ s s a X X X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) � + d π ( s ) π ( a | s, θ ) d π ( s ) r ˜ v π ( s ) s a s s 0 X X
Xh ⇤i X ⇥ 0 h h ii X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) = r π ( a | s, θ )˜ q π ( s, a ) + π ( a | s, θ ) �r r ( π ) + a s 0 ,r h i X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) ∴ r r ( π ) = r π ( a | s, θ )˜ q π ( s, a )+ π ( a | s, θ ) �r ˜ v π ( s ) a s 0 X X X d π ( s ) r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) ∴ s s a X X X X p ( s 0 | s, a ) r ˜ v π ( s 0 ) � + d π ( s ) π ( a | s, θ ) d π ( s ) r ˜ v π ( s ) s a s s 0 X X = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) s a X X X X π ( a | s, θ ) p ( s 0 | s, a ) v π ( s 0 ) � + d π ( s ) r ˜ d π ( s ) r ˜ v π ( s ) s a s s 0 | {z } d π ( s 0 ) X X r r ( π ) = d π ( s ) r π ( a | s, θ )˜ q π ( s, a ) Q.E.D. s a
Complete PG algorithm Initialize parameters of policy θ 2 R n , and state-value function w 2 R m Initialize eligibility traces e θ 2 R n and e w 2 R m to 0 Initialize ¯ R = 0 On each step, in state S : Choose A according to π ( ·| S, θ ) Take action A , observe S 0 , R δ R � ¯ form TD error from critic v ( S 0 , w ) � ˆ R + ˆ v ( S, w ) R ¯ ¯ update average reward estimate R + α θ δ e w λ e w + r update eligibility trace for critic w ˆ v ( S, w ) update critic parameters w w + α w δ e w e θ λ e θ + r π ( A | S, θ ) update eligibility trace for actor π ( A | S, θ ) update actor parameters θ θ + α θ δ e θ
The generality of the policy-gradient strategy • Can be applied whenever we can compute the effect of parameter changes on the action probabilities, ⌘ r π ( A t | S t , θ ) ) • E.g., has been applied to spiking neuron models • There are many possibilities other than linear- exponential and linear-gaussian • e.g., mixture of random, argmax, and fixed- width gaussian; learn the mixing weights, drift/ diffusion models
eg, linear-gaussian policies (continuous actions) action 휇 and 휎 linear prob. in the state density action
eg, linear-gaussian policies (continuous actions) • The mean and std. dev. for the action taken in state s are linear and linear-exponential in µ ( s ) . σ ( s ) . θ . = exp( θ > = θ > = ( θ > µ ; θ > σ ) > µ φ ( s ) σ φ ( s ) • The probability density function for the action taken in state s is gaussian � ( a � µ ( s )) 2 ✓ ◆ 1 π ( a | s, θ ) . = 2 π exp p 2 σ ( s ) 2 σ ( s )
Gaussian eligibility functions θ µ π ( a | s, θ ) 1 r = σ ( s ) 2 ( a � µ ( s )) φ µ ( s ) π ( a | s, θ ) ✓ ( a � µ ( s )) 2 ◆ θ σ π ( a | s, θ ) r = � 1 φ σ ( s ) σ ( s ) 2 π ( a | s, θ )
Recommend
More recommend