On Conservative Policy Iteration Bruno Scherrer INRIA Lorraine, LORIA ICML 2014 1 / 13
Motivation / Context • Large Markov Decision Process • A policy space Π • A reference policy π ∈ Π • On-Policy data from π Can we compute a provably better policy ? • Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003) • When local (gradient) optimization induces a (good) global performance guarantee 2 / 13
Motivation / Context • Large Markov Decision Process • A policy space Π • A reference policy π ∈ Π • On-Policy data from π Can we compute a provably better policy ? • Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003) • When local (gradient) optimization induces a (good) global performance guarantee 2 / 13
Motivation / Context • Large Markov Decision Process • A policy space Π • A reference policy π ∈ Π • On-Policy data from π Can we compute a provably better policy ? • Conservative Policy Iteration (Kakade & Langford, 2002; Kakade, 2003) • When local (gradient) optimization induces a (good) global performance guarantee 2 / 13
Outline 1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement 3 / 13
Outline 1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement 4 / 13
Infinite-Horizon Markov Decision Process (Puterman, 1994; Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998) Markov Decision Process (MDP): • X is the state space, • A is the action space, • r : X → R is the reward function, ( r t = r ( x t )) • p : X × A → ∆ X is the transition function. ( x t +1 ∼ p ( ·| x t , a t )) Problem: Find a policy π : X → A that maximizes the value v π ( x ) for all x : � ∞ � � � � γ t r t v π ( x ) = E � x 0 = x , {∀ t , a t = π ( x t ) } ( γ ∈ (0 , 1)) � . � t =0 5 / 13
Infinite-Horizon Markov Decision Process (Puterman, 1994; Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998) Markov Decision Process (MDP): • X is the state space, • A is the action space, • r : X → R is the reward function, ( r t = r ( x t )) • p : X × A → ∆ X is the transition function. ( x t +1 ∼ p ( ·| x t , a t )) Problem: Find a policy π : X → A that maximizes the value v π ( x ) for all x : � ∞ � � � � γ t r t v π ( x ) = E � x 0 = x , {∀ t , a t = π ( x t ) } ( γ ∈ (0 , 1)) � . � t =0 5 / 13
Notations • For any policy π , v π is the unique solution of the Bellman equation : � ∀ x , v π ( x ) = r ( x ) + γ p ( y | x , π ( x )) v π ( y ) ⇔ v π = T π v π y ∈X v π = ( I − γ P π ) − 1 r . ⇔ v π = r + γ P π v π ⇔ • The optimal value v ∗ is the unique solution of the Bellman optimality equation : � � � ∀ x , v ∗ ( x ) = max r ( x ) + γ p ( y | x , a ) v ∗ ( y ) ⇔ v ∗ = Tv ∗ a ∈A y ∈X ⇔ v ∗ = max T π v ∗ . π • π is a greedy policy w.r.t. v , written π = G v , iff � � � ∀ x , π ( x ) ∈ arg max r ( x ) + γ p ( y | x , a ) v ( y ) ⇔ T π v = Tv . a ∈A y ∈X 6 / 13
Outline 1 Markov Decision Processes 2 Conservative Policy Iteration 3 Practical Issues for a Guaranteed Improvement 7 / 13
Approximate Policy Iteration (Exact) Policy Iteration π k +1 ← G v π k (where v π k = T π k v π k ) • Guaranteed improvement in all states • π is ( ǫ, ν ) -approximately greedy with respect to v , written π = G ǫ ( ν, v ), iff ν T ( Tv − T π v ) = E x ∼ ν { [ Tv ]( x ) − [ T π v ]( x ) } ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) π k +1 ← G ǫ ( ν, v π k ) • Performance may decrease in all states! 8 / 13
Approximate Policy Iteration (Exact) Policy Iteration π k +1 ← G v π k (where v π k = T π k v π k ) • Guaranteed improvement in all states • π is ( ǫ, ν ) -approximately greedy with respect to v , written π = G ǫ ( ν, v ), iff ν T ( Tv − T π v ) = E x ∼ ν { [ Tv ]( x ) − [ T π v ]( x ) } ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) π k +1 ← G ǫ ( ν, v π k ) • Performance may decrease in all states! 8 / 13
Approximate Policy Iteration (Exact) Policy Iteration π k +1 ← G v π k (where v π k = T π k v π k ) • Guaranteed improvement in all states • π is ( ǫ, ν ) -approximately greedy with respect to v , written π = G ǫ ( ν, v ), iff ν T ( Tv − T π v ) = E x ∼ ν { [ Tv ]( x ) − [ T π v ]( x ) } ≤ ǫ. API (Bertsekas & Tsitsiklis, 1996) π k +1 ← G ǫ ( ν, v π k ) • Performance may decrease in all states! 8 / 13
Conservative Policy Iteration as a Projected Gradient Ascent Algorithm • π : current policy • π ′ : alternative policy • π α = (1 − α ) π + απ ′ : α -mixture of π and π ′ Taylor expansion of α �→ ν T v π α = E x ∼ ν [ v π α ( x )] around α = 0: ν T ( v π α − v π ) = ν T [( I − γ P π α ) − 1 r − v π ] = ν T ( I − γ P π α ) − 1 ( r − v π + γ P π α v π ) = ν T [( I − γ P π ) − 1 + o ( α )]( T π α v π − T π v π ) = ν T [( I − γ P π ) − 1 + o ( α )] α ( T π ′ v π − T π v π ) α ν,π ( T π ′ v π − T π v π ) + o ( α 2 ) 1 − γ d T = with d T ν,π = (1 − γ ) ν T ( I − γ P π ) − 1 . • The steepest direction is π ′ ∈ G v π . • Choosing π ′ ∈ G ǫ ( d ν,π , v π ) amounts to find an approximately steepest direction. 9 / 13
Conservative Policy Iteration as a Projected Gradient Ascent Algorithm • π : current policy • π ′ : alternative policy • π α = (1 − α ) π + απ ′ : α -mixture of π and π ′ Taylor expansion of α �→ ν T v π α = E x ∼ ν [ v π α ( x )] around α = 0: ν T ( v π α − v π ) = ν T [( I − γ P π α ) − 1 r − v π ] = ν T ( I − γ P π α ) − 1 ( r − v π + γ P π α v π ) = ν T [( I − γ P π ) − 1 + o ( α )]( T π α v π − T π v π ) = ν T [( I − γ P π ) − 1 + o ( α )] α ( T π ′ v π − T π v π ) α ν,π ( T π ′ v π − T π v π ) + o ( α 2 ) 1 − γ d T = with d T ν,π = (1 − γ ) ν T ( I − γ P π ) − 1 . • The steepest direction is π ′ ∈ G v π . • Choosing π ′ ∈ G ǫ ( d ν,π , v π ) amounts to find an approximately steepest direction. 9 / 13
Conservative Policy Iteration as a Projected Gradient Ascent Algorithm • π : current policy • π ′ : alternative policy • π α = (1 − α ) π + απ ′ : α -mixture of π and π ′ Taylor expansion of α �→ ν T v π α = E x ∼ ν [ v π α ( x )] around α = 0: ν T ( v π α − v π ) = ν T [( I − γ P π α ) − 1 r − v π ] = ν T ( I − γ P π α ) − 1 ( r − v π + γ P π α v π ) = ν T [( I − γ P π ) − 1 + o ( α )]( T π α v π − T π v π ) = ν T [( I − γ P π ) − 1 + o ( α )] α ( T π ′ v π − T π v π ) α ν,π ( T π ′ v π − T π v π ) + o ( α 2 ) 1 − γ d T = with d T ν,π = (1 − γ ) ν T ( I − γ P π ) − 1 . • The steepest direction is π ′ ∈ G v π . • Choosing π ′ ∈ G ǫ ( d ν,π , v π ) amounts to find an approximately steepest direction. 9 / 13
Conservative Policy Iteration as a Projected Gradient Ascent Algorithm • π : current policy • π ′ : alternative policy • π α = (1 − α ) π + απ ′ : α -mixture of π and π ′ Taylor expansion of α �→ ν T v π α = E x ∼ ν [ v π α ( x )] around α = 0: ν T ( v π α − v π ) = ν T [( I − γ P π α ) − 1 r − v π ] = ν T ( I − γ P π α ) − 1 ( r − v π + γ P π α v π ) = ν T [( I − γ P π ) − 1 + o ( α )]( T π α v π − T π v π ) = ν T [( I − γ P π ) − 1 + o ( α )] α ( T π ′ v π − T π v π ) α ν,π ( T π ′ v π − T π v π ) + o ( α 2 ) 1 − γ d T = with d T ν,π = (1 − γ ) ν T ( I − γ P π ) − 1 . • The steepest direction is π ′ ∈ G v π . • Choosing π ′ ∈ G ǫ ( d ν,π , v π ) amounts to find an approximately steepest direction. 9 / 13
Conservative Policy Iteration as a Projected Gradient Ascent Algorithm • π : current policy • π ′ : alternative policy • π α = (1 − α ) π + απ ′ : α -mixture of π and π ′ Taylor expansion of α �→ ν T v π α = E x ∼ ν [ v π α ( x )] around α = 0: ν T ( v π α − v π ) = ν T [( I − γ P π α ) − 1 r − v π ] = ν T ( I − γ P π α ) − 1 ( r − v π + γ P π α v π ) = ν T [( I − γ P π ) − 1 + o ( α )]( T π α v π − T π v π ) = ν T [( I − γ P π ) − 1 + o ( α )] α ( T π ′ v π − T π v π ) α ν,π ( T π ′ v π − T π v π ) + o ( α 2 ) 1 − γ d T = with d T ν,π = (1 − γ ) ν T ( I − γ P π ) − 1 . • The steepest direction is π ′ ∈ G v π . • Choosing π ′ ∈ G ǫ ( d ν,π , v π ) amounts to find an approximately steepest direction. 9 / 13
Recommend
More recommend